Technical Tuesday: Storing your data with Python and Pandas effectively
Sourcing, cleaning and storing data are often the most time-consuming steps of data science pipelines. Using and calibrating models has become fairly convenient over the past decade, but the "plumbing" parts of the process can often be quite painful. When you experiment with new data to evaluate a potential project it is a good idea to keep things very simple. Instead of setting up a full database it might be worth managing and storing your data in flat files. I will describe two simple and effective methods today both of which are far easier and quicker to use than a range of alternatives. Simplicity is the mother of robustness, as they say.
- CSV (comma-separated value) files, which are language and platform agnostic
- Pickle, a Python-native tool to serialise and de-serialise objects
Python has a CSV module, which is worth exploring. However I find myself using Pandas for almost all of my exploratory and experimental data science work so I would recommend using CSV methods and Pickle together with Pandas as they are much more convenient.
First we import standard libraries like Pandas, Numpy etc and also install some packages to download stock price data from Yahoo Finance (YF). Finally we set some parameters to retrieve data, including dates, frequency and tickers for the S&P 500.
The idea to use data storage tools in a real-world setting instead of randomly generated toy data. We define the function load_price_data to retrieve data from YF for the first 100 tickers of the S&P 500. As YF can be a bit unreliable, we use the try/except statements - a defensive programming technique - to make sure that we get an error message if there is a problem.
Here are the first few rows and columns of the data set. The DataFrame has dimensions of 1525 rows (trading days since our start date 1 January 2015) and 100 columns (for the first 100 components of the S&P 500).
The amount of space/memory this data takes up is around 1.2 MB - a small amount by any standard. In order to benchmark the speed of reading from and writing to our two formats it is best to take a larger amount of data. We therefore multiply the number of rows by 1000, which increases our file size/ memory requirement by as much.
We now save the 1.2GB dataframe as a CSV file, which is really rather simple. This operation took almost 4 minutes as you can see below. Loading the data from the CSV file is almost as easy, but there are a few things to consider when using read_csv. As the name implies, CSV uses commas to separate values, but I have seen semicola in such files as well. Therefore it is worth investigating the data - a rare possibility that CSV files afford as one can open them and visually inspect the data that is separated by commas (or other separators). Such separators should be specified with an argument if needed. The below has a different issue. When writing to CSV the date column was replaced with a regular range index (and the dates were pushed into a new "unnamed" column). We need to tell the read_csv() method that it should look at the first column (index: 0) to find our preferred index and it should parse that column to convert the values into datetime. In terms of performance reading 1.5 million data points took about 13.6 seconds.
As we can see below, the data has been restored to exactly the same format. We can also see that the index has successfully been converted to "datetime64".
Now we move on to the next tool: Pickle. We need to import the module for it first. Below is also an example of how we should try to wrap procedures we use again and again into functions so we do not repeat the original code. (The old DRY principle of programming: Don't Repeat Yourself.) The function pickle_save takes as argument the data we're looking to "pickle" and a file name. We assert that the type of the filename must be a string - assert statements are another example of defensive programming. We then open the file, use the pickle dump method and close the file. Next we call that function and pass it the data and just the name of the file "data_pickle". As you can see the function automatically adds the file extension .pkl to the name. The magic command %%timeit (which only works in Jupyter notebooks) tells us that "pickling" the data took 15.8 seconds. That is a massive improvement over writing to CSV, which took almost 4 minutes for the same operation.
We now would like to load the pickled data into memory again. This is easily done with the below code. Note this is an alternative to using the above pickle_save function. The command "with open ... as f:" does not require us to close() the file at the end - this is done automatically, saving us an extra line of code. It is a matter of preference whether you like having the equivalent (function) to pickle_save for loading or using the below code. Both achieve the same respective outcome. The loading time is less than one second - very fast and impressive, compared to read_csv, which took almost 14 seconds.
The above example used only numeric data, but strings or categorical data can change the performance dramatically. And there are more dimensions along which one can compare these and other data storage options, e.g. file size. Compared to the 1.23 GB CSV file, the pickle file takes up 1.17 GB, so marginally less space. Memory use is another dimension to consider.
There is a much more comprehensive comparison in this article, which I highly recommend. It shows that for reading and writing of string data CSV is the slowest option. However the difference in load_time for CSV and Pickle (roungle 2x) is not nearly as massive as it was it our example above (32x).
There are a variety of other solutions like Feather, HDF and Parquet. JSON files and other dictionary-like formats are also popular choices. My default is to start with CSV and Pickle. Once the relevant objectives are established and prioritised - read speed, write speed, file size, memory use etc a more informed decision can be made as to which solution is most appropriate.