Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It's frustrating when I see ample scope for parallelization here, but unfortunately, Pandas does not provide this functionality (yet). Although I am never in favor of creating CSVs in the first place with Pandas (read my post below to know why), I understand that there might be situations where one has no other choice but to work with CSVs.
Therefore, in this post, we will explore Dask and DataTable, two of the most trending Pandas-like libraries for Data Scientists. We’ll rank Pandas, Dask and Datatable based on their performance on the following parameters:
- Time taken to read the CSV and obtain a PANDAS DATAFRAME
If we read a CSV through Dask and DataTable, they will generate a Dask DataFrame and DataTable DataFrame respectively, not the Pandas DataFrame. Assuming that we want to stick to the traditional Pandas syntax and functions (due to familiarity), we would have to convert these to a Pandas DataFrame first, as shown below.
Time taken to store a PANDAS DATAFRAME to a CSV
The objective is to generate a CSV file from a given Pandas DataFrame. For Pandas, we are already aware of the df.to_csv() method. However, to create a CSV from Dask and DataTable, we first need to convert the given Pandas DataFrame to their respective DataFrames and then store them in a CSV. Thus, we’ll also consider the time taken for this DataFrame conversion in this analysis.
- For experimentation purposes, I generated a random dataset in Python with variable rows and thirty columns — encompassing string, float, and integer data types.
- I repeated each experiment described below five times to reduce randomness and draw fair conclusions from the observed results. The figures I report in the section below are averages across the five experiments.
- Python environment and libraries:
- Python 3.9.12
- Pandas 1.4.2
- DataTable 1.0.0
- Dask 2022.02.1
Experiment 1: Time taken to read the CSV
The plot below depicts the time taken (in seconds) by Pandas, Dask, and DataTable to read a CSV file and generate a Pandas DataFrame. The number of rows of the CSV ranges from 100k to 5 million.