Reading and Writing Pandas DataFrames in Chunks03 Apr 2021
Table of Contents
This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.
Create Pandas Iterator
First, create a
TextFileReader object for iteration. This won’t load the data until you start iterating over it. Here it chunks the data in DataFrames with 10000 rows each:
df_iterator = pd.read_csv( 'input_data.csv.gz', chunksize=10000, compression='gzip')
Iterate over the File in Batches
Now, you can use the iterator to load the chunked DataFrames iteratively. Here you have a function
do_something(df_chunk), that is some operation that you need to have done on the table:
for i, df_chunk in enumerate(df_iterator) do_something(df_chunk) # Set writing mode to append after first chunk mode = 'w' if i == 0 else 'a' # Add header if it is the first chunk header = i == 0 df_chunk.to_csv( "dst_data.csv.gz", index=False, # Skip index column header=header, mode=mode, compression='gzip')
By default, Pandas infers the compression from the filename. Other supported compression formats include
For more information on chunking, have a look at the documentation on chunking. Another useful tool, when working with data that won’t fit your memory, is Dask. Dask can parallelize the workload on multiple cores or even multiple machines, although it is not a drop-in replacement for Pandas and can be rather viewed as a wrapper for Pandas.