Reading and Writing Pandas DataFrames in Chunks
03 Apr 2021Table of Contents
This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.
Create Pandas Iterator
First, create a TextFileReader
object for iteration. This won’t load the data until you start iterating over it. Here it chunks the data in DataFrames with 10000 rows each:
df_iterator = pd.read_csv(
'input_data.csv.gz',
chunksize=10000,
compression='gzip')
Iterate over the File in Batches
Now, you can use the iterator to load the chunked DataFrames iteratively. Here you have a function do_something(df_chunk)
, that is some operation that you need to have done on the table:
for i, df_chunk in enumerate(df_iterator)
do_something(df_chunk)
# Set writing mode to append after first chunk
mode = 'w' if i == 0 else 'a'
# Add header if it is the first chunk
header = i == 0
df_chunk.to_csv(
"dst_data.csv.gz",
index=False, # Skip index column
header=header,
mode=mode,
compression='gzip')
By default, Pandas infers the compression from the filename. Other supported compression formats include bz2
, zip
, and xz
.
Resources
For more information on chunking, have a look at the documentation on chunking. Another useful tool, when working with data that won’t fit your memory, is Dask. Dask can parallelize the workload on multiple cores or even multiple machines, although it is not a drop-in replacement for Pandas and can be rather viewed as a wrapper for Pandas.