Image from Wikimedia Commons

Reading and Writing Pandas DataFrames in Chunks

03 Apr 2021

Create Pandas Iterator
Iterate over the File in Batches
Resources

This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.

Create Pandas Iterator

First, create a TextFileReader object for iteration. This won’t load the data until you start iterating over it. Here it chunks the data in DataFrames with 10000 rows each:

df_iterator = pd.read_csv(
    'input_data.csv.gz', 
    chunksize=10000,
    compression='gzip')

Iterate over the File in Batches

Now, you can use the iterator to load the chunked DataFrames iteratively. Here you have a function do_something(df_chunk), that is some operation that you need to have done on the table:

for i, df_chunk in enumerate(df_iterator)

    do_something(df_chunk)
    
    # Set writing mode to append after first chunk
    mode = 'w' if i == 0 else 'a'
    
    # Add header if it is the first chunk
    header = i == 0

    df_chunk.to_csv(
        "dst_data.csv.gz",
        index=False,  # Skip index column
        header=header, 
        mode=mode,
        compression='gzip')

By default, Pandas infers the compression from the filename. Other supported compression formats include bz2, zip, and xz.

Resources

For more information on chunking, have a look at the documentation on chunking. Another useful tool, when working with data that won’t fit your memory, is Dask. Dask can parallelize the workload on multiple cores or even multiple machines, although it is not a drop-in replacement for Pandas and can be rather viewed as a wrapper for Pandas.

home · about · projects

Reading and Writing Pandas DataFrames in Chunks

Table of Contents

Create Pandas Iterator

Iterate over the File in Batches

Resources

home · about · projects

Reading and Writing Pandas DataFrames in Chunks

Table of Contents

Create Pandas Iterator

Iterate over the File in Batches

Resources

Related Posts

Working with Missing Values in Pandas 19 Feb 2018

Reading and Writing Parquet Files on S3 with Pandas and PyArrow 10 Apr 2022

Working with MultiIndex and Pivot Tables in Pandas and Python 22 Apr 2018

Working with Missing Values in Pandas
19 Feb 2018

Reading and Writing Parquet Files on S3 with Pandas and PyArrow
10 Apr 2022

Working with MultiIndex and Pivot Tables in Pandas and Python
22 Apr 2018