Image from Wikimedia Commons

Reading and Writing Parquet Files on S3 with Pandas and PyArrow

Table of Contents

When working with large amounts of data, a common approach is to store the data in S3 buckets. Instead of dumping the data as CSV files or plain text files, a good option is to use Apache Parquet. In this short guide you’ll see how to read and write Parquet files on S3 using Python, Pandas and PyArrow.

This guide was tested using Contabo object storage, MinIO, and Linode Object Storage. You should be able to use it on most S3-compatible providers and software.

Prepare Connection

Prepare the S3 environment variables in a file called .env in the project folder with the following contents:

S3_REGION=eu-central-1
S3_ENDPOINT=https://eu-central-1.domain.com
S3_ACCESS_KEY=XXXX
S3_SECRET_KEY=XXXX

Prepare some S3 bucket that you want to use. In this case we’ll be using s3://s3-example bucket to store and access our data. Next, prepare some random example data with:

import numpy as np
import pandas as pd

df = pd.DataFrame({'data': np.random.random((1000,))})
df.to_parquet("data/data.parquet")

Load the environment variables in your script with python-dotenv:

from dotenv import load_dotenv
load_dotenv();

Now, prepare the S3 connection with:

import os
import s3fs

fs = s3fs.S3FileSystem(
    anon=False,
    use_ssl=True,
    client_kwargs={
        "region_name": os.environ['S3_REGION'],
        "endpoint_url": os.environ['S3_ENDPOINT'],
        "aws_access_key_id": os.environ['S3_ACCESS_KEY'],
        "aws_secret_access_key": os.environ['S3_SECRET_KEY'],
        "verify": True,
    }
)

Write Pandas DataFrame to S3 as Parquet

Save the DataFrame to S3 using s3fs and Pandas:

with fs.open('s3-example/data.parquet', 'wb') as f:
    df.to_parquet(f)

Save the DataFrame to S3 using s3fs and PyArrow:

import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Table

s3_filepath = 's3-example/data.parquet'

pq.write_to_dataset(
    Table.from_pandas(df),
    s3_filepath,
    filesystem=fs,
    use_dictionary=True,
    compression="snappy",
    version="2.4",
)

You can also upload this file with s3cmd by typing:

s3cmd \
  --config ~/.s3cfg \
  put data/data.parquet s3://s3-example

Reading Parquet File from S3 as Pandas DataFrame

Now, let’s have a look at the Parquet file by using PyArrow:

s3_filepath = "s3-example/data.parquet"

pf = pq.ParquetDataset(
    s3_filepath,
    filesystem=fs)

Now, you can already explore the metadata with pf.metadata or the schema with pf.schema. To read the data set into Pandas type:

pf.metadata
pf.schema
<pyarrow._parquet.ParquetSchema object at 0x7f1c2fa4a300>
required group field_id=-1 schema {
  optional double field_id=-1 data;
}

When using ParquetDataset, you can also use multiple paths. You can get those for example with:

s3_filepath = 's3://s3-example'
s3_filepaths = [path for path in fs.ls(s3_filepath)
                if path.endswith('.parquet')]
s3_filepaths
['s3-example/data.parquet', 's3-example/data.parquet']

Resources

Buy me a Coffee