Reading and Writing Parquet Files on S3 with Pandas and PyArrow
10 Apr 2022Table of Contents
- Prepare Connection
- Write Pandas DataFrame to S3 as Parquet
- Reading Parquet File from S3 as Pandas DataFrame
- Resources
When working with large amounts of data, a common approach is to store the data in S3 buckets. Instead of dumping the data as CSV files or plain text files, a good option is to use Apache Parquet. In this short guide you’ll see how to read and write Parquet files on S3 using Python, Pandas and PyArrow.
This guide was tested using Contabo object storage, MinIO, and Linode Object Storage. You should be able to use it on most S3-compatible providers and software.
Prepare Connection
Prepare the S3 environment variables in a file called .env
in the project folder with the following contents:
S3_REGION=eu-central-1
S3_ENDPOINT=https://eu-central-1.domain.com
S3_ACCESS_KEY=XXXX
S3_SECRET_KEY=XXXX
Prepare some S3 bucket that you want to use. In this case we’ll be using s3://s3-example
bucket to store and access our data. Next, prepare some random example data with:
import numpy as np
import pandas as pd
df = pd.DataFrame({'data': np.random.random((1000,))})
df.to_parquet("data/data.parquet")
Load the environment variables in your script with python-dotenv:
from dotenv import load_dotenv
load_dotenv();
Now, prepare the S3 connection with:
import os
import s3fs
fs = s3fs.S3FileSystem(
anon=False,
use_ssl=True,
client_kwargs={
"region_name": os.environ['S3_REGION'],
"endpoint_url": os.environ['S3_ENDPOINT'],
"aws_access_key_id": os.environ['S3_ACCESS_KEY'],
"aws_secret_access_key": os.environ['S3_SECRET_KEY'],
"verify": True,
}
)
Write Pandas DataFrame to S3 as Parquet
Save the DataFrame to S3 using s3fs and Pandas:
with fs.open('s3-example/data.parquet', 'wb') as f:
df.to_parquet(f)
Save the DataFrame to S3 using s3fs and PyArrow:
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Table
s3_filepath = 's3-example/data.parquet'
pq.write_to_dataset(
Table.from_pandas(df),
s3_filepath,
filesystem=fs,
use_dictionary=True,
compression="snappy",
version="2.4",
)
You can also upload this file with s3cmd by typing:
s3cmd \
--config ~/.s3cfg \
put data/data.parquet s3://s3-example
Reading Parquet File from S3 as Pandas DataFrame
Now, let’s have a look at the Parquet file by using PyArrow:
s3_filepath = "s3-example/data.parquet"
pf = pq.ParquetDataset(
s3_filepath,
filesystem=fs)
Now, you can already explore the metadata with pf.metadata
or the schema with pf.schema
. To read the data set into Pandas type:
pf.metadata
pf.schema
<pyarrow._parquet.ParquetSchema object at 0x7f1c2fa4a300>
required group field_id=-1 schema {
optional double field_id=-1 data;
}
When using ParquetDataset
, you can also use multiple paths. You can get those for example with:
s3_filepath = 's3://s3-example'
s3_filepaths = [path for path in fs.ls(s3_filepath)
if path.endswith('.parquet')]
s3_filepaths
['s3-example/data.parquet', 's3-example/data.parquet']
Resources
- s3fs.readthedocs.io - S3Fs Documentation
- PyArrow - Apache Arrow Python bindings
- Apache Parquet