Querying S3 Object Stores with Presto or Trino

Querying big data on Hadoop can be challenging to get running, but alternatively, many solutions are using S3 object stores which you can access and query with Presto or Trino. In this guide you will see how to install, configure, and run Presto or Trino on Debian or Ubuntu with the S3 object store of your choice and the Hive standalone metastore.

This guide was tested on MinIO, Linode object storage, and AWS S3, with a personal preference for Linode. For installing Trino or Presto I recommend Hetzner VPS instances. To run the queries yourself, you can check out this repository to help you replicate this guide locally with MinIO and docker. If you want to run Presto or Trino on Hadoop HDFS, have a look at this tutorial.

S3 Object Stores

Amazon started AWS Simple Storage Service (S3) in 2006 and it is the most popular object storage to date. In S3 your data is grouped in buckets with a globally unique name and the data can be stored unstructured without a schema. It offers high scalability, high availability, and low latency storage and it is a common alternative to using HDFS and the Hadoop ecosystem. Although S3 is no official standard, many vendors and projects support the interface and are compatible with S3. Depending on your needs and applications, there is a variety of possible S3 compatible object stores to choose from:

Interacting with S3 Buckets using s3cmd

In order to interact with your S3 bucket, you need some tool. Many vendors have their own CLI tooling like the AWS CLI or Linode CLI. A platform independent tool is s3cmd which we will be using here. You can install s3cmd on Ubuntu/Debian with:

sudo apt update
sudo apt install s3cmd

Now, you need to configure the s3cmd environment with:

s3cmd --configure

After following all prompts, you can find the configuration in ~/.s3cfg. If you want to save it as a custom profile you can add the -c/--config argument with the location of the configuration:

s3cmd --config aws.s3cfg --configure

Warning: the access keys are saved in plain text. Here is a list of useful commands when working with s3cmd:

  • s3cmd mb s3://bucket Make bucket
  • s3cmd rb s3://bucket Remove bucket
  • s3cmd ls List available buckets
  • s3cmd ls s3://bucket List folders within bucket
  • s3cmd get s3://bucket/file.txt Download file from bucket
  • s3cmd get -r s3://bucket/folder Download recursively files from bucket/directory
  • s3cmd put file.txt s3://bucket Upload files to bucket
  • s3cmd put -r folder s3://bucket Upload folder to bucket
  • s3cmd del s3://bucket/file.txt Delete file or folder from bucket

For more commands and documentation, have a look at the s3cmd usage.

Hive Standalone Metastore

In any database, you need a place to manage the various tables, schemas, relationships, and views. This is commonly done in a metastore. When using S3 it is common to have the tables stored as CSV, Apache Parquet, and Apache ORC files among others. To store the schemas of those tables Trino/Presto needs Apache Hive for the query engine to access the metadata of those tables. Hive is also commonly used as a metastore in the Hadoop ecosystem in projects like Apache Impala, Apache Spark, and Apache Drill.

Installation

In my previous tutorial, the installation relied on Hadoop and HDFS, but in this case, it will use a standalone version of the Hive metastore which runs without the rest of Hive. Hive metastore requires a database to store the schemas. For this, you can use DerbyDB, MySQL, MariaDB, and PostgreSQL. In this tutorial, you will see how to use it with MariaDB. Further, the metastore and Trino/Presto require Java 11. To install MariaDB and the Java 11 JRE, type:

sudo apt update
sudo apt install -y \
    mariadb openjdk-11-jre-headless

Now make sure it’s enabled and running by typing:

sudo systemctl enable mariadb.service
sudo systemctl start mariadb.service

To check its status, you can type systemctl status mariadb.service. Next, you will need to create a user and a database for the Hive metastore, which you can do with the following command:

sudo mysql -u root -e "
    DROP DATABASE IF EXISTS metastore;
    CREATE DATABASE metastore;

    CREATE USER 'hive'@localhost IDENTIFIED BY 'hive';
    GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost';
    FLUSH PRIVILEGES;"

Great, now that the database is set up, we can continue with downloading and extracting the metastore to /usr/local/metastore with:

wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
sudo chown user:user /usr/local/metastore

If you want to use another version instead of version 3.1.2 have a look at the following list. Hive metastore requires some dependencies from Hadoop, therefore you need to download Hadoop as well with:

wget "https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"
tar xvf hadoop-3.2.1.tar.gz
sudo mv hadoop-3.2.1 /usr/local/hadoop
sudo chown user:user /usr/local/hadoop

There are a few dependencies that you need to copy and change to make it compatible with S3 and Hadoop. Here are the commands for that:

rm /usr/local/metastore/lib/guava-19.0.jar
cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
  /usr/local/metastore/lib/
cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
  /usr/local/metastore/lib/
cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
  /usr/local/metastore/lib/

Configuration

Now, Hive needs connection details to your S3 bucket. This can be done in the /usr/local/metastore/conf/metastore-site.xml file. Open the existing metastore-site.xml and add the following properties within the <configuration> section:

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
</property>
<property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
</property>

Further, add those properties to specify the S3 connection:

<property>
    <name>fs.s3a.access.key</name>
    <value>S3_ACCESS_KEY</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>S3_SECRET_KEY</value>
</property>
<property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
</property>
<property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
</property>
<property>
    <name>fs.s3a.endpoint</name>
    <value>S3_ENDPOINT</value>
</property>

Additionally, you will need to define the JAVA_HOME and HADOOP_HOME environment variables, which you can set with:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop

To have those ready every time you open the shell, you need to append those two lines in the ~/.bashrc file. Once you have everything configured, you can initialize the metastore with:

/usr/local/metastore/bin/schematool -initSchema -dbType mysql

After the initialization is finished, you can start the metastore service with:

/usr/local/metastore/bin/start-metastore &

For more information about the metastore configuration, have a look at the documentation and more specifically on Running the Metastore Without Hive.

Trino and Presto

Trino and Presto are both open-source distributed query engines for big data across a large variety of data sources including HDFS, S3, PostgreSQL, MySQL, Cassandra, MongoDB, and Elasticsearch among others. To see the difference between both projects, have a look at this article. In this installation, you will see how to install Trino 352 in particular, but all the steps and queries were also tested on Presto 0.247.

Installation

After you have S3 and the Hive standalone metastore ready, you can proceed with installing and configuring Trino on your server. To install it, download it from here and extract it in /usr/local/trino as outlined here:

wget "https://repo1.maven.org/maven2/io/trino/trino-server/352/trino-server-352.tar.gz"
tar -xzvf trino-server-352.tar.gz
sudo mv trino-server-352 /usr/local/trino
sudo chown $$USER:$$USER /usr/local/trino

Additionally, you will need the CLI in order to access the query engine, which you can download to the bin folder of the same directory and make it executable:

wget "https://repo1.maven.org/maven2/io/trino/trino-cli/352/trino-cli-352-executable.jar"
mv trino-cli-352-executable.jar /usr/local/trino/bin/trino
sudo chmod +x /usr/local/trino/bin/trino

Configuration

Lastly, you need to configure Trino. For this, you have a few configuration files that are required. First you need the configuration for the JVM in /usr/local/trino/etc/jvm.config. You can fill it with the following contents:

-server
-Xmx6G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

Here, make sure that you utilize your RAM memory properly by specifying -Xmx with around 80% of your available memory. This way you will have enough memory for the system as long as you don’t have anything else running on this machine. In this example, it is set to 6 GB. Next up, is the /usr/local/trino/etc/node.properties which contains configuration for each node:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/usr/local/trino/data

Here you need to specify the name of the environment with node.environment, the unique identifier of the node with node.id, and finally the directory of the data directory with node.data-dir. Nest, you need to add the configuration for the Trino server in /usr/local/trino/etc/config.properties. Here is a possible configuration:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

The configuration of how much memory requires some trial and error and depends on the expected workload and the number of queries that will run simultaneously. A good tutorial on the topic that you can follow is Memory Configuration in Presto Cluster. Finally, you need to configure the connection to S3. For this create the file /usr/local/trino/etc/catalog/hive.properties with the following contents:

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.path-style-access=true
hive.s3.endpoint=S3_ENDPOINT
hive.s3.aws-access-key=S3_ACCESS_KEY
hive.s3.aws-secret-key=S3_SECRET_KEY
hive.s3.ssl.enabled=false

This should do the trick. For more information on the deployment and the Hive connector have a look at Deploying Trino, the Hive connector documentation, and the Hive connector with Amazon S3 documentation. For multi-node configuration, follow this tutorial.

Now, you should be able to start Trino by running:

/usr/local/trino/bin/launcher start

Once it is running, you should open the Trino web UI at localhost:8080 with the port previously defined in config.properties. Once you run queries, they should be listed there.

Query Data stored on S3

We will work with a previously uploaded parquet file which you can find in this repository, which was converted from the famous Iris data set. In this example the file is stored on the bucket at s3a://iris/iris_parquet/iris.parq. First, you need to create a schema to access the file which you can do by running the following SQL statement:

CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://iris/');

Next, you need to create a table to the existing data set on S3 with:

CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
  sepal_length DOUBLE,
  sepal_width  DOUBLE,
  petal_length DOUBLE,
  petal_width  DOUBLE,
  class        VARCHAR
)
WITH (
  external_location = 's3a://iris/iris_parquet',
  format = 'PARQUET'
);

Now, you should be able to query the data with:

SELECT 
  sepal_length,
  class
FROM hive.iris.iris_parquet 
LIMIT 10;

To show all tables in a particular schema, you can type SHOW TABLES IN hive.iris;.

Conclusion

There you have it. You have seen how to setup Trino or Presto to query data stored on S3 storage like AWS S3, Linode object storage, or MinIO among others. If you are already familiar with S3, this makes it incredibly easy to query large data sets instead of dealing with preparing HDFS on local infrastructure. Trino and Presto provide separation of data and compute which makes it a one-stop-shop to query across multiple data sources with their federated queries in reasonable time and low technical overhead. For more useful information, have a look at the following resources:

Resources

There were also some related podcasts in this topic by the Data Engineering Podcast, which were worth a listen:

Appendix

Running the Hive metastore can be tiresome and for quick tests it can be sometimes useful to use the build-in FileHiveMetastore. Be warned, It is not advised to use it in production and there are barely any documentation about it except for these articles and discussions:

Image from Wikimedia Commons