Reading and Writing Pandas DataFrames in Chunks

This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.

Querying S3 Object Stores with Presto or Trino

Querying big data on Hadoop can be challenging to get running, but alternatively, many solutions are using S3 object stores which you can access and query with Presto or Trino. In this guide you will see how to install, configure, and run Presto or Trino on Debian or Ubuntu with the S3 object store of your choice and the Hive standalone metastore.

Manage Jupyter Notebook and JupyterLab with Systemd

In this article you will see how to easily manage Jupyter Notebook and JupyterLab by using the Systemd tooling. This is useful when you want to have an instance running local or on your server that you can manage and monitor.

Systemd Cheatsheet

Systemd is an init system in Linux used for system intialization and service management. It is fairly useful to manage and monitor services. In this cheatsheet you will find a collection of common commands used with the command line tools systemctl and journalctl.

How to Install Presto or Trino on a Cluster and Query Distributed Data on Apache Hive and HDFS

Presto is an open source distibruted query engine built for Big Data enabling high performance SQL access to a large variety of data sources including HDFS, PostgreSQL, MySQL, Cassandra, MongoDB, Elasticsearch and Kafka among others.

Classifying the Iris Data Set with PyTorch

In this short article we will have a look on how to use PyTorch with the Iris data set. We will create and train a neural network with Linear layers and we will employ a Softmax activation function and the Adam optimizer.

Google Analytics Analytics with Python

Google Analytics is a powerful analytics tool found in an astonishing number of websites. In this tutorial, we will take a look at how to access the Google Analytics API (v4) with Python and Pandas. Additionally, we will take a look at the various ways to analyze your tracking data and create custom reports.

How to Manage Apache Airflow with Systemd on Debian or Ubuntu

Apache Airflow is a powerfull workflow management system which you can use to automate and manage complex Extract Transform Load (ETL) pipelines. In this tutorial you will see how to integrate Airflow with the systemd system and service manager which is available on most Linux systems to help you with monitoring and restarting Airflow on failure.

How to Create Your Data Science Blog with Pelican and Jupyter Notebooks

Writing articles and tutorials are a great way to learn new things in depth while building a portfolio. In this tutorial, you will find the first steps that you will need to start your data science blog with Pelican and Jupyter Notebooks.

How to Execute Shell Commands with Python

Python is a wonderful language for scripting and automating workflows and it is packed with useful tools out of the box with the Python Standard Library. A common thing to do, especially for a sysadmin, is to execute shell commands. But what usually will end up in a bash or batch file, can be also done in Python. You’ll learn here how to do just that with the os and subprocess modules.