How to Manage Apache Airflow with Systemd on Debian or Ubuntu20 Dec 2019
Apache Airflow is a powerfull workflow management system which you can use to automate and manage complex Extract Transform Load (ETL) pipelines. In this tutorial you will see how to integrate Airflow with the systemd system and service manager which is available on most Linux systems to help you with monitoring and restarting Airflow on failure.
Apache Airflow goes by the principle of configuration as code which lets you programmatically configure and schedule complex workflows and also monitor them. This is great if you have big data pipelines with lots of dependencies to take care. If you haven’t installed Apache Airflow yet, have a look at this installation guide and this tutorial which should bring you up to speed.
Systemd is an init system, which is the first process (with PID 1) that bootstraps the user space and manages user processes. It is widely used on most Linux distributions and it simplifies common sysadmin tasks like checking and configuring services, mounted devices and system states. To interact with systemd, you have a whole suite of command-line tools at your disposal, but for this tutorial you will need only need
systemctl is responsible for starting, stopping, restarting and checking the status of systemd services and
journalctl on the other hand is a tool to explore the logs generated by the systemd units.
Apache Airflow Unit Files
In systemd, the managed resources are refered as units which are configured with unit files stored in the
/lib/systemd/system/ folder. The configuration of these files follow the INI file format. These units can have different categories, but for the sake of this tutorial we will focus only on service units whose files are suffixed with
.service. Each unit file consists of sections specified with square brackets and that are case-sensitive. Inside the sections you will find the directives which are defined as
key=value pairs. The
[Unit] section is responsible for metadata and describing the relationship to other units. The
[Service] section provides the main configuration for the service and finally the
[Install] section defines what should happen when the unit is enabled. This is only scratching the surface, but you will find an extensive tutorial covering systemd units in this article.
You can find unit files for Apache Airflow in airflow/scripts/systemd, but those are specified for Red Hat Linux systems. If you are not using the distributed task queue by Celery or network authentication with Kerberos you will only need
airflow-scheduler.service unit files. You will need to do some changes to those files. The unit files shown in this tutorials are working for Apache Airflow installed on an Anaconda virtual environment. Here is the unit file for
[Unit] Description=Airflow webserver daemon After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service Wants=postgresql.service mysql.service redis.service rabbitmq-server.service [Service] EnvironmentFile=/home/airflow/airflow.env User=airflow Group=airflow Type=simple ExecStart=/bin/bash -c 'source /home/user/anaconda3/etc/profile.d/conda.sh; \ conda activate ENV; \ airflow webserver' Restart=on-failure RestartSec=5s PrivateTmp=true [Install] WantedBy=multi-user.target
This unit file needs a user called
airflow, but if you want to use it for a different user, change the directives
Group= to the desired user. You might notice that the
ExecStart= directives are changed. The
EnvironmentFile= directive specifies the path to a file with environment variables that can be used by the service. Here you can define variables like
AIRFLOW_CONFIG. Make sure that this file exists even if there are no variables defined. You can find in airflow/scripts/systemd/airflow a template that you can copy. The
ExecStart= directive defines the full path (!) and arguments of the command that you want to execute. Have a look at the documentation to know how this directive needs to formatted. In this case we want to activate the Anaconda environment before starting airflow.
Here, similar to the previous unit file, is the unit file for
[Unit] Description=Airflow scheduler daemon After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service Wants=postgresql.service mysql.service redis.service rabbitmq-server.service [Service] EnvironmentFile=/home/airflow/airflow.env User=airflow Group=airflow Type=simple ExecStart=/bin/bash -c 'source /home/user/anaconda3/etc/profile.d/conda.sh; \ conda activate ENV; \ airflow initdb; \ airflow scheduler' Restart=always RestartSec=5s [Install] WantedBy=multi-user.target
Note, that we have defined multiple services in
Wants=. These units don’t have to exist, but you can install them once you need them. For example the
postgresql.service is available once you install the PostgreSQL database.
If you need the other services like
airflow-worker.service (celery worker),
airflow-flower.service (celery flower) or
airlfow-kerberos.service (kerberos ticket renewer) you can copy the files from the airflow/scripts/systemd/ scripts, where you need to adapt the
ExecStart directives as shown here with the webserver and scheduler.
Starting and Managing the Apache Airflow Unit Files
This two unit files now need to be saved (or linked) to the
/lib/systemd/system/ folder. Now, to activate those you first need to reload the systemd manager configuration with:
sudo systemctl daemon-reload
Next, you can start the services with:
sudo systemctl start airflow-webserver.service sudo systemctl start airflow-scheduler.service
If you did everything right you should see an active service when you check the status with:
sudo systemctl status airflow-webserver.service sudo systemctl status airflow-scheduler.service
This shows also the most recent logs for that service which is handy to see what has gone wrong. To have the service start when you restart your server/computer you need to enable the services with:
sudo systemctl enable airflow-webserver.service sudo systemctl enable airflow-scheduler.service
To disable a service use
sudo systemctl disable your-service.service and to stop a service use
sudo systemctl stop your-service.service. Sometimes you also need to debug a service. This can be done by checking the logs with
journalctl. For example if you want to check the last 10 log entries for a particular unit you can type:
sudo journalctl -u your-service.service -n 10
If you want to see the 10 last log entries for both services you can type:
sudo journalctl -u airflow-webserver.service -u airflow-scheduler.service -n 10
For more useful commands and arguments have a look at this Systemd cheatsheet.
In this tutorial you have seen how to run Apache Airflow with systemd on Debian or Ubuntu. We have also scratched the surface on the things you can do with systemd to help you with monitoring and managing services. To delve deeper into this topic, I recommend the following list of articles that were highly helpful in grasping the topics covered here.
- Systemd Cheatsheet
- How To Use Systemctl to Manage Systemd Services and Units
- How To Use Journalctl to View and Manipulate Systemd Logs
- Understanding Systemd Units and Unit Files
- Using environment variables in systemd units
- systemd for Administrators, Part 1, Part 2, Part 3 by Lennart Poettering, the creator of systemd