Apache Airflow is quickly becoming the go-to tool for workflow management for data engineering pipelines. Airflow is easy to use, versatile and open-source, which ensures a degree of independence from vendor specific solutions. Using the managed Airflow from AWS reduces the engineering burden of provisioning hardware and managing the deployed Airflow software.
In this post we will describe how to deploy a Airflow platform in AWS using the CDK.
TL;DR:
Create and AWS account (or use existing)
Clone the repo
Update the config to use to your AWS account
Deploy the infrastructure using the CDK
Upload DAGs (sample scripts)
Use Airflow to run a sample DAG
The resources are free to use. Reach out to engineering@shellstack.consulting for any questions you may have.
About MWAA (Managed Workflows for Apache Airflow)

"Amazon Managed Workflows for Apache Airflow is a managed orchestration service for Apache Airflow that you can use to setup and operate data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows. With Amazon MWAA, you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. Amazon MWAA automatically scales its workflow execution capacity to meet your needs, Amazon MWAA integrates with AWS security services to help provide you with fast and secure access to your data."
Credit: https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html
Airflow is free and easy to use, but as an open source solution the burden of deploying and managing the tool is up to the user, which is where the cost and complexity becomes relevant. Similar to other Apache tools, such as those in the Hadoop ecosystem, creating and supporting a production grade deployment requires hardware and a dedicated DevOps team focussing on aspects such as scalability, security, accessibility and patching/versioning. For most organisations, having a dedicated Airflow DevOps teams is a waste or resources, where those resources could be better spent solving business problems through development. In those cases, managed Airflow is the answer.
AWS has a manager Airflow solution called MWAA (Manager Workflows for Apache Airflow). It allows the user to deploy a fully managed Airflow platform while abstracting away the much of the complexities. Some effort is required up-front to configure the deployment but after configuration the effort for provisioning and managing resources is transferred to AWS. Configuration can be done in the AWS console, but for the sake of repeatability and standardisation we have created an AWS CDK project to which contains the configuration and allows for simple one step deployment.
Deployed Architecture
Executing the CDK deploy command will trigger Cloudformation to deploy the following resources into your account.

The following resources are deployed:
A VPC (Virtual Private Cloud) to ring-fence and secure your infrastructure
A private subnet for resources only accessible from within the VPC. In this case the Managed Airflow deployment is associated with the custom private subnet.
A public subnet for resources accessible from the internet. In this case a bastion host is deployed in to the public subnet. Users can create an ssh bridge to the bastion and thereby securely access the resources in the private subnet
Security groups further restricting communication between entities. Ingress and egress rules dictate which traffic is allowed / blocked for the associated resources.
An Airflow Execution IAM Role that allows Airflow to access the various resources it needs.
An S3 bucket to host your DAGs as well as logs generated by DAGs
A configuration to Cloudwartch, which will hold the logs of the various Airflow Infrastructure Resources, useful for additional observability and debugging.
Other resources - Airflow is a scheduler and will execute DAGs, which in turn will trigger compute execution in other resources. These resources, used to manipulate data, can reside in the same VPC, inside AWS managed services, or outside of AWS.
How to deploy
Pre-requisites
An AWS account [https://docs.aws.amazon.com/accounts/latest/reference/manage-acct-creating.html ]
Locally configured credential with credentials allowing you to deploy using the CDK [https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html]
Locally configured Git
Node JS and NPM installed [https://docs.npmjs.com/downloading-and-installing-node-js-and-npm]
Steps
Clone the repository.
git clone https://github.com/Shell-Stack-Cloud-Engineers/aws-managed-airflow.git
Edit the config (infra/config/config.ts ) to use the account and region you would like to deploy to.
const config: Config = { account:'<Your account number here>', region:'<Your region here>' };
Deploy the stack.
npm run deploy
Wait for the deployment to finish
Navigate to the MWAA in the AWS Management Console and confirm that Airflow has been deployed
Access the Airflow UI using the link
Adding DAGs to Airflow
Initially there are no DAGs uploaded to Airflow when the stack is first deployed.
To add DAGs to Airflow you must upload them to the S3 bucket that was associated with Airflow as part of the infrastructure deployment.
As part of this project we has included a few sample GAGs. You can work on these DAGs locally and sync them to Airflow using the provided sync.ts script.
To sync the DAGs from local to the Managed Airflow, perform the following steps:
Update the sync config (/airflow/scripts/.env && /airflow/src/dags/.env):
In AWS App Config, find the Config Profile ID.
Set the APPCONFIG_CONFIGURATION_PROFILE_IDENTIFIER value in the .env files to the Config Profile ID value.
Run the sync scripts (/airflow/scripts/sync.ts)
npm run sync
Wait a few minutes for Airflow to recognise the newly uploaded files. They should now be visible in the Airflow DAGs list, ready to be execute.
Summary
We have shown that it is possible to deploy a fully managed Apache Airflow platform with a single command given an AWS account and some basic configuration. Users can update and configure Airflow in the CDK and easily customise the deployment as they see fit.
Comments