Docker : Docker container with Python and Apache airflow for seamless integration with AWS S3

This guide provides step-by-step instructions for creating a Docker container with Python and Apache Airflow installed. The container will be configured to have access to an AWS S3 bucket, allowing it to read and write files. This setup can be useful for managing data workflows and automating ETL processes.

Prerequisites

  • AWS Account
  • Docker installed on your machine
  • Basic knowledge of Python and Airflow

Step 1: AWS S3 Bucket Configuration

1.1 Create an S3 Bucket

Create an S3 bucket named freshers-in with the data/viewership path using the AWS Management Console.

1.2 Set up IAM Role and Permissions

Create an IAM Role with programmatic access, and attach a policy granting read and write permissions to the bucket.

Example policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::freshers-in/data/viewership/*"
        }
    ]
}
Step 2: Create the Dockerfile

Create a Dockerfile to build an image with Python, Airflow, and the necessary AWS libraries.

FROM python:3.8-slim-buster

# Install dependencies
RUN pip install apache-airflow boto3

# Set Airflow environment variables
ENV AIRFLOW_HOME=/usr/local/airflow

# Initialize Airflow database
RUN airflow db init

# Copy your Airflow DAGs
COPY ./dags /usr/local/airflow/dags

# Set default command to run Airflow webserver
CMD ["airflow", "webserver"]

Step 3: Build the Docker Image

Build the Docker image using the following command:

docker build -t airflow-python-s3 .

Step 4: Run the Docker Container

Run the container with the IAM Role’s access keys:

docker run -e AWS_ACCESS_KEY_ID=<your-access-key> -e AWS_SECRET_ACCESS_KEY=<your-secret-key> -p 8080:8080 airflow-python-s3

Step 5: Access the S3 Bucket from a Python Script

Now you can access the S3 bucket from a Python script or Airflow DAG within the container. Here’s an example script using the boto3 library to list objects in the bucket:

import boto3
s3 = boto3.client('s3')
objects = s3.list_objects(Bucket='freshers-in', Prefix='data/viewership/')
print(objects)
Author: user

Leave a Reply