This guide provides step-by-step instructions for creating a Docker container with Python and Apache Airflow installed. The container will be configured to have access to an AWS S3 bucket, allowing it to read and write files. This setup can be useful for managing data workflows and automating ETL processes.
Prerequisites
- AWS Account
- Docker installed on your machine
- Basic knowledge of Python and Airflow
Step 1: AWS S3 Bucket Configuration
1.1 Create an S3 Bucket
Create an S3 bucket named freshers-in with the data/viewership path using the AWS Management Console.
1.2 Set up IAM Role and Permissions
Create an IAM Role with programmatic access, and attach a policy granting read and write permissions to the bucket.
Example policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3Access",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::freshers-in/data/viewership/*"
}
]
}
Create a Dockerfile to build an image with Python, Airflow, and the necessary AWS libraries.
FROM python:3.8-slim-buster
# Install dependencies
RUN pip install apache-airflow boto3
# Set Airflow environment variables
ENV AIRFLOW_HOME=/usr/local/airflow
# Initialize Airflow database
RUN airflow db init
# Copy your Airflow DAGs
COPY ./dags /usr/local/airflow/dags
# Set default command to run Airflow webserver
CMD ["airflow", "webserver"]
Step 3: Build the Docker Image
Build the Docker image using the following command:
docker build -t airflow-python-s3 .
Step 4: Run the Docker Container
Run the container with the IAM Role’s access keys:
docker run -e AWS_ACCESS_KEY_ID=<your-access-key> -e AWS_SECRET_ACCESS_KEY=<your-secret-key> -p 8080:8080 airflow-python-s3
Step 5: Access the S3 Bucket from a Python Script
Now you can access the S3 bucket from a Python script or Airflow DAG within the container. Here’s an example script using the boto3 library to list objects in the bucket:
import boto3
s3 = boto3.client('s3')
objects = s3.list_objects(Bucket='freshers-in', Prefix='data/viewership/')
print(objects)