Building ETL Job: Transferring Data from MySQL to Redshift using Python

Extract, Transform, Load (ETL) is a data pipeline process that involves extracting data from a source system, transforming it in some way, and then loading it into a target system. In this article, we’ll demonstrate how to build an ETL job that extracts data from a MySQL database and loads it into a Redshift data warehouse. We’ll also implement the Change Data Capture (CDC) concept to capture delta changes and trigger this ETL job every hour.

Using Normal Python Script

Prerequisites

  • Python 3 installed on your local machine
  • MySQL and AWS Redshift instances up and running
  • mysql-connector-python and psycopg2 Python libraries installed
  • An orders table in your MySQL database with create_date and update_date columns

Step 1: Extract Data from MySQL

First, we will extract the data from the MySQL database using the mysql-connector-python library. Here's a simple Python function that connects to a MySQL database and fetches new records from the orders table:

import mysql.connector
from datetime import datetime

# set last_update to the start of epoch
last_update = datetime(1970, 1, 1)

def extract_new_records():
    global last_update
    connection = mysql.connector.connect(user='mysql_user', password='mysql_password',
                                        host='mysql_host', database='mysql_database')

    cursor = connection.cursor()
    query = f"""SELECT * FROM orders
                WHERE update_date > '{last_update.strftime("%Y-%m-%d %H:%M:%S")}'"""
    cursor.execute(query)

    records = cursor.fetchall()
    last_update = datetime.now()
    cursor.close()
    connection.close()
    return records

Step 2: Load Data to Redshift

The next step is to load the extracted data into Redshift. We’ll use the psycopg2 library to insert the records into the Redshift table

Read More