Extract, Transform, Load (ETL) is a data pipeline process that involves extracting data from a source system, transforming it in some way, and then loading it into a target system. In this article, we’ll demonstrate how to build an ETL job that extracts data from a MySQL database and loads it into a Redshift data warehouse. We’ll also implement the Change Data Capture (CDC) concept to capture delta changes and trigger this ETL job every hour.
Using Normal Python Script
Prerequisites
- Python 3 installed on your local machine
- MySQL and AWS Redshift instances up and running
mysql-connector-pythonandpsycopg2Python libraries installed- An
orderstable in your MySQL database withcreate_dateandupdate_datecolumns
Step 1: Extract Data from MySQL
First, we will extract the data from the MySQL database using the mysql-connector-python library. Here's a simple Python function that connects to a MySQL database and fetches new records from the orders table:
import mysql.connector
from datetime import datetime
# set last_update to the start of epoch
last_update = datetime(1970, 1, 1)
def extract_new_records():
global last_update
connection = mysql.connector.connect(user='mysql_user', password='mysql_password',
host='mysql_host', database='mysql_database')
cursor = connection.cursor()
query = f"""SELECT * FROM orders
WHERE update_date > '{last_update.strftime("%Y-%m-%d %H:%M:%S")}'"""
cursor.execute(query)
records = cursor.fetchall()
last_update = datetime.now()
cursor.close()
connection.close()
return records
Step 2: Load Data to Redshift
The next step is to load the extracted data into Redshift. We’ll use the psycopg2 library to insert the records into the Redshift table