Stage 3.3. Extract, Transform, Load (ETL)#
The Evolution of Data Engineering (Don’t Quote Me On This!)#
With the basic understanding of Data Engineering and its essential role in Machine Learning, it’s important to recognize the evolution of data handling practices. Traditional ETL (Extract, Transform, Load) methodologies have long been the backbone of data pipeline design. They set the stage for collecting, processing, and storing data in a structured manner.
However, the modern era of data-driven applications demands a more agile and responsive approach. This is where DataOps, encompassing principles of Continuous Integration/Continuous Deployment (CI/CD), comes into play. The process builds on the ETL framework but now with automation, collaboration, monitoring, and quality assurance.
In traditional ETL or ELT processes, the main focus is on extracting data from various sources, transforming it into the required format, and then loading it into a target system. These processes are typically batch-oriented and can be run on schedules or triggered manually.
In a CI/CD DataOps pipeline, the focus expands to the entire data lifecycle and emphasizes automation, continuous integration, and continuous deployment. This means that the process not only includes the basic ETL or ELT steps but also involves:
Continuous Integration: Automating the process of integrating code changes from multiple contributors into a shared repository, often followed by automated building and testing.
Continuous Deployment: Automating the process of deploying the integrated and tested code to production environments, ensuring that the data pipeline remains stable and updated.
Monitoring and Alerting: Keeping track of the performance and health of the data pipeline, triggering alerts if anomalies or issues are detected.
Testing and Quality Assurance: Embedding rigorous testing within the pipeline to ensure data quality, integrity, and compliance with business rules.
The ETL/ELT Framework#
ETL (Extract, Transform, Load)#
ETL is a process in data handling that involves three main stages:
Extract: Gathering data from various sources.
Transform: Processing this data to fit the desired format, usually outside the target system. This might include cleaning, aggregating, filtering, etc.
Load: Finally, loading the transformed data into the destination data warehouse or database.
ELT (Extract, Load, Transform)#
ELT is a variant of ETL, but with a different order of operations:
Extract: Gathering data from various sources.
Load: Loading the raw data into the destination system.
Transform: Performing transformations within the target system itself, utilizing the processing capabilities of modern data warehouses.
ELTL (Extract, Load, Transform, Load)#
This combination could represent a two-step process:
Extract: Gathering data from various sources.
Load: Loading the raw data into a staging area or temporary storage.
Transform: Performing transformations within this temporary storage.
Load: Loading the transformed data into the final destination, such as a data warehouse or database.
Step 0. Project Scope#
As usual, defining the scope of your project is the first step in any DataOps pipeline. The scope of your project will determine the type of data you need to extract, the tools you’ll use to extract it, and the type of data you’ll need to store and process.
Step 1. Extract (Data Extraction and Defining the Data Source)#
Data extraction refers to the process of retrieving or gathering data from various sources, such as websites, APIs, databases, or files. This process often involves techniques like web scraping, API requests, or database queries. Data extraction is usually the first step in the broader process of data collection.
Intuition#
The key of this step is to scope the data source.
This means identifying the data you need to extract and the sources where you can find it. You can extract data from various sources, such as web pages, APIs, databases, and public/private datasets. You can also extract data in real-time from streaming sources, such as social media feeds, or log files.
For example, let’s say you are a movie streaming provider and you want to build a movie recommendation system. You need to extract data about movies and users. You can extract this data from various sources such as IMDb, Rotten Tomatoes, as well as your internal database of user interactions in order to build your recommendation system. This step is non-trivial in practice because no one is going to give you a dataset with all the cleaned data you need magically.
Definition#
Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources (such as websites, APIs) for further data processing or data storage (data migration).
This version of the definition does not entail loading the extracted data into an intermediate storage. Instead, we are only interested in identifying the key points listed in the next section.
Template#
What (raw) data do I need to extract (relevant to my project)?#
Describe the types of data required for your project (e.g., text data, images, time-series data, etc.).
What information do you need from the data (e.g., sentiment, keywords, trends, etc.)?
Where can I find this data? This means finding all potential data sources#
List potential sources where the required data can be found (e.g., websites, APIs, databases, public or private datasets, etc.).
Are there any restrictions or limitations on accessing the data (e.g., licensing, API rate limits, etc.)?
What format is the data in? Is it in a format that I can use?#
Describe the format of the data in each source (e.g., JSON, CSV, XML, plain text, etc.).
If the data format is not directly usable, what conversion or transformation is needed?
How do I access and extract the data?#
List the methods to access the data (e.g., web scraping, API requests, database queries, file downloads, etc.).
Are there any authentication or authorization requirements to access the data (e.g., API keys, login credentials, etc.)?
What tools do I need to extract the data?#
For each data access method, list the appropriate tools or libraries (e.g., Scrapy for web scraping, Requests for API requests, etc.).
Are there any specific configurations or setups needed for using these tools (e.g., user agents, request headers, etc.)?
Workflow#
Data collection involves gathering raw data from various sources and making it available for further processing in your DataOps pipeline. We see below for some common techniques and tools for data extraction, along with their advantages and disadvantages.
Description (What) |
Data Sources (Where) |
Data Formats |
Access & Extraction (How) |
Tools and Libraries |
Requirements / Configurations |
---|---|---|---|---|---|
Extract text data, images, and other content from web pages, such as movie reviews, product listings, or news articles. |
Websites, Web forums, Blogs |
HTML, CSS, JavaScript |
Web Scraping |
User agents, Request headers, Handling AJAX requests, Managing cookies, Captcha solving |
|
Retrieve data from various APIs provided by platforms like Twitter, Reddit, news organizations, government data, or other services. |
APIs (Twitter, Reddit, IMDb, etc.) |
JSON, XML, CSV |
API Requests |
API keys, Access tokens, Rate limits, Pagination, Error handling |
|
Access large datasets like Common Crawl, Wikipedia dumps, or proprietary data from private sources, including text, images, and multimedia. |
Public/Private Datasets (Common Crawl, Wikipedia) |
CSV, TSV, Excel, JSON, XML, Images, Audio, Video |
Downloading datasets |
Licensing, Data access restrictions, Data storage and management, File format conversion |
|
Collect data in real-time from streaming sources, such as social media feeds, or log files. |
Social media feeds, IoT devices, Log files |
JSON, XML, CSV, Binary, Text |
Data Streaming |
Real-time processing, Stream configuration, Handling data partitioning and replication |
|
Extract data from structured or semi-structured databases, such as relational databases, NoSQL databases, or data warehouses. |
Databases (SQL, NoSQL, Data warehouses) |
SQL, NoSQL, CSV, TSV, JSON, XML |
Database queries |
SQL queries, SQLAlchemy, PyMySQL (MySQL), psycopg2 (PostgreSQL) |
Authentication, Connection strings, Query optimization, Managing database connections and sessions |
After collecting the data, it’s important to ingest it into your DataOps pipeline for further processing, such as storage, preprocessing, and transformation. Data ingestion can be done in real-time or batch mode, depending on the specific requirements of your project.
An Example Of Data Extraction For Collecting Large Volumes Of Text Data#
In this example, we will illustrate the process of data extraction for a ChatGPT equivalent for South East Asian languages using the data collection template as a guide.
Description (What data) |
Data Sources (Where) |
Data Formats |
Access & Extraction (How) |
Tools and Libraries |
Requirements / Configurations |
---|---|---|---|---|---|
Large volumes of text data in South East Asian languages (e.g., Indonesian, Malay, Thai, Vietnamese, Filipino, etc.) from diverse sources. |
- Common Crawl |
- WARC |
- Common Crawl API |
- warcio or WARC Tools |
- Processing and extraction of WARC files |
What (raw) data do I need to extract (relevant to my project)?#
Large volumes of text data in South East Asian languages (e.g., Indonesian, Malay, Thai, Vietnamese, Filipino, etc.).
Data should come from diverse sources to capture different styles, contexts, and domains (e.g., news articles, social media, forums, websites, etc.).
Where can I find this data? This means finding all potential data sources#
Common Crawl: A large and frequently updated dataset of web crawl data that includes text in various languages, including South East Asian languages.
APIs: Platforms like Twitter, Reddit, and news organizations that have content in South East Asian languages.
Public/Private Datasets: Language-specific datasets, government datasets, and domain-specific datasets containing text in South East Asian languages.
Web Scraping: Target websites and forums with a high volume of content in South East Asian languages.
What format is the data in? Is it in a format that I can use?#
Common Crawl data comes in WARC format, which will require processing and extraction to obtain usable text.
API data is typically in JSON format, which can be easily parsed and processed.
Datasets can come in various formats, such as CSV, JSON, XML, or plain text. Ensure that you have the appropriate tools to process each format.
How do I access the data?#
Common Crawl: Access data through the Common Crawl API or download WARC files directly from Amazon S3.
APIs: Request data using platform-specific APIs (e.g., Twitter API, Reddit API, etc.) with proper authentication and rate limit handling.
Public/Private Datasets: Access datasets through their respective providers (e.g., download links, dataset repositories, etc.).
Web Scraping: Crawl and scrape target websites using web scraping tools and libraries, while adhering to the site’s robots.txt and usage policies.
What tools do I need to extract the data?#
Common Crawl: Use libraries like warcio or WARC Tools to process WARC files.
APIs: Use libraries like Requests, Tweepy (Twitter), and PRAW (Reddit) to interact with APIs.
Public/Private Datasets: Use data processing libraries (e.g., Pandas for CSV, JSON for JSON) to handle various dataset formats.
Web Scraping: Use web scraping libraries like Scrapy, Beautiful Soup, and Selenium to scrape target websites.
Step 2. Load (Data Ingestion and Destination)#
Data ingestion refers to the process of loading the extracted data into a storage solution, such as a data warehouse, data lake, or data lakehouse. This step is crucial because it makes the data available for further processing and analysis in your DataOps pipeline.
Intuition#
The key to this step is to choose an appropriate destination for storing the extracted data. The choice of destination depends on your project requirements, the type and size of the data, and your budget constraints. Different destinations have their own advantages and limitations, so it’s essential to choose the right one for your specific needs.
In our case, since we’re dealing with large volumes of text data, a cloud-based storage solution like Google BigQuery (a data warehouse) or Google Cloud Storage (a data lake) would be suitable.
A side note is that these large text data is used for downstream pre-training task, and as such, say we are using SLURM cluster on AWS, naturally we would also need a highly efficient shared file system like Amazon EFS or Amazon FSx for Lustre that plays well with High Performance Computing (HPC) workloads. If you plan to use say FSx, then stick to AWS for the data lake as well as S3 and FSx are integrated well enough for one to dump data to S3 and then “mount” it to FSx.
Definition#
Data loading is the process of importing extracted data into a target storage system, such as a database, data warehouse, data lake, or data lakehouse.
In the context of ETL (Extract, Transform, Load), this process typically involves transforming, validating, and cleaning the extracted data to ensure its compatibility, consistency, and accuracy within the target storage system. However, in this version, we will adopt an approach more similar to ELT (Extract, Load, Transform), where the data is loaded into the destination storage system (i.e., data lake) without any transformations or cleaning.
Loading consists of two main steps:
Store the extracted raw data temporarily on a local machine, on-premises server, or cloud-based storage solution.
Ingest the raw data directly into the destination storage solution for further processing and analysis in your DataOps pipeline.
Template#
To successfully load the extracted data, consider the following key points that helps one to choose the right destination storage for their data.
What are my project requirements?#
What are the main goals and objectives of the project?
What kind of processing, analysis, or transformations will be performed on the data?
Do I need real-time processing and analytics, or is batch processing sufficient?
Where should I store the extracted data?#
Choose an appropriate storage solution based on your project requirements, the type and size of the data, and your budget constraints.
Consider the scalability, performance, and cost of the storage solution.
Ensure the storage solution is compatible with your data processing and analysis tools.
How do I load the extracted data into the destination storage?#
Select an appropriate method for loading the data (e.g., batch loading, real-time streaming, etc.).
Choose the right tools or libraries for loading the data.
Ensure the loaded data is organized and structured in a way that facilitates efficient querying and analysis.
What considerations should I keep in mind when loading the data?#
Consider data security and privacy, especially if you’re dealing with sensitive or personal information.
Ensure data integrity by validating, cleaning, and transforming the data.
Monitor and optimize the performance of the data loading process.
What type of data am I dealing with?#
Is the data structured, semi-structured, or unstructured?
What is the size and complexity of the data?
What are my budget constraints?#
What is the total budget for data storage?
How much will it cost to store the data for the duration of the project, considering factors like storage capacity, data transfer, and data access costs?
Are there any cost-saving options or discounts available, such as long-term storage plans or usage-based pricing?
What are the performance requirements?#
How fast do I need to access and query the data?
Are there any specific latency, throughput, or concurrency requirements?
Does the storage solution provide sufficient performance for my use case?
What are the scalability and flexibility requirements?#
Will the data volume grow over time? If so, how much?
Does the storage solution support scaling storage capacity and performance as needed?
Can the storage solution handle changes in data types, formats, or schemas over time?
What are the data security and compliance requirements?#
What are the security requirements for the data, such as encryption, access control, and auditing?
Does the storage solution meet any regulatory or compliance requirements that apply to my project or industry (e.g., GDPR, HIPAA, etc.)?
Workflow#
Here’s an overview of some common techniques and tools for data loading, along with their advantages and disadvantages.
Description (What) |
Destination (Where) |
Access & Ingestion (How) |
Tools and Libraries |
Considerations |
---|---|---|---|---|
Load extracted data into a relational database for structured data storage, querying, and analysis. |
SQL Databases (MySQL, PostgreSQL, Oracle, MS SQL Server) |
Batch loading, Real-time streaming |
Data schema, Data normalization, Indexing, Transactions, Constraints, Backup and recovery |
|
Load extracted data into a NoSQL database for flexible, schema-less storage and querying of semi-structured or unstructured data. |
NoSQL Databases (MongoDB, Cassandra, Couchbase, Redis) |
Batch loading, Real-time streaming |
PyMongo (MongoDB), Cassandra Driver (Cassandra), Couchbase SDK (Couchbase) |
Data model, Horizontal scaling, Partitioning, Replication, Consistency, Backup and recovery |
Ingest extracted data into a data warehouse for large-scale storage, querying, and analysis of structured or semi-structured data. |
Data Warehouses (Google BigQuery, Amazon Redshift, Snowflake) |
Batch loading, Real-time streaming |
Google Cloud SDK (BigQuery), Amazon Redshift SDK, Snowflake Python Connector |
Data storage format, Data partitioning, Indexing, Data compression, Security, Backup and recovery |
Store extracted data in a data lake for cost-effective, scalable storage and processing of raw, unstructured or semi-structured data. |
Data Lakes (Amazon S3, Google Cloud Storage, Azure Blob Storage, Hadoop HDFS) |
Batch loading, Real-time streaming |
Boto3 (Amazon S3), Google Cloud SDK (Google Cloud Storage), Azure SDK (Azure Blob Storage), Hadoop HDFS SDK (Hadoop HDFS) |
Data storage format, Data partitioning, Data compression, Security, Backup and recovery, Data lake organization and governance |
Load extracted data into a data lakehouse for a hybrid solution that combines the best features of data warehouses and data lakes. |
Data Lakehouses (Databricks, Delta Lake) |
Batch loading, Real-time streaming |
Data storage format, Data partitioning, Indexing, Data compression, Security, Backup and recovery, Data lake organization and governance |
Common Data Destinations#
Destination |
Description |
Examples |
---|---|---|
Data Warehouse |
A centralized storage solution optimized for structured or semi-structured data, designed for fast querying and analytics. Data warehouses typically use SQL-based languages for querying and are ideal for large-scale, long-term data storage. |
|
Data Lake |
A storage solution that can store large volumes of raw, unprocessed data in its native format, typically optimized for cost-effective storage. Data lakes can store structured, semi-structured, and unstructured data. |
Google Cloud Storage, Amazon S3, Microsoft Azure Data Lake Storage |
Data Lakehouse |
A hybrid storage solution that combines the best features of data warehouses and data lakes, providing a unified platform for storing, processing, and analyzing structured, semi-structured, and unstructured data. |
Databricks Delta Lake, Apache Hudi, Snowflake (when combined with data lake storage solutions like Amazon S3) |
Data Lake vs. Data Warehouse#
In the Data Engineering Guide by Goku Mohandas, he explains that storing raw data in data lakes has become popular due to cost-effective storage and cloud-based Software as a Service (SaaS) management options. Data lakes can store unstructured and unprocessed data without the need for justifying its storage for downstream applications. When there is a need to process and transform the data, it can be moved to a data warehouse, where these operations can be performed more efficiently. Data warehouses are optimized for fast querying and analytics and are typically used for storing structured or semi-structured data.
What this really means is that, no matter what kind of data is being extracted from the source, it can be loaded and stored into a data lake first as a copy. Then, when the data needs to be processed and transformed for downstream applications, it can be moved to a data warehouse for further processing and analysis.
Monitoring and Optimizing#
Once the data is loaded into the destination storage, it’s essential to monitor the performance of the data loading process and optimize it as needed. This can help ensure that the data is always up-to-date, accurate, and available for further processing and analysis.
Some tips for monitoring and optimizing the data loading process include:
Regularly check the performance metrics of the data loading process, such as data load times, error rates, and data throughput.
Identify and resolve any bottlenecks or issues that may be impacting the performance of the data loading process.
Optimize the data loading process by adjusting the batch size, parallelizing the process, or using incremental loading.
Monitor the storage utilization and performance of the destination storage system to ensure it meets the requirements of your DataOps pipeline.
Leverage built-in tools, third-party solutions, or custom scripts for monitoring and optimization
Step 3. Transform (Data Processing and Transformation)#
Data transformation is the process of converting, cleaning, and enriching the raw data that has been loaded into your storage solution to make it more suitable for analysis and reporting. This step is essential for improving data quality, reducing inconsistencies, and ensuring that the data is in a format that can be easily understood and consumed by your downstream analytics tools and processes.
However, as Goku puts it, the transform here is not necessarily the same as the preprocessing step in a machine learning task. A typical machine learning pipeline involves steps such as one-hot encoding, feature scaling, and feature extraction. In contrast, the transform step in the DataOps pipeline reflects more about the business logic of the data.
Why? This is because we want the data at this stage to be as agnostic as possible so that downstream tasks are not only restricted to machine learning. For example, we may want to extract data from the data lake and load it into a data warehouse for further processing and analysis. In this case, the data needs to be transformed in a way that’s suitable for the downstream task (and not necessarily for machine learning).
As a result, our DataOps workflows are not specific to any particular downstream application so the transformation must be globally relevant.
Intuition#
The key to this step is to understand the structure and nature of your data and then apply the necessary transformations to make it more valuable and useful for your analysis. The specific transformations you apply will depend on your project requirements, the type of data you’re working with, and the desired output format.
In cases where large volumes of text data are being processed, some common agnostic transformations applicable to various downstream tasks include:
Cleaning: Removing obvious errors, inconsistencies, or irrelevant information.
De-duplication: Removing duplicate records.
Definition#
Data transformation is the process of converting, cleaning, and enriching raw data into a format that is more suitable for analysis, reporting, and other data-driven tasks. This step typically involves applying various data processing techniques, such as filtering, sorting, aggregation, and joining, to modify and enhance the data’s structure, quality, and consistency.
These operations typically include:
Filtering: Removing unwanted data, such as duplicates, irrelevant information, or incomplete records.
Aggregating: Combining data from multiple sources or records to create summary statistics, averages, or other measures of central tendency.
Normalizing: Scaling or standardizing data to a common range or distribution, making it easier to compare and analyze.
In the context of ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform), data transformation is an essential step that prepares the data for loading into a target storage system or for further processing and analysis in your DataOps pipeline.
Template#
What are my project requirements?#
What are the main goals and objectives of the project?
What kind of processing, analysis, or transformations will be performed on the data?
What is the desired output format or structure of the transformed data?
What type of data am I dealing with?#
Is the data structured, semi-structured, or unstructured?
What is the size and complexity of the data?
Are there any specific data quality or consistency issues that need to be addressed?
What are the data processing techniques and tools I should use?#
Select appropriate data processing techniques based on your project requirements and the type of data you’re working with.
Choose the right tools or libraries for implementing the data processing techniques.
Ensure the tools and libraries are compatible with your data storage solution and analytics tools.
What are the performance requirements?#
How fast do I need to process and transform the data?
Are there any specific latency, throughput, or concurrency requirements?
Does the data processing solution provide sufficient performance for my use case?
What are the scalability and flexibility requirements?#
Will the data volume grow over time? If so, how much?
Does the data processing solution support scaling processing capacity and performance as needed?
How well does the data processing solution handle changes in data types, formats, or schemas over time?
Workflow#
Technique (What) |
Description |
Input (Where) |
Output (How) |
Tools and Libraries |
Considerations |
---|---|---|---|---|---|
Remove unwanted or irrelevant data based on specified criteria or conditions. |
Raw data from storage solution |
Filtered data |
Pandas, NumPy, Dask |
Criteria for filtering, handling missing or incomplete data, data quality |
|
Organize the data based on one or more attributes, either in ascending or descending order. |
Filtered or unfiltered data |
Sorted data |
Pandas, NumPy, Dask |
Sorting criteria, handling missing or incomplete data, sorting algorithms and complexity |
|
Combine and summarize data based on specific attributes, such as sum, average, count, or other statistical measures. |
Filtered or unfiltered data |
Aggregated data |
Pandas, NumPy, Dask |
Aggregation functions, handling missing or incomplete data, data granularity |
|
Merge data from multiple sources or tables based on a common attribute or key. |
Data from multiple sources or tables |
Joined data |
Pandas, SQLAlchemy, Apache Spark |
Join types, joining keys, handling missing or incomplete data, join performance |
Batch Processing vs. Stream Processing#
For real-time or near-real-time ML applications, traditional batch processing of ETL might not be suitable. Instead, stream processing frameworks like Apache Kafka or Apache Flink allow for continuous data processing and may be used as alternatives or complements to ETL.