|
||||||||
"Extract, Transform, Load" (ETL) and " Extract, Load, Transform" (ELT) - Python Automation and Machine Learning for ICs - - An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao - |
||||||||
| Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
| Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Extract, load, and transform (ELT) emerged because of big data processing. All the data resides in a data lake. A data lake is a pool of raw data for which the data purpose is not predefined. In a data lake, each project forms individual transformation tasks as required. It does not anticipate all the transformation requirements usage scenarios as in the case of ETL and a data warehouse. Organizations opt to use a mixture of ETL and ELT. Extract, transform, load (ETL) plays a crucial role as the initial phase in any data processing pipeline, supplying data to warehouses for subsequent use in applications, machine learning models, and various services. In the final stage of the ETL pipeline, data can be saved to a disk or transferred to a different database. Additionally, there is the option to output the data as a JSON file on the disk or to store it in another database, such as PostgreSQL. Alternatively, an API can be utilized to move the data to a database, for example, a PostgreSQL database. ETL process is a broad term used in data handling and data warehousing that involves:
The primary goal of data transformation is to change the data into a format that is useful for business users. This process involves converting data from one format or structure into another, often to make it more suitable for analysis, reporting, or specific business applications. It can involve activities such as cleansing, aggregating, and reorganizing data to meet specific needs. The tools and libraries from the Apache ecosystem — Spark, Arrow, and Flink (see page3328) — can be involved in various stages of an ETL pipeline, especially the extraction and loading parts. They also have capabilities for performing complex transformations on the data as part of processing workflows, making them suitable for comprehensive ETL tasks. ETL is often a crucial first step in a machine learning (ML) pipeline. The relationship between ELT (Extract, Load, Transform), data lakes, and data warehouses is interconnected and integral to modern data architectures, especially in big data and analytics-driven environments:
===========================================
|
||||||||
| ================================================================================= | ||||||||
|
|
||||||||