Catalyst in Spark - Python Automation and Machine Learning for ICs - - An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Catalyst is Spark's query optimization framework. In the Analysis phase, Catalyst examines the query, DataFrame, unresolved logical plan, and Catalog to formulate a logical plan. This plan is then refined into an optimized logical plan during the logical optimization phase, which involves rule-based enhancements such as folding, pushdown, and pruning in Spark SQL. Subsequently, during the physical planning phase, Catalyst devises multiple physical plans from the optimized logical plan. Each physical plan specifies the methods of computation on datasets, outlining precise instructions for executing these computations. Cost is determined by evaluating the time and memory utilized by a query. The Catalyst optimizer chooses a query path that minimizes consumption of both resources. Given that queries may follow various paths, these evaluations can grow significantly complex, especially when large datasets are involved. The most efficient physical plan is chosen based on a cost model, highlighting the cost-based optimization step. The final phase is code generation, where Catalyst implements the chosen physical plan by generating Java bytecode, which executes across the cluster nodes. It is responsible for optimizing the logical and physical plan of SQL queries. Catalyst uses a rule-based optimization engine, where optimization rules are applied in multiple phases:
===========================================
|
||||||||
================================================================================= | ||||||||
|
||||||||