SparkSQL - Python Automation and Machine Learning for ICs - - An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= SparkSQL is a module in Apache Spark designed for processing structured data using SQL queries through SQL and DataFrame APIs, aiming to optimize queries for better performance and efficiency. It allows users to perform SQL queries on large datasets stored in various data sources such as Hive, HDFS, and more. It also allows users to execute SQL queries on Spark DataFrames and offers APIs in Java, Scala, Python, and R. The main features and functionalities of SparkSQL are:
SparkSQL makes it easier for users familiar with SQL to start working with Spark, as they can leverage their existing SQL knowledge to perform complex data analysis and processing on large datasets distributed across a cluster. The key goals of SparkSQL optimization include:
With DataFrame-based APIs, Apache Spark or Pandas in Python provides:
Creating SQL queries in Spark SQL begins with this initial step. It involves using a temporary table designed for executing SQL queries. Spark SQL supports both temporary and global temporary views. A temporary view is confined to the local scope, meaning it is only available within the current Spark session on the particular node it was created on. In contrast, a global temporary view is accessible across the broader Spark application, allowing it to be shared among various Spark sessions. Spark SQL Memory Optimization focuses on enhancing the runtime efficiency of SQL queries by reducing both the query duration and memory usage. This optimization aids organizations in saving both time and resources. Parquet is a columnar format compatible with various data processing systems. Spark SQL supports reading and writing data to and from Parquet files, maintaining the data schema throughout. Data sources, such as Parquet files, external APIs, MongoDB and custom file formats, can be utilized with Apache Spark SQL. To create a Global Temporary view in Spark SQL, we should use the createGlobalTempView function. This function creates a temporary view that is visible across multiple Spark sessions within the same Spark application. These views are stored in a global temporary database and are tied to the system's lifecycle rather than that of a specific session. ===========================================
|
||||||||
================================================================================= | ||||||||
|
||||||||