Dataset in Apache Spark - Python Automation and Machine Learning for ICs - - An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= In Apache Spark, a "Dataset" is a distributed collection of data, which provides the benefits of both Spark RDDs (Resilient Distributed Datasets) and Spark DataFrames, with optimized execution plans and strong typing. Datasets are a part of Spark SQL and are primarily used for structured data processing.
Datasets in Spark are designed to provide an easier, more efficient way to handle structured and semi-structured data at scale, making big data processing tasks more straightforward and less prone to error. They strike a balance between the flexibility of RDDs and the performance optimization of DataFrames. The latest data abstraction in Spark, similar to RDDs and DataFrames, offers APIs for accessing distributed data collections. These are composed of a series of strongly typed objects within the Java Virtual Machine (JVM). Being strongly typed means that the datasets are typesafe, with the data type explicitly defined at the time of their creation. They combine the advantages of RDDs, including lambda functions and type-safety, with SQL optimizations from SparkSQL. The characteristics of datasets in Apache Spark are:
toDS() is the function to create a dataset from a sequence in many programming contexts, especially in Apache Spark. This function is used to convert a Resilient Distributed Dataset (RDD) or a DataFrame into a Dataset, which is a strongly-typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. ===========================================
|
||||||||
================================================================================= | ||||||||
|
||||||||