This post will serve as a continuous knowledge dump regarding the "Learning Spark 2.0" book, where I'll dump certain quotes that I find relevant (and hopefully you will too :]!)
- In Spark’s supported languages, columns are objects with public methods (represented by the Column type).
Code example that uses expr()
, withColumn
, and col()
:
blogsDF.withColumn("Big Hitters", (expr("Hits > 10000"))).show()
The above adds a new column, Big Hitters, based on the conditional expression, noting that expr(...)
part can be changed with: col("Hits") > 1000
.
- In Scala, using the dollar sign
$
before a string will be rendered as a function which converts this string into typeColumn
. Details can be found here
- In Spark, a Row is a Spark object which consists of an ordered collection of fields. Therefore, you can access its fields by an index starting from 0.
- Parquet does not support some symbols and whitespace characters in column names (source).
- Check the PySpark date and time functions here. Check this docs for all pyspark's SQL-related functions. Finally, check Spark SQL's built-in functions that are equivalent in definition here. For example, "=" and "==", "!=" and "<>".
- From the book: The DataFrame API also offers the collect() method, but for extremely large DataFrames this is resource-heavy (expensive) and dangerous, as it can cause out-of-memory (OOM) exceptions. Unlike count(), which returns a single number to the driver, collect() returns a collection of all the Row objects in the entire DataFrame or Dataset. If you want to take a peek at some Row records you’re better off with take(n), which will return only the first n Row objects of the DataFrame.
- DSL operators means Domain-specific language operators
- "Datasets" -> Java/Scala (as they're compile-time/type-safe languages)
- "Dataframe" -> Python (as types are dynamically inferred/assigned during execution)
- Tricky note: "Dataframe" -> Scala (because in Scala, a DataFrame is just an alias for untyped Dataset[Row])
- A
Row
is a generic object type in Spark, holding a collection of mixed types that can be accessed using an index. - Example of defining a Scala Case Class to use
DeviceIoTData
instead ofRow
can be found here
- A
Summary of the above bullet points:
- When we use Datasets, the underlying Spark SQL engine handles the creation/conversion/serialization/deserialization of JVM objects as well as Java heap memory management with the help of Dataset encoders.
DataFrames Versus Datasets:
- Use Datasets if you want:
- strict compile-time type safety and don’t mind creating multiple case classes for a specific Dataset[T].
- Tungsten’s efficient serialization with Encoders.
- Use Dataframes if you want:
- SQL-like queries to perform relational transformations to your data.
- unification, code optimization, and simplification of APIs across Spark components, use DataFrames.
- an easy transition from R language.
- to precisely instruct Spark how to do a query using RDDs.
- space and speed efficiency.
Top comments (0)