Learning Spark 2.0 Knowledge Dump

#spark #dataengineering #learning #python

This post will serve as a continuous knowledge dump regarding the "Learning Spark 2.0" book, where I'll dump certain quotes that I find relevant (and hopefully you will too :]!)

In Spark’s supported languages, columns are objects with public methods (represented by the Column type).

Code example that uses expr(), withColumn, and col():

blogsDF.withColumn("Big Hitters", (expr("Hits > 10000"))).show()

The above adds a new column, Big Hitters, based on the conditional expression, noting that expr(...) part can be changed with: col("Hits") > 1000.

In Scala, using the dollar sign $ before a string will be rendered as a function which converts this string into type Column. Details can be found here

In Spark, a Row is a Spark object which consists of an ordered collection of fields. Therefore, you can access its fields by an index starting from 0.

Parquet does not support some symbols and whitespace characters in column names (source).

Check the PySpark date and time functions here. Check this docs for all pyspark's SQL-related functions. Finally, check Spark SQL's built-in functions that are equivalent in definition here. For example, "=" and "==", "!=" and "<>".

From the book: The DataFrame API also offers the collect() method, but for extremely large DataFrames this is resource-heavy (expensive) and dangerous, as it can cause out-of-memory (OOM) exceptions. Unlike count(), which returns a single number to the driver, collect() returns a collection of all the Row objects in the entire DataFrame or Dataset. If you want to take a peek at some Row records you’re better off with take(n), which will return only the first n Row objects of the DataFrame.

DSL operators means Domain-specific language operators

"Datasets" -> Java/Scala (as they're compile-time/type-safe languages)
"Dataframe" -> Python (as types are dynamically inferred/assigned during execution)
Tricky note: "Dataframe" -> Scala (because in Scala, a DataFrame is just an alias for untyped Dataset[Row])
- A Row is a generic object type in Spark, holding a collection of mixed types that can be accessed using an index.
- Example of defining a Scala Case Class to use DeviceIoTData instead of Row can be found here

Summary of the above bullet points:

When we use Datasets, the underlying Spark SQL engine handles the creation/conversion/serialization/deserialization of JVM objects as well as Java heap memory management with the help of Dataset encoders.

DataFrames Versus Datasets:

Use Datasets if you want:
- strict compile-time type safety and don’t mind creating multiple case classes for a specific Dataset[T].
- Tungsten’s efficient serialization with Encoders.
Use Dataframes if you want:
- SQL-like queries to perform relational transformations to your data.
- unification, code optimization, and simplification of APIs across Spark components, use DataFrames.
- an easy transition from R language.
- to precisely instruct Spark how to do a query using RDDs.
- space and speed efficiency.

DEV Community

Learning Spark 2.0 Knowledge Dump

Top comments (0)

Read next

Connect to multiple databases, make or generate SQL queries, analyze or visualize.

Telegram bot para replicar sinais no mt5

Suppressing "KeyboardInterrupt" Message on Python Script

Cómo crear un Wallpaper dinámico con la Hora y Fecha usando Python