1. Dataframe is a Dataset
Try searching for a DataFrame API in Scala Spark documentation where all functions like withColumn, select, etc., are listed. Surprisingly, you won't find it because a DataFrame is essentially a Dataset[Row]. So, you'll only find an API doc for Dataset, as DataFrame is just an alias.
In Scala Spark, Scala is a statically typed language where DataFrame is considered an untyped API, whereas Dataset is considered a typed API. However, calling a DataFrame untyped is slightly incorrect—they do have types, but Spark only checks them at runtime, whereas for Dataset, type checking happens at compile time.
2. Physical Plan is Too Abstract? Go Deeper
Meet df.queryExecution.debug.codegen
. This is a valuable feature in Spark that provides the generated code, which is a close representation of what Spark will actually execute.
Sometimes, the Spark documentation is not enough, and black-box testing also doesn't provide enough proof. This generated code gives you an idea and is a really handy tool. Yes, the code might seem cryptic, but thanks to AI, we can decode it.
df.queryExecution.debug.codegen
3. Symbol is a Simple Way to Refer to a Column
There are five ways we can refer to a column name:
//df is dataframe
//if column not exists then will throw error
df("columnName")
//generic column
col("columName")
expr("columnName")
//become easy to write expression
// $"colA" + $"colB"
$"columnName"
//Simplest way, which uses scala symbol feature
df.select('columnName)
4. Column's Nullable Property is Not a Constraint
A DataFrame column has three properties: column name, data type, and nullable flag. It's a misconception that Spark will enforce this as a constraint, similar to other databases. In reality, it's just a flag used for better execution planning.
for more details, checkout https://medium.com/p/1d1b7b042adb
5. Adding More Executors Doesn't Always Mean Faster Jobs
While Spark's architecture supports horizontal scaling, simply increasing the number of executors to speed up slow jobs doesn't always yield the desired results. In many cases, this approach can backfire, leading to slower job performance and higher costs. Sometimes, jobs may run only slightly faster, but the increased resource usage can significantly raise costs.
Finding the right balance of executor and core count is crucial for optimizing job performance while controlling costs. Factors such as shuffle partitions, number of cores, number of executors, source file or table size, number of files, scheduler mode, driver's capacity, and network latency all need to be considered. Everything should be in sync to achieve optimal performance. Be cautious about adding more executors, especially in scenarios involving skewed data, as this can exacerbate issues rather than solve them.
Top comments (0)