Apache Spark Basics

#beginners #productivity #bitesize

This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.

If you work with Apache Spark and look for a cheat sheet, this is for you as well!

First thing first:

-1- the workspace:

First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.

-2- Basic Apache Spark Vocabulary :

Dataframe

This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.

Dataset

Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.

RelationalGroupedDataset

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup.

This is an evolving page and more terms, code snippets and architecture design will be added.

DEV Community

Apache Spark Basics

-1- the workspace:

-2- Basic Apache Spark Vocabulary :

Dataframe

Dataset

RelationalGroupedDataset

Top comments (0)

Read next

Understanding the MLOps Lifecycle

Removing code smells: Using dependency injection through Props in React

Day 9: Terminal Forms 📇

New to Dev.to. What do you usually do here?