While researching materials for installing a JupyterLab instance with Spark support (via PySpark), I noticed a lot of outdated content. That was until I came across an up-to-date Docker image provided by the Jupyter Docker Stacks project.
https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#apache-spark
Python and Java have good support for ARM architecture, so we can assume that any framework developed on these platforms will run well on a Raspberry Pi. Then, I just ran the Docker command to start the environment.
docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
It couldn't be easier.
Checking CPU and memory
Let's try a simple code to check the availability of the PySpark and SQL API.
from pyspark.sql import SparkSession
spark = ( SparkSession
.builder
.appName("Python Spark SQL basic example")
.config("spark.executor.memory", "2g")
.config("spark.executor.cores", "4")
.config("spark.eventLog.enabled", "true")
.config("spark.sql.shuffle.partitions", "50")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate() )
Got a SparkSession-in-memory
- Version: v3.5.0
- Master: local[*]
- AppName: Python Spark SQL basic example
We can setup a dataframe.
from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType
# Setup the Schema
schema = StructType([
StructField("User ID", IntegerType(),True),
StructField("Username", StringType(),True),
StructField("Browser", StringType(),True),
StructField("OS", StringType(),True),
])
# Add Data
data = ([(1580, "Barry", "FireFox", "Windows" ),
(5820, "Sam", "MS Edge", "Linux"),
(2340, "Harry", "Vivaldi", "Windows"),
(7860, "Albert", "Chrome", "Windows"),
(1123, "May", "Safari", "macOS")
])
# Setup the Data Frame
user_data_df = spark.createDataFrame(data,schema=schema)
user_data_df.show()
+-------+--------+-------+-------+
|User ID|Username|Browser| OS|
+-------+--------+-------+-------+
| 1580| Barry|FireFox|Windows|
| 5820| Sam|MS Edge| Linux|
| 2340| Harry|Vivaldi|Windows|
| 7860| Albert| Chrome|Windows|
| 1123| May| Safari| macOS|
+-------+--------+-------+-------+
We can save this dataframe to a physical table in a new database.
spark.sql("CREATE DATABASE raspland")
user_data_df.write.saveAsTable("raspland.user_data")
And then run SQL commands over the table.
spark.sql("SELECT * FROM raspland.user_data WHERE OS = 'Linux'").show()
+-------+--------+-------+-----+
|User ID|Username|Browser| OS|
+-------+--------+-------+-----+
| 5820| Sam|MS Edge|Linux|
+-------+--------+-------+-----+
By accessing the Terminal, we can check the Parquet files that store the data of our created table.
(base) jovyan@0e1d1463f0b0:~/spark-warehouse/raspland.db/user_data$ ls
part-00000-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00001-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00002-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00003-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
_SUCCESS
Install the jupyterlab-sql-editor
extension to get enhanced functionalities for SQL execution.
Here is a post to learn more about the extension:
https://towardsdatascience.com/jupyterlab-sql-cell-editor-e6ac865b42df
You will need to run 2 commands in the Terminal to install server prerequisites.
pip install jupyterlab-lsp jupyterlab-sql-editor
sudo npm install -g sql-language-server
Load the extension in the notebook ...
%load_ext jupyterlab_sql_editor.ipython_magic.sparksql
and SparkSQL cells will be enabled.
Metastore
Spark should be using a Hive metastore to manage databases and tables. We will talk about Hive later. Until now, our data has been physically persisted in Parquet files, while our metadata remains in memory.
We need to enable Hive Support to persist our metadata. So let's delete our data, recreate our Spark session, and run the sample again.
In the Terminal:
rm -rf spark-warehouse/
Change the code in the notebook to enable Hive:
from pyspark.sql import SparkSession
spark = ( SparkSession
.builder
.appName("Python Spark SQL basic example")
...
.enableHiveSupport()
.getOrCreate() )
After running our database creation script, certain files and folders appear in the file explorer.
While spark-warehouse
houses our Parquet files, metastore_db
serves as a Derby repository to store our database and table definitions.
Derby is a lightweight relational database management system (RDBMS) implemented in Java. It is often used as a local, embedded metastore for Spark's SQL component when running in standalone mode.
Hive is a data warehousing and SQL query engine for Hadoop, originally developed by Facebook. It allows you to query big data in a distributed storage architecture like Hadoop's HDFS using SQL-like syntax. Hive's architecture includes a metadata repository stored in an RDBMS, which is often referred to as the Hive Metastore.
The standalone installation of Spark does not inherently include Hive, but it has built-in support for connecting to a Hive metastore if you have one set up separately.
Thank you for reading. Let me know in the comments what else you might be interested in to complement this article.
Top comments (1)
Thanks a lot!
Very helpful