DEV Community

Akmal Chaudhri for SingleStore

Posted on

Quick tip: Using SingleStoreDB with Delta Lake

Abstract

This short article will show how to install Delta Lake with Apache Spark on Deepnote. We'll use the SingleStore Spark Connector to read data from SingleStoreDB and write it into the Delta Lake, then read data from the Delta Lake and write it back into SingleStoreDB.

Introduction

Previously, we explored how to use Apache Iceberg with SingleStoreDB via Spark Dataframes. In this article, we'll focus on Delta Lake and provide details on one possible configuration for using it with SingleStoreDB from a Python notebook environment.

Create a Deepnote account

We'll create a free account on the Deepnote website. Once logged in, we'll create a new Deepnote project to give us a new notebook. We'll also need to create two folders (jars and warehouse).

In the jars folder, we'll store the following files:

The warehouse folder will be used to store our Delta Lake.

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use Iris Demo Group as our Workspace Group Name and iris-demo as our Workspace Name.

Once we've created our database in the following steps, we'll make a note of our password and host name.

Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database, table and populate the table. We can use a GitHub Gist for this task.

Deepnote notebook

Let's now start to fill out our notebook.

Install Software

First, we'll need to install pyspark and delta-spark:

! sudo apt-get update
! sudo mkdir -p /usr/share/man/man1
! sudo apt-get install -y openjdk-11-jdk
! pip install pyspark==3.3.2
! pip install delta-spark==2.2.0
Enter fullscreen mode Exit fullscreen mode

Once the installation is complete, we'll prepare our SparkSession:

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars", "jars/singlestore-jdbc-client-1.1.4.jar, jars/singlestore-spark-connector_2.12-4.1.2-spark-3.3.0.jar, jars/spray-json_3-1.3.6.jar, jars/commons-dbcp2-2.9.0.jar, jars/commons-pool2-2.11.1.jar") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()
Enter fullscreen mode Exit fullscreen mode

We can check the version of Spark as follows:

spark.version
Enter fullscreen mode Exit fullscreen mode

The output should be:

'3.3.2'
Enter fullscreen mode Exit fullscreen mode

Connect to SingleStoreDB

First, we'll provide connection details for SingleStoreDB:

host = "<host>"
password = "<password>"

port = "3306"
cluster = host + ":" + port
Enter fullscreen mode Exit fullscreen mode

We'll replace the <host> and <password> with the values from our SingleStoreDB Cloud account.

We'll now set some parameters for the SingleStore Spark Connector:

spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")
Enter fullscreen mode Exit fullscreen mode

Create Dataframe from SingleStoreDB, Write to Delta Lake

We can create a Dataframe from SingleStoreDB, as follows:

iris_df = (spark.read
                .format("singlestore")
                .load("iris_db.iris")
)
Enter fullscreen mode Exit fullscreen mode

Next, we'll check the data:

iris_df.show(5)
Enter fullscreen mode Exit fullscreen mode

The result should be similar to the following:

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|    species|
+------------+-----------+------------+-----------+-----------+
|         5.7|        3.8|         1.7|        0.3|Iris-setosa|
|         5.1|        3.8|         1.5|        0.3|Iris-setosa|
|         4.8|        3.4|         1.9|        0.2|Iris-setosa|
|           5|        3.4|         1.6|        0.4|Iris-setosa|
|         4.4|        3.2|         1.3|        0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows
Enter fullscreen mode Exit fullscreen mode

Let's now write the data to the Delta Lake:

(iris_df.write
        .format("delta")
        .save("warehouse/delta-table")
)
Enter fullscreen mode Exit fullscreen mode

Create Dataframe from Delta Lake, Write to SingleStoreDB

Now, let's read the data back from the Delta Lake:

new_iris_df = (spark.read
                    .format("delta")
                    .load("warehouse/delta-table")
)
Enter fullscreen mode Exit fullscreen mode

Next, we'll check the data:

new_iris_df.show(5)
Enter fullscreen mode Exit fullscreen mode

The result should be similar to the following:

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|    species|
+------------+-----------+------------+-----------+-----------+
|         5.7|        3.8|         1.7|        0.3|Iris-setosa|
|         5.1|        3.8|         1.5|        0.3|Iris-setosa|
|         4.8|        3.4|         1.9|        0.2|Iris-setosa|
|         5.0|        3.4|         1.6|        0.4|Iris-setosa|
|         4.4|        3.2|         1.3|        0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows
Enter fullscreen mode Exit fullscreen mode

Let's now write the data to SingleStoreDB:

(new_iris_df.write
            .format("singlestore")
            .option("loadDataCompression", "LZ4")
            .mode("overwrite")
            .save("iris_db.new_iris")
)
Enter fullscreen mode Exit fullscreen mode

From SingleStoreDB Cloud, we can check that the new_iris table was created, and we can query the data:

USE iris_db;

SELECT * FROM new_iris LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Summary

Using Spark Dataframes, we can seamlessly work with SingleStoreDB and Delta Lake.

Top comments (0)