Authors: Wei Wang and Amey Banarse
Originally Published on October 19, 2021 at https://blog.yugabyte.com/
At Distributed SQL Summit 2021 (https://distributedsql.org/), we presented a workshop on how to build an application using the YugabyteDB Spark Connector (https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb) and Yugabyte Cloud to deliver business outcomes for our customers.
The YugabyteDB Spark Connector brings together the best of breed technologies Apache Spark — an industry-leading, distributed computing engine — with YugabyteDB, a modern, cloud-native distributed SQL database. This connector allows customers to seamlessly and natively read from, perform complex ETL, and write to YugabyteDB.
This native integration removes all the complexity and guesswork in deciding what processing should happen where. With the optimized connector, the complex workloads processed by Spark can be translated to SQL and executed by YugabyteDB directly, making the application much more scalable and performant.
In this blog post, we recap some highlights from the workshop, and show you how to get started with your first application.
Workshop recording and slides
You can check out the complete tutorial by accessing the workshop recording (https://www.yugabyte.com/dss-2021-on-demand/?utm_campaign=DSS21&utm_medium=email&_hsmi=164556599&_hsenc=p2ANqtz-9aPU72_LtnHCEhXrHHR8zWPundujwpXJUq9VLF0tRmlcRdlvqHNLHs1qfJopA_T-TASED8Fk0tHNvKvao3XFi3cTvzyg&utm_content=164556599&utm_source=hs_email) and slides (https://blog.yugabyte.com/wp-content/uploads/2021/10/DSS-2021-%E2%80%94-Yugabyte-Cloud-and-Spark-Workshop-Slides-Final.pdf).
Brief introduction to YugabyteDB and Yugabyte Cloud
Cloud native enterprise applications in the multi-cloud world demand a highly scalable, resilient, horizontally scalable, geographically distributed, and cloud agnostic modern database. YugabyteDB (https://blog.yugabyte.com/reimagining-the-rdbms-for-the-cloud/) meets these challenges and is the database of choice for organizations building microservices and born-in-the-cloud apps. In addition, YugabyteDB provides multiple APIs by converging SQL and NoSQL, which simplifies the polyglot data architecture needs of the enterprise.
Yugabyte Cloud (http://www.yugabyte.com/cloud) is a fully-managed offering of YugabyteDB and unlocks the power of “any.” Organizations can run any app at any scale, anywhere, accessible at any time and running in any cloud, a perfect addition to a multi-cloud world.
Getting started with the YugabyteDB Spark Connector
The latest version of the YugabyteDB Spark Connector is Spark 3.0 compatible and allows you to expose YugabyteDB tables as Spark RDDs, write Spark RDDs to Yugabyte tables, and execute arbitrary CQL queries in your Spark applications. Key features of the YugabyteDB Spark Connector include:
-Read from YugabyteDB: Exposes YugabyteDB tables as Spark RDDs
-Write to YugabyteDB: Save RDDs back to Cassandra by implicit saveToCassandra calls
-Native JSON data support using JSONB data type, where Cassanda lacks the native support
-Supports PySpark DataFrames
-Performance optimizations with predicate pushdowns
Cluster, topology and partition awareness
Understanding the application architecture
As seen in the diagram, we are building a Scala application using YugabyteDB Spark Connector to demonstrate:
-How Spark integrates with Yugabyte Cloud to develop applications
-How YugabyteDB models JSON data efficiently using a JSONB data type
-Key features of the connector
Prerequisites
Before you get started, you’ll need to have the following software installed on your machine:
-Java JDK 1.8
-Spark 3.0 and Scala 2.12
wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
tar xvf spark-3.0.3-bin-hadoop2.7.tgz
cd spark-3.0.3-bin-hadoop2.7
./bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions --packages com.yugabyte.spark:spark-cassandra-connector_2.12:3.0-yb-8
-Yugabyte Cloud access: Create a YugabyteDB cluster from Yugabyte Cloud (https://www.yugabyte.com/cloud/) and follow the instructions to configure and connect to the cluster, as well as create database objects using the script namespace.sql (https://github.com/YB-WangTx/yugabyteSparkDemo/blob/main/namespace.sql)
./bin/ycqlsh -h your_cluster_ip -f namepspace.sql
//a DDL example with jsonb data type
Create keyspace test;
Create table test.employeess_json
(department_id INT, employee_id INT,dept_name TEXT, salary Double,
phone jsonb, PRIMARY KEY(department_id, employee_id));
Steps for building your first application
- Import the libraries required to build the application:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnectorConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import com.datastax.spark.connector.cql.CassandraConnector
- Configure Yugabyte Cloud connectivity:
val host = "748fdee2-aabe-4d75-a698-a6514e0b19ff.aws.ybdb.io"
val user = "admin"
val password = "your password for admin"
val keyStore ="/Users/xxx/Documents/spark3yb/yb-keystore.jks"
- Now create a Spark session and connect to Yugabyte Cloud:
val conf = new SparkConf()
.setAppName("yb.spark-jsonb")
.setMaster("local[1]")
.set("spark.cassandra.connection.localDC", "us-east-2")
.set("spark.cassandra.connection.host", "127.0.0.1")
.set("spark.sql.catalog.ybcatalog",
"com.datastax.spark.connector.datasource.CassandraCatalog")
.set("spark.sql.extensions",
"com.datastax.spark.connector.CassandraSparkExtensions")
val spark = SparkSession.builder()
.config(conf)
.config("spark.cassandra.connection.host", host)
.config("spark.cassandra.connection.port", "9042")
.config("spark.cassandra.connection.ssl.clientAuth.enabled", true)
.config("spark.cassandra.auth.username", user)
.config("spark.cassandra.auth.password", password)
.config("spark.cassandra.connection.ssl.enabled", true)
.config("spark.cassandra.connection.ssl.trustStore.type", "jks")
.config("spark.cassandra.connection.ssl.trustStore.path", keyStore)
.config("spark.cassandra.connection.ssl.trustStore.password", "ybcloud")
.withExtensions(new CassandraSparkExtensions)
.getOrCreate()
- Process the data by reading from YugabyteDB, performing an ETL-window function specifically in Spark and saving the result back to YugabyteDB:
//Read data into a data frame from a YB table
val df_yb = spark.read.table("ybcatalog.test.employees_json")
//Perform window function
val windowSpec = Window.partitionBy("department_id").orderBy("salary")
df_yb.withColumn("row_number",row_number.over(windowSpec)).show(false)
df_yb.withColumn("rank",rank().over(windowSpec)).show(false)
//Write back to a YB table
df_yb.write.cassandraFormat("employees_json_copy", "test").mode("append").save()
- Now query the data in YugabyteDB:
YugabyteDB models JSON data in a JSONB data type efficiently. The connector optimizes the query performance with column pruning and predicate pushdown:
//Column Pruning
Val query1 = "SELECT department_id, employee_id, get_json_object(phone,'$.code') as code FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key(1)') = '1400' order by department_id limit 2";
val df_sel1=spark.sql(query1)
df_sel1.explain
//Predicate pushdown
val query2 = "SELECT department_id, employee_id, get_json_object(phone, '$.key[1].m[2].b') as key FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key[1].m[2].b') = '400' order by department_id limit 2";
val df_sel2 = spark.sql(query2)
df_sel2.show(false)
//verify with the explain plan from YB
df_sel2.explain
Conclusion
YugabyteDB is an excellent choice of distributed SQL database for storing critical business information such as system of records and product catalogs. Spark provides all the capabilities you need to perform complex computations on this data by leveraging the YugabyteDB Spark Connector.
Next steps
Interested in learning more about YugabyteDB Spark Connector? Download it here (https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb) to get started!
The code for the application we just walked through can be found on GitHub (https://github.com/YB-WangTx/yugabyteSparkDemo). You can also sign up for Yugabyte Cloud (https://cloud.yugabyte.com/signup), a fully managed YugabyteDB-as-a-service.
Top comments (0)