The Data Engineer job requires a variety of skills, including building data pipelines, evaluating databases, designing schemas and managing them. In short, the Data Engineer loads, extracts, manipulates and in general manages data. Often this work requires a lot of skills and if the process is not automated, the Data Engineer risks making many mistakes and wasting a lot of time in resolving unexpected events.
Recently, I have tested a very interesting framework, which facilitates the Data Engineer duties: the Versatile Data Kit, released by VMware as open source on Github.
Versatile Data Kit allows Data Engineer to perform its tasks semi-automatically. In practice, they will have to focus only on the data and the general configuration of the framework, such as database setup and cron tasks scheduling, without worrying about manual deployment, versioning and similar stuff.
In other words, Versatile Data Kit simplifies the life of Data Engineers, as it allows to manage data in a simple and fast way, as well as to deal with unexpected events quickly.
A Data Engineer can build a full data processing workload (Data Job, in the Versatile Data Kit language) in just three steps:
- Ingest data
- Process data
- Publish data
In this article, I give an overview of Versatile Data Kit, as well as a practical use case, which shows its potentialities. For more information, you can read the Versatile Data Kit full documentation.
1 Overview
Versatile Data Kit is a framework which enables Data Engineers to develop, deploy, run and manage Data Jobs. A Data Job is data processing workload.
Versatile Data Kit consists of two main components:
A Data SDK, which provides all tools for data extraction, transformation and loading, as well as a plugin framework, which permits to extend the framework according to specific requirements of data application.
A Control Service, which permits to create, deploy, manage and execute Data Jobs in Kubernetes runtime environment.
Versatile Data Kit manages three types of Data Jobs, as shown in light green in the following figure:
1.1 Ingestion Jobs
Ingestion Jobs involve pushing data from different sources to the Data Lake, which is the basic container for raw data. Data may be provided in different formats, such as CSV, JSON, SQL and so on.
Ingestion can be defined through different steps, including, but not limited to the creation of the data schema and the process to load data into a table. All these steps can be specified either writing a Data Job (typically in Python or SQL) or through a plugin. Versatile Data Kit provides some prepackaged plugins, for example for CSV and SQL ingestion, but you can implement your own plugin.
1.2 Processing Jobs
Processing Jobs permit to create curated datasets from those contained in the Data Lake. Usually, these jobs involve... continue reading on Towards Data Science
Top comments (0)