GCP DataFlow Function Schedule
Overview
This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution. It demonstrates how you can automate data pipelines, perform seamless integration with other GCP services like BigQuery, and manage workflows efficiently through CI/CD pipelines with GitHub Actions. This setup provides flexibility, reduces manual intervention, and ensures that the data processing workflows run smoothly and consistently.
Table of Contents
- Technologies Used
- Features
- Architecture Diagram
- Getting Started
- Deploying the Project
- Resources Created After Deployment
- Conclusion
- Documentation Links
Technologies Used
Google Dataflow
Google Dataflow is a fully managed service for stream and batch data processing, which is built on Apache Beam. It allows for the creation of highly efficient, low-latency, and cost-effective data pipelines. Dataflow can handle large-scale data processing tasks, making it ideal for use cases like real-time analytics and ETL jobs.Cloud Storage
Google Cloud Storage is a scalable, durable, and secure object storage service designed to handle large volumes of unstructured data. It is ideal for use in big data analysis, backups, and content distribution, offering high availability and low latency across the globe.Cloud Functions
Google Cloud Functions is a serverless execution environment that allows you to run code in response to events. In this project, Cloud Functions are used to trigger Dataflow jobs and manage workflow automation efficiently with minimal operational overhead.Cloud Scheduler
Google Cloud Scheduler is a fully managed cron job service that allows you to schedule tasks or trigger cloud services at specific intervals. It’s used in this project to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.CI/CD Process with GitHub Actions
GitHub Actions enables continuous integration and continuous delivery (CI/CD) workflows directly from your GitHub repository. In this project, it is used to automate the build, testing, and deployment of resources to Google Cloud, ensuring consistent and reliable deployments.GitHub Secrets and Configuration
GitHub Secrets securely store sensitive information such as API keys, service account credentials, and configuration settings required for deployment. By keeping these details secure, the risk of leaks and unauthorized access is minimized.
Features
- Ingest and transform data from Google Cloud Storage using Google Dataflow.
- Encapsulate the Dataflow process into a reusable Dataflow template.
- Create a Cloud Function that executes the Dataflow template through a REST API.
- Automate the execution of the Cloud Function using Cloud Scheduler.
- Implement a CI/CD pipeline with GitHub Actions for automated deployments.
- Incorporate comprehensive error handling and logging for reliable data processing.
Architecture Diagram
Getting Started
Prerequisites
Before getting started, ensure you have the following:
- A Google Cloud account with billing enabled.
- A GitHub account.
Setup Instructions
- Clone the Repository ```bash
git clone https://github.com/jader-lima/gcp-dataproc-bigquery-workflow-template.git
cd gcp-dataproc-bigquery-workflow-template
## Set Up Google Cloud Environment
1. **Create a Google Cloud Storage bucket** to store your data.
2. **Set up a BigQuery dataset** where your data will be ingested.
3. **Create a Dataproc cluster** for processing.
## Creat a new service account for deploy purposes
1. Create the Service Account:
```bash
gcloud iam service-accounts create devops-dataops-sa \
--description="Service account for DevOps and DataOps tasks" \
--display-name="DevOps DataOps Service Account"
- Grant Storage Access Permissions (Buckets): Storage Admin (roles/storage.admin): Grants permissions to create, list, and manipulate buckets and files.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.admin"
- Grant Dataflow Permissions: Dataflow Admin (roles/dataflow.admin): To create, run, and manage Dataflow jobs. Dataflow Developer (roles/dataflow.developer): Allows the development and submission of Dataflow jobs.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/dataflow.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/dataflow.developer"
- Permissions to Create and Manage Cloud Functions and Cloud Scheduler: Cloud Functions Admin (roles/cloudfunctions.admin): To create and manage Cloud Functions. Cloud Scheduler Admin (roles/cloudscheduler.admin): To create and manage Cloud Scheduler jobs.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudfunctions.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudscheduler.admin"
- Grant Permissions to Manage Service Accounts: IAM Service Account Admin (roles/iam.serviceAccountAdmin): To create and manage other service accounts. IAM Service Account User (roles/iam.serviceAccountUser): To use service accounts in different services.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountAdmin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/serviceusage.serviceUsageAdmin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/resourcemanager.projectIamAdmin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/resourcemanager.projectIamAdmin"
- Permission to enable API services: ```bash
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudscheduler.admin"
- Additional Permissions (Optional):
Compute Admin (roles/compute.admin): If your pipeline needs to create compute resources (e.g., virtual machine instances). Viewer (roles/viewer):
To ensure the account can view other resources in the project.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/compute.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/viewer"
</code></pre></div><h2>
<a name="configure-environment-variables-and-secrets" href="#configure-environment-variables-and-secrets">
</a>
Configure Environment Variables and Secrets
</h2>
<p>Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:</p>
<ul>
<li><code>GCP_BUCKET_BIGDATA_FILES</code>: Secret used to store the name of the cloud storage</li>
<li><code>GCP_BUCKET_DATALAKE</code>: Secret used to store the name of the cloud storage</li>
<li><code>GCP_BUCKET_DATAPROC</code>: Secret used to store the name of the cloud storage</li>
<li><code>GCP_BUCKET_TEMP_BIGQUERY</code>: Secret used to store the name of the cloud storage</li>
<li><code>GCP_DEVOPS_SA_KEY</code>: Secret used to store the value of the service account key. For this project, the default service key was used. </li>
<li><code>GCP_SERVICE_ACCOUNT</code>: Secret used to store the value of the service account key. For this project, the default service key was used. </li>
<li><code>PROJECT_ID</code>: Secret used to store the project id value</li>
</ul>
<h2>
<a name="creat-a-new-service-account-for-deploy-purposes" href="#creat-a-new-service-account-for-deploy-purposes">
</a>
Creat a new service account for deploy purposes
</h2>
<h3>
<a name="creating-github-secret" href="#creating-github-secret">
</a>
Creating github secret
</h3>
<ol>
<li>To create a new secret:
<ol>
<li>In project repository, menu <strong>Settings</strong> </li>
<li>
<strong>Security</strong>, </li>
<li>
<strong>Secrets and variables</strong>,click in access <strong>Action</strong>
</li>
<li>
<strong>New repository secret</strong>, type a <strong>name</strong> and <strong>value</strong> for secret.</li>
</ol>
</li>
</ol>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i45cicz0q89ije7j70yf.png" alt="github secret creation"></p>
<p>For more details , access :<br>
<a href="https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions">https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions</a></p>
<h2>
<a name="deploying-the-project" href="#deploying-the-project">
</a>
Deploying the project <a name="workflow-overview"></a>
</h2>
<p>Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.</p>
<h2>
<a name="workflow-file-yaml-explanation" href="#workflow-file-yaml-explanation">
</a>
Workflow File YAML Explanation<a name="workflow-yaml-explanation"></a>
</h2>
<p>Environments Needed<br>
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps <br>
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :</p>
<h2>
<a name="workflow-job-steps" href="#workflow-job-steps">
</a>
Workflow Job Steps <a name="enable-services"></a>
</h2>
<ul>
<li><p><strong>enable-services</strong>:<br>
This step enables the necessary APIs for Cloud Functions, Dataflow, and the build process.</p></li>
<li><p><strong>deploy-buckets</strong>:<br>
This step creates Google Cloud Storage buckets and copies the required data files and scripts into them.</p></li>
<li><p><strong>build-dataflow-classic-template</strong>:<br>
Builds and stores a Dataflow template in a Cloud Storage bucket for future execution.</p></li>
<li><p><strong>deploy-cloud-function</strong>:<br>
Deploys a Cloud Function that triggers the execution of the Dataflow template using the google-api-python-client library.</p></li>
<li><p><strong>deploy-cloud-schedule</strong>:<br>
Creates a Cloud Scheduler job to automate the execution of the Cloud Function, ensuring data is processed at defined intervals.</p></li>
</ul>
<h2>
<a name="resources-created-after-deployment" href="#resources-created-after-deployment">
</a>
Resources Created After Deployment
</h2>
<p>Upon deployment, the following resources are created:</p>
<h3>
<a name="google-cloud-storage-bucket" href="#google-cloud-storage-bucket">
</a>
Google Cloud Storage Bucket
</h3>
<p>A Cloud Storage bucket to store data and templates.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5au57kwtdazwa8qovvkl.png" alt="buckets"></p>
<p>Csv files of the olist dataset, stored in the transient layer of the datalake.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wcby7lbjl2aqly7fvfgl.png" alt="bucket transient"></p>
<p>Csv file created after Dataflow processing, this file can be used in analysis tools, spreadsheets, databases, etc.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jdfb9ebtdgp69n6syqwd.png" alt="bucket silver"></p>
<h3>
<a name="dataflow-classic-template" href="#dataflow-classic-template">
</a>
Dataflow Classic Template
</h3>
<p>A reusable Dataflow template stored in Cloud Storage.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jj0payqwvuca6pq1vyq1.JPG" alt="dataproc-workflow3"></p>
<h3>
<a name="cloud-scheduler-job" href="#cloud-scheduler-job">
</a>
Cloud Scheduler Job
</h3>
<p>Automated scheduled jobs for Dataflow executions.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x74p6oys3q5n66vx2ojf.JPG" alt="Cloud Schedule"></p>
<h2>
<a name="conclusion" href="#conclusion">
</a>
Conclusion<a name="Conclusion"></a>
</h2>
<p>This project demonstrates how to leverage Google Cloud services like Dataflow, Cloud Functions, and Cloud Scheduler to create a fully automated and scalable data processing pipeline. The integration with GitHub Actions ensures continuous deployment, while the use of Cloud Functions and Scheduler provides flexibility and automation, minimizing operational overhead. This setup is versatile and can be easily extended to incorporate additional GCP services such as BigQuery.</p>
<p>Links and References<br>
<a href="https://github.com/jader-lima/gcp-dataflow-function-schedule">GitHub Repo</a><br>
<a href="https://cloud.google.com/functions/docs">Cloud Functions</a><br>
<a href="https://cloud.google.com/dataflow/docs">DataFlow</a><br>
<a href="https://docs.github.com/en/actions">Cloud Scheduler</a></p>
Top comments (0)