Are you looking for an easy and efficient way to create and manage data processing pipelines? Look no further! I am excited to introduce dataDisk, a powerful Python package designed to streamline your data processing tasks. Whether you are a data scientist, data engineer, or a developer working with data, dataDisk offers a flexible and robust solution to handle your data transformation and validation needs.
Key Features
- Flexible Data Pipelines: Define a sequence of data processing tasks, including transformations and validations, with ease.
- Built-in Transformations: Use a variety of pre-built transformations such as normalization, standardization, and encoding.
- Custom Transformations: Define and integrate your custom transformation functions.
- Parallel Processing: Enhance performance with parallel execution of pipeline tasks.
- Easy Integration: Simple and intuitive API to integrate dataDisk into your existing projects.
How It Works
- Define Your Data Source and Sink
Specify the source of your data and where you want the processed data to be saved.
from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVSink
source = CSVDataSource('input_data.csv')
sink = CSVSink('output_data.csv')
- Create Your Data Pipeline
Initialize the data pipeline and add the desired tasks.
from dataDisk.pipeline import DataPipeline
from dataDisk.transformation import Transformation
pipeline = DataPipeline(source=source, sink=sink)
pipeline.add_task(Transformation.data_cleaning)
pipeline.add_task(Transformation.normalize)
pipeline.add_task(Transformation.label_encode)
- Execute the pipeline to process your data.
pipeline.process()
print("Data processing complete.")
Get Started
To start using dataDisk, simply install it via pip:
pip install dataDisk
Contribute to dataDisk
I believe in the power of community and open source. dataDisk is still growing, and I need your help to make it even better! Here’s how you can contribute:
Star the Repository: If you find dataDisk useful, please star our Github Repository. It helps us gain more visibility and attract more contributors.
Submit Issues: Found a bug or have a feature request? Submit an issue on GitHub.
Contribute Code: I welcome pull requests! If you have improvements or new features to add, please fork the repository and submit a PR.
Spread the Word: Share dataDisk with your colleagues and friends who might benefit from it.
Example: Testing Transformations
Here's an example to demonstrate testing all the transformation features available in dataDisk:
import logging
import pandas as pd
from dataDisk.transformation import Transformation
logging.basicConfig(level=logging.INFO)
# Sample DataFrame
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': [6, 7, 8, 9, 10],
'category': ['A', 'B', 'A', 'B', 'A'],
'feature3': [None, 2.0, None, 4.0, 5.0]
})
logging.info("Original Data:")
logging.info(data)
# Test standardize
logging.info("Testing standardize transformation")
try:
standardized_data = Transformation.standardize(data.copy())
logging.info(standardized_data)
except Exception as e:
logging.error(f"Standardize transformation failed: {str(e)}")
# Test other transformations...
# Add similar blocks for normalize, label_encode, etc.
Join us in making dataDisk the go-to solution for data processing pipelines!
GitHub: Github Repository
Please star my Project.
Top comments (0)