BIg data is mainly composed of sensors and devices data,social media data,enterprise data and VoIP data.Big data is commonly characterized by five Vs Volume,Variety((type and nature of the data: text, image, video, audio),Velocity(how fast the data is generated and processed)Veracity (how trustworthy the sources are), and value (how actionable the data is).
Data engineers are responsible for the first step of the process:ingesting collected data and storing it.They have a great responsibility as they lay the ground work for data analysts,data scientusts and machine learning engineers. if the data is scattered around,corrupted and difficult to access,there's not much to prepare,expore or experiment with.
Data Engineers lay the groundwork that makes data science activity possible.
Data engineers deliver:
- the correct data.
- in the right form.
- to the right people.
- as efficiently as possible.
- Why you need Data engineers And that's exactly why you need a Data engineer: their job is to deliver the correct data, in the right form, to the right people, as efficiently as possible
A data engineer's responsibilities;
- ingest data from different sources.
- optimize databases for analysis.
remove corrupted data.
Devlop,construct test and maintain data architectures such as databases and large-scale processing systems to process and handle massive amounts of data
Data Lake and Data Warehouses
Data Lake
Data Lake-Stores all the raw data as it was collected just as it was uploaded from different sources.its just unprocessed and messy.
Data Lake - can take petabytes of data.
Data lake- can store any kind of data whether its structured,semi-structured or unstructured meaning it does not enforce any model on the way to store the data hence cost effective.
Data lakes are used by data scientists for real-time analytics on big data.
Data warehouse.
Data warehouse stores specific data for specific use ,for example users and their subscriptions type or all the listening sessions for behavioral analysis
Data warehouse relatively small data,but still way bigger than a hard drive.
Data warehouse enforce a structured format which makes them more costly to manipulate.
Data warehouse on the other hand is optimized for analytics to drive business decisions.
data warehouses are used by analysts for ad-hoc, read-only queries like aggregation and summarization.
A data catalog is a source of truth that compensates for the lack of structure in a data lake. Among other things, it keeps track of where the data comes from, how it is used, who is responsible for maintaining it, and how often it gets updated. It's good practice in terms of data governance (managing the availability, usability, integrity and security of the data), and guarantees the reproducibility of the processes in case anything unexpected happens.
**Data warehouses **also have subsets, like data marts, which are highly curated for a particular community of users, such as a specific team Data marts are also much smaller, tens of gigabytes instead of the hundreds of gigabytes to petabytes of data that can be held in a data warehouse.
How data engineers process data
- Data manipulation,cleaning and tidying tasks.
- that can be automated
- that will always need to be done
- Rejecting corrupt song files.
- Store data in a sanely structured database.
- Deciding what happens with missing metadata.
- Separate artists and albulms tables.
- create views on top of database tables
- Opitimizng the performance of the database.
Scheduling data
Scheduling is the glue of a data engineering system.it holds each small piece and organizes how they work together,by running tasks in a specific order and resolving all dependencies correctly.
Parallel computing forms the basis of almost all modern data processing tools. It is important mainly for memory concerns, but also for processing power
cloud computing for data storage.
Database relaibility;data replication.
Risk with sensitive data.
Top comments (0)