Introduction:
In this article we'll be looking at the different concept behind data analytics. It will provide you with the clearer understanding of what data analytics and how it allow you to collect, store, review and analyze data to help derive business decisions through insights that have been identified.
Who should read this article?
This article is written to provide you the foundation to data analytics and is ideal for those looking to become data analyst or data scientist
Objectives:
- The objectives of this course is to provide you with the understanding of different analytics concepts.
- Data types: including structured, semi structured and unstructured data
- When you should use data analytics within your business
- The process behind running analytics against data
Introduction to Data Analytics:
In this article, I'll provide you with a basic understanding of different concepts to help you understand the concept behind many of the AWS services and architectures used to implement data analysis.
Simply speaking, analytics or data analytics is the science of data transformation, transforming data into meaningful information and insights.
Here we refer data as any input you have like a spreadsheet, a CSV file, historic sales information, a database, raw research data, essentially any data that you may have.
With this basic concepts I mind let’s explore a bit more about data analytics concepts.
Starts with a question:
Everything start with the question to the problem we have, using analytics we want to solve these problems by selecting the right tools to collect or clean the data in an appropriate fashion.
Types of data:
As an input for our analytics, we may have data which can be organized into different categories. For example,
- Qualitative data
- Quantitative data
- Structured data
- Semi structured data
- Un structured data
If you are new to these concepts and you don’t hear these terms before then don’t worry ill be explaining next in my article.
In this article we will understand which AWS services can be used in the analytics process.
Data types:
Here let’s take a little bit about the input. We have two basics data types where the data is organized
Quantitative Data:
Quantitative refers to numbers, the amount of certain values, like the number of citizen in a given geographical area.
Example: no. of students in class = 16
Boys= 8
Girls= 8
Qualitative data:
Qualitative data refers to attributes of the population not expressed in direct numbers but to qualify their attributes like,
- Eye color
- Satisfaction level
Main data types:
We also have three main classifications for the data formats which are following
Structured data:
Structured data referred to data with a defined data model like SQL databases, where tables have a fixed DB model and schema. On AWS for example, the RDS or Rational databases services is a complete example of a structured store.
Semi- structured data:
In semi structured data we basically have a flexible data model or tagging mechanism that allows a semantic organization and some kind of hierarchy discovery from the data without having the fixed and rigid rules from SQL database. Non- SQL databases can also be structured but usually they are used in flexible way to complement the limitations from Amazon S3 to SQL databases. Amazon DynamoDB allows each record to have a different numbers of columns but gives fixed indexes for searching. This provides a very flexible schema
Eg. Non-sql, XML, JSON and CSV files are good examples of semi-structured data.
Unstructured data:
And lastly, we have the un structured data where all kind of text information without a data model is classified. Here we have all kind of documents without a proper data model or a schema like;
Books, natural language processing and all sorts of text processing.
Data generation has exploded exponentially in the past decade. We generate data from the moment we wake up to the moment we go to bed or even while sleeping, sensors can be collecting data from our body and environment to improve a series of apps and services. It could be suggested that we are generating too much data much more than we can probably analyze.
When to use data analytics:
To help us define if we need to use data analytics services to find the answers of our problems that are locked within our data. We can look at the different factors
Volume: the first is the volume which refers to the size of the data set or as we usually call it the data size. And size matters in order to decide the right tool to analyze it. Usually, a big data problem will go into scale from gigabits to petabytes of data.
Velocity: There is also what we call the velocity of the data, which indicates how quickly you need to get your answers and is also related to the age of the data. For example, historical records from previous years or real-time alerting and information. This has a great impact on the tools used to analyze the data because, depending on the response time you need, are you comfortable with real-time responses or waiting? By knowing this, we can choose the right tool and technique.
Variety: this refers to the variety of the source data classification; if its structured or unstructured. As often analytic problems will have sources from several types like.
Business intelligence platform data, blogs, CSV data, texts and any sorts of structured or unstructured data.
ANALYTICS TYPES:
The power of analytics grows when we move from batch analytics to real-time analytics and then to predictive analytics, but as always the problem you are trying to resolve will always dictate the best method.
*Batch analytics: * In reporting or BI analysis, data is processed to a job and the results are presented after a period of time. We have years of data in your data warehouse or in log files, in spreadsheets, and we want to find interesting patterns in this data such as potential sales, potential profits, or potential insights from research data.
Real-time analytics: When it comes to real-time analytics, we need to get the answers as soon as possible. If you lose time, there can be serious consequences like fast response to security alerts from intrusion detection systems or responses to advertising campaigns.
Predictive analytics: The last type of data analytics is predictive analytics, which take historical data as an input, it then learn from the history and then leaves us predictions for future behaviors. This is a common case for machine learning like spam detection where based on past behavior we identify malicious messages, predicting and avoiding spam messages.
Data analytic process:
First of all, when you have a problem you usually define it as a question to start your journey into the analytics field. With your question ready, you need the source data which is effectively your starting point. This can be a data warehouse database, rational database tables or a NoSQL store, a csv file, books, text files. In short, every readable format can be used as an input. Selecting the input will depend on the answers you are trying to return from your problem or question.
For example:
The problem might be to count words in book from Shakespeare or on the other end of the scale, the problem might be to analyze DNA to find patterns. So the type of problem will dictate the data and also the processing algorithm.
With your input ready, you need to store it in an accessible place for the tools to then process it, analyze and return the results. The separation using process and analysis is based on the fact that some analytics solutions from AWS will require a previous cleaning or pre-processing from the data for better results and accuracy.
AWS has structured its portfolio around the collect, store, analyze and visualize the methodology for each step which integrated services to perform each function.
Collect/ingest:
The first step in your analytics process is to collect the data that you want to use as an input. The data collection is also called ingestion which is the act of acquiring data and storing it for later usage. In the data collection, we have different type of ingested data. We can have transactional data which is represented by traditional rational databases, reads and writes. We can also have file ingestion reading data from file sources such as like logs, texts, CSV files, book contents and so on and you can have also stream data represented by any kind of streamed content like a clickstream and events on a website, internet of things devices and so on.
The toolset AWS currently offers can ingest data from many different sources. For example; with kinesis streams or firehose we can work easily with streamed data on any source even if they are on-premises.
Store:
After the data is generated or acquired, we need to store it in an accessible place for AWS. This is usually called a data lake. The big pool where your services go to get the source and to deliver back the results. Amazon S3 is one of the core storage services from AWS. As a highly durable object store integrated seamlessly with all other AWS analytic services for data loading ad store. You can also have data on Amazon RDS if the data has structured format or a Redshift. If it has no fixed data model but a basic structure, we can use DynamoDB, the NoSQL solution from AWS TO STORE IT. And if your data has very infrequent access we can use glacier, the archive service from AWS.
Analyze/process:
- Batch
- Real-time
- Predictive Remember the right service or tool depends on the type of problem you have and the velocity of your replies. If you can wait for a while or if you need real-time answers to the problems and if you want to predict future behaviors. Amazon EMR
If your goal is to provide reports based on
- Batch processing,
- Historical data Analysis
- Identifying patterns on large data sets
If you need real-time replies for questions or the results must be displayed on live dashboards, then you might take advantage of stream-based processing with amazon Kinesis, AWS Lambda or Amazon OpenSearch. Kinesis provides streams to load, store and easily consume live stream data and AWS Lambda can protect to theses streams events, executing functions you define.
For predictive analytics where you need to forecast an event based on historical cases, you might take advantage of Amazon Machine Learning services to build highly available predicting applications. Not forgetting data pipeline which can be used orchestrate all these services. As a frame work for data-driven workflows, data pipeline can be used to automate your database loads and for the visualization aspect to get a nice overview or dashboard from you replies you can use Amazon QuickSight which allows you to create rich visualization from your data.
conclusion:
That brings end to this introductory article and now you should have a greater understanding of the Basic concepts behind Data Analytics.
We have covered all the basic concepts of Data Analytics in this article. Thanks for taking the time to read this article. I hope you found it interesting and informative. I hope you understand what is Data Analytics, Analytics types, When you should use data analytics within your business? and The process behind running analytics against data. I hope you enjoy reading this article.
Please share your feedback with us as well.
Thank YOU!!
Top comments (0)