Table of Content | Read Next: Part 6 - Set up Elasticsearch for data transformation and data ingestion
Now that our server is connected to Elasticsearch hosted on Elastic Cloud, it is time to think about the data that we want to ingest into Elasticsearch.
Review from part 1
For our project, we will be retrieving data from the USGS API and ingesting it into Elasticsearch.
Before ingesting data into Elasticsearch, assessing the data structure and planning for desired mapping are essential in ensuring efficient storage and search of data.
In this blog, we will assess the data to:
- determine what data we need
- discover if we need to transform the data to fit our use case
- decide on the desired mapping for efficient storage and search of data
Resources
Would you rather watch a video to learn this content? Click on the link below!
The following USGS Earthquake API home page contains the link to the API that contains all earthquake data from the past 30 Days(red arrow). We will be retrieving data from this API!
The following documentation contains explanations of the terms(field names) included in the API. Refer to this if you need clarifications on the acronyms or want details about certain fields.
The current blog builds upon concepts covered in the blog and video above. Refer to these links if some of the jargons don't make sense to you or if you need a little refresher.
We will be referring to the following Elasticsearch documentations on field data types and numeric field types while coming up with the desired mapping for our data.
You will find these resources helpful when assigning field types to your data.
Now that you have access to all resources, let's get to work!
Step 1: Review the final outcome
Before we examine the data structure, let's review the final outcome of the app we are building.
Our app allows the user to search for earthquakes using the following criteria.
- type
- magnitude
- location
- date range
When the user clicks on the search button, the search results are displayed as cards.
Each card displays following information about one earthquake.
The information outlined above is what we need from the USGS API.
We will store this information in Elasticsearch in the form of documents. Each document will contain information about one earthquake.
Step 2: Examine the data structure of earthquake API
Go to the USGS Earthquake Home Page and scroll down to the Output
section(red box).
The output shows the data structure of a typical earthquake object found in the API.
Scroll down on the output to view the field features
(green box).
The features
field contains an array of objects. Each object contains information about one earthquake. It lists the name and data type of fields contained in a typical earthquake object.
In step 1, we determined what information we need to store in Elasticsearch.
You will see that the object properties
(orange box) and geometry
(blue box) contain the information we seek(underlined in pink).
If you need clarification on the acronyms or want details about certain fields, refer to the Earthquake Catalog Documentation.
Take a look at the image below.
You will see that the earthquake object from the API contains more information than we need.
To save storage, we will only index the fields mag
, place
, time
, url
, sig
(significance), type
, and coordinates
array which includes longitude
, latitude
, and depth
in that order.
The API fields that correspond to the info on the card are highlighted in same colors.
When you compare the two, for majority of these, the info from the API is identical to the info displayed in the search results.
However, there are a few that are not the same.
Step 3: Determine whether we need to transform any data before ingestion
Data transformation task 1: time
Let's compare the field time
in the API earthquake object with the field Time
on our card.
You will see that the field time
in API earthquake object is in Unix epoch time(1651522073266). However, Time
on the card displays a human readable timestamp(2022-05-02T20:07:53.266Z).
To achieve this outcome, we will convert the Unix epoch time in the API field time
to human readable timestamp. Then, store transformed information in the field @timestamp
in Elasticsearch(more on that later).
Data transformation task 2: coordinates
You will see that the search results card has fields called latitude
, longitude
, and depth
(pink boxes).
In the API earthquake object, the values of these fields are contained in an array called coordinates
(pink box) and are not labeled as such.
To make it easier to identify this information, we will create fields for lat
(latitude), lon
(longitude), and depth
in Elasticsearch.
Then, we will store the corresponding info from the API's coordinates
array into its respective fields.
Then, we will store lat
and lon
into an object called coordinates
to keep this information together as a pair as shown below.
Note that in Elasticsearch, the abbreviation lat
should be used for latitude and lon
should be used for longitude.
Step 4: Determine the desired mapping
We just figured out how we should transform the retrieved data from the API before we store it in Elasticsearch.
Next, we will figure out how to store this data using the smallest disk space while maximizing our search performance.
This is when customizing your mapping
come into play!
Mapping
defines how a document and its fields are indexed and stored.
It does that by assigning types to fields being indexed. Depending on the assigned field type, each field is indexed and primed for different types of requests(full text search, exact searches, aggregations, sorting & etc).
This is why mapping
plays an important role in how Elasticsearch stores and searches for data.
We will be glossing over a lot of the concepts we have covered in Understanding mapping with Elasticsearch and Kibana. Check out this resource if you need more in depth review of these concepts!
Take a look at the table below.
To make this process easier, I have created a table that displays(pink box) the name of the field, description of the field, typical values contained in the field, the purpose this field will serve, and the desired mapping I have chosen for the field.
Let's go through each field and determine why I chose certain field type for each field.
When you take a look at the Typical values
column in the table, you will see that these fields either contain numeric(green box) or string(orange box) values.
Let's take a look at the fields that contain numeric values first.
Numeric field types
There are various field types that can be assigned to numeric fields. The field type you choose will depend on the value type a field contains and for what purpose you will be using the field.
Coordinates
In step 3, we decided that we want to store the fields lat
(latitude) and lon
(longitude) within an object called coordinates
.
We are planning on using the fields lat
and lon
for two tasks:
- to display this information on a search results card
- to use these coordinates to mark the location of earthquakes on a heat map(part 10).
The second task requires running geo-based queries.
Therefore, the field coordinates
should be typed as geo_point
in order for this to work.
If you want to learn more about the field type geo_point
, check out this documentation.
depth
and mag
The typical values for the fields depth
and mag
are in decimals.
As the values of these fields will only be displayed in the search result cards, we will assign the field type float
for these fields.
When you look at the typical values for the field sig
, it consists of integers that range between 0 to 1000.
The value of this field will be displayed in the search results card.
We want to choose the field type that will store integers using the smallest disk space.
If you take a look at the documentation for numeric field types, the field type that will allow us to store this data using the smallest disk space is short
.
The value of this field will be displayed in the search results card. It will also be used to search for earthquakes that have occurred within a chosen date range.
To do so, we will run range
queries on this field so we will assign the field type date
.
Text field types
Let's go over the fields that contain string data types(place
, type
, and url
).
By default, every field that contains string data type gets mapped twice as a text
field and as a keyword
multi-field.
Each field type is primed for different types of requests.
Text
field type is used for full text search.
keyword
field type is used for aggregations, sorting, and exact searches.
In scenarios where you do not need both field types, the default setting is wasteful. It will slow down indexing and use up more disk space.
When deciding on a string field type, make sure you know for what purpose this field will be serving so you can choose the correct field type.
place
The field place
will be used for three purposes.
- The value of this field will be displayed on the search results card.
- The field will be used for full text search(when a user types in the location, the user input will be searched against this field to retrieve relevant data)
- Aggregation will be performed on this field to yield a table of 10 locations with the highest frequency of earthquakes (part 10)
Since we need to run both full text search and aggregations on the field place
, we will assign both field types text
and keyword
.
type
The value of this field will be displayed on the search results card.
This field will also be used for exact searches.
When a user searches for a specific type of quake, the user input is searched against the field type
to retrieve relevant search results.
The user is prompted to select a type from a list of options. Therefore, we can perform exact searches on this field and will map this field as keyword
.
url
The value of the field url
is only displayed on a card and it is not used for search.
Therefore, there is no need to create search data structures(inverted index or doc values) for this field so we will disable this field(enabled:false
).
Summary
In this blog, we figured out:
- how we want to transform the retrieved data before ingesting it into Elasticsearch
- the desired mapping to efficiently store and search data in Elasticsearch
Move on to Part 6 to set up Elasticsearch for data transformation and data ingestion!
Top comments (0)