Amazon Redshift is a great offering by AWS and it is one of the most popular and fastest cloud data warehouses in the market right now. It gives you a great performance, can be scaled, is secure and easy to manage. The connectors and queries are almost identical to Postgres hence it becomes very easy to query the data. Today we will take a look at the architecture of Redshift and understand how it achieves performance and scale even at tera and peta byte levels.
Architecture
AWS Redshift architecture Taken from https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
The above diagram is the architecture of AWS Redshift. We will start from the top and move towards the bottom and understand each part.
- Leader Node: The leader node is responsible for the communication with the connectors. It abstracts the underlying architecture of the compute nodes. When a connector sends a query to the Leader node, it decides on which compute node to fetch the results from and after fetching the results it aggregates the data and sends the query result to the connector. Similarly It decides which data will be stored in which compute node.
- Compute Node: The compute node is responsible for storing and retrieving data. Each compute node has its own CPU, memory and disk. When it receives instruction on what needs to be fetched it uses Node slices to fetch the data. The amount of CPU, memory and disk is dependent on the node type and in order to handle the scale we can easily increase or decrease the amount of CPU,memory and disk by changing the node type providing us with a truly scalable data warehouse.
- Node Slices: Node slices are the smallest part of this architecture. It is not very clear from the AWS docs but imagine them as a way to parallely fetch/store the data. Each node slice is allocated a portion of the node’s memory and disk space where it can process the workload which is assigned to it. Which node slice will process the workload is determined by the leader node.A single query workload can be distributed across node slices in different nodes each one working parallely. The number of node slices per node depends on the node type.
- Cluster - This entire interconnection of a Leader and compute node over a high speed network is called a cluster. One thing to note here is the leader node is created only when there are more than 1 compute nodes in a cluster. So if there is only a single node, the cluster is basically the node itself.
- Network - The network on which the leader and compute node communicates is a high speed private network abstracted from the client. AWS uses close proximity and custom communication protocols to achieve the said high speed network.
How data is stored in Redshift
Now we will look into some of the concepts and components used to store the data in a particular way.
Columnar Storage
Columnar storage is where we store columns together instead of rows. So for example If there is a data like this
A typical relational database would store store the entire row in a block something like this
But in Columnar Storage, data of a single column is combined and stored in blocks something like this. Of course each column will have its set of blocks.
Why would someone use columnar storage? In traditional RDBMS you would be required to create indexes on the fields which your queries need to filter or group by for faster execution. But for that you would be required to know all the queries needed beforehand. Consider an extreme case where you create an index for every column, that would just mean replicating all the data in the table again in indexes. Redshift does the same thing; it does away with the actual row data and stores data in a way very similar to an index for every column of data. Redshift thus removes row-level work by using columnar storage and only performs input and output (I/O) operations for the columns required by a given query.
Number of blocks accessed in row level storage
Number of blocks accessed in Columnar storage
The above illustrations shows suggestive comparisons between traditional row level and columnar storage on how many blocks are accessed if we select only a single column in the data.
Compression
The columnar storage changes the way we can compress the data. It opens up a lot of improvements in compression when compared to traditional row level storage. When each block contains data of the same type and similar values, compression becomes more efficient. Moreover we can use different compression techniques on different columns since they are not stored together with other data types. This is known as Column encoding in Redshift. This is not to be confused by the character set that a column uses, Amazon Redshift always stores data in UTF-8 and unicode. Amazon Redshift can choose a compression technique for you or you can select from any of these options. When compression becomes efficient it increases your I/O throughput.
Massively Parallel Processing
Amazon Redshift taking advantage of compute nodes and node slices can perform parallel processing at a massive scale, hence the term Massive parallel processing. But there is a catch to it, You need to design your tables in such a way that they can take advantage of this parallel processing. This is very critical when you have a large dataset. Amazon Redshift uses distribution style to determine the node slice where the row will be saved.
Distribution Style
Amazon Redshift uses 4 distribution styles to determine the data distribution among the node slices. One thing not to be confused with is the distribution style is defined at a table level and not at a database level. Let’s take a look at them.
- ALL - This is the most simple distribution type. You replicate your table data on all the nodes. This might sound a little non-optimal at first but imagine a small table requiring join with other tables all the time, instead of fetching and transferring data from different compute nodes it makes sense to keep a copy of it in all the nodes. This is not optimal for large datasets.
Recommended for: Small tables requiring join with most tables.
- EVEN - This distribution tries to evenly assign the table rows to all the nodes in a cluster. This distribution style is recommended when the table is queried individually and not with joins. When queried individually it makes maximum use of parallelizing load since each node has data and the execution time is reduced. When queried with a join on other tables, rows may be matched on all the nodes increasing the network traffic which in turns impact query performance. The only exception is that it can be joined with a table with ALL distribution since that table will have data on all nodes the query performance is not compromised.
Recommended for: Tables queried individually or joined with tables having distribution type ALL.
- KEY - This distribution assigns rows among nodes based on the value of a column or columns. Amazon Redshift will make sure that rows with the same value of that column will reside in the same node. When you want to optimize the join between 2 large tables you can use this distribution on the common column or columns so that both tables are co-located on the same nodes.
Recommended for: Large tables having joins together and needs optimization.
- AUTO - This distribution style starts with ALL to the table and then the style is switched to EVEN as the table grows and it does not switch back to ALL when the size reduces.
Recommended for: When you are not sure of which distribution style to use and are just starting.
One additional thing to note is that distribution style is not set in stone. The amazing thing about amazon redshift is that a simple table rewrite can change the distribution style of the table.
Sort Key
Sort keys are used to determine how rows in a table are sorted. When used properly it helps the query optimizer to read fewer chunks of data which improves query performance. When storing the data a set of metadata is created called zone maps. This zone map contains the min-max range of the values stored in each 1 MB block. Zone maps are used during query processing to find the relevant blocks before disk scan. If the columns are not sorted on the disk the min and max range can overlap and for a particular value so there can be 2 or more blocks returned for a query. There are two srot types in Amazon redshift
Compound Sort Key - This is the default sort type. You can specify one or more columns as the sort key. The first column specified has the highest weightage and the weightage keeps on decreasing with the next columns. So use compound sort key when your query includes joins, group by, order by and partition by and your table size is small.
Interleaved Sort Key - In this sort type Amazon Redshift gives equal weightage to each column selected as sort key. Hence the performance improved significantly when the query uses equality operator in where clause on secondary sort columns. Adding rows to an already sorted column affects performance hence vacuum and analyze operation should be used regularly to re-sort the data and the zone maps. So use interleaved sort keys when you have only one column as sort key or when there is an equality operator in queries for the secondary column or when the table is huge. It is not recommended to use interleaved keys for monotonically increasing attributes like dates, timestamp, auto increment ids.
I hope this article will help you in better understanding Amazon Redshift. Do let me know how you like this article.
Top comments (0)