Amazon Redshift is the Amazon Web Services (AWS) service offering a fast, fully managed, secure and scalable data warehouse.
Built on massive parallel processing (MPP) technology, it allows for large-scale dataset processing and database migrations.
Amazon Redshift differs from other databases in that it can process analytical workloads on large datasets, stored via column-oriented database management (DBMS).
Amazon Redshift Architecture
- SQL endpoint.
- Stores metadata.
- Coordinates the execution of queries.
- Local and columnar storage.
- Execute parallel queries.
- Load, backup, restore.
Partitioned areas of the node’s memory and disk space where the allocated load is processed.
Two hardware platforms
Optimized for data processing.
DC: Dense Compute. SSD; scales from 160GB to 256TB.
DS: Dense Storage. HDD; scales from 2TB to 1.6PB.
Types of nodes offered by AWS
As shown in the previous diagram, Amazon Redshift offers two types of nodes, one oriented to higher computing capacity (DC2) and another oriented to storage (DS2).
The DS2 node types are optimized for large data workloads and use Hard Disk Drive (HDD) storage. Each node has a capacity of 2TB of storage, which allows for great storage in a local cluster using nodes of this type.
The other DC2 node types are optimized for high performance workloads. Their use of solid state drive (SSD) storage offers faster I/O (input/output) than DS2 nodes, but they provide less storage space, as each node has a capacity of 160GB.
In both types of nodes we can choose two varieties: “(x)large” and “8xlarge”. Clusters with “large” type nodes consist of between 1 and 32 nodes, while “8xlarge” nodes consist of a minimum of 2 nodes up to 128. This allows our Redshift cluster to be scaled according to our needs, always keeping the focus on the size of the dataset and the desired query speed.
Data Modeling in Amazon Redshift
To take full advantage of the benefits offered by Amazon Redshift as a distributed columnar database, it is very important to define the data model correctly. In general, as a relational database, modeling is conceptually similar to other more traditional databases. Nonetheless, it is very important to define certain parameters in Amazon Redshift. In addition, we need to change the way we view data collection, although that will be discussed in more detail in the next section.
In this section we will discuss the main parameter keys to get the most out of our data warehouse at Amazon Redshift:
- Distribution keys (dist key): these allow you to tell Amazon Redshift how your data is grouped so that distribution between nodes is more effective.
- Sorting keys (sort key): These allow you to tell Amazon Redshift how you sort your data, making sorting more efficient.
- Column compression: This tells Amazon Redshift how to compress columns, resulting in better use of space.
- Constraints: Allows you to tell the Amazon Redshift planner the features of your model to improve your consultation plans.
The first three concepts revolve around the way Amazon Redshift stores data, while the last deals with the way Amazon Redshift queries the data.
When a table is loaded into Amazon Redshift, it is distributed among nodes based on the distribution style. Later, when performing a query, the planner defines which data blocks have to be moved between nodes to generate the result. There are two common problems associated with this movement: a performance problem, as a large amount of data has to be moved; and a full disk issue, as the query exceeds the storage capacity of the node.
Due to these characteristics, it is very important to define a dist key that distributes the data in the most efficient way. And with this in mind, what kind of distribution styles does Amazon Redshift offer? Only three. One of these is ALL, which distributes a copy of the table in all nodes, and as we said it is recommended for widely used tables with very few records (such as dimension or look up tables). Another distribution style is EVEN, defined by default, which distributes the table among all nodes in a round-robin mode, that is, in equal chunks. This is useful when there is no clear distribution key. And the crown jewel is distribution style KEY, recommended for fact tables of high volume and which allows the user to define which column acts as the distribution key.
A good distribution key is usually chosen so that it meets, in part or in full, the following conditions:
- Uses a column that is frequently used to join different large tables (by means of a join).
- Considers how the data volume will look after filtering ie it is better to select columns that segment the data well.
- Gives high cardinality after filtering. In other words, you should avoid columns that appear to segment data well, but when joined with other tables, the resulting volumes are always unbalanced (some categories are very heavy and others less so).
- In dimension tables (with few records) it is preferable to use the ALL distribution.
In order to optimize the planner-generated query plan, it is also important that the data is arranged in the way they are usually accessed. This means that the query does not have to continually retrieve data that are physically far from each other. Amazon Redshift uses the sort key for this purpose.
Usually you select the column by which it is sorted. Based on this column, the data is stored and re-sorted each time a Vacuum is performed. For example, if you usually access the most recent data, a good sort option would be the timestamp column.
In case you join a column very often, it may be useful to select that column as a sort and distribution key. This allows Amazon Redshift to skip the sorting step and merge the tables directly, so improving performance.
However, it should be noted that Amazon Redshift allows multi-column sorting in two modes: compound and interleaved. In the first case, the set of columns by which it is sorted is in turn sorted (the first column indicated has more weight than the second, and so on). In the second case, however, all sorting columns have the same weight. For more information about sort keys, we recommend that you consult the documentation.
Column compression allows Redshift to store data in less space, thus increasing the amount of data that can be stored. Although compression can be defined during the creation of the tables, it is recommended to run the Amazon Redshift analysis (ANALYZE COMPRESSION) to automatically evaluate which compression is best.
Although Amazon Redshift stores compressed data, this is not the case when running queries. The temporary tables generated as a result of the query plan are not compressed and because of this, it is important to define the columns with the smallest possible size in Amazon Redshift eg avoid setting very large VARCHAR columns as is the case with other databases.
Definition of constraints
The constraints in Amazon Redshift (unique, primary key) are merely informative ie if a column is defined as unique, Amazon Redshift does not guarantee that it will be (nor does it check, in fact). However, it is very helpful to define them, since these limitations are taken into account when defining the query plan.
For example, when consulting a SELECT DISTINCT on a column marked as UNIQUE, the planner saves the execution of DISTINCT since that column is unique. An important detail is that, as mentioned previously, Amazon Redshift does not check the constraints. So in the previous example, if there were duplicates, what would happen? Indeed, our SELECT DISTINCT would have some brand new duplicates.
Consultations in Amazon Redshift
Due to the columnar distribution method of storing data, there are certain points to bear in mind when running queries on Amazon Redshift. Amazon Redshift is a database designed for analytics. This is one of the main reasons why storage is columnar. And as a result, it is not recommended to undertake SELECT *, but instead to select exactly the columns you want to retrieve. Having columnar storage makes it much more expensive to access the entire record than just some columns. It is also preferable to use the CASE statement when performing complex aggregations, rather than selecting the same table many times.
With regard to joining and filtering tables, it is not advisable to use cross-joins, unless it is essential. It is by far the least efficient operation and the one that has the most impact on performance. In addition, the use of filters that limit the number of records is recommended as far as possible, especially when joining, even if they are applied again later. This allows the planner to scan only the part of the table that will actually be used, avoiding accessing many disk blocks. Furthermore, it is preferable to perform filtering based on native comparisons, rather than on complex statements such as LIKE.
Finally, in terms of grouping and sorting, it is convenient to group using the sort keys defined above, to make aggregation more efficient. The order of the columns in the sort clause, if used in conjunction with a grouping clause, must be the same.
Extra ball: Amazon Redshift Spectrum
Amazon Redshift Spectrum allows only compute nodes to be employed in a local cluster, and the storage in other external sources (like S3). This makes the volume of data that can be stored much greater. There are some disadvantages, such as lower performance since access is not direct to the cluster nodes but must be connected to external sources; or that the compression and the way of storing the data is much more important.
It is a solution for organisations that have such a large volume of data that they are unable to store in Amazon Redshift, but want to make use of its benefits.
We will talk about this functionality in more detail in future posts.
We hope this in-depth review of Amazon Redshift will be a useful guide for those of you seeking to better understand how this AWS service works or to get started. If you would like to add anything, you can leave your comments below or contact us. We will be delighted to exchange opinions.
Image: pixbay.com + aws