Not long ago we organised a cloud introduction meetup with the idea of bringing this vague concept into focus for people. During this meeting, we clarified many queries about the cloud and about the most important providers, but the concept that attracted more attention than the rest was Big Data.
So what is this Big Data? Far from wishing to embark on a theoretical post about definitions, advantages and disadvantages, I would like to give a brief definition and then move on to the mechanics.
Big data itself is nothing more than, as the term itself corroborates, large volumes of data, but the important thing is not so much what data we have, but how to take advantage of it. What may at first glance appear unrelated or of little value can become a raw gem if well-polished. And how do we set about polishing it?In #BigData it's not just the data that's important, but how to make the most of it Click To Tweet
Generally, the process consists of the following phases:
Often this data is scattered across various storage facilities that support our processes and applications. Therefore, the first step is to build a point where all these data coexist and which becomes our source of truth.
Most probably these raw data are full of noise and/or elements that do not add value to the specific use case of the solution. By refining these data we will be able to improve the performance of the different processes that exploit them.
This consists of exploring data in order to draw conclusions from them. Given the amount of data being handled, specific tools and components are needed to carry out this phase.
Finally, and once data have been refined, they are presented in such a way that facilitates interpretation.
So now we know how things work, let’s analyse the different phases using a practical example.
We will consider a typical architecture called Big Data Lambda Architecture. When processing large quantities of data, we often have to compromise between the speed at which we make the data available (real time) and data accuracy (scheduled loads). This architecture provides for both use cases, as it is able to make data productive in real time as well as in programmed loads.
There are several ways to perform real-time ingestion, but in this case we used Kinesis Streams and Firehose, services recommended by AWS for use cases such as these. Kinesis Data Streams is a service that allows data collection in a continuous and highly scalable manner. When we want to integrate data with available data on a scheduled basis, we use Kinesis Firehose, which allows us to package data so that they may be joined to the datalake.
In the case of scheduled ingestion, we can use services such as AWS Glue, AWS Batch; or AWS Data Migration Service if we want to replicate data from systems such as RDBMS, data warehouses or NoSQLs. It is helpful to analyse different data sources in order to select the best service for each case.
When it comes to real-time data transformation in AWS architecture, Kinesis Firehose is king. In the past we have seen how Firehose allows data to be packaged at intake in order to be joined to ingested data via scheduled processes; however, with Kinesis, we can do more than just package data, we can perform real-time transformations for later use as conventional visualization and analytical tools or real-time analytical and visualization services such as Kinesis Analytics.
Using S3 as a datalake allows us to reduce costs (it is not necessary to always have all data in warehouses such as Redshift), while disengaging the architecture, as we do not always want to use the same technologies for all data.
If we choose to accumulate data in the datalake (both in the case of real time and scheduled loads) we have several alternatives when treating this data and preparing it for later production. These alternatives (AWS Batch, AWS Glue, AWS EMR) offer different use cases and levels of personalisation and administration, providing a wide range of possibilities to cover the needs of different clients. In some cases where transformation is not too demanding in terms of CPU, memory or time, it is possible to carry out this phase with AWS Lambda, by a process known as micro-batching.
The following example considers the application of transformations with AWS Glue, a service that allows the functionalities of Spark or Scala to be exploited in a totally managed way, and also offers us some canonical transformations in its catalogue.
When it comes to analytics, several AWS tools and services come into play. The previous example shows a Redshift database, which allows us to perform analysis using PSQL queries to take advantage of this technology’s great computing capacity. Redshift and Athena allow us to perform analysis on already transformed data, while for specific real time cases we can use AWS Kinesis, which allows us to perform real time queries on the data received.
The end goal is to be able to interpret this data quickly and easily in order to draw conclusions that will help us improve our business. To meet this need, Amazon Web Services offers a service called QuickSight, which is easily integrated with AWS storage services. This service allows us to quickly build data dashboards enabling us to make the best decision in each situation, whilst only paying for the sessions used.
This results in architecture built on blocks (interchangeable and reusable) that permits the reception and processing of data, both in real time and on a scheduled basis. By unifying the processing capacity a single processing point can be used for both cases.
Image: unsplash | Nate Grant