Hadoop Processing in AWS using EMR

Amazon EMR is the best solution for mass data processing using Hadoop or Spark in AWS. It is a managed service that makes it easy to deploy data processing clusters integrated with highly reliable and low-cost AWS storage solutions. You can use multiple versions of Hadoop and Spark to simplify the migration of processes previously developed in on-premise environments. Keepler uses Amazon EMR for the fast processing of large volumes of information in transformation and data loading processes, transforming data in Data Lake environments, calculating indicators, and mathematical and statistical models for machine learning.

What is Amazon EMR?

Amazon EMR is an AWS managed service that makes it possible to create and scale Hadoop clusters. EMR can be used to run applications based on Spark, Impala, Presto, Flik, Hive, Pig and HBase, in addition to MapReduce processing. Amazon EMR integrates with other AWS services that can be used as HDFS warehouses. These services are Amazon S3, Amazon Kinesis, Amazon Redshift and Amazon DynamoDB.

Easy to use

A managed service that facilitates the creation and management of clusters of Hadoop servers without having to manually provision servers.

Inexpensive to use 

It allows the use of transient servers and storage in AWS low-cost services. The cost of the service is tied to its use and the computing and storage characteristics required.

Separation of Computing and Storage 

It lets you adjust processing capacity independently from storage, so the costs of scaling the service adjust to the specific needs of the work loads being executed.


It integrates with AWS security mechanisms including virtual private networks (VPC), the use of security groups to limit access to the machines in clusters, and data encryption in AWS storage services compatible with EMR like DynamoDB and S3.


It constantly monitors the cluster, retries tasks that have errors and automatically replaces requests that have deficient performance. Amazon EMR clusters have high availability and do automatic error switching in the event errors occur in a node.


It integrates with AWS monitoring and audit services, enabling precise control of the health and performance of processes as well as a full audit of them.

Which use cases can be adressed with AWS EMR?

Big Data

Bank Operations Data Lake and Dashboarding

The digital bank wanted to have an holistic view of the main processes including digital enrollment and the life-cycle of the main products, as mortgages, loans, credit cards, etc. The information is disperse and operational KPIs are not defined. Amazon EMR was used to add information from different sources and calculate complex KPIs

Big Data
Data Science

Cloud Migration of Data Exploration Environment 

The exploration environment in an on-premise Cloudera installation has to be upgraded to accommodate more data and more users. Amazon EMR was used to migrate Impala processes from an on-premise Cloudera platform.

Big Data

Customer 360 Vision Platform

The client has a Big Data 360 platform where different use cases have been deployed (reporting and data science models) which are accessed by various business and project areas. Once the deployment is complete, the customer is willing to save costs on support services and platform evolution. Amazon EMR was used to optimize the ingestion and transformation of data to accelerate processing and make the data available in a Redshift-based Data Warehouse.

Big Data

Platform for Selling Data Solutions

The client has an on-premise platform based on relational database technology, being unable to scale, or to have the flexibility to manage different data types. Amazon EMR was used to support data processes previously in an on-premises relational database system. This made it possible to optimize the data transformation processes using Spark, improve data ingestion and processing, and accelerate the calculations of key indicators.

Benefits of the AWS EMR service

Hadoop migration on-premise

Amazon EMR has a large number of popular data processing libraries like Impala and Spark.

Processing data in Data Lakes

Amazon EMR is the best solution for processing large volumes of data in S3 Data Lakes, either to transform to data formats like parquet or to generate a dataset of business value.

Calculating key business indicators

Amazon EMR can calculate indicators with multiple analytical axes to then be able to download the business intelligence tool to do the calculations as data is being visualized.

If you want to make the move to the AWS public cloud, contact us and we’ll talk.