T3chfest 2019 is approaching and in a preliminary warm-up event on February 25th, I had the pleasure of giving a talk about the Data Analysis Lifecycle. This preliminary was organised jointly with Developer Circles at the Facebook offices in Madrid.
In the talk we shared views on the different phases that make up a data analysis cycle, emphasising the importance of each individually and how these stages are related to each other throughout the analytical process. In this article we will review the most important points from this preliminary meeting. Let’s start!
The life cycle of data analysis is divided into the following stages:
- Taking requirements and data sources
- EDA (Exploratory Data Analysis)
- Data Model and Architecture
1. COLLECTION OF DATA REQUIREMENTS AND SOURCES
We start from the scenario in which a product owner needs to analyse a set of data that is available from several sources.
This first stage requires the participation of a Data Analyst, who will lead the analysis and visualization of the data, and a Data Engineer, who will be responsible for the transformation and input into the data model.
The data can be made available via different types of sources or origins: text files in csv or tabular format (and which could be accessible through cloud repositories such as Amazon S3), relational databases (Postgre, SQL server, MySQL), analytical databases (Redshift, BigQuery) or non-relational databases (MongoDB, Cassandra).
Having noted the analysis requirements and ensured the different sources for analysis are available, we can get down to work and start transforming and ingesting the data from the sources into a data model for later visualization, right? Wrong.
We must be skeptical from the word go regarding requirements, sources and even the analysis/problem raised. Questioning and working on the format, structure and quality of sources, or reviewing the approach of the exercise and requirements is focused on achieving a robust and quality analysis. However, being skeptical does not imply distrust; it is fundamental that Product Owner, Data Analyst and Data Engineer all work together from the beginning in a communicative and transparent way in all stages of the cycle to achieve a successful analysis.
How are we going to work through this skepticism? By means of EDA (Exploratory Data Analysis).
2. Exploratory Data Analysis (EDA)
This stage is fundamental to execute a good analysis. Using exploratory analysis techniques, we will analyse the different sources, measure the quality and consistency of the data and confirm (or otherwise) that the analysis proposed by the product owner is correct or makes sense. In addition, we can propose other complementary analysis prisms that will add value to the whole exercise.
This will enable us to be efficient and discover very early on in the cycle if the analysis is not correctly set out or if the sources are inconsistent or incomplete. This means that we will not need to wait to ingest the transformed data into a data model or to visualise the data in this model to realise that something is not right (with all the costs that this would entail).
How do we perform this EDA? Using languages like R or Python and platforms like RStudio or Jupyter Notebook.
This exercise will be divided into two parts: firstly, the quality/consistency of the data will be measured by means of data cleansing techniques, and secondly, we will check if the analysis approach is correct by means of library visualizations/plots such as matplotlib (python) or ggplot (R).
Inconsistencies in the model or formatting errors are classified in two groups:
- Data with format errors: NA’s, wrong data, wrong formats, constants, duplicates, wrong categorisation.
- Data with intuitive errors: Outliers, erroneous segmentation. These are data that contradict the initial approach, scenario or hypothesis of the analysis.
The context of the data will be checked using plots to see if data are aligned with the analysis.
The simple example below shows an image that corresponds to a visualization of a dataset with historical information about the Olympic Games. It may have initially been planned as an analysis of the role of men and women in the history of the Olympic Games, devised on the basis that there has always been participation by both sexes. However, this ggplot shows that women only started to participate in the Games from 1912, contradicting the initial premise and so the exercise would need to be reconsidered.
What did we achieve through the EDA?
- We analysed the different data sources.
- We cleaned and normalised the data: we improved data quality.
- We saw whether it is viable in analysis or not.
- We deepened data: value was added by proposing a 2.0 analysis.
- We increased data knowledge.
- We increased knowledge of the data context.
- We paved the way for defining the model and its architecture.
Great, isn’t it?
Having taken note of the analysis to be carried out and its requirements, we have explored, cleaned and studied the sources available to us and can now move forward to the next stage: the data model.
3. DATA MODEL AND ARCHITECTURE
Depending on the amount of data to be exploited, type of analysis to be performed and nature of the sources, we must now choose: our data model, the database we will consult for information and the architecture on which the model will be based.
Our model can be relational (star or snowflake model), focused on analysis with a low volume of data, or columnar (table) if our analysis is on a large scale (terabytes of data).
Analysis on relational models is usually done through an OLTP (OnLine Transaction Processing) database such as MySQL or PostgreSQL. In order to analyse large volumes of data we must use analytical databases (OLAP, On-Line Analytical Processing) such as Amazon Redshift or Google BigQuery.
Depending on volume of data, there are three types of architecture: Datamart, Datawarehouse (DWH) and Datalake:
A Datamart is a structure on a low volumetric relational model that we will exploit through OLTP databases. A DWH is a set of interrelated Datamarts (in a single database) whose volume implies the need for analysis through an analytical database. A Datalake is an infrastructure with a large amount of data that may or may not be structured. Although it is true that Datamarts and DWH can coexist in a Datalake, its main base is a file system. An example of a file system is the S3 repository of AWS.
From an analytical point of view, working with cloud platforms such as AWS, Google Cloud Platform or Microsoft’s Azure has the following advantages:
- Cost per use: cost control, you only pay for the infrastructure you actually use.
- Privacy and security: physically very secure platforms where there is a shared responsibility with the platform. The user is responsible for managing security of access and distribution of data (data governance).
- Scalability: more or less computing power is available according to the volume of data on which to perform the analysis, in many cases automatically.
We have already transformed and ingested our data in the model and defined architecture. Next step: data visualization.
We can choose one of three methods to analyse and visualise our information:
It refers to commercial Business Intelligence tools (such as Microstrategy, Power BI, Tableau) that may carry out analysis at different levels of detail. Designed for all types of users, whether having a technical or business background, these tools come prepared with a set of suites (all in one) with which we will work without having to make custom developments in the first instance. It is therefore a “rigid” model. In any case, the tools are increasingly complete and offer us more and more functionalities.
Depending on the level of analysis and expertise of the end user, visualizations may be defined in three groups:
- Dashboards: High level analysis for all types of users.
- Documents/Reports: Intermediate level of analysis with more detailed information. Medium expertise of the tool and data/business model is required.
- Self-consumption: Focused for custom analysis of our data model/environment. A high level of knowledge of the tool, the model and the data context is required.
DO IT YOURSELF (DIY) MODEL
In contrast to BI tools, the DIY model is focused on making ad hoc visualizations that require complete development (e.g., D3.js). Designed more for a particular web front end or analysis than for a query-focused report whose data evolves to a timeline.
Finally, there is a methodology for carrying out visualizations and analyses that lies somewhere in between the two points we have seen above. With libraries like Shiny from RStudio we will develop custom interactive visualizations in a platform environment through which we will be able to manage sources and users.
AND NOW, WHAT?
Although we have successfully reached the end of our analysis answered the questions that the product owner asked us initially, the analysis in itself does not have to have reached its end.
In the future, new data sources could be made available by changing the initial scenario at all levels, or the focus of the analysis could be changed to a new data context, impacting some of the phases directly, and reactivating the whole process.
In short, an analysis can be dynamic, changing and evolving over time.
On March 15th we’ll be at t3cfhest! Don’t miss the talks by our fellow data scientists:
- Ramiro Manso: Understanding your own predictive models
- Nicolás Forteza: Resources for the super data scientist, turning anomalies into value