Data visualization (what we know as dataviz) plays a fundamental role in the data analysis life cycle, although all phases of analysis are equally important. There are two main reasons or motives why we have to rely on graphs or plots throughout our process:
1 – Exploratory Analysis: Visualizations through which we will look for insights, check date quality or completeness. These visualizations do not have to be perfect. The data analyst uses graphs to give answers and deepen the perspective of the initial scenario on the raw data, confirm/reformulate hypotheses, enrich analysis or focus it from different prisms. Therefore, these visualizations do not need to be aesthetically appealing.
2 – Explanatory Analysis. Conclusions or results are presented, so these visualizations have to serve as a support in order to convince the audience in our presentation of the work undertaken and our message. In this case visualizations must show our conclusions or insights clearly, be visually powerful and precise.
In the EDA stage (Exploratory Data Analysis), and in order to carry out data cleansing or exploration tasks, we will use visualizations defined in point 1; to present results we will produce explanatory graphs and in an intermediate phase of pure analysis we will use both exploratory and explanatory visualizations.
BEFORE WE GO ON…
Two fundamental principles enable us to consider whether our visualizations are good or not; although it may seem obvious, we must keep these two premises on our radar:
- The graphics must convey the desired message.
- The information shown in the visualizations must not be misleading.
Those of us involved in this important work in the world of data must be rigorous, objective and honest when showing our information. For example, displaying results through area (2D) or volume (in the case of 3D) in a pie chart can be misleading:
In this case, displaying information in a 3D pie chart, with multiple categories and manual sorting does not help the transparency of our desired message.
Depending on the variables to be analyzed, we will opt for one type of display or another. There are two types of variables:
QUALITATIVE OR CATEGORICAL (NOT NUMERICAL)
Nominal data: labels without inherent order; no label is intrinsically greater or less than any other. Country (Finland, Israel, Belize), Gender (Male or Female) or Profession (Teacher, Potter, Engineer) are examples of nominal data.
Cardinal data: labels with intrinsic order or classification; comparison operations between values are possible, but the magnitude of the differences is not well defined. An example would be a satisfaction scale: Very Dissatisfied – Dissatisfied – Neutral – Satisfied – Very Satisfied
Interval data: numerical values where the absolute differences are significant (addition and subtraction operations can be performed)
Relationship data: numerical values where the relative differences are significant (multiplication and division operations may be performed).
All quantitative variables are in turn divided into two types: discrete and continuous.
Discrete quantitative variables can only take specific values set at a maximum level of precision. For example: Number of pets a person has: 2, 5, 6 or more.
Quantitative continuous variables can (hypothetically) take values at any level of precision; there can be other values in between two values (decimal values). Example: a person’s height (1.72m, 1.719m, 1.7186m…).
FORMAT: WHAT SHOULD MY CHART LOOK LIKE?
Regardless of the data we are going to represent, there are a series of minimum criteria to take into account when representing our variables in a graph.
- Avoid thick lines on the axes.
- Do not add unnecessary text.
- Clean display: no background/images.
- Do not use 3D effect.
- Simple format for axes.
In short: less is more.
In our visualizations, colours are only used when they add value to our analysis and these are always soft colours.
In the pair of charts above, the use of colour per category does not add value, information or extra value to the display. The use of a colour palette should be avoided by substituting one colour.
In the scatter plots above, two variables are compared: length and width of a flower sepal. The addition of colour by species type (right graph) in this case does provide relevant information in our analysis.
HOW SHOULD I REPRESENT THE VARIABLES IN MY ANALYSIS?
Depending on whether you want to represent or analyse one, two or more variables in the graphs, there will be different approaches. Next we will give examples of some case studies that could be given to us over the course of our analysis.
To visualize a variable we can choose from the following types of graphs:
Barplots are used to represent the distribution of a categorical variable. In a bar chart, each level or value of the categorical variable is represented by a bar, whose height indicates the frequency of data points in the category to which it belongs.
Histograms are used to analyse distribution of a numerical variable. They are the quantitative version of the bar chart; instead of drawing a bar for each categorical value, values are grouped by numerical ranges or bins.
Although very commonly used and one of the preferred formulas when displaying data/showing results, we have to be careful with the use of this type of display. We list three reasons why pie-charts or ring-charts should not be used:
1. Areas: A > B or B >A?
Humans are not lynxes with eyesight that can differentiate areas/volumes from a visual point of view. If we have categories with similar values, we cannot distinguish them clearly through a pie-chart. The best practice for charting categorical variables is to use bar charts ordering categories from highest to lowest (or vice versa) according to their associated numerical variable:
B > A! We can see it much more clearly here, can’t we?
2. Categorisation: How many categories can I include in my pie-chart? Look at the following chart:
Can you tell anything from such a tidal wave of slices? Certainly not. If we choose to use a pie chart, it should include 2-3 categories at most.
3. Pie chart + 3D: Can I use them? When? No. Never. 3D and its perspective falsify the areas/volumes that as we have already seen in the first point, the human eye does not differentiate 2D in a simple/natural way:
In the case of wanting to visualize two variables, we can opt for the following standard graphics:
A scatter plot is a two-dimensional data display that uses points to represent the values obtained from two different variables: one represented along the x-axis and the other represented along the y-axis. Through this visualization we can check the correlation between two variables.
We see in the matrix above that price and carat variables have a higher correlation than the rest of the variables.
This type of visualization allows us to analyse groups of numerical data through their quartiles, showing symmetrical distribution of data, IQR, detection of outliers, and graphic representation of descriptive statistics such as median, 25th and 75th percentiles, maximums or minimums.
Representation of raw data of a numerical variable. The greater the density of points in a given numerical range, the greater the amplitude. Through this type of visualization we have a better understanding of our data distribution, asymmetry and kurtosis.
This gives us a more concrete picture of how the points are distributed, eliminating the long “tails” that exist in violin plots (bottom right graphic).
We can add one more variable to our bar chart allowing us to see a categorical comparison at various levels.
These enable us to identify, for example, the role of one quantitative variable versus two categorical ones. The following example represents the use of a bike sharing system on weekdays (x-axis) and the hours of the day (y-axis). The darker shade of the cells indicates greater use by users. From this graph, it can be seen that the service is used much more to travel to work in the mornings (6 – 9am) and to return home (4 – 6pm) after the end of the working day from Monday to Friday.
At this point, we may want to go further and deepen our analysis and the behaviour of the variables. Analysis is a game: an iterative and intuitive process in which we will go deeper as we obtain insights or ask ourselves more questions as we learn more about the data and its context. In the following example, we divide a two-dimensional analysis (price and carat), divided by the cut variable and analyzed as a heat map.
I hope that this review of good practices on the most appropriate type of visual representation at any given time and the minimum characteristics it should have in order to provide value, has been useful. For any questions, comments 🙂
Image: unsplash | @rayhands