Statistics 101 - Essential Data Visualization

Statistics 101 - Essential Data Visualization
⬅️
This article is the 2nd part in "Statistics 101" series.

If you haven't read the previous articles, I would recommend you to go through them, especially, if you are not aware with those topics. Below are the links to all the articles published in this series so far.

1 - Types of Data
2 - Essential Data Visualization (You are here)
3 - Central Tendencies

Introduction

As we discussed, in our Types of Data article, in statistics understanding the type of data you are dealing with is very important. This understanding makes you choose the right statistical tools for your data. This means if you understand the data type correctly and how you are going to use the data you can correctly choose which type of graph to use for that data and what type of descriptive or inferential analysis can be done on that data.

In this post for our statistics 101 series, we are going to take a deep dive into some of the most commonly used types of graphs in statistics that are used to visually describe and understand the data.

💡
NOTE: Although, we are discussing a most of the commonly used graphs in this post, this is not an exhaustive list. There are many more graphs that can be used based on the scenario and type of representation required.

However, understanding the graphs discussed in this post and how they work will provide you with a solid foundation that will help you for most of the visualization situations.

Before we take a look at different types of graphs and plots, it's crucial for you to understand that because data can be broadly categorized in two types - Quantitative and Qualitative, the graphs used for these two types are generally different. In other words, in most scenarios, you won't be able to use the graph applicable for quantitative data with qualitative data and vice versa. We will be discussing the use of different types of graphs in this series using an example, however, I will also tag those graphs as either Quantitative or Qualitative Data graph using the heading, so look out for that. Alright, now let's take a look at what different graphs are available for each type of data.

🧠
If you are not familiar with quantitative data and qualitative data, I would recommend you to go through our Types of Data article in this statistics 101 series.

To understand different types of graphs for representing our data, we will consider an example.

Let's say you want to get fit in the upcoming months, so you plan and start tracking your daily water intake, calorie intake, whether you exercised and exercise category. You collect the data for 60 days and now want to to look at a graphical representation of your data. You have 5 specific requirements from your graphical representation of your data, and thus you want 5 different graphs that shows,

  • how many times you have drank a specific amount of water.
  • how your water and calorie intake has changed over time.
  • if there's a relation between how many calories you eat and how much water you drink.
  • the number of times (relative frequency) you perform exercises for different muscle groups.
  • the overall distribution between the number of days you do exercise and the number of days you don't.

Considering your requirements, let's take a look at different types of graphs that can help you with this. You can download the raw data being used in this example for your reference from below provided link to csv file.

Graphs for Quantitative Data

Graphs for Frequency Distribution

Graph for Requirement 1: Let's first try to handle your requirement for showing how many times you have drank a specific amount of water.

We can actually use one of the most commonly used graphs in statistics to achieve this, the one and only, Histogram.

Histogram

With histogram plot, we can plot the frequency or the number of times you drank a specific amount of water. However, before we can plot a histogram, we need to understand about something called binning.

In our example, we have water intake values for 60 days. If we start plotting the frequency of these water intake values i.e. the number of times the person drank the exact same amount of water, we will realize that it's highly unlikely that a person is going to drink the exact same water multiple number of times. For example, it might be possible that the person drank 3,211 ml of water on Day 1 and 3,212 ml on Day 2. Now these might look almost similar to us in our context but mathematically speaking these values are obviously not equal. Thus if we plot the histogram with all the individual data values each data value will only have a frequency of 1 most of the times. It might be possible to have a higher frequency sometimes, but it would usually be 1. This makes sense because we are dealing with Quantitative data and frankly on a theoretical level there's just an infinite number of values possible.

Binning helps us in grouping individual values into bins (or ranges) which provide us with a better and more useful graphical interpretation of our data.

Now Let's do this on our entire data of water intake and see what our histogram looks like for our data.

Wow, that looks great. With 5 bins we can see that the most amount of water that the person drank in the 60 day period was in the range 3.4 to 3.7 ltrs, with other ranges being almost similar in comparison. The above graph is an interactive plot so you can play around with the number of bins to see how the histogram changes when you increase the number of bins.

Frequency Polygon

Frequency Polygon is just another type of histogram with the only difference being that instead of rectangular bars the frequency is represented using a line as shown in the interactive plot below. The plot below is the frequency polygon for our Water Intake example. Notice how much similar the above histogram and this frequency polygon are. It's because they are representing the same data just using different shapes.

Graph for Trend Analysis

Line Plot

Graph for Requirement 2: Alright, now let's take a look at the second requirement for graphs. The second requirement stated that the graph should display how the water intake has varied over time.

In statistics and data terminology, we call this a trend plot. A trend plot is simply a plot that has time on the X axis and the quantity that you want to track over time on the Y-axis. In this case, the quantity is water intake.

Now that you know what a trend plot is, let's plot it for our data and see what the result looks like.

That looks like a pretty consistent water intake averaging at about 3,562 ml as can be seen from the graph. Here's an interesting thing to notice though, you can change the look of this graph by manipulating the upper and lower limits on the Y-axis. In other words, if you change the points on the Y-axis for water intake where the count starts from and where the count ends, then that significantly warps the visual perception of this graph. You can try it out yourself and see the results.

Graph for Relationship Analysis

Scatter Plot

Graph for Requirement 3: For the third requirement you need the graph that should visualize if the calorie intake affects the water intake levels.

Scatter Plot is a great plot when we want to understand whether one quantitative variable (i.e. calories) has any relationship or affects another quantitative variable (i.e. water intake). Technically, we plot the one variable on the X-axis and another on the Y-axis and then plot all the data points on the graph. If the plot shape has any kind of visual pattern, then it's pretty clear that yes, there's a direct relationship between the variables. However, that being said, sometimes, even if the pattern isn't visible in the plot, there might still be a valid relationship between variables. Thus it's important to use other statistical analyses to rule out all possibilities. We will discuss these methods later in this series.

Alright, now that we have an idea what scatter plot does. Let's plot it for our data and see if the calorie intake affects the water intake levels.

According to our scatter plot above, it looks like water intake levels are surely affected by the calorie intake. The higher the number of calories you eat the more water your drink.

Graphs for Qualitative Data

Alright, now let's take your last two requirements of visualizing and understanding the exercise data.

Graph for Requirement 4: A graph to see the relative frequency of different type of exercises.

Bar Plot

Bar plot is graph that's used for qualitative data to visually represent and analyze the relative frequency of different values for that qualitative data variable. It's very similar to histogram, but instead of having a quantitative data variable (i.e. water intake or calorie intake) on the X-axis , we have a qualitative data variable i.e. (exercise type). This difference is pretty important to remember since it changes the nature of the graph itself and what it actually represents. We will talk more about this point later in this series during our discussion on the topics of Distributions.

The Y-axis for bar plot is same as that of the histogram and represents the frequency or the number of times that particular data point occurred.

Alright, let's create a bar plot for your exercise data to understand which different muscle groups you target the most and the least,

Wow, that's interesting, it seems like you really love working out your arms and chest while legs and back not so much.

Pie Chart

Graph for Requirement 5: For your fifth and last requirement, you want to see the proportion of days when you exercise and when you don't

Alright, now for the last requirement, we are going to introduce another type of graph that's useful for qualitative data i.e. Pie chart.

Pie chart typically is also used when we want to understand the distribution of different values of a particular qualitative data variable (like in our case "Exercise"). However, there's a difference between the use of Bar plot and Pie chart that you should always be aware of.

While Bar plot use is quite general and it can be freely used for mostly any qualitative variable without any restriction, pie chart has a limiting factor that makes it bit more restrictive in terms of it's use. Because Pie chart show the data in a circular shape and each category value/data segment is represented as a section of this circle, if you have a high number of these categories, then it becomes really hard to understand the true distribution of these categories.

Alright, now let's plot the pie chart for our data and take a look at how does it look for our data.

Wow, that's interesting, looks like for almost 70% of the tracked days you did do perform exercise, which is quite remarkable, considering rest is also important between exercise sessions.

Alright, that was fun, wasn't it. We got to look at different types of graphs in statistics and how they can be applied in understanding and exploring data. In the next post in our statistics 101 series, we will discuss about the concept of central tendencies. I will see you guys in the next one.