Complete guide to Pandas library (Python Code) — Part 4/ 4

Fundamental Statistics.

Vijay yadav
Analytics Vidhya

--

Basically all the hard work we are going to put to understand about Statistics is just to make sense out of our data. As a part of data science projects, we have the responsibility to explore and draw meaning out of huge datasets in various situation and that is where different measures in statistics are used to understand distributions, variations within the data. Let’s explore this in detail.

Index

  1. Understanding Descriptive analysis
  2. Measures in Descriptive analysis
  3. Exploring python libraries for statistics
  4. Visualization by using charts

What is Descriptive Statistics.

Source: A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information.

Photo by Giorgio Tomassetti on Unsplash

So there are two categories to look when starting, descriptive and inferential statistics. Distinguishing factor between these methods are such that if you are in a situation to describe or visualize the whole dataset you can perform it using numbers or various charts like bar graph, histograms ,pie chart or line charts etc. You can also check for the shape of the whole dataset if it is symmetrical in nature or it is skewed. If it is skewed whether it is skewed to left or to right. All this important information falls under the category of descriptive statistics.

With inferential statistics you can work with a small sample of dataset to draw an inference for the whole population. In order to be sure about our assumptions for the whole population based on the inference of a sample, we use probability distribution which shows how much confidence we have in our assumptions. Eg. Analysis of Census data.

Commonly used methods to describe data in descriptive statistics are

  • measures of central tendency which uses mean, median and mode and
  • measures of variability or dispersion which uses standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.

Exploring python libraries for statistics

Python comes with its own statistics library in-built which is very practical to use in any given situation. However, after looking some basic functions in this library we will explore Pandas specifically in this article for our statistics need.

Let’s download a dataset to practice all these statistical functions and calculate values using above mentioned measures, and do it practically.

Measures of central tendency

A measure of central tendency is a single value that represents the center point of a dataset. This value can also be referred to as “the central location” of a dataset.

mean(), median() , mode() ,

Import necessary libraries and load the dataset.

import pandas as pd
import statistics
data = pd.read_csv("heart.csv" , sep= ',')
data.head()

The most commonly used measure of central tendency is the mean. It is nothing but the average of the values and it’s only sensible to take a mean of a dataset when our data is fairly symmetrical in nature. If our dataset is skewed or has some heavy outliers (extreme values) then it is not the smartest choice to go with mean.

What is the mean age of a person having heart disease?

# using Statistics library, we can use mean()statistics.mean(data['age'])

What will be the middle value of our age column in the dataset?

To answer this question we will use second measure called median() , this returns the middle value of the whole dataset , and this will be the best choice to go for if you dataset is skewed or have outliers.

Which gender do we have more and what is the most occurring age in our dataset?

Well , this type of question is very interesting to see and understand the distribution of the dataset itself, however mode() is used mostly with categorical values as it returns the most frequently occurring value in the dataset. Suppose you want to see how is the data distributed gender wise , mode will be a good option to go with.

data.age.value_counts().head(2)data.sex.value_counts().head(2)

As, you can see Statistics library is very helpful in getting these information if you don’t have access to any other libraries for any reason.

Statistics using Pandas

Now let’s see how this all values can be obtained using Pandas library.

In Pandas , there are several methods built on top of numpy library which makes it very simple to perform statistical analysis on both numerical as well as categorical values. Pandas commonly work with two type of data structure called as dataframe and series, on top to them you can apply every function directly.

describe()

By far, this is the most helpful function in pandas to view almost every important statistical value at a glance. This function by default gives the statistical summary for only numerical values.
For categorical, you can explicitly specify by (include = ‘ object’) parameter.

#for numerical by default
data.describe()
#for categorical
data.describe(include = ‘object’)

The summary contains the following results:

  • Count : Total number of values in the column.
  • Mean: The mean of each column
  • std: The standard deviation
  • min and max: The minimum and maximum values
  • 25%, 50%, and 75%: The amount of values gathered under that much percentile

Additionally, suppose you don’t wish to apply describe function to the whole dataset, you can also do it for specific column by using filter method.

data[['age' , 'sex' , 'cp']].describe()

Visualizing Data

We have matplotlib and seaborn library in python for visualization purpose which is very widely used, and in this example we will be exploring seaborn mostly because its much simpler are aesthetically beautiful compared with matplotlib.

Bar Charts

Ideally used to display categorical labels or discrete numerical values with the frequency details using bars. Bar chart also allows to stack multiple bars next to each other or on top of each other for various categories in dataset.

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(8, 8))
sns.barplot(x="cp", y="age", data=data,
color="b")
plt.title("Bar plot using SNS")
plt.xlabel("Chest Pain Value")
plt.ylabel("Age Range")
Image from Author

Box Plot Chart

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(8, 8))
sns.boxplot(x="cp", y="age",
hue="sex", palette=["m", "g"],
data=data)
sns.despine(offset=10, trim=True)
plt.title("Box plot usiong SNS")
plt.xlabel("Chest Pain Value")
plt.ylabel("Age Range")
Image from Author

To Statistically understand and interpret boxplot, read below points.

  • The line in the middle of the box is called median line.
    It is more useful when your data is not normalized and prone to outliers.
  • Every Statistical detail in pandas.describe() method can be seen here,
    the lower end of the box is 25% quartile, and the top part of the box is 75% quartile, and it simply means 75% of the datapoints are included within that range.
  • The top and bottom end you see, is the min and max values in dataset.
  • If the box is small in shape, it means our data is less variance or has smaller values
  • If the box is large, it means datapoints are fairly high in values and wide spread.
  • Position of the median line can tell if the majority of the data lies in the lower end or higher end or normally distributed if in middle.

Scatter Plot

import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(x="age", y="chol",
hue="sex",
sizes=(1, 8), linewidth=0,
data=data, ax=ax )
plt.title("Scatter plot to show relationship")
plt.xlabel("age")
plt.ylabel("Cholestrol level Range")
Image from Author

Scatter plot helps us very much to understand the relationships between two variables and get statistical values to prove them.
Here, we see that there is definitely some relationship between the two variable plotted Age and Cholesterol level.

At a glance, we can look and say that there is a positive relation which means as the age increase the cholesterol level of the people increases.

Rel plot

sns.relplot(x="age", y="chol", size="cp", sizes=(15, 200), data=data);
Image from Author

Heat map

This is probably the best graph to represent relationships, correlation within multiple variable at once. It’s specially helpful if you are working with a dataset with many numerical column and need to understand collinearity of variables for machine learning modeling.

Example and Dataset directly from from Seaborn package.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# Load the example flights dataset and convert to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")
# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax)
Image from Author

Conclusion

Photo by Tetiana SHYSHKINA on Unsplash

Although , Statistical analysis is very wide topic and methods to perform them depends on the type of datasets, I hope the above article and things we discussed can help a little in getting started.

I myself am further learning and exploring more about these methods ito get a deeper understanding and be better at statistical concepts.

I hope you enjoyed this 4 Part article series of Complete guide to pandas.
Read other ones after this, links to them attached below.

Other articles.

--

--

Vijay yadav
Analytics Vidhya

Data Science Consultant @Tata Consultancy Services. Currently in USA