Code
YouTubeVideo('AiDqx1eZzTo', width=672, height=378)YouTubeVideo('AiDqx1eZzTo', width=672, height=378)With all of the above background around analytics, we are ready to jump right in! We will start with descriptive statistics, which are key summary attributes of a dataset that help describe or summarize the dataset in a meaningful way.
Descriptive statistics help us understand the data at the highest level, and are generally what we seek when we perform exploratory analysis on a dataset for the first time. (We will cover exploratory data analysis next, after a quick review of descriptive statistics.)
Descriptive statistics include measures that summarize the:
Descriptive statistics do not allow us to make conclusions or predictions beyond the data we have analyzed, or reach conclusions regarding any hypotheses we might have made.
Below is a summary listing of the commonly used descriptive statistics. We cover them briefly, because rarely will we have to calculate any of these by hand as the software will almost always do it for us.
Mean: The mean is the most commonly used measure of central tendency. It is simply the average of all observations, which is obtained by summing all the values in the dataset and dividing by the total number of observations.
Geometric Mean: The geometric mean is calculated by multiplying all the values in the data, and taking the n-th root, where n is the number of observations. Geometric mean may be useful when values compound over time, but is otherwise not very commonly used.
Median: The median is the middle value in the dataset. By definition, half the data will be more than the median value, and the other half lower than the median. There are rules around how to compute the median when the count of data values is odd or even, but those nuances don’t really matter much when one is dealing with thousands or millions of observations.
Mode: Mode is the most commonly occurring value in the data.
YouTubeVideo('Ddkfq9fT62U', width=672, height=378)What is Standard Deviation useful for?
When you see a number for standard deviation, the question is - how do you interpret it? A useful way to think about standard deviation is to think of it as setting an upper and lower limit on where data points would lie either side of the mean.
If you know your data is normally distributed (or is bell shaped), the empirical rule (below) applies. However most of the time we have no way of knowing if the distribution is normal or not. In such cases, we can use Chebyshev’s rule, also listed below.
I personally find Chebyshev’s rule to be very useful - if I know the mean, and someone tells me the standard deviation, then I know that 75% of the data is between the mean and 2x standard deviation on either side of the mean.
Empirical Rule
For a normal distribution:
- Approximately 68.27% of the data values will be within 1 standard deviation.
- Approximately 95.45% of the data values will be within 2 standard deviations.
- Almost all the data values will be within 3 standard deviations
Chebyshev’s Theorem
For any distribution:
- At least 3/4th of the data lie within 2 standard deviations of the mean
- at least 8/9th of the data lie within three standard deviations of the mean
- at least \(1 - \frac{1}{k^2}\) of the data lie within \(k\) standard deviations of the mean
While Pearson’s correlation coefficient is generally the default, it works only when both the variables are numeric. Which becomes an issue when the variables are categories, for example, one variable is nationality and the other education.
There are multiple ways to calculate correlation. Below is an extract from the ydata-profiling library (formerly pandas_profiling), which calculates several types of correlations between variables.
Note: The
pandas_profilingpackage was renamed toydata-profilingin 2023. Install withpip install ydata-profiling; import asfrom ydata_profiling import ProfileReport.
(Source: https://ydata-profiling.ydata.ai/)
Pearson’s r (generally the default, can calculate using pandas)
The Pearson’s correlation coefficient (r) is a measure of linear correlation between two variables. It’s value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r. To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman’s \(\rho\) (supported by pandas)
The Spearman’s rank correlation coefficient (\(\rho\)) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson’s r. It’s value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation. To calculate \(\rho\) for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall’s \(\tau\) (supported by pandas)
Similarly to Spearman’s rank correlation coefficient, the Kendall rank correlation coefficient (\(\tau\)) measures ordinal association between two variables. It’s value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation. To calculate \(\tau\) for two variables \(X\) and \(Y\), one determines the number of concordant and discordant pairs of observations. \(\tau\) is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (\(\phi k\)) (use library phik)
Phik (\(\phi k\)) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. (Interval variables are a special case of ordinal variables where the ordered points are equidistant.)
Cramér’s V (\(\phi c\)) (use custom function, or PyCorr library)
Cramér’s V is an association measure for nominal random variables (nominal random variables are categorical variables with no order, eg, country names). The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér’s V have been proved to be biased, even for large samples.