Advanced Properties of Your Data

Kurtosis and Its Types

Kurtosis

Excess kurtosis is a statistical measurement that tells us about the shape of a distribution. It specifically tells us how "peaked" or "flat" a distribution is compared to a normal distribution.

A normal distribution is a symmetric bell-shaped curve that has an excess kurtosis of 0. It has the same amount of data on both sides of the mean, median, and mode.

If a distribution has a positive kurtosis, it means that it is more peaked than a normal distribution. This is often referred to as a "fat-tailed" distribution because the tails (or extremes) of the distribution are "fatter" than in a normal distribution.

On the other hand, if a distribution has a negative kurtosis, it means that it is more flat than a normal distribution. This is often referred to as a "thin-tailed" distribution because the tails are "thinner" than in a normal distribution.

In conclusion, kurtosis can help you identify the shape of a distribution and help you tell if it is fat-tailed, normal-tailed or thin-tailed.

Labelling Kurtosis

If kurtosis is greater than 3, then the distribution is Leptokurtic. A Leptokurtic distribution has a high peak, declines rapidly as you move away from the mean, and has heavy tails – more outliers.

If kurtosis is less than 3, then it is Platykurtic. It will have a flatter top – not always as flat as the uniform distribution – and it will be mostly body, no long tails.

And what if kurtosis is exactly 3? Then it is Mesokurtic. It has a moderate peak, and it’s best represented by the normal distribution.

For a real-world application of the interpretation of kurtosis – it is often used as a measure of financial risk. The higher the kurtosis, the higher the risk because the asset is more volatile. You can make high returns, but it can also generate large losses.

Clearing up the confusion about Kurtosis and fat tails

While a Platykurtic distribution might look like it has fatter tails, it is actually a thin-tailed distribution because outliers are infrequent. Many people get confused because the tails can look thicker – because they can be higher on the x-axis. However, a Platykurtic distribution is like an elephant – a very small proportion of its weight is in the tail.

In contrast, a Leptokurtic distribution is fat-tailed because there are a lot of outliers – not to mention these outliers can be very large and far away from the mean. A Leptokurtic distribution is like a leaping Kangaroo - a large proportion of its weight is in the tail.

Regression Analysis and Line of Best Fit

Line of best fit

The line of best fit is drawn on scatter plots and represents the best prediction of the dependent variable that could be made, based on the value of the independent variable.

Consider for example that we have two different dependent variable values for the exact same value of the independent variable across two observations; any estimate must therefore fall somewhere in between the two points.

When we only have two values, we can estimate simply by taking the average of our two values. For example, if we have age on the X-axis (our independent variable) and length of commute on the Y-axis (our dependent variable). We have two data points for age 30. One commutes 30 minutes, one commutes 60 minutes. Our estimate must fall somewhere between these two values - the average would be 45 minutes.

However, when we have many values, we need to create a reliable rule for estimation and prediction. That reliable rule is the line of best fit. In a regression analysis, it is called the regression line.

Regression analysis

Regression analysis is a statistical technique that is used to model the relationship between a dependent variable and one or more independent variables. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is used to make the prediction.

The goal of regression analysis is to find the best fitting model to describe the relationship between the dependent and independent variables.

For example, regression analysis can be used to understand how the price of a house (the dependent variable) is influenced by multiple independent variables like the size of the house, the area, the age and the number of bedrooms.

Simple and multiple linear regression analysis

Simple linear regression is used when we want to predict a single dependent variable using a single independent variable. For example, we might use the number of hours studied to predict a student's test score. In this case, the number of hours studied would be the independent variable, and the test score would be the dependent variable.

Multiple linear regression is used when we want to predict a single dependent variable using multiple independent variables. For example, we might use a student's number of hours studied, their class attendance, and their previous test scores to predict their next test score. In this case, the number of hours studied, class attendance, and previous test scores would be the independent variables, and the next test score would be the dependent variable.

In both simple and multiple linear regression, we use statistical analysis to find the best-fit line (or equation) that describes the relationship between the independent variables and the dependent variable. This line can then be used to make predictions about the dependent variable, given a set of values for the independent variables.

Residuals - how the line of best fit is found in regression analysis

Residuals, in statistics, are the difference between the actual value of a data point and the predicted value of that data point. The line of best fit is the line that minimizes the sum of the squared residuals.

You may hear the term ‘error’ when discussing residuals. The error is, as you might have guessed, the difference between the actual and predicted value, otherwise called the residual.

The residuals are illustrated by the red and green lines shown in the image above. It is the distance between our line of best fit, and our value.

You will notice the blue line intercepts the y-axis at 3. This is called our ‘intercept’ and the steepness of the line is our ‘slope’. These are represented in a regression equation as follows:

M = slope

B = intercept

X = the value of our datapoint

Y = mX + b

In minimizing the residuals via the ‘least squares method’ as it is called, we are finding the values of M and B that minimize the sum of squared residuals.

Our line of best fit must be straight - it cannot curve. That is why it is further away from some points than it is from others. But overall, its position and slope is one that minimizes the sum of the squared errors for all points.

Homoscedasticity and Heteroscedasticity

Homoscedasticity

Homoscedasticity, or 'homogeneity of variance', means constant variance within groups. It gives you an idea of how spread out your data is. If the variances are not homogeneous, the results of your tests may be biased.

In homoscedastic data, the data points will be evenly distributed. You can see an example of this in the image below. There's as much variance between the data at the start of the curve as there is in the middle or at the end. This suggests consistent data, which is easier to work with.

Heteroscedasticity

The opposite of homoscedastic data is heteroscedastic data. This term desribes data which varies non-constantly. You can see two examples of heteroscedastic data in the image below.

Unlike in the homoscedastic example, the variance between the data in the heteroscedastic examples is non-constant. In the 'bow tie' example, the variance starts wide, then narrows in the middle, before widening again. In the 'fan' example, the variance starts wide, then gradually narrows.

Causes or sources of heteroscedasticity

Heteroscedasticity can be caused by several different factors. It can result from differences in time series data – like seasonal fluctuations – or inaccuracies in your measurement tool – for example, your measurement tool might become more and more inaccurate due to changes in the external environment over time.

It could also be that your measurement tool exhibits greater variance as the inputs it is supposed to measure become greater. For example, a device might measure the wattage of batteries, but be less accurate and exhibit higher variance in readings for higher wattages.

Implications of heteroscedasticity in predictive statistics

Heteroscedasticity occurs when the spread or dispersion of the residuals differs systematically from one part of the dataset to another.

When conducting predictive statistics, for example by using a regression analysis, this means that your model may provide more accurate predictions at one end of the data range, while at the other end of the data range, the predictions are less accurate.

You can still perform a regression analysis on such data, but your results will be less accurate outside of a certain range.