Understanding Key Statistical Tests for Data Scientists

Image Created by Author

Introduction

In statistics, understanding how to choose and apply the right test is crucial for analyzing data effectively. That is also true for data science use cases. A proper statistical test can help you determine if there is a significant difference between numerical or categorical variables. This blog post will delve into five fundamental statistical tests I have used throughout my data science and analytics career. Those tests are the t-test, ANOVA, Chi-squared test, Pearson correlation, and Fisher’s exact test. We will explore some of the mathematics behind the tests and recommend when to use them, depending on your data and project.

(1) T-Test: Comparing Means of Two Groups

The t-test is a statistical test used to compare the means of two groups. It’s useful when dealing with small sample sizes and when unsure of the population’s standard deviation.

Math for the t-test:

Where X bar 1 and X bar 2 are the sample means, s^2 is the sample variance, and n1 and n2 are the sample size

When to Use:

Use a t-test when comparing the means of two independent groups (e.g., comparing the heights of men vs. women when the sample size is small. The number of times I had to use a t-test in a work environment is limited, but it is still useful to know if you run into small sample sizes. If you have a large sample size, use a z-test instead.

(2) ANOVA: Analyzing Variance Across Multiple Groups

ANOVA (Analysis of Variance) compares the means across three or more groups. It is ideal when you have to test if any significant differences exist among group means.

Math for the ANOVA:

ANOVA calculates the F-statistics using the formula:

The main point here is the higher the F-statistic, the more likely the means of the groups differ significantly.

When to Use:

Use ANOVA when comparing more than two groups, like testing if your exam performance varies across different teaching methods. You will commonly use ANOVA in practice if you have to compare means across a few categories.

(3) Chi-Squared Test: Relationship Between Categorical Variables

The Chi-squared test assesses if there is a significant association between two categorical variables.

Math For Chi-Squared Test:

Where O is the observed frequency, and E is the expected frequency under the null hypothesis.

When to Use:

You can utilize this test when examining relationships between categorical variables, such as gender and clothing buying preferences.

(4) Pearson Correlation: Measuring the Relationship Between Continuous Variables

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. The coefficient, denoted as r, ranges from -1 to 1, where:

Math for Pearson Correlation:

  • r =1 indicates a perfect positive linear relationship

  • r = -1 indicates a perfect negative linear relationship

  • r = 0 indicates no linear relationship​

When to Use:

Use Pearson correlation to examine the relationship between two continuous variables, like height and weight. This statistic is also used frequently in practice.

(Bonus) Fisher’s Exact Test: Analyzing Small Sample Sizes

Fisher’s exact test is an alternative to Chi-squared test when sample sizes are small. Without diving into complex formulas, Fisher’s Exact Test calculates the probability of observing data as extreme or more extreme than what was overserved, assuming no association between the variables.

Conclusion

Choosing the correct statistical test is crucial for valid data science model development. The t-test and ANOVA are great choices for comparing means, while Chi-Squared is meant for categorical data relationships. The Pearson correlation assesses linear relationships between continuous variables. Fisher’s Exact Tests is used in place of Chi-Squared when the sample sizes are small. By understanding the context and requirements of your data, you can select the most appropriate statistical method to uncover meaningful insights for your data science use case.

Previous
Previous

Understanding Clustering Models: A Simple Guide with Examples

Next
Next

Interpreting Machine Learning Models in Python with SHAP