Commonly Used Statistical Tests in Data Science

Common Statistical Tests in Data Science


A Comprehensive Guide to Essential Statistical Tests and Their Applications

In the age of data, making informed decisions relies on the subtle art of interpreting numbers and patterns.

This article introduces the statistical tests commonly used in data science to aid in making these informed decisions.

For each statistical test, such as the T-test and Chi-square test, you will find explanations, calculations, Python implementations, and project suggestions. Let’s start with T-test.

If you want a beginner’s guide to the most popular types of statistical tests, check out these basic types of statistical tests in data science.

Commonly Used Statistical Tests in Data Science

T-Test

The T-test checks if there's a big difference in averages between the two groups. It is based on the t-distribution, a type of probability distribution.

There are mainly three types of T-tests.

  • Independent samples T-test: The T-test looks at averages in two groups that aren't connected.
  • Paired sample T-test: Compares means from the same group at different times.
  • One-sample T-test: Tests the mean of a single group against a known mean.

Calculation

In simple terms, the T-test calculates the difference between the two groups means dividing this by the variability in the data. It generates a t-score and then compares it against a critical value from the t-distribution to determine significance.

t=xˉ1xˉ2s2n1+s2n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s^2}{n_1} + \frac{s^2}{n_2}}}

Here, x1 and x2 are sample means s2 is the variance and n1 and n2 are the sample size.

Simple Python Implementation

Let’s see a simple implementation of it.

import numpy as np
from scipy import stats

# Sample data: Group A and Group B
group_a = np.random.normal(5.0, 1.5, 30)
group_b = np.random.normal(6.0, 1.5, 30)

# Performing an Independent T-test
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"T-Statistic: {t_stat}, P-Value: {p_val}")

T-test Python Implementation

The Python code's output for the T-test is:

  • T-Statistic -3.06
  • P-Value 0.003

This P-value is less than the common alpha level of 0.05, suggesting a statistically significant difference between the two groups' means at the 5% significance level. The negative T-statistic indicates that group A's mean is lower than group B's.

Further Project Suggestion

Effectiveness of Sleep Aids: Compare the average sleep duration of subjects taking a new herbal sleep aid versus a placebo.

Educational Methods: Evaluate students' test scores using traditional methods against those taught via e-learning platforms.

Fitness Program Results: Assess the impact of two different 8-week fitness programs on weight loss among similar demographic groups.

Productivity Software: Compare the average task completion time for two groups using different productivity apps.

Food Preference Study: Measure and compare the perceived taste rating of a new beverage with a standard competitor's product across a sample of consumers.

Chi-Square Test

The Chi-Square Test shows if there's a strong link between the two data types. It compares the observed frequencies in each category against expected frequencies if no associations exist.

  • Chi-Square Test of Independence: Assesses if two categorical variables are independent.
  • Chi-Square Goodness of Fit Test: Determines if a sample distribution fits a population distribution.

Calculation

The formula for the Chi-Square statistic is:

χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

'Oi' is the number we see and 'Ei' is the number we expect.

Simply, it involves calculating a value that summarizes the difference between observed and expected frequencies. The larger this value, the more likely the observed differences are not due to random chance.

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy.stats import chi2_contingency
import numpy as np

# Example data: Gender vs. Movie Preference
data = np.array([[30, 10], [5, 25]])
chi2, p, dof, expected = chi2_contingency(data)
print(f"Chi2 Statistic: {chi2}, P-value: {p}")

Chi-square Python Implementation


The Python code's output for the Chi-Square Test is

  • Chi-Square Statistic: 21.06
  • P-Value: 0.00000446

The Chi-Square statistic is 21.06 with a P-value of approximately 0.00000446. This very low P-value suggests a significant association between gender and movie preference at a 5% significance level.

Further Project Suggestion

Voter Preferences: Analyze the association between voter age groups and their preference for specific political issues.

Marketing Campaign Effectiveness: Investigate if there is a significant difference in the response to two different marketing campaigns across regions.

Education Level and Technology Use: Explore the relationship between education levels and adopting new technology in a community.

Disease Outbreak: Study the correlation between the spread of disease and the population density of affected areas.

Customer Satisfaction: Assess the link between customer satisfaction levels and the time of day they receive service in a retail setting.

ANOVA (Analysis of Variance)

ANOVA is used to compare averages in three or more groups. It helps determine if at least one group's mean is statistically different.

  • One-Way ANOVA: Compares means across one independent variable with three or more levels (groups).
  • Two-Way ANOVA: Compares means considering two independent variables.
  • Repeated Measures ANOVA: Used when the same subjects are used in all groups.

Calculation

The formula for ANOVA is:

F=Between-Group VariabilityWithin-Group VariabilityF = \frac{\text{Between-Group Variability}}{\text{Within-Group Variability}}


In simpler terms, ANOVA calculates an F-statistic, a ratio of the variance between groups to the variance within groups. A higher F-value indicates a more significant difference between the group means.

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy import stats
import numpy as np

# Sample data: Three different groups
group1 = np.random.normal(5.0, 1.5, 30)
group2 = np.random.normal(6.0, 1.5, 30)
group3 = np.random.normal(7.0, 1.5, 30)

# Performing One-Way ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
print(f"F-Statistic: {f_stat}, P-Value: {p_val}")

ANOVA Python Implementation

The Python code's output for the ANOVA test is:

  • F-Statistic: 15.86
  • P-Value: 0.00000134

The F-statistic is 15.86 with a P-value of approximately 0.00000134. This extremely low P-value indicates a significant difference between the means of at least one of the groups compared to the others at a 5% significance level.

Further Project Suggestion

Agricultural Crop Yields: Compare the average yields of different wheat varieties grown in multiple regions to determine the most productive.

Employee Productivity: Analyze employees' productivity across various company departments to see if there's a significant difference.

Therapeutic Techniques: Evaluate the effectiveness of multiple therapeutic techniques in reducing anxiety levels among different patient groups.

Gaming Platforms: Test if there are significant differences in average frame rates across multiple gaming consoles when running the same video game.

Dietary Effects on Health: Examine the impact of different diets (vegan, vegetarian, omnivorous) on specific health markers in participants over six months.

Pearson Correlation

Pearson Correlation evaluates the straight-line connection between two ongoing variables. It produces a value between -1 and 1, indicating the strength and direction of the association.

The Pearson Correlation is a specific type of correlation, mainly differing from others like Spearman's correlation, used for non-linear relationships.

Calculation

The formula for the Pearson Correlation coefficient is:

r=(XiXˉ)(YiYˉ)(XiXˉ)2(YiYˉ)2r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}

Simply put, it calculates how much one variable changes with another.

A value close to 1 indicates a strong positive correlation, and close to -1 indicates a strong negative correlation.

Simple Python Implementation

Let’s see a simple implementation of it.

import numpy as np
from scipy.stats import pearsonr

# Sample data
x = np.array([10, 20, 30, 40, 50])
y = np.array([15, 25, 35, 45, 55])

# Calculating Pearson Correlation
corr, _ = pearsonr(x, y)
print(f"Pearson Correlation Coefficient: {corr}")

Pearson Correlation Python Implementation

The Python code's output for the Pearson Correlation test is:

  • Pearson Correlation Coefficient: 1.0

The correlation coefficient of 1.0 indicates a perfect positive linear relationship between the two variables. This means when one variable goes up, the other also rises at a steady pace.

Further Project Suggestion

Economic Indicators: Investigate the relationship between consumer confidence levels and the volume of retail sales.

Healthcare Analysis: Explore the correlation between the number of hours spent on physical activity and blood pressure levels among adults.

Educational Achievement: Study the relationship between the amount of time students spend on homework and their overall academic performance.

Technology Usage: Examine the correlation between the time spent on social media and reported levels of stress or happiness.

Real Estate Pricing: Assess the strength of the linear relationship between the size of homes and their selling prices in a particular region.

Mann-Whitney U Test

The Mann-Whitney U Test is a non-parametric test used to compare differences between two independent groups when the data doesn't follow a normal distribution.

It is a standalone test, primarily used as an alternative to the T-test when data doesn't meet the normality assumption.

Calculation

The Mann-Whitney U statistic is calculated based on the ranks of the data in the combined dataset.

U=R2n2(n2+1)2U = R_2 - \frac{n_2(n_2+1)}{2}


Where

  • U is the Mann-Whitney U statistic.
  • R1 and R2 are the sum of ranks for the first and second groups, respectively.
  • n1 and n2 are the sample sizes of the two groups

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy.stats import mannwhitneyu
import numpy as np

# Sample data: Two groups
group1 = np.random.normal(5.0, 1.5, 30)
group2 = np.random.normal(6.0, 1.5, 30)

# Performing Mann-Whitney U Test
u_stat, p_val = mannwhitneyu(group1, group2)
print(f"U Statistic: {u_stat}, P-Value: {p_val}")

Mann Whitney U Test Python Implementation

The Python code's output for the Mann-Whitney U Test is:

  • U Statistic: 305.0
  • P-Value: 0.032

This P-value is below the typical alpha level of 0.05, indicating that there is a statistically significant difference in the median ranks of the two groups at the 5% significance level. The Mann-Whitney U Test result suggests that the distributions of the two groups are not equal.

Further Project Suggestion

Medication Response: Compare the change in symptom severity before and after using two different medications in non-normally distributed patient data.

Job Satisfaction: Investigate the job satisfaction levels between employees in high-stress and low-stress departments of a company.

Teaching Materials: Evaluate the effectiveness of two teaching materials on student engagement in a classroom where data is not normally distributed.

E-Commerce Delivery Times: Assess the difference in delivery times for two courier services for e-commerce packages.

Exercise Impact on Mood: Study the effects of two different types of short-term exercise on mood improvement in participants, focusing on non-parametric data.

Conclusion

From the T-test to the Mann-Whitney U Test, we've explored their concepts and applications and even went into Python implementations and real-life project ideas.

Remember, the path to becoming a proficient data scientist is paved with practice. Diving into these tests through hands-on projects solidifies your understanding and sharpens your analytical skills.

To do that, visit our platform and do data projects like Student Performance Analysis. Here, you’ll have a chance to do Chi-square tests.

Common Statistical Tests in Data Science


Become a data expert. Subscribe to our newsletter.