The mean (average) is calculated by summing up all the values in a dataset and then dividing the sum by the total number of values. It represents the central tendency of the data.
Formula: Mean = (Σx) / n
Where:
- Mean is the average
- Σx is the sum of all values in the dataset
- n is the total number of values in the dataset
The median is the middle value in a dataset when the values are arranged in ascending order.
If there is an even number of values, the median is the average of the two middle values.
Formula (Odd number of values): Median = Middle value
Formula (Even number of values): Median = (Value at position n/2 + Value at position (n/2 + 1)) / 2
The minimum is the smallest value in a dataset.
Formula: Minimum = Smallest Value
The maximum is the largest value in a dataset.
Formula: Maximum = Largest Value
The range is the difference between the maximum and minimum values in a dataset. It provides a measure of the spread or variability in the data.
Formula: Range = Maximum - Minimum
The midrange is the average of the maximum and minimum values in a dataset.
Formula: Midrange = (Maximum + Minimum) / 2
The count represents the total number of values in a dataset.
The sum is the total of all values in a dataset.
Formula: Sum = Σx
Where:
- Σx is the sum of all values in the dataset
A percentile represents the value below which a given percentage of the data falls. It is often used to identify specific data points in a distribution.
A quartile divides a dataset into four equal parts, with each part containing 25% of the data. Quartiles are often used to assess the spread of data.
The sum of squares is the sum of the squares of the differences between each data point and the mean. It is a key component in calculating variance and standard deviation.
Formula: Sum of Squares = Σ(x - Mean)²
Where:
- Σ represents the summation symbol
- x is each data point
- Mean is the mean (average) of the dataset
The standard deviation measures the amount of variation or dispersion in a dataset. It indicates how spread out the data points are from the mean.
Formula: Standard Deviation = √(Σ(x - Mean)² / (n - 1))
Where:
- √ represents the square root
- Σ represents the summation symbol
- x is each data point
- Mean is the mean (average) of the dataset
- n is the total number of values in the dataset
The variance is a measure of the spread or dispersion of a dataset. It is the average of the squared differences between each data point and the mean.
Formula (Population Variance): Variance (σ²) = Σ(x - Mean)² / N
Where:
- Σ represents the summation symbol
- x is each data point
- Mean is the mean (average) of the dataset
- N is the total number of values in the population
Note: When working with a sample of data, use the sample variance formula, which divides by (N - 1) instead of N. This correction accounts for sample bias.
The Z-score measures how many standard deviations a data point is from the mean in a standard normal distribution. It is used to standardize data and assess its position relative to the mean.
Formula: Z-Score = (x - Mean) / Standard Deviation
Where:
- x is the data point
- Mean is the mean (average) of the dataset
- Standard Deviation is the standard deviation of the dataset
The interquartile range is the range between the first quartile (Q1 - 25th percentile) and the third quartile (Q3 - 75th percentile) in a dataset. It provides a measure of the spread of the middle 50% of the data.
Formula: IQR = Q3 - Q1
Where:
- Q1 is the first quartile (25th percentile)
- Q3 is the third quartile (75th percentile)
The coefficient of variation is a relative measure of variability and is expressed as a percentage. It is used to compare the standard deviation of data to its mean, making it useful for assessing relative variability between datasets with different means.
Formula: CV = (Standard Deviation / Mean) * 100%
Skewness measures the asymmetry of the probability distribution of a real-valued random variable. It indicates whether the data is skewed to the right or left.
A positive skew indicates that the distribution tail is skewed to the right (right-skewed), meaning there are more extreme values on the right side of the distribution.
A negative skew indicates that the distribution tail is skewed to the left (left-skewed), meaning there are more extreme values on the left side of the distribution.
Kurtosis measures the "tailedness" of the probability distribution of a real-valued random variable. It indicates the presence and degree of outliers in the data.
A positive kurtosis (leptokurtic) indicates heavy tails and a peak, meaning the data has more extreme values and is more peaked than a normal distribution.
A negative kurtosis (platykurtic) indicates light tails and a flatter distribution, meaning the data has fewer extreme values and is flatter than a normal distribution.
Covariance measures the degree to which two variables change together. It indicates whether the variables have a positive or negative linear relationship.
Formula: Cov(X, Y) = Σ((X - Mean(X)) * (Y - Mean(Y))) / (n - 1)
Where:
- Σ represents the summation symbol
- X and Y are variables
- Mean(X) and Mean(Y) are the means of X and Y, respectively
- n is the total number of observations
If the covariance is positive, it indicates a positive relationship (X tends to increase when Y increases).
If the covariance is negative, it indicates a negative relationship (X tends to decrease when Y increases).
The correlation coefficient measures the strength and direction of the linear relationship between two variables. It is a normalized version of covariance that ranges from -1 to 1.
Formula: r = Cov(X, Y) / (Standard Deviation(X) * Standard Deviation(Y))
Where:
- Cov(X, Y) is the covariance between X and Y
- Standard Deviation(X) and Standard Deviation(Y) are the standard deviations of X and Y, respectively
If |r| is close to 1, it indicates a strong linear relationship, with positive r indicating a positive correlation and negative r indicating a negative correlation. If |r| is close to 0, it indicates a weak or no linear relationship.