INTRODUCTION
Statistics is a science that studies how to plan, collect, analyze, and draw conclusions from data. It plays a crucial role in various fields, such as economics, medicine, social sciences, and government. For example, a population census is a real-world application of statistics, where data is gathered about individuals in a country to provide insights into demographics, employment, education, and more.
Mean, median, and mode are fundamental concepts in statistics used to summarize and describe data distributions. These measures help in understanding the central tendencies of a dataset, making it easier to interpret large amounts of information. Here's how each relates to statistics:
1. Mean (Average):
The mean is the sum of all the values in a dataset divided by the number of values. It provides a mathematical average, which gives an idea of the general magnitude of the values. In statistics, the mean is commonly used to analyze normally distributed data but can be sensitive to outliers (extremely high or low values).
Here's the sigma notation for the mean (average) in statistics:
Where:
μ (mu) represents the mean
N is the number of values in the dataset
xᵢ represents each individual value
∑ (sigma) indicates the sum from i=1 to n
The History of the Mean in Statistics
The concept of the mean, a fundamental statistical measure, has a rich and long history that spans centuries. While there isn't a single, definitive inventor, its evolution can be traced back to ancient civilizations and through various mathematical advancements.
Early Origins
- Ancient Greeks: The ancient Greeks were among the earliest to use the concept of the average. They calculated the average position of celestial bodies and used it to predict their future movements.
- Medieval astronomers: Astronomers in the Middle Ages continued to employ the mean to refine their astronomical calculations.
Modern Developments
- Simon Stevin (1548-1620): The Flemish mathematician Simon Stevin introduced the decimal system, which greatly simplified calculations involving averages.
- Pierre-Simon Laplace (1749-1827): Laplace's work on probability theory laid the groundwork for understanding the mean and its properties within statistical distributions.
- Karl Friedrich Gauss (1777-1855): Gauss's contributions to statistics, including the development of the normal distribution, solidified the mean's role as a central measure of location.
- Statistical inference: The development of statistical inference methods, such as hypothesis testing and confidence intervals, further elevated the importance of the mean in data analysis.
The Mean Today
The mean remains a cornerstone of modern statistics. It is used in various fields, including:
- Social sciences: To analyze demographic data, economic indicators, and social trends.
- Natural sciences: To study biological populations, physical phenomena, and environmental factors.
- Engineering: To assess quality control, analyze performance data, and optimize processes.
- Finance: To evaluate investment returns, calculate risk metrics, and make financial decisions.
In conclusion, the history of the mean is a testament to the enduring value of this statistical concept. Its development, influenced by mathematicians and scientists across centuries, has shaped our understanding of data and its applications in various fields.
2. Median:
The median is the middle value when a dataset is arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle one. If it has an even number, the median is the average of the two middle values. The median is useful in cases where the data contains outliers or skewed distributions, as it is not affected by extreme values.
Unlike the mean, the median does not have a straightforward sigma notation because it involves sorting the data and finding the middle value, rather than summing values.
- If the data set has (n) observations, sort the data in ascending order.
- The median is the value at the position
a. Odd number of observations:
b. Even number of observations:
The History of the Median
The median, a statistical measure representing the middle value in a dataset, has a long and rich history. While there isn't a single, definitive inventor, its origins can be traced back to ancient civilizations and through various mathematical advancements.
Early Origins
- Ancient civilizations: The concept of the middle value likely emerged in ancient times, possibly in conjunction with early forms of data analysis. For example, ancient Egyptians might have used the median to determine the average height of soldiers.
- Medieval astronomers: Astronomers in the Middle Ages may have employed the median to analyze astronomical data and identify central tendencies.
Modern Developments
- Early statisticians: Pioneers in the field of statistics, such as John Graunt (1620-1674) and William Petty (1623-1687), likely used the median in their analyses of population data.
- 19th and 20th centuries: The median gained prominence in the 19th and 20th centuries as statistical methods became more sophisticated. It was particularly useful in analyzing skewed distributions, where the mean might be influenced by extreme values (outliers).
- Robust statistics: The median's resistance to outliers has led to its increased use in robust statistics, a branch of statistics that focuses on methods that are less sensitive to data anomalies.
The Median Today
The median remains a widely used statistical measure in various fields, including:
- Social Sciences: To analyze income distribution, educational attainment, and other social indicators.
- Economics: To study economic inequality, poverty rates, and the distribution of wealth.
- Medicine: To evaluate health outcomes, such as median survival times for patients with certain diseases.
- Environmental science: To assess environmental pollution levels and ecological changes.
In conclusion, the history of the median is a testament to its enduring value as a statistical tool. Its ability to provide a robust measure of central tendency, even in the presence of outliers, has made it a valuable asset in various fields of study.
3. Mode:
The mode is the value that occurs most frequently in a dataset. It is especially helpful for categorical data or data with repeated values. There can be more than one mode (bimodal or multimodal) or none at all if no values repeat. In statistics, the mode is useful for understanding the most common values in a dataset, particularly in non-numerical contexts like survey responses.
The mode, like the median, doesn't have a standard sigma notation formula. The mode is the value that appears most frequently in a dataset. However, it can be provided with a mathematical representation that describes the mode.
For a dataset {x₁, x₂, ..., xₙ}, the mode can be expressed as:
Mode = arg max f(x)
Where:
- arg max means "the argument that maximizes"
- f(x) is the frequency function that returns the number of occurrences of x in the dataset
In other words, the mode is the value of x for which f(x) is the largest.
For a more detailed representation, we could write:
Mode = {x | f(x) = max(f(x₁), f(x₂), ..., f(xₙ))}
Or with little mathematical tricky, it can be represented the summation of the Kronecker delta function over the index i, from 1 to n. The Kronecker delta function is defined as:
This representation says that the mode is the value x for which the frequency f(x) is equal to the maximum frequency of any value in the dataset.
It's important to note that:
- A dataset can have multiple modes if two or more values share the highest frequency.
- The mode doesn't involve summing or averaging values, which is why it doesn't use sigma notation.
- The mode can be applied to non-numeric data as well, unlike the mean and median.
The History of the Mode
The mode, a statistical measure representing the most frequently occurring value in a dataset, has a long and rich history. While there isn't a single, definitive inventor, its origins can be traced back to ancient civilizations and through various mathematical advancements.
Early Origins
- Ancient civilizations: The concept of the most frequent value likely emerged in ancient times, possibly in conjunction with early forms of data analysis. For example, ancient Egyptians might have used the mode to determine the most common type of grain harvested.
- Medieval astronomers: Astronomers in the Middle Ages may have employed the mode to identify the most common observation or measurement.
Modern Developments
- Early statisticians: Pioneers in the field of statistics, such as John Graunt (1620-1674) and William Petty (1623-1687), likely used the mode in their analyses of population data.
- 19th and 20th centuries: The mode gained prominence in the 19th and 20th centuries as statistical methods became more sophisticated. It was particularly useful in analyzing categorical data, such as the most common color of cars or the most popular brand of ice cream.
- Data analysis: The mode has become an essential tool in data analysis, especially in fields like market research, quality control, and social sciences.
The Mode Today
The mode remains a widely used statistical measure in various fields, including:
- Market research: To identify the most popular products or services.
- Quality control: To detect defects or inconsistencies in manufacturing processes.
- Social sciences: To analyze voting patterns, consumer preferences, and other social phenomena.
In conclusion, the history of the mode is a testament to its enduring value as a statistical tool. Its ability to identify the most common value in a dataset has made it a valuable asset in various fields of study.
Together, these measures provide a comprehensive understanding of data's central tendency and are key tools in statistical analysis.
LET’S CODE
Now, you might wonder, how does this concept relate to programming, and more specifically, how can it be implemented in Java? Let's dive straight into the practical side of things and explore how to calculate mean, median, and mode using Java.
We create a class named Statistics
public class Statistics {
}
First, we create a NumberOperation interface that has a compute method
interface NumberOperation<T extends Number> {
List<T> compute(T[] numbers);
}
After that we directly write it in the main method
public static void main (String [] args) {
}
One by one, we implement the NumberOperation interface for Mean, Median, and Mode.
Mean
NumberOperation<Integer> mean = numbers -> {
double mean1 = Arrays.stream(numbers)
.mapToInt(Integer::intValue)
.average() .orElse(Double.NaN);
return Double.isNaN(mean1) ? List.of() : List.of((int) mean1);
};
Code Explanation
The code is creating a lambda expression that calculates the mean (average) of an array of integers and returns the result in a List. It uses Java Streams to perform this operation.
1. Lambda Expression:
NumberOperation<Integer> mean = numbers -> { ... };
This defines a lambda expression that implements the NumberOperation interface (assumed to be some custom functional interface) where numbers is the input parameter.
The lambda takes an array of integers (numbers) as input and returns a List containing the mean of those integers.
2. Streaming the Array:
Arrays.stream(numbers)
- numbers is an array of Integer.
- Arrays.stream(numbers) converts this array into a stream of Integer elements. Streams allow for functional-style operations on sequences of elements, such as mapping, filtering, and reducing.
3. Mapping to Primitive Int:
.mapToInt(Integer::intValue)
- The .mapToInt(Integer::intValue) step converts the Integer elements into int (primitive int values). This is necessary because the .average() method works on IntStream, which operates on primitive int.
4. Calculating the Average:
.average()
- This calculates the average of the elements in the stream and returns an OptionalDouble, which may or may not contain a result. If the stream is empty (no elements), it returns an empty OptionalDouble.
5. Handling Empty Arrays:
.orElse(Double.NaN)
- If the stream is empty (i.e., no numbers are provided), .orElse(Double.NaN) ensures that the mean value returned is NaN (Not-a-Number), represented as Double.NaN, instead of an empty optional.
6. Casting to Integer:
(int) mean1
- Since the result of .average() is a double, it is cast to an int. This cast truncates the decimal part of the mean (for example, if the mean is 3.7, it will be truncated to 3).
7. Not a Number handling:
- Casting Double.NaN to int results in 0. So if the array is empty, it will return 0, which might not be what you want. Consider handling NaN separately if you want different behavior (like returning an empty list or throwing an exception).
8. Returning a List:
return Double.isNaN(mean1) ? List.of() : List.of((int) mean1);
- This wraps the mean result (converted to an Integer) in an Integer[] array and then creates a list from that array using List.of().
- List.of() is a method introduced in Java 9 to create an immutable list with the provided elements. Here, it's creating a list containing just the single value of the mean.
Median
NumberOperation<Integer> median = numbers -> {
var listSorted = Arrays.stream(numbers).sorted().toList();
var index = listSorted.size() / 2;
var value = switch (listSorted.size() % 2) {
case 0 -> (int) ((listSorted.get(index - 1) + listSorted.get(index)) / 2.0);
default -> listSorted.get(index);
};
return List.of(value);
};
Code Explanation
This code defines a lambda expression that calculates the median of an array of integers. It sorts the array, computes the median, and returns the result in a List. Let's go through the code step by step.
- Lambda Expression: NumberOperation median = numbers -> { ... }; o This defines a lambda that implements the NumberOperation interface, where numbers is the input parameter (assumed to be an array of Integer values). o The lambda takes an array of integers (numbers) as input and returns a List containing the median value.
- Sorting the Array: var listSorted = Arrays.stream(numbers).sorted().toList(); o Arrays.stream(numbers) creates a stream from the input numbers array. o .sorted() sorts the integers in ascending order. o .toList() converts the sorted stream into a list, which is stored in the listSorted variable. o Now, listSorted contains the sorted integers, which are needed to calculate the median.
Finding the Middle Index:
var index = listSorted.size() / 2;
o The median is the middle value in a sorted list.
o The size() method returns the number of elements in the sorted list.
o Dividing the size by 2 gives the index of the middle element (or the point between the two middle elements if the list size is even).
o This index is used to access the median value(s) from the sorted list.Switch Expression for Median Calculation:
var value = switch (listSorted.size() % 2) {
case 0 -> (listSorted.get(index - 1) + listSorted.get(index)) / 2;
default -> listSorted.get(index);
};
o This switch expression checks whether the list size is even or odd using the modulus operator (listSorted.size() % 2).
o Case 0 (Even number of elements):
If the list size is even (listSorted.size() % 2 == 0), the median is the average of the two middle elements. The two middle elements are at index index - 1 and index.
The average of these two elements is calculated as:
(listSorted.get(index - 1) + listSorted.get(index)) / 2;
The result of this expression is the median value.
o Default (Odd number of elements):
If the list size is odd, the median is simply the element at the middle index (index). This is:
listSorted.get(index);
- Returning the Result: return List.of(new Integer[]{value}); o The median value (calculated above) is stored in value, which is then wrapped in an Integer[] array. o List.of(new Integer[]{value}) creates an immutable list containing this median value. o This list is returned as the result of the lambda expression.
Mode
NumberOperation mode = numbers -> {
if (numbers.length == 0) return List.of(); // Handle empty array
Map frequencyMap = Arrays.stream(numbers)
.collect(Collectors.groupingBy(n -> n, Collectors.counting()));
long maxFrequency = frequencyMap.values().stream()
.mapToLong(count -> count)
.max()
.orElse(0);
return frequencyMap.entrySet().stream()
.filter(entry -> entry.getValue() == maxFrequency)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
};
Code Explanation
This code defines a lambda expression to calculate the mode of an array of integers. The mode is the value that appears most frequently in a dataset. If multiple numbers have the same maximum frequency, the code returns all such numbers in a list.
- Lambda Expression: NumberOperation mode = numbers -> { ... }; o This lambda implements the NumberOperation interface, with numbers as the input (assumed to be an array of integers). o It calculates the mode(s) of the input numbers and returns them as a List.
- Creating the Frequency Map: Map frequencyMap = Arrays.stream(numbers) .collect(Collectors.groupingBy(n -> n, Collectors.counting()));
o Arrays.stream(numbers) creates a stream from the input array numbers.
o .collect(Collectors.groupingBy(n -> n, Collectors.counting())) groups the numbers based on their values and counts how often each number appears. Here's how it works:
groupingBy(n -> n): Groups the numbers by their value (n -> n means each number is its own key).
Collectors.counting(): Counts how many times each number occurs.
o The result is stored in frequencyMap, a Map where:
The key is the integer from the input array.
The value is the frequency (count) of that integer.
- Finding the Maximum Frequency: long maxFrequency = frequencyMap.values() .stream() .mapToLong(count -> count) .max() .orElse(0); o frequencyMap.values() returns the collection of all frequency values (i.e., the counts of each number). o .stream() creates a stream from these frequency values. o .mapToLong(count -> count) converts the frequency values (which are Long) to primitive long for efficient numerical operations. o .max() finds the maximum frequency in the stream, which is the highest number of occurrences of any integer. o .orElse(0) ensures that if the stream is empty (which happens if the input array is empty), maxFrequency will be set to 0.
- Finding the Numbers with Maximum Frequency: return frequencyMap.entrySet() .stream().filter(entry -> entry.getValue() == maxFrequency) .map(Map.Entry::getKey).collect(Collectors.toList()); o frequencyMap.entrySet() returns a set of Map.Entry representing the key-value pairs in the frequencyMap (where the key is the number, and the value is its frequency). o .stream() creates a stream from this set of entries. o .filter(entry -> entry.getValue() == maxFrequency) filters the entries to keep only those whose frequency equals the maxFrequency. This ensures that only the numbers with the maximum frequency are considered. o .map(Map.Entry::getKey) maps each entry to its key, which is the integer (the number from the original array). o .collect(Collectors.toList()) collects the resulting keys (the mode values) into a List and returns that list. Key Concepts:
- Grouping and Counting: o The Collectors.groupingBy() method is used to group the numbers by their value and then count how often each number appears using Collectors.counting().
- Finding Maximum: o The maximum frequency is calculated by finding the largest value in the collection of counts using .max(). If no numbers are provided, .orElse(0) handles this by setting the maximum frequency to 0.
- Filtering by Maximum Frequency: o After finding the maximum frequency, the code filters the entries in the map to find which numbers occur that many times (i.e., those with the maximum frequency).
- Multiple Modes: o If there are multiple numbers with the same frequency (multiple modes), the code returns all of them in the list. For example, if two numbers both appear the most times, they will both be returned. Example Walkthrough: Let’s assume the input array is: 100, 125, 70, 120, 100, 200, 150, 100, 80, 120, 120} • Step 1: Create a frequency map: After applying Collectors.groupingBy(), the frequencyMap would look like: {100=3, 125=1, 70=1, 120=3, 200=1, 150=1, 80=1}. This means: • 100 appears 3 times. • 120 appears 3 times. • 125, 70, 200, 150, and 80 each appear 1 time. • Step 2: Find the maximum frequency: The maximum frequency is 3 (both the numbers 100 and 120 appear 3 times). • Step 3: Filter the numbers that have this maximum frequency: The numbers 100 and 120 both have this maximum frequency. • Result: The result is a list containing [100, 120], which are the modes of the array. Finally, here is the entire code of everything we discussed
package com.takatws;
import java.util.*;
import java.util.stream.Collectors;
public class Statistics {
interface NumberOperation<T extends Number> {
List<T> compute(T[] numbers);
}
public static void main(String[] args) {
NumberOperation<Integer> mean = numbers -> {
double mean1 = Arrays.stream(numbers)
.mapToInt(Integer::intValue)
.average()
.orElse(Double.NaN);
return Double.isNaN(mean1) ? List.of() : List.of((int) mean1);
};
NumberOperation<Integer> median = numbers -> {
var listSorted = Arrays.stream(numbers).sorted().toList();
var index = listSorted.size() / 2;
var value = switch (listSorted.size() % 2) {
case 0 -> (int) ((listSorted.get(index - 1) + listSorted.get(index)) / 2.0);
default -> listSorted.get(index);
};
return List.of(value);
};
NumberOperation<Integer> mode = numbers -> {
if (numbers.length == 0) return List.of(); // Handle empty array
Map<Integer, Long> frequencyMap = Arrays.stream(numbers)
.collect(Collectors.groupingBy(n -> n, Collectors.counting()));
long maxFrequency = frequencyMap.values().stream()
.mapToLong(count -> count)
.max()
.orElse(0);
return frequencyMap.entrySet().stream()
.filter(entry -> entry.getValue() == maxFrequency)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
};
int[] data = {100, 125, 70, 120, 100, 200, 150, 100, 80, 120, 120};
/* The variable below is needed if the 'data' variable needs to be kept unchanged */
Integer[] data1 = Arrays.stream(data)
.boxed() // Convert each int to Integer (auto-boxing)
.toArray(Integer[]::new);
System.out.println("Mean: " + mean.compute(data1));
System.out.println("Median: " + median.compute(data1));
System.out.println("Mode: " + mode.compute(data1));
}
}
SUMMARY
In the above code snippets, we have implemented three common statistical operations—mean, median, and mode—using Java's functional programming constructs, specifically lambda expressions and streams.
CONCLUSION:
These implementations use Java's powerful stream API and functional programming techniques to process collections of numbers in a clear and efficient way. The following aspects are worth noting:
• Consistency: Each statistical operation returns a result as a List, ensuring consistent handling of results.
• Handling Edge Cases:
o The mean calculation handles empty arrays by returning NaN.
o The mode calculation handles cases where multiple numbers share the same highest frequency by returning all such numbers.
• Performance Consideration:
o Sorting for the median operation has a time complexity of O(n log n), which may affect performance for large datasets.
o The mode calculation requires counting the frequency of each element, but the overall complexity is linear (O(n)).
These implementations are flexible, reusable, and leverage Java's functional programming constructs to compute fundamental statistical measures efficiently.
Author Bio:
Takat Wicaksono with over 26 years of experience across a wide range of technologies, specializing in software development and system architecture. Currently focused on backend development, multiplatform mobile applications, and data analysis. His project experience, relevant to the discussed article, includes developing psychometric applications with various models for multiple companies, as well as desktop and web-based HRMS systems. He has also served as a Startup Founder, IT Advisor / Consultant, Trainer, and System Development/IT Lead. For system development consulting or building a startup system, you can contact him at LinkedIn.
to be continued...
(This article is currently in draft form)
Top comments (0)