Page Contents

2022-04-09 #apache_math #java

Computing descriptive statistics from a list of data points, using the Apache Math library.

Example: Iris Dataset

The example used here is the Iris Data set from the UCI database: download iris.data.

The four attributes are stored as double values in a Java record, and the label as a string:

record IrisInstance(
    double sepalLength,
    double sepalWidth,
    double petalLength,
    double petalWidth,
    String label
  ) {}

Assume the data have been stored in a List<IrisInstance>, e.g. by using the technique in Reading CSV Files.

Apache Commons Math

The Apache Commons Math library provides the class DescriptiveStatistics, which accepts a collection of values and provides methods to access statistical properties, such as arithmetic mean, standard deviation, etc.

Using an attribute-accessor method reference, the following method can return an instance of DescriptiveStatistics for any of the attributes:

  public static DescriptiveStatistics statistics (List<IrisInstance> data,
      Function<IrisInstance, Double> accessor) {                  1
    DescriptiveStatistics ds = new DescriptiveStatistics ();      2

    for (IrisInstance instance : data) {                          3
      ds.addValue(accessor.apply(instance));                      4
    }

    return ds;
  }
1 The second argument is a method reference to access an attribute.
2 The DescriptiveStatistics instance is created.
3 Loop through every instance in the data, and …​
4 Add the value obtained by applying the attribute-accessor to each instance.

Printing Statistics on All Attributes

Having obtained an instance of DescriptiveStatistics for an attribute, some information can be printed, e.g. the minimum and maximum values, and the mean and standard deviation:

  public static void displayStatistics (String attributeName, DescriptiveStatistics statistics) {
    System.out.println(attributeName);
    System.out.println(" -- Minimum: " + statistics.getMin());
    System.out.println(" -- Maximum: " + statistics.getMax());
    System.out.println(String.format(" -- Mean:    %.2f", statistics.getMean()));
    System.out.println(String.format(" -- Stddev:  %.2f", statistics.getStandardDeviation()));
  }

Finally, the code to analyse each attribute in turn and display its information:

      displayStatistics ("Sepal Length", statistics(data, IrisInstance::sepalLength);
      displayStatistics ("Sepal Width", statistics(data, IrisInstance::sepalWidth);
      displayStatistics ("Petal Length", statistics(data, IrisInstance::petalLength);
      displayStatistics ("Petal Width", statistics(data, IrisInstance::petalWidth);

This produces the following output:

Sepal Length
 -- Minimum: 4.3
 -- Maximum: 7.9
 -- Mean:    5.84
 -- Stddev:  0.83
Sepal Width
 -- Minimum: 2.0
 -- Maximum: 4.4
 -- Mean:    3.05
 -- Stddev:  0.43
Petal Length
 -- Minimum: 1.0
 -- Maximum: 6.9
 -- Mean:    3.76
 -- Stddev:  1.76
Petal Width
 -- Minimum: 0.1
 -- Maximum: 2.5
 -- Mean:    1.20
 -- Stddev:  0.76