Class KMeansPlusPlusClusterer<T extends Clusterable>

java.lang.Object
org.apache.commons.math3.ml.clustering.Clusterer<T>
org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer<T>
Type Parameters:
T - type of the points to cluster

public class KMeansPlusPlusClusterer<T extends Clusterable> extends Clusterer<T>
Clustering algorithm based on David Arthur and Sergei Vassilvitski k-means++ algorithm.
Since:
3.2
See Also:
  • Field Details

    • k

      private final int k
      The number of clusters.
    • maxIterations

      private final int maxIterations
      The maximum number of iterations.
    • random

      private final RandomGenerator random
      Random generator for choosing initial centers.
    • emptyStrategy

      private final KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy
      Selected strategy for empty clusters.
  • Constructor Details

    • KMeansPlusPlusClusterer

      public KMeansPlusPlusClusterer(int k)
      Build a clusterer.

      The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

      The euclidean distance will be used as default distance measure.

      Parameters:
      k - the number of clusters to split the data into
    • KMeansPlusPlusClusterer

      public KMeansPlusPlusClusterer(int k, int maxIterations)
      Build a clusterer.

      The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

      The euclidean distance will be used as default distance measure.

      Parameters:
      k - the number of clusters to split the data into
      maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
    • KMeansPlusPlusClusterer

      public KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure)
      Build a clusterer.

      The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

      Parameters:
      k - the number of clusters to split the data into
      maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
      measure - the distance measure to use
    • KMeansPlusPlusClusterer

      public KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure, RandomGenerator random)
      Build a clusterer.

      The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

      Parameters:
      k - the number of clusters to split the data into
      maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
      measure - the distance measure to use
      random - random generator to use for choosing initial centers
    • KMeansPlusPlusClusterer

      public KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure, RandomGenerator random, KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
      Build a clusterer.
      Parameters:
      k - the number of clusters to split the data into
      maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
      measure - the distance measure to use
      random - random generator to use for choosing initial centers
      emptyStrategy - strategy to use for handling empty clusters that may appear during algorithm iterations
  • Method Details

    • getK

      public int getK()
      Return the number of clusters this instance will use.
      Returns:
      the number of clusters
    • getMaxIterations

      public int getMaxIterations()
      Returns the maximum number of iterations this instance will use.
      Returns:
      the maximum number of iterations, or -1 if no maximum is set
    • getRandomGenerator

      public RandomGenerator getRandomGenerator()
      Returns the random generator this instance will use.
      Returns:
      the random generator
    • getEmptyClusterStrategy

      public KMeansPlusPlusClusterer.EmptyClusterStrategy getEmptyClusterStrategy()
      Returns the KMeansPlusPlusClusterer.EmptyClusterStrategy used by this instance.
      Returns:
      the KMeansPlusPlusClusterer.EmptyClusterStrategy
    • cluster

      Runs the K-means++ clustering algorithm.
      Specified by:
      cluster in class Clusterer<T extends Clusterable>
      Parameters:
      points - the points to cluster
      Returns:
      a list of clusters containing the points
      Throws:
      MathIllegalArgumentException - if the data points are null or the number of clusters is larger than the number of data points
      ConvergenceException - if an empty cluster is encountered and the emptyStrategy is set to ERROR
    • assignPointsToClusters

      private int assignPointsToClusters(List<CentroidCluster<T>> clusters, Collection<T> points, int[] assignments)
      Adds the given points to the closest Cluster.
      Parameters:
      clusters - the Clusters to add the points to
      points - the points to add to the given Clusters
      assignments - points assignments to clusters
      Returns:
      the number of points assigned to different clusters as the iteration before
    • chooseInitialCenters

      private List<CentroidCluster<T>> chooseInitialCenters(Collection<T> points)
      Use K-means++ to choose the initial centers.
      Parameters:
      points - the points to choose the initial centers from
      Returns:
      the initial centers
    • getPointFromLargestVarianceCluster

      private T getPointFromLargestVarianceCluster(Collection<CentroidCluster<T>> clusters) throws ConvergenceException
      Get a random point from the Cluster with the largest distance variance.
      Parameters:
      clusters - the Clusters to search
      Returns:
      a random point from the selected cluster
      Throws:
      ConvergenceException - if clusters are all empty
    • getPointFromLargestNumberCluster

      private T getPointFromLargestNumberCluster(Collection<? extends Cluster<T>> clusters) throws ConvergenceException
      Get a random point from the Cluster with the largest number of points
      Parameters:
      clusters - the Clusters to search
      Returns:
      a random point from the selected cluster
      Throws:
      ConvergenceException - if clusters are all empty
    • getFarthestPoint

      private T getFarthestPoint(Collection<CentroidCluster<T>> clusters) throws ConvergenceException
      Get the point farthest to its cluster center
      Parameters:
      clusters - the Clusters to search
      Returns:
      point farthest to its cluster center
      Throws:
      ConvergenceException - if clusters are all empty
    • getNearestCluster

      private int getNearestCluster(Collection<CentroidCluster<T>> clusters, T point)
      Returns the nearest Cluster to the given point
      Parameters:
      clusters - the Clusters to search
      point - the point to find the nearest Cluster for
      Returns:
      the index of the nearest Cluster to the given point
    • centroidOf

      private Clusterable centroidOf(Collection<T> points, int dimension)
      Computes the centroid for a set of points.
      Parameters:
      points - the set of points
      dimension - the point dimension
      Returns:
      the computed centroid for the set of points