hierclust {multiv} | R Documentation |
Agglomerates the closest pair of observations, and replaces them with a cluster. Agglomerates the next closest pair of observations or clusters. Continues through `n-1' agglomerations, assuming there are `n' observations.
hierclust(a, method = 1, bign = FALSE, diagnostics = FALSE, sim = " ", movie = FALSE, option = "thru", alpha = 1.0, repres = "chull", show = "change") hierclust(dst, method="compact", sim=)
a |
data matrix to be analyzed, the rows representing observations and the
columns variables; or, alternatively, lower-diagonal dissimilarity matrix in
vector or list format. In latter case, the number of values is n(n-1)/2
where n is the number of observations.
(See below for second variant of usage of this function.) First variant: |
method |
First possibility: integer taking the value between 1 and 7. These methods
correspond to:
Ward's minimum variance or error sum of squares method (1); single linkage
or nearest neighbor method (2); complete linkage or diameter (3); average
linkage, group average, or UPGMA method (4); McQuitty's or WPGMA method
(5); median, Gower's or WPGMC method (6); and centroid or UPGMC method (7).
If the movie option is used, only method s 1 to 4 are supported.
Second possibility: string compact (default; complete linkage),
average (average linkage), or connected (single linkage).
|
sim |
structure giving similarities rather than dissimilarities. This can either
be a symmetric matrix or a vector with a Size attribute. Only one of
dissimilarities or similarities may be specified (or, of course, as an
alternative, a data matrix.
|
bign |
Is n big? If storage is problemsome, a different implementation of the
Ward criterion may be tried. This determines dissimilarities on the fly, and
hence requires O(n) storage.
|
diagnostics |
show, or suppress, some information directed to the command window. Suppression is useful if functional programming capabilities are availed of, leading to lazy evaluation; this often results in multiple evaluations, with consequent repetition of diagnostic information. |
movie |
whether interactive display of construction of hierarchy is wanted. This
necessitates an open plot window. It does not produce an output object.
All remaining arguments apply only in the case of the movie=TRUE .
|
option |
(Only for movie .) "thru" (default) or "prompt": Go right through all n-1
agglomerations, or ask
the user whether to continue at each agglomeration? Here is what we
recommend: On the first occasion, go right through. Check out the cluster
criterion values reported on in the command window. Then using the 'prompt'
option, go as far as the desired partition, and print it out.
|
alpha |
(Only for movie .) 1.0 (default) or 0.0 < alpha <= 1.0.
Only used by the minimum variance agglomerative criterion. Alpha downweights
the variance component on the principal axis of the cluster, and thereby
redefines the cluster criterion such that linear clusters are preferred. A
value of alpha = 1.0 gives exactly the traditional Ward's minimum variance
method. See Murtagh and Raftery (1984) for more on this altered criterion.
|
repres |
(Only for movie .) repres = "chull" (default), "lines", "gauss":
Representation of clusters: For input data which has dimensionality greater
than 2, only "chull" is allowed (since the representations were found to be
too problematic, given that the clustering takes place in m-dimensions, and
the the representation is necessarily a 2-dimensional projection of this).
Representation "chull" = convex hull, or straight line in case
of collinear points. (We test for collinearity using principal
components analysis.)
Representation "lines" = lines of best fit (least squares fit is used,
and the extremities are defined by the intersection of the min. and max. x
values on this LS fit line). "gauss" = 1-sigma circles. Side-effect of
latter: warning messages indicate whenever circles overlap plot boundaries.
|
show |
(Only for movie .) "change" or "all":
In this method's present implementation, the clusters at each stage are
highlighted by means of a different symbol. Option show = "change" has
this highlighting changed at each agglomeration, so that only the most recent
agglomerands are shown. Option show = "all" does not turn off the
highlighting in this way.
Second variant (which re-wraps S function hclust : cf. help file on hclust ):
Exactly one of dst or sim must be specified. |
dst |
a distance structure or distance matrix. Normally this
will be the result of the function dist , but it can be any
data of the form returned by dist , or a full, symmetric
matrix. Missing values are not allowed.
|
method |
character string giving the clustering method. The three methods currently implemented are "average", "con- nected" (single linkage) and "compact" (complete linkage). (The first three characters of the method are sufficient.) |
sim= |
structure giving similarities rather than distances. This can either be a symmetric matrix or a vector with a "Size" attribute. Missing values are not allowed. |
When movie
is not specified in the first variant, this is a
list describing the hierarchical clustering of the observations.
The movie
option is considerably slower than alternative options, due
to the processing involved. Hence its use is recommended only on small
data sets (less than 100 observations).
merge |
an n-1 by 2 matrix. Row i of merge describes the merging of clusters
at step i of the clustering. If an element j in the row is negative,
then observation -j was merged at this stage. If j is positive
then the merge
was with the cluster formed at the (earlier) stage j of the algorithm. Thus
negative entries in merge indicate agglomerations of singletons, and
positive entries indicate agglomerations of non-singletons.
|
height |
a set of n-1 non-decreasing real values. The
clustering "height": that is, the value of the criterion associated with
the clustering method for the particular agglomeration.
|
order |
a vector giving the permutation of the original observations suitable for
plotting,
in the sense that a cluster plot using this ordering and matrix merge will
not have crossings of the branches.
In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. Since, for n observations there are n-1 merges, there are 2^(n-1) possible
orderings for the leaves in a cluster tree, or dendrogram. The default
algorithm in hc is to order the subtree so that the tighter cluster is on
the left (the last, i.e. most recent, merge of the left subtree is at a lower
value than the last
merge of the right subtree). Observations are the tightest clusters possible,
and merges involving two observations place them in order by their
observation sequence number.
|
The squared Euclidean distance is used in the movie
case, and in the case
when a data matrix is input. Note that the movie
option returns
invisibly.
In the case of all methods when bign = FALSE
(default),
the Lance-Williams dissimilarity update formula is used to allow the 7 methods
to be implemented.
In the direct data matrix case,
a list of nearest neighbors is maintained, which
allows the closest pair of clusters or observations to be determined by
scanning this list to find the pair of agglomerands on each iteration.
Hence O(n)
operations are required for each of n-1
iterations, i.e.
O(n^2)
computation time in total. The
initial setting up of the n
nearest neighbors also requires O(n^2)
time.
The processing to be carried out following each agglomeration is O(n)
.
In total, all methods implemented in hc
are of time complexity O(n^2)
.
Unless bign = TRUE
(see below), storage
complexity is O(n^2)
. Typical times on a Sun SPARCstation
1 for 400x4 and 800x4 arrays: 12 and 48 secs.
Ward's minimum variance method aims at finding compact, spherical clusters.
The complete link method finds similar clusters. The single link method,
which is closely related to the minimal spanning tree, adopts a
friends of friends
clustering strategy.
The other methods can be regarded as aiming
for clusters with characteristics somewhere between the single and complete
link methods.
In the case of the Ward method, and bign = TRUE
, the nearest-neighbor
chain algorithm is used, which gives an O(n^2)
computation time algorithm. The input matrix is stored, but not the
dissimilarities produced. Computation time is marginally inferior to the
implementation of Ward's method when bign = FALSE
(default). Sample
times for a 400x4 and an 800x4
array on a Sun SPARCstation 1: 12.5 and 50 secs.
A small sample:
B. Everitt, Cluster Analysis Heinemann Educ. Books, London 1974;
P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy Freeman, San Francisco, 1973;
M.R. Anderberg, Cluster Analysis for Applications Academic Press, New York, 1973;
A.D. Gordon, Classification Chapman and Hall, London, 1981; and
F. Murtagh, Multidimensional Clustering Algorithms COMPSTAT Lectures 4, Physica-Verlag, Wuerzburg, 1985 (for algorithmic details of algorithms used).
For finding assignments of objects to clusters, use members
.
'Dist' is used for creating (dis)similarities. Functions
cutree
, and dist
are already available in R. Function members
is available in this function suite.
data(iris) iris <- as.matrix(iris[,1:4]) # Default Ward criterion hierclust(iris, movie = TRUE) h <- hierclust(iris) k <- members(h)