pca {multiv} | R Documentation |
Finds a new coordinate system for multivariate data such that the first coordinate has maximal variance, the second coordinate has maximal variance subject to being orthogonal to the first, etc.
pca(a, method=3)
a |
data matrix to be decomposed, the rows representing observations and the columns variables. Missing values are not supported. |
method |
integer taking values between 1 and 8. method =
1 implies no
transformation of data matrix. Hence the singular value decomposition (SVD)
is carried out on a sums of squares and
cross-products matrix. method = 2 implies that the observations are centered
to zero mean. Hence the SVD is carried out on a variance-covariance matrix.
method = 3 (default) implies that the observations are centered to zero
mean, and additionally reduced to unit standard deviation. In this case the
observations are standardized. Hence the SVD is carried out on a correlation
matrix. method = 4 implies that the observations are normalized by being
range-divided, and then the variance-covariance matrix is used. method = 5
implies that the SVD is carried out on a Kendall (rank-order) correlation
matrix. method = 6 implies that the SVD is carried out on a Spearman
(rank-order) correlation matrix. method = 7 implies that the SVD is carried
out on the sample covariance matrix. method = 8 implies that the SVD is
carried out on the sample correlation matrix.
|
list describing the principal components analysis:
rproj |
projections of row points on the new axes. |
cproj |
projections of column points on the new axes. |
evals |
eigenvalues associated with the new axes. These provide
figures of merit for the variance explained by the new axes.
They are usually quoted in terms of percentage of the total, or in
terms of cumulative percentage of the total. |
evecs |
eigenvectors associated with the new axes. This
orthogonal matrix describes the rotation. The first column is the
linear combination of columns of a defining the first
principal component, etc. |
When carrying out a PCA of a hierarchy object, the partition is
specified bt lev
. The level plus the associated number of
groups equals the number of observations, at all times.
In the case of method
= 3, if any column point has zero standard deviation,
then a value of 1 is substituted for the standard deviation.
Up to 7 principal axes are determined. The inherent dimensionality of either
of the dual spaces is ordinarily min(n,m)
where n
and m
are respectively
the numbers of rows and columns of a
. The centering transformation which is
part of method
s 2 and 3 introduces a linear dependency causing the inherent
dimensionality to be min(n-1,m)
. Hence the number of columns returned in
rproj
, cproj
, and evecs
will be the lesser of this inherent
dimensionality and 7.
In the case of methods
1 to 4, very small negative eigenvalues, if they
arise, are an artifact of the SVD algorithm used, and may be treated as zero.
In the case of PCA using rank-order correlations (methods
5 and 6), negative
eigenvalues indicate that a Euclidean representation of the data is not
possible. The approximate Euclidean representation given by the axes
associated with the positive eigenvalues can often be quite adequate for
practical interpretation of the data.
Routine prcomp
is identical, to within small numerical
precision differences, to method
= 7 here. The examples below show
how to transform the outputs of the present implementation (pca
) onto
outputs of prcomp
.
Note that a very large number of columns in the input data matrix will cause dynamic memory problems: the matrix to be diagonalized requires O(m^2) storage where m is the number of variables.
A singular value decomposition is carried out.
Principal components analysis defines the axis which provides the best fit to both the row points and the column points. A second axis is determined which best fits the data subject to being orthogonal to the first. Third and subsequent axes are similarly found. Best fit is in the least squares sense. The criterion which optimizes the fit of the axes to the points is, by virtue of Pythagoras' theorem, simultaneously a criterion which optimizes the variance of projections on the axes.
Principal components analysis is often used as a data reduction
technique. In the pattern recognition field, it is often termed the
Karhunen-Loeve expansion since the data matrix a
may be written
as a series expansion using the eigenvectors and eigenvalues found.
Many multivariate statistics and data analysis books include a discussion of principal components analysis. Below are a few examples:
C. Chatfield and A.J. Collins, Introduction to Multivariate
Analysis
, Chapman and Hall, 1980 (a good, all-round introduction);
M. Kendall, Multivariate Analysis
, Griffin, 1980 (dated in
relation to computing techniques, but exceptionally clear and concise
in the treatment of many practical aspects);
F.H.C. Marriott, The Interpretation of Multiple Observations
,
Academic, 1974 (a short, very readable textbook);
L. Lebart, A. Morineau, and K.M. Warwick, `Multivariate Descriptive Statistical Analysis', Wiley, 1984 (an excellent geometric treatment of PCA);
I.T. Joliffe, Principal Component Analysis
, Springer, 1980.
data(iris) iris <- as.matrix(iris[,1:4]) pcprim <- pca(iris) # plot of first and second principal components plot(pcprim$rproj[,1], pcprim$rproj[,2]) # variance explained by the principal components pcprim$evals*100.0/sum(pcprim$evals) # In the implementation of the S function `prcomp', different results are # produced. Here is how to obtain these results, using the function `pca'. library(mva) # Consider the following result of `prcomp': old <- prcomp(iris) # With `pca', one would do the following: new <- pca(iris, method=7)