Class KolmogorovSmirnovTest
The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Two-sample tests are also supported, evaluating the null hypothesis that the two samples
x
and y
come from the same underlying distribution. In this case, the test
statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of x
, \(m\) is
the length of y
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of
the values in x
and \(F_m\) is the empirical distribution of the y
values. The
default 2-sample test method, kolmogorovSmirnovTest(double[], double[])
works as
follows:
- For small samples (where the product of the sample sizes is less than 10000), the method presented in [4] is used to compute the exact p-value for the 2-sample test.
- When the product of the sample sizes exceeds 10000, the asymptotic
distribution of \(D_{n,m}\) is used. See
approximateP(double, int, int)
for details on the approximation.
If the product of the sample sizes is less than 10000 and the sample
data contains ties, random jitter is added to the sample data to break ties before applying
the algorithm above. Alternatively, the bootstrap(double[], double[], int, boolean)
method, modeled after ks.boot
in the R Matching package [3], can be used if ties are known to be present in the data.
In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value
associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \)
by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean
strict
parameter. This parameter is ignored for large samples.
The methods used by the 2-sample default implementation are also exposed directly:
exactP(double, int, int, boolean)
computes exact 2-sample p-valuesapproximateP(double, int, int)
uses the asymptotic distribution Theboolean
arguments in the first two methods allow the probability used to estimate the p-value to be expressed using strict or non-strict inequality. SeekolmogorovSmirnovTest(double[], double[], boolean)
.
References:
- [1] Evaluating Kolmogorov's Distribution by George Marsaglia, Wai Wan Tsang, and Jingbo Wang
- [2] Computing the Two-Sided Kolmogorov-Smirnov Distribution by Richard Simard and Pierre L'Ecuyer
- [3] Jasjeet S. Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R Journal of Statistical Software, 42(7): 1-52.
- [4] Wilcox, Rand. 2012. Introduction to Robust Estimation and Hypothesis Testing, Chapter 5, 3rd Ed. Academic Press.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
- Since:
- 3.3
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final double
Convergence criterion forksSum(double, double, int)
protected static final int
When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.protected static final int
Bound on the number of partial sums inksSum(double, double, int)
protected static final int
Deprecated.protected static final double
Convergence criterion for the sums in #pelzGood(double, double, int)}private final RandomGenerator
Random data generator used bymonteCarloP(double, int, int, boolean, int)
protected static final int
Deprecated. -
Constructor Summary
ConstructorsConstructorDescriptionConstruct a KolmogorovSmirnovTest instance with a default random data generator.Deprecated. -
Method Summary
Modifier and TypeMethodDescriptiondouble
approximateP
(double d, int n, int m) Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.double
bootstrap
(double[] x, double[] y, int iterations) Computesbootstrap(x, y, iterations, true)
.double
bootstrap
(double[] x, double[] y, int iterations, boolean strict) Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.private static int
c
(int i, int j, int m, int n, long cmn, boolean strict) The function C(i, j) defined in [4] (class javadoc), formula (5.5).private static long
calculateIntegralD
(double d, int n, int m, boolean strict) Given a d-statistic in the range [0, 1] and the two sample sizes n and m, an integral d-statistic in the range [0, n*m] is calculated, that can be used for comparison with other integral d-statistics.double
cdf
(double d, int n) Calculates \(P(D_n invalid input: '<' d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdf
(double d, int n, boolean exact) CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdfExact
(double d, int n) CalculatesP(D_n < d)
.private void
checkArray
(double[] array) Verifies thatarray
has length at least 2.private FieldMatrix
<BigFraction> createExactH
(double d, int n) CreatesH
of sizem x m
as described in [1] (see above).private RealMatrix
createRoundedH
(double d, int n) CreatesH
of sizem x m
as described in [1] (see above) using double-precision.private double
exactK
(double d, int n) Calculates the exact value ofP(D_n < d)
using the method described in [1] (reference in class javadoc above) andBigFraction
(see above).double
exactP
(double d, int n, int m, boolean strict) Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.(package private) static void
fillBooleanArrayRandomlyWithFixedNumberTrueValues
(boolean[] b, int numberOfTrueValues, RandomGenerator rng) Fills a boolean array randomly with a fixed number oftrue
values.private static void
fixTies
(double[] x, double[] y) If there are no ties in the combined dataset formed from x and y, this method is a no-op.private static boolean
hasTies
(double[] x, double[] y) Returns true iff there are ties in the combined sample formed from x and y.private long
integralKolmogorovSmirnovStatistic
(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.private double
integralMonteCarloP
(long d, int n, int m, int iterations) Uses Monte Carlo simulation to approximate \(P(D_{n,m} >= d/(n*m))\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.private static void
jitter
(double[] data, RealDistribution dist) Adds random jitter todata
using deviates sampled fromdist
.double
kolmogorovSmirnovStatistic
(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.double
kolmogorovSmirnovStatistic
(RealDistribution distribution, double[] data) Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
.double
kolmogorovSmirnovTest
(double[] x, double[] y) Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest
(double[] x, double[] y, boolean strict) Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest
(RealDistribution distribution, double[] data) Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
kolmogorovSmirnovTest
(RealDistribution distribution, double[] data, boolean exact) Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.boolean
kolmogorovSmirnovTest
(RealDistribution distribution, double[] data, double alpha) Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
ksSum
(double t, double tolerance, int maxIterations) Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed.double
monteCarloP
(double d, int n, int m, boolean strict, int iterations) Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.private static double
n
(int i, int j, int m, int n, long cnm, boolean strict) The function N(i, j) defined in [4] (class javadoc).double
pelzGood
(double d, int n) Computes the Pelz-Good approximation for \(P(D_n invalid input: '<' d)\) as described in [2] in the class javadoc.private double
roundedK
(double d, int n) CalculatesP(D_n < d)
using method described in [1] and doubles (see above).
-
Field Details
-
MAXIMUM_PARTIAL_SUM_COUNT
protected static final int MAXIMUM_PARTIAL_SUM_COUNTBound on the number of partial sums inksSum(double, double, int)
- See Also:
-
KS_SUM_CAUCHY_CRITERION
protected static final double KS_SUM_CAUCHY_CRITERIONConvergence criterion forksSum(double, double, int)
- See Also:
-
PG_SUM_RELATIVE_ERROR
protected static final double PG_SUM_RELATIVE_ERRORConvergence criterion for the sums in #pelzGood(double, double, int)}- See Also:
-
SMALL_SAMPLE_PRODUCT
Deprecated.No longer used.- See Also:
-
LARGE_SAMPLE_PRODUCT
protected static final int LARGE_SAMPLE_PRODUCTWhen product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.- See Also:
-
MONTE_CARLO_ITERATIONS
Deprecated.Default number of iterations used bymonteCarloP(double, int, int, boolean, int)
. Deprecated as of version 3.6, as this method is no longer needed.- See Also:
-
rng
Random data generator used bymonteCarloP(double, int, int, boolean, int)
-
-
Constructor Details
-
KolmogorovSmirnovTest
public KolmogorovSmirnovTest()Construct a KolmogorovSmirnovTest instance with a default random data generator. -
KolmogorovSmirnovTest
Deprecated.Construct a KolmogorovSmirnovTest with the provided random data generator. The #monteCarloP(double, int, int, boolean, int) that uses the generator supplied to this constructor is deprecated as of version 3.6.- Parameters:
rng
- random data generator used bymonteCarloP(double, int, int, boolean, int)
-
-
Method Details
-
kolmogorovSmirnovTest
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
. Ifexact
is true, the distribution used to compute the p-value is computed using extended precision. SeecdfExact(double, int)
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluatedexact
- whether or not to force exact computation of the p-value- Returns:
- the p-value associated with the null hypothesis that
data
is a sample fromdistribution
- Throws:
InsufficientDataException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovStatistic
Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
.- Parameters:
distribution
- reference distributiondata
- sample being evaluated- Returns:
- Kolmogorov-Smirnov statistic \(D_n\)
- Throws:
InsufficientDataException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict) Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that thekolmogorovSmirnovStatistic(double[], double[])
associated with a randomly selected partition of the combined sample into subsamples of sizesx.length
andy.length
will strictly exceed (ifstrict
istrue
) or be at least as large asstrict = false
) askolmogorovSmirnovStatistic(x, y)
.- For small samples (where the product of the sample sizes is less than
10000), the exact p-value is computed using the method presented
in [4], implemented in
exactP(double, int, int, boolean)
. - When the product of the sample sizes exceeds 10000, the
asymptotic distribution of \(D_{n,m}\) is used. See
approximateP(double, int, int)
for details on the approximation.
If
x.length * y.length
invalid input: '<' 10000 and the combined set of values inx
andy
contains ties, random jitter is added tox
andy
to break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between values in the combined sample.If ties are known to be present in the data,
bootstrap(double[], double[], int, boolean)
may be used as an alternative method for estimating the p-value.- Parameters:
x
- first sample datasety
- second sample datasetstrict
- whether or not the probability to compute is expressed as a strict inequality (ignored for large samples)- Returns:
- p-value associated with the null hypothesis that
x
andy
represent samples from the same distribution - Throws:
InsufficientDataException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null- See Also:
- For small samples (where the product of the sample sizes is less than
10000), the exact p-value is computed using the method presented
in [4], implemented in
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y) Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. SeekolmogorovSmirnovTest(RealDistribution, double[], boolean)
.- Parameters:
x
- first sample datasety
- second sample dataset- Returns:
- p-value associated with the null hypothesis that
x
andy
represent samples from the same distribution - Throws:
InsufficientDataException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null
-
kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.- Parameters:
x
- first sampley
- second sample- Returns:
- test statistic \(D_{n,m}\) used to evaluate the null hypothesis that
x
andy
represent samples from the same underlying distribution - Throws:
InsufficientDataException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null
-
integralKolmogorovSmirnovStatistic
private long integralKolmogorovSmirnovStatistic(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values. Finally \(n m D_{n,m}\) is returned as long value.- Parameters:
x
- first sampley
- second sample- Returns:
- test statistic \(n m D_{n,m}\) used to evaluate the null hypothesis that
x
andy
represent samples from the same underlying distribution - Throws:
InsufficientDataException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null
-
kolmogorovSmirnovTest
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluated- Returns:
- the p-value associated with the null hypothesis that
data
is a sample fromdistribution
- Throws:
InsufficientDataException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovTest
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluatedalpha
- significance level of the test- Returns:
- true iff the null hypothesis that
data
is a sample fromdistribution
can be rejected with confidence 1 -alpha
- Throws:
InsufficientDataException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
bootstrap
public double bootstrap(double[] x, double[] y, int iterations, boolean strict) Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. This method estimates the p-value by repeatedly sampling sets of sizex.length
andy.length
from the empirical distribution of the combined sample. Whenstrict
is true, this is equivalent to the algorithm implemented in the R functionks.boot
, described inJasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
- Parameters:
x
- first sampley
- second sampleiterations
- number of bootstrap resampling iterationsstrict
- whether or not the null hypothesis is expressed as a strict inequality- Returns:
- estimated p-value
-
bootstrap
public double bootstrap(double[] x, double[] y, int iterations) Computesbootstrap(x, y, iterations, true)
. This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching package function. See #bootstrap(double[], double[], int, boolean).- Parameters:
x
- first sampley
- second sampleiterations
- number of bootstrap resampling iterations- Returns:
- estimated p-value
-
cdf
Calculates \(P(D_n invalid input: '<' d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as withcdfExact(double, int)
because calculations are based ondouble
rather thanBigFraction
.- Parameters:
d
- statisticn
- sample size- Returns:
- \(P(D_n invalid input: '<' d)\)
- Throws:
MathArithmeticException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 \le h invalid input: '<' 1\)
-
cdfExact
CalculatesP(D_n < d)
. The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choosecdf(double, int)
. See the class javadoc for definitions and algorithm description.- Parameters:
d
- statisticn
- sample size- Returns:
- \(P(D_n invalid input: '<' d)\)
- Throws:
MathArithmeticException
- if the algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 \le h invalid input: '<' 1\)
-
cdf
CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above).- Parameters:
d
- statisticn
- sample sizeexact
- whether the probability should be calculated exact usingBigFraction
everywhere at the expense of very slow execution time, or ifdouble
should be used convenient places to gain speed. Almost never choosetrue
in real applications unless you are very sure;true
is almost solely for verification purposes.- Returns:
- \(P(D_n invalid input: '<' d)\)
- Throws:
MathArithmeticException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 \le h invalid input: '<' 1\).
-
exactK
Calculates the exact value ofP(D_n < d)
using the method described in [1] (reference in class javadoc above) andBigFraction
(see above).- Parameters:
d
- statisticn
- sample size- Returns:
- the two-sided probability of \(P(D_n invalid input: '<' d)\)
- Throws:
MathArithmeticException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 \le h invalid input: '<' 1\).
-
roundedK
private double roundedK(double d, int n) CalculatesP(D_n < d)
using method described in [1] and doubles (see above).- Parameters:
d
- statisticn
- sample size- Returns:
- \(P(D_n invalid input: '<' d)\)
-
pelzGood
public double pelzGood(double d, int n) Computes the Pelz-Good approximation for \(P(D_n invalid input: '<' d)\) as described in [2] in the class javadoc.- Parameters:
d
- value of d-statistic (x in [2])n
- sample size- Returns:
- \(P(D_n invalid input: '<' d)\)
- Since:
- 3.4
-
createExactH
private FieldMatrix<BigFraction> createExactH(double d, int n) throws NumberIsTooLargeException, FractionConversionException CreatesH
of sizem x m
as described in [1] (see above).- Parameters:
d
- statisticn
- sample size- Returns:
- H matrix
- Throws:
NumberIsTooLargeException
- if fractional part is greater than 1FractionConversionException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 invalid input: '<'= h invalid input: '<' 1\).
-
createRoundedH
CreatesH
of sizem x m
as described in [1] (see above) using double-precision.- Parameters:
d
- statisticn
- sample size- Returns:
- H matrix
- Throws:
NumberIsTooLargeException
- if fractional part is greater than 1
-
checkArray
private void checkArray(double[] array) Verifies thatarray
has length at least 2.- Parameters:
array
- array to test- Throws:
NullArgumentException
- if array is nullInsufficientDataException
- if array is too short
-
ksSum
public double ksSum(double t, double tolerance, int maxIterations) Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed. If the sum does not converge beforemaxIterations
iterations aTooManyIterationsException
is thrown.- Parameters:
t
- argumenttolerance
- Cauchy criterion for partial sumsmaxIterations
- maximum number of partial sums to compute- Returns:
- Kolmogorov sum evaluated at t
- Throws:
TooManyIterationsException
- if the series does not converge
-
calculateIntegralD
private static long calculateIntegralD(double d, int n, int m, boolean strict) Given a d-statistic in the range [0, 1] and the two sample sizes n and m, an integral d-statistic in the range [0, n*m] is calculated, that can be used for comparison with other integral d-statistics. Depending whetherstrict
istrue
or not, the returned value divided by (n*m) is greater than (resp greater than or equal to) the given d value (allowing some tolerance).- Parameters:
d
- a d-statistic in the range [0, 1]n
- first sample sizem
- second sample sizestrict
- whether the returned value divided by (n*m) is allowed to be equal to d- Returns:
- the integral d-statistic in the range [0, n*m]
-
exactP
public double exactP(double d, int n, int m, boolean strict) Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
- Parameters:
d
- D-statistic valuen
- first sample sizem
- second sample sizestrict
- whether or not the probability to compute is expressed as a strict inequality- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than (resp. greater than or equal to)
d
-
approximateP
public double approximateP(double d, int n, int m) Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See
ksSum(double, double, int)
for details on how convergence of the sum is determined. This implementation passesksSum
1.0E-20 astolerance
and 100000 asmaxIterations
.- Parameters:
d
- D-statistic valuen
- first sample sizem
- second sample size- Returns:
- approximate probability that a randomly selected m-n partition of m + n generates
\(D_{n,m}\) greater than
d
-
fillBooleanArrayRandomlyWithFixedNumberTrueValues
static void fillBooleanArrayRandomlyWithFixedNumberTrueValues(boolean[] b, int numberOfTrueValues, RandomGenerator rng) Fills a boolean array randomly with a fixed number oftrue
values. The method uses a simplified version of the Fisher-Yates shuffle algorithm. By processing first thetrue
values followed by the remainingfalse
values less random numbers need to be generated. The method is optimized for the case that the number oftrue
values is larger than or equal to the number offalse
values.- Parameters:
b
- boolean arraynumberOfTrueValues
- number oftrue
values the boolean array should finally haverng
- random data generator
-
monteCarloP
public double monteCarloP(double d, int n, int m, boolean strict, int iterations) Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).The simulation generates
iterations
random partitions ofm + n
into ann
set and anm
set, computing \(D_{n,m}\) for each partition and returning the proportion of values that are greater thand
, or greater than or equal tod
ifstrict
isfalse
.- Parameters:
d
- D-statistic valuen
- first sample sizem
- second sample sizestrict
- whether or not the probability to compute is expressed as a strict inequalityiterations
- number of random partitions to generate- Returns:
- proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\)
greater than (resp. greater than or equal to)
d
-
integralMonteCarloP
private double integralMonteCarloP(long d, int n, int m, int iterations) Uses Monte Carlo simulation to approximate \(P(D_{n,m} >= d/(n*m))\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.Here d is the D-statistic represented as long value. The real D-statistic is obtained by dividing d by n*m. See also
monteCarloP(double, int, int, boolean, int)
.- Parameters:
d
- integral D-statisticn
- first sample sizem
- second sample sizeiterations
- number of random partitions to generate- Returns:
- proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\)
greater than or equal to
d/(n*m))
-
fixTies
private static void fixTies(double[] x, double[] y) If there are no ties in the combined dataset formed from x and y, this method is a no-op. If there are ties, a uniform random deviate in (-minDelta / 2, minDelta / 2) - {0} is added to each value in x and y, where minDelta is the minimum difference between unequal values in the combined sample. A fixed seed is used to generate the jitter, so repeated activations with the same input arrays result in the same values. NOTE: if there are ties in the data, this method overwrites the data in x and y with the jittered values.- Parameters:
x
- first sampley
- second sample
-
hasTies
private static boolean hasTies(double[] x, double[] y) Returns true iff there are ties in the combined sample formed from x and y.- Parameters:
x
- first sampley
- second sample- Returns:
- true if x and y together contain ties
-
jitter
Adds random jitter todata
using deviates sampled fromdist
.Note that jitter is applied in-place - i.e., the array values are overwritten with the result of applying jitter.
- Parameters:
data
- input/output data array - entries overwritten by the methoddist
- probability distribution to sample for jitter values- Throws:
NullPointerException
- if either of the parameters is null
-
c
private static int c(int i, int j, int m, int n, long cmn, boolean strict) The function C(i, j) defined in [4] (class javadoc), formula (5.5). defined to return 1 if |i/n - j/m| invalid input: '<'= c; 0 otherwise. Here c is scaled up and recoded as a long to avoid rounding errors in comparison tests, so what is actually tested is |im - jn| invalid input: '<'= cmn.- Parameters:
i
- first path parameterj
- second path paramterm
- first sample sizen
- second sample sizecmn
- integral D-statistic (seecalculateIntegralD(double, int, int, boolean)
)strict
- whether or not the null hypothesis uses strict inequality- Returns:
- C(i,j) for given m, n, c
-
n
private static double n(int i, int j, int m, int n, long cnm, boolean strict) The function N(i, j) defined in [4] (class javadoc). Returns the number of paths over the lattice {(i,j) : 0 invalid input: '<'= i invalid input: '<'= n, 0 invalid input: '<'= j invalid input: '<'= m} from (0,0) to (i,j) satisfying C(h,k, m, n, c) = 1 for each (h,k) on the path. The return value is integral, but subject to overflow, so it is maintained and returned as a double.- Parameters:
i
- first path parameterj
- second path parameterm
- first sample sizen
- second sample sizecnm
- integral D-statistic (seecalculateIntegralD(double, int, int, boolean)
)strict
- whether or not the null hypothesis uses strict inequality- Returns:
- number or paths to (i, j) from (0,0) representing D-values as large as c for given m, n
-