Using Data Compressors to Construct Rank Tests

Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples is straightforward). $H_0$ should be rejected if the string is compressed (to a certain degree) and accepted otherwise. We show that such a test obtained from an ideal data compressor is valid against all alternatives. Component independence is reduced to homogeneity testing by constructing two samples, one of which is the first half of the original and the other is the second half with one of the components randomly permuted.

[1]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[2]  V. A. Monarev,et al.  Using information theory approach to randomness testing , 2005 .

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  Jaakko Astola,et al.  Universal Codes as a Basis for Time Series Testing , 2006, ArXiv.

[5]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[6]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[7]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.