Fundamental Limits of Lossless Data Compression With Side Information

The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in two different versions of the problem: Reference-based compression, when a single side information string is used repeatedly in compressing different source messages, and pair-based compression, where a different side information string is used for each source message. General achievability and converse theorems are established for arbitrary source-side information pairs. Nonasymptotic normal approximation expansions are proved for the optimal rate in both the reference-based and pair-based settings, for memoryless sources. These are stated in terms of explicit, finite-blocklength bounds, that are tight up to third-order terms. Extensions that go significantly beyond the class of memoryless sources are obtained. The relevant source dispersion is identified and its relationship with the conditional varentropy rate is established. Interestingly, the dispersion is different in reference-based and pair-based compression, and it is proved that the reference-based dispersion is in general smaller.

[1]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[2]  Bernd Girod,et al.  Compression with side information using turbo codes , 2002, Proceedings DCC 2002. Data Compression Conference.

[3]  Toby Berger,et al.  A sliding window Lempel-Ziv algorithm for differential layer encoding in progressive transmission , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[4]  Vincent Y. F. Tan,et al.  The dispersion of Slepian-Wolf coding , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[5]  Idoia Ochoa,et al.  Reference based genome compression , 2012, 2012 IEEE Information Theory Workshop.

[6]  Jack K. Wolf,et al.  Noiseless coding of correlated information sources , 1973, IEEE Trans. Inf. Theory.

[7]  S. V. Nagaev More Exact Statement of Limit Theorems for Homogeneous Markov Chains , 1961 .

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  John C. Kieffer,et al.  Resolution scalable lossless progressive image coding via conditional quadrisection , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[10]  H. Vincent Poor,et al.  Channel Coding Rate in the Finite Blocklength Regime , 2010, IEEE Transactions on Information Theory.

[11]  En-Hui Yang,et al.  Universal lossless data compression with side information by using a conditional MPM grammar transform , 2001, IEEE Trans. Inf. Theory.

[12]  Ioannis Kontoyiannis,et al.  Asymptotic Recurrence and Waiting Times for Stationary Processes , 1998 .

[13]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[14]  Oliver Kosut,et al.  Third-order coding rate for universal compression of Markov sources , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[15]  Frans M. J. Willems,et al.  Universal data compression and repetition times , 1989, IEEE Trans. Inf. Theory.

[16]  Kai Lai Chung,et al.  Markov Chains with Stationary Transition Probabilities , 1961 .

[18]  Sanjeev R. Kulkarni,et al.  An Algorithm for Universal Lossless Compression With Side Information , 2006, IEEE Transactions on Information Theory.

[19]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  M. Kac On the notion of recurrence in discrete stochastic processes , 1947 .

[22]  J. Norris Appendix: probability and measure , 1997 .

[23]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[24]  I. Shevtsova,et al.  On the Upper Bound for the Absolute Constant in the Berry–Esseen Inequality , 2010 .

[25]  Michelle Effros,et al.  Lossless Source Coding in the Point-to-Point, Multiple Access, and Random Access Scenarios , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[26]  W. Philipp,et al.  Almost sure invariance principles for partial sums of weakly dependent random variables , 1975 .

[27]  Shigeaki Kuzuoka,et al.  Conditional Lempel-Ziv complexity and its application to source coding theorem with side information , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[28]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[29]  Rakesh K. Bansal,et al.  On optimality and redundancy of side information version of SWLZ , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[30]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[31]  Rui Zhang,et al.  Wyner-Ziv coding of motion video , 2002, Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002..

[32]  Guangyue Han Limit theorems for the sample entropy of hidden Markov chains , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[33]  R. C. Bradley Basic Properties of Strong Mixing Conditions , 1985 .

[34]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[35]  V. Statulevičius,et al.  Limit Theorems of Probability Theory , 2000 .

[36]  Shizuo Kakutani,et al.  131. Induced Measure Preserving Transformations , 1943 .

[37]  A. Dasgupta Central Limit Theorem for Markov Chains , 2008 .

[38]  Sergio Verdú,et al.  Optimal Lossless Data Compression: Non-Asymptotics and Asymptotics , 2014, IEEE Transactions on Information Theory.

[39]  Vincent Y. F. Tan,et al.  Variable-Length Source Dispersions Differ under Maximum and Average Error Criteria , 2019, 2020 IEEE International Symposium on Information Theory (ISIT).

[40]  Kannan Ramchandran,et al.  Enhancing analog image transmission systems using digital side information: a new wavelet-based image coding paradigm , 2001, Proceedings DCC 2001. Data Compression Conference.

[41]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[42]  Te Sun Han,et al.  Second-order Slepian-Wolf coding theorems for non-mixed and mixed sources , 2012, 2013 IEEE International Symposium on Information Theory.

[43]  Toshiyasu Matsushima,et al.  On the Overflow Probability of Fixed-to-Variable Length Codes with Side Information , 2010, 2010 Data Compression Conference.

[44]  Alexandr A. Borovkov,et al.  Limit Theorems of Probability Theory. , 2011 .

[45]  S. Meyn,et al.  Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[46]  Ankur A. Kulkarni,et al.  Improved Finite Blocklength Converses for Slepian–Wolf Coding via Linear Programming , 2018, IEEE Transactions on Information Theory.

[47]  Vincent Y. F. Tan,et al.  On the dispersions of three network information theory problems , 2012, 2012 46th Annual Conference on Information Sciences and Systems (CISS).

[48]  P. Shields The Ergodic Theory of Discrete Sample Paths , 1996 .

[49]  Ioannis Kontoyiannis Second-order noiseless source coding theorems , 1997, IEEE Trans. Inf. Theory.

[50]  Sergio Verdu,et al.  Fixed-length-parsing universal compression with side information , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[51]  John C. Kieffer,et al.  Sample converses in source coding theory , 1991, IEEE Trans. Inf. Theory.

[52]  Tsachy Weissman,et al.  The quest to save genomics: Unless researchers solve the looming data compression problem, biomedical science could stagnate , 2018, IEEE Spectrum.

[53]  F. Kanaya,et al.  Coding Theorems on Correlated General Sources , 1995 .

[54]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[55]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[56]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.