Exact and approximate Boolean matrix decomposition with column-use condition

An arbitrary $$m\times n$$m×n Boolean matrix M can be decomposed exactly as $$M =U\circ V$$M=U∘V, where U (resp. V) is an $$m\times k$$m×k (resp. $$k\times n$$k×n) Boolean matrix and $$\circ $$∘ denotes the Boolean matrix multiplication operator. The minimum k is called the Boolean rank of M, and it is known to be NP-hard to find it. With the interpretability issue in data mining applications in mind, we impose the column-use condition that the columns of U form a subset of the columns of the given M, and employ commonly used heuristics to find as small a k as possible.To this end, we first derive an exact closed-form formula, $$J=\overline{\overline{M}^\mathrm{T}\circ M}$$J=M¯T∘M¯, such that $$M =M\circ J^\mathrm{T}$$M=M∘JT holds, where J is maximal in the sense that if any 0 element in J is changed to a 1; then, this equality no longer holds. We measure the performance (in minimizing k) of our algorithms on several real benchmark datasets. The results demonstrate that one of our proposed algorithms performs as well or better on all but one of them than other representative heuristic algorithms, which do not impose the column-use condition and thus theoretically should find a smaller k.Boolean matrix decomposition with the column-use condition has wide applications. In educational databases, for example, the “ideal item response matrix” R, the “knowledge state matrix” A, and the “Q-matrix” Q play important roles. As they are related exactly by $$\overline{R}=\overline{A}\circ Q^\mathrm{T}$$R¯=A¯∘QT, given R, we can find A and Q with a small number (k) of interpretable “knowledge states,” using our heuristics.

[1]  Vijayalakshmi Atluri,et al.  The role mining problem: finding a minimal descriptive set of roles , 2007, SACMAT '07.

[2]  Curtis Tatsuoka,et al.  Data analytic methods for latent partially ordered classification models , 2002 .

[3]  Anna Lubiw,et al.  A weighted min-max relation for intervals , 1990, J. Comb. Theory, Ser. B.

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  Pauli Miettinen,et al.  Matrix Decomposition Methods for Data Mining : Computational Complexity and Algorithms , 2009 .

[6]  Vilém Vychodil,et al.  Discovery of optimal factors in binary data via a novel method of matrix decomposition , 2010, J. Comput. Syst. Sci..

[7]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[8]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[9]  M. Yagiura,et al.  RELAXATION HEURISTICS FOR THE SET COVERING PROBLEM( the 50th Anniversary of the Operations Research Society of Japan) , 2007 .

[10]  Jaideep Vaidya,et al.  Boolean Matrix Decomposition Problem: Theory, Variations and Applications to Data Engineering , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[12]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[13]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Pauli Miettinen,et al.  The Boolean column and column-row matrix decompositions , 2008, Data Mining and Knowledge Discovery.

[16]  J. Orlin Contentment in graph theory: Covering graphs with cliques , 1977 .

[17]  Joachim M. Buhmann,et al.  Multi-assignment clustering for Boolean data , 2009, ICML '09.

[18]  Stephen A. Vavasis,et al.  On the Complexity of Nonnegative Matrix Factorization , 2007, SIAM J. Optim..

[19]  M. Woodbury,et al.  A mathematical analysis of human leukocyte antigen serology , 1978 .

[20]  Anna Lubiw The Boolean Basis Problem and How to Cover Some Polygons by Rectangles , 1990, SIAM J. Discret. Math..

[21]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[22]  Haiko Müller,et al.  On edge perfectness and classes of bipartite graphs , 1996, Discret. Math..

[23]  Pauli Miettinen,et al.  On Finding Joint Subspace Boolean Matrix Factorizations , 2012, SDM.

[24]  Radim Belohlávek,et al.  From-below approximations in Boolean matrix factorization: Geometry and new algorithm , 2015, J. Comput. Syst. Sci..

[25]  Kenneth R. Koedinger,et al.  Automated Student Model Improvement , 2012, EDM.

[26]  Daniel J. Kleitman,et al.  An Algorithm for Covering Polygons with Rectangles , 1986, Inf. Control..

[27]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[28]  Jérôme Amilhastre,et al.  Complexity of Minimum Biclique Cover and Minimum Biclique Decomposition for Bipartite Domino-free Graphs , 1998, Discret. Appl. Math..

[29]  Zhiliang Ying,et al.  Non-identifiability, equivalence classes, and attribute-specific classification in Q-matrix based Cognitive Diagnosis Models , 2013 .

[30]  Robert E. Tarjan,et al.  Fast exact and heuristic methods for role minimization problems , 2008, SACMAT '08.

[31]  Yang Xiang,et al.  Summarizing transactional databases with overlapped hyperrectangles , 2011, Data Mining and Knowledge Discovery.

[32]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[33]  Tiffany Barnes Novel Derivation and Application of Skill Matrices: The q-Matrix Method , 2010 .

[34]  Yi Sun,et al.  Alternating Recursive Method for Q-matrix Learning , 2014, EDM.

[35]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[36]  Václav Snásel,et al.  Binary Factor Analysis with Help of Formal Concepts , 2004, CLA.

[37]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[38]  Ki Hang Kim Boolean matrix theory and applications , 1982 .

[39]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[40]  K. Tatsuoka Cognitive Assessment: An Introduction to the Rule Space Method , 2009 .

[41]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[42]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[43]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[44]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[45]  Jingchen Liu,et al.  Data-Driven Learning of Q-Matrix , 2012, Applied psychological measurement.

[46]  Pauli Miettinen,et al.  Interpretable nonnegative matrix decompositions , 2008, KDD.

[47]  Guanghui Lan,et al.  An effective and simple heuristic for the set covering problem , 2007, Eur. J. Oper. Res..

[48]  Eyal Kushilevitz,et al.  Communication Complexity: Index of Notation , 1996 .