Biclustering

from Wikipedia, the free encyclopedia

Biclustering , co-clustering or two- mode clustering is a data mining technique that enables the simultaneous clustering of rows and columns of a matrix . The term was introduced by Mirkin, but the technique itself was originally introduced much earlier (e.g. by JA Hartigan under the term Direct Clustering ).

definition

Let be a set of data in the form of a rectangular matrix composed of columns and rows. Each column stands for a sample (data sample) and each row for a feature (component of a sample). A series of features is thus assigned to each sample. The element represents the expression of the -th feature in the -th sample.

Furthermore, let the classes be a classification of the samples and the classes a classification of the features. A submatrix of , which is represented as a pair with , is called a bicluster .

When biclustering one created partition from consisting of Biclustern. Some biclustering methods allow biclusters that contain sample and feature classes with different indices (these are pairs with and ).

complexity

The complexity of the biclustering problem depends on the exact formulation of the problem, in particular on the quality function that is used to evaluate the quality of the given bicluster. However, the most interesting variants of this problem are NP-complete and require either a high computational effort or the use of a lossy heuristic to shorten the computation.

Types of biclusters

Different biclustering algorithms have different definitions for the bicluster:

  1. Bicluster with constant values ​​(a),
  2. Bicluster with constant row values ​​(b) or column values ​​(c),
  3. Bicluster with coherent values ​​(d, e).
a) Bicluster with constant values
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
b) Bicluster with constant line values
1.0 1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0 4.0
4.0 4.0 4.0 4.0 4.0
c) Bicluster with constant column values
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
d) Bicluster with coherent values ​​(additive)
1.0 4.0 5.0 0.0 1.5
4.0 7.0 8.0 3.0 4.5
3.0 6.0 7.0 2.0 3.5
5.0 8.0 9.0 4.0 5.5
2.0 5.0 6.0 1.0 2.5
e) Bicluster with coherent values ​​(multiplicative)
1.0 0.5 2.0 0.2 0.8
2.0 1.0 4.0 0.4 1.6
3.0 1.5 6.0 0.6 2.4
4.0 2.0 8.0 0.8 3.2
5.0 2.5 10.0 1.0 4.0

The relationship between these cluster models and other types of clustering, such as correlation clustering, is discussed in.

Algorithms

Numerous biclustering algorithms have been developed for bioinformatics , including: Block Clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-Bicluster, δ-pCluster, δ-Pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving Submatrices), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis), Robust Biclustering Algorithm (RoBA), Crossing Minimization, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms are also proposed and used in other application areas. There they can be found under the names co-clustering, bidimensional clustering and subspace clustering.

Biclustering has been shown to be relevant for recognizing local patterns in time series data . Therefore, the latest approaches to time series data of gene expressions have addressed the biclustering problem, with which interesting biclusters can be restricted to biclusters with adjacent columns. This limitation creates a problem that can be solved in polynomial time. It also enables the development of efficient, exhaustive enumeration algorithms such as CCC-Biclustering and e -CCC-Biclustering. Some of the latest algorithms have attempted to include additional support for biclustering rectangular matrices with other data types , including cMonkey.

There is an ongoing debate about how to evaluate the results of these methods, as biclustering allows clusters to overlap and some algorithms allow the exclusion of difficult-to-negotiate columns / conditions. Not all of the available algorithms are deterministic. The analyst must pay attention to the extent of those results that represent stable minima . Since the problem relates to unsupervised learning , it is difficult to detect errors in the results due to the lack of a gold standard . One approach is to apply multiple biclustering algorithms, the majority or qualified majority of which determine the best result. Another possibility is a quality analysis of the shifting and scaling patterns of biclusters. Biclustering is also used in the field of text mining (or classification) under the term co-clustering. Text corpora are represented in vector form as a matrix D, the lines of which stand for the documents and the columns for the words of a dictionary. The matrix elements D ij designate the occurrence of the word j in the document i. Co-clustering algorithms are then used to identify blocks in D that correspond to a group of documents (rows) identified by a group of words (columns).

Several approaches have been proposed based on the informational content of the blocks obtained: matrix-based approaches such as SVD and BVD as well as graph-based. Information theoretic algorithms iteratively assign a cluster of documents to each row and a cluster of words to each column, so that mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks in order to minimize the decomposition error between the original matrix and the newly formed matrices. Graph-based methods tend to minimize the overlap between the clusters. Given two groups of documents d 1 and d 2 , the number of overlaps can be measured from the number of words occurring in documents of groups d 1 and d 2 .

Bisson and Hussain proposed a new approach to using similarity between words and similarity between documents to co-cluster a matrix. Their method χ-Sim (Cross Similarity) is based on finding a similarity between documents and words and the subsequent use of classic clustering methods, such as hierarchical clustering .

See also

literature

  • Amos Tanay, Roded Sharan, Ron Shamir: Biclustering Algorithms: A Survey . In: Srinivas Aluru (Ed.): Handbook of Computational Molecular Biology . Chapman, 2004.
  • Yuval Kluger, Ronen Basri, Joseph T. Chang, Mark Gerstein: Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions . In: Genome Research . 13, No. 4, 2003, pp. 703-716. doi : 10.1101 / gr.648603 . PMID 12671006 . PMC 430175 (free full text).

Individual evidence

  1. ^ Iven Van Mechelen, Hans-Hermann Bock, Paul De Boeck: Two-mode clustering methods: a structured overview . In: Statistical Methods in Medical Research . 13, No. 5, 2004, pp. 363-94. doi : 10.1191 / 0962280204sm373ra . PMID 15516031 .
  2. a b Boris Mirkin: Mathematical Classification and Clustering . Kluwer Academic Publishers, 1996, ISBN 0-7923-4159-7 .
  3. JA Hartigan: Direct clustering of a data matrix . In: American Statistical Association (Ed.): Journal of the American Statistical Association . 67, No. 337, 1972, pp. 123-9. doi : 10.2307 / 2284710 .
  4. a b c d e Stanislav Busygin, Oleg Prokopyev, Panos M. Pardalos: Biclustering in data mining . In: Computers & Operations Research . 35, No. 9, 2008, pp. 2964-2987. doi : 10.1016 / j.cor.2007.01.005 .
  5. a b c Antonio Mucherino, Sonia Cafieri: A New Heuristic for Feature Selection by Consistent biclustering . In: Cornell University . 2010. arxiv : 1003.3279v1 .
  6. a b c Sara C. Madeira, Arlindo L. Oliveira: Biclustering Algorithms for Biological Data Analysis: A Survey . In: IEEE Transactions on Computational Biology and Bioinformatics . 1, No. 1, 2004, pp. 24-45. doi : 10.1109 / TCBB.2004.2 . PMID 17048406 .
  7. Hans-Peter Kriegel, Peer Kröger, Arthur Zimek: Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering . In: ACM Transactions on Knowledge Discovery from Data (TKDD) . 3, No. 1, March 2009, pp. 1-58. doi : 10.1145 / 1497577.1497578 .
  8. Amos Tanay, Roded Sharan, Martin Kupiec, Ron Shamir: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data . In: Proc Natl Acad Sci USA . 101, No. 9, 2004, pp. 2981-2986. doi : 10.1073 / pnas.0308661100 . PMID 14973197 . PMC 365731 (free full text).
  9. Ahsan Abdullah, Amir Hussain: A new biclustering technique based on crossing minimization . In: Neurocomputing . 69, No. 16-18, 2006, pp. 1882-1896. doi : 10.1016 / j.neucom.2006.02.018 .
  10. David J. Reiss, Nitin S. Baliga, Richard Bonneau: Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks . In: BMC Bioinformatics . 2, 2006, pp. 280-302. doi : 10.1186 / 1471-2105-7-280 . PMID 16749936 . PMC 1502140 (free full text).
  11. Sepp Hochreiter, Ulrich Bodenhofer, Martin Heusel, Andreas Mayr, Andreas Mitterecker, Adetayo Kasim, Tatsiana Khamiakova, Suzy Van Sanden, Dan Lin, Willem Talloen, Luc Bijnens, Hinrich WH Gohlmann, Ziv Shkedy, Djork-Arné Clevert: FABIA: factor analysis for bicluster acquisition . In: Bioinformatics . 26, No. 12, 2010, pp. 1520-1527. doi : 10.1093 / bioinformatics / btq227 . PMID 20418340 . PMC 2881408 (free full text).
  12. ^ Sara C. Madeira, Miguel C. Teixeira, Isabel Sá-Correia, Arlindo L. Oliveira: Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm . In: IEEE Transactions on Computational Biology and Bioinformatics . 1, No. 7, 2010, pp. 153-165. doi : 10.1109 / TCBB.2008.34 .
  13. ^ Sara C. Madeira, Arlindo L. Oliveira: A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series . In: Algorithms for Molecular Biology . 4, No. 8, 2009.
  14. Jesús S. Aguilar-Ruiz: Shifting and scaling patterns from gene expression data . In: Bioinformatics . 21, No. 10, 2005, pp. 3840-3845. doi : 10.1093 / bioinformatics / bti641 . PMID 16144809 .
  15. ^ A b Gilles Bisson, Fawad Hussain: Chi-Sim: A new similarity measure for the co-clustering task . In: ICMLA . 2008, pp. 211-217. doi : 10.1109 / ICMLA.2008.103 .