A Clustering Algorithm Based on an Ensemble of DissimilaritiesAn Application in the Bioinformatics Domain

  1. Manuel Martín Merino 1
  2. Alfonso José López Rivero 1
  3. Vidal Alonso 1
  4. Marcelo Vallejo 1
  5. Antonio Ferreras 1
  1. 1 Universidad Pontificia de Salamanca
    info

    Universidad Pontificia de Salamanca

    Salamanca, España

    ROR https://ror.org/02jj93564

Revista:
IJIMAI

ISSN: 1989-1660

Año de publicación: 2022

Título del ejemplar: Special Issue on New Trends in Disruptive Technologies, Tech Ethics and Artificial Intelligence

Volumen: 7

Número: 6

Páginas: 6-13

Tipo: Artículo

DOI: 10.9781/IJIMAI.2022.09.007 DIALNET GOOGLE SCHOLAR lock_openDialnet editor

Otras publicaciones en: IJIMAI

Resumen

Clustering algorithms such as k-means depend heavily on choosing an appropriate distance metric that reflect accurately the object proximities. A wide range of dissimilarities may be defined that often lead to different clustering results. Choosing the best dissimilarity is an ill-posed problem and learning a general distance from the data is a complex task, particularly for high dimensional problems. Therefore, an appealing approach is to learn an ensemble of dissimilarities. In this paper, we have developed a semi-supervised clustering algorithm that learns a linear combination of dissimilarities considering incomplete knowledge in the form of pairwise constraints. The minimization of the loss function is based on a robust and efficient quadratic optimization algorithm. Besides, a regularization term is considered that controls the complexity of the distance metric learned avoiding overfitting. The algorithm has been applied to the identification of tumor samples using the gene expression profiles, where domain experts provide often incomplete knowledge in the form of pairwise constraints. We report that the algorithm proposed outperforms a standard semi-supervised clustering technique available in the literature and clustering results based on a single dissimilarity. The improvement is particularly relevant for applications with high level of noise.

Referencias bibliográficas

  • C. Shen, J. Kim, L. Wang, “Scalable large-margin mahalanobis distance metric learning,” IEEE transactions on Neural Networks, vol. 21, no. 9, pp. 1524–1530, 2010.
  • H. Fyad, F. Barigou, K. Bouamrane, “An Experimental Study on Microarray Expression Data from Plants under Salt Stress by using Clustering Methods,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 2, pp. 38-47, 2020.
  • M. Martín-Merino, Á. Blanco, “A local semi-supervised sammon algorithm for textual data visualization,” Journal of Intelligent Information Systems, vol. 33, no. 1, pp. 23–40, 2009.
  • A. Woznica, A. Kalousis, M. Hilario, “Learning to combine distances for complex representations,” in Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 1031–1038.
  • A. Seal, A. Karlekar, O. Krejcar, E. Herrera-Viedma, “Performance and convergence analysis of modified C-means using jeffreys-divergence for clustering”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 2, pp. 141-149, 2021.
  • B. Nguyen, B. De Baets, “Kernel-based distance metric learning for supervised k-means clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3084–3095, 2019.
  • A. Huang, T. Zhao, C.-W. Lin, “Multi-view data fusion oriented clustering via nuclear norm minimization,” IEEE Transactions on Image Processing, vol. 29, pp. 9600–9613, 2020.
  • B. Zhao, J. T. Kwok, C. Zhang, “Multiple kernel clustering,” in Proceedings of the 2009 SIAM International Conference on Data Mining, 2009, pp. 638–649, SIAM.
  • J. Hu, M. Li, E. Zhu, S. Wang, X. Liu, Y. Zhai, “Consensus multiple kernel k-means clustering with late fusion alignment and matrix-induced regularization,” IEEE Access, vol. 7, pp. 136322–136331, 2019.
  • D. Huang, W. Pan, “Incorporating biological knowledge into distancebased clustering analysis of microarray gene expression data,” Bioinformatics, vol. 22, no. 10, pp. 1259–1268, 2006.
  • H. Zeng, Y.-m. Cheung, “Semi-supervised maximum margin clustering with pairwise constraints,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 5, pp. 926–939, 2011.
  • A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, G. Ridgeway, “Learning a mahalanobis metric from equivalence constraints,” Journal of Machine Learning Research, vol. 6, no. 6, 2005.
  • E. Xing, A. Y. Ng, M. I. Jordan, S. J. Russell, “Distance metric learning with application to clustering with side-information,” in Advances in Neural Information Processing Systems, vol. 15, 2002, MIT Press.
  • B. Nguyen, B. De Baets, “Kernel distance metric learning using pairwise constraints for person reidentification,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 589–600, 2018.
  • D.-Y. Yeung, H. Chang, “A kernel approach for semisupervised metric learning,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 141– 149, 2007.
  • Y. Yao, Y. Li, B. Jiang, H. Chen, “Multiple kernel kmeans clustering by selecting representative kernels,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 4983–4996, 2020.
  • T. Yang, R. Jin, A. K. Jain, “Learning kernel combination from noisy pairwise constraints,” in 2012 IEEE Statistical Signal Processing Workshop (SSP), 2012, pp. 752–755, IEEE.
  • V. Vapnik, Statistical Learning Theory. New York: John Wiley & Sons, 1998.
  • E. Pekalska, P. Paclick, R. Duin, “A generalized kernel approach to dissimilarity-based classification,” Journal of Machine Learning Research, vol. 2, pp. 175–211, 2001.
  • G. Wu, E. Y. Chang, N. Panda, “Formulating distance functions via the kernel trick,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, pp. 703–709.
  • M. Martín-Merino, A. Blanco, J. De Las Rivas, “Combining dissimilarities in a hyper reproducing kernel hilbert space for complex human cancer prediction,” Journal of Biomedicine and Biotechnology, vol. 2009, 2009.
  • N. Cristianini, J. Kandola, J. Elisseeff, A. Shawe-Taylor, “On the kernel target alignment,” Journal of Machine Learning Research, vol. 1, pp. 1–31, 2002.
  • C. Soon Ong, A. Smola, R. Williamson, “Learning the kernel with hyperkernels,” Journal of Machine Learning Research, vol. 6, pp. 1043– 1071, 2005.
  • A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, “Simple multiple kernel learning,” Journal of Machine Learning Research, vol. 9, pp. 2491– 2521, 2008.
  • S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. A. Suykens, B. De Moor, Y. Moreau, “L2-norm multiple kernel learning and its application to biomedical data fusion,” BMC bioinformatics, vol. 11, no. 1, pp. 1–24, 2010.
  • M. Kloft, U. Brefeld, P. Laskov, K.-R. Müller, A. Zien, S. Sonnenburg, “Efficient and accurate lp-norm multiple kernel learning,” in Advances in Neural Information Processing Systems, vol. 22, 2009, Curran Associates, Inc.
  • W. Kaplan, Maxima and minima with applications: practical optimization and duality, vol. 51. John Wiley & Sons, 1998.
  • L. Hubert, P. Arabie, “Comparing partitions,” journal of classification, vol. 2, pp. 193–228, 1985.
  • M. S. Baghshah, S. B. Shouraki, “Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data,” Pattern Recognition, vol. 43, no. 8, pp. 2982–2992, 2010.
  • B. Yan, C. Domeniconi, “An adaptive kernel method for semi-supervised clustering,” in European Conference on Machine Learning, 2006, pp. 521–532, Springer.