Efficient mining for structurally diverse subgraph patterns in large molecular databases

被引:7
|
作者
Maunz, Andreas [1 ]
Helma, Christoph [2 ]
Kramer, Stefan [3 ]
机构
[1] Univ Freiburg, Machine Learning Lab, D-79110 Freiburg, Germany
[2] In Silico Toxicol, CH-4054 Basel, Switzerland
[3] Inst Informat I12, D-85748 Garching, Germany
关键词
Correlated graph mining; Backbone; Dynamic upper bound pruning; Structural diversity;
D O I
10.1007/s10994-010-5187-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to purely occurrence-based criteria, which is characteristic for open or closed fragment mining. We first give an intuitive explanation why backbone refinement class features lead to a set of relevant features that are suitable for classification, in particular in the area of structure-activity relationships (SARs). We then show that backbone refinement classes yield a high compression in the search space of rooted perfect binary trees. We conduct several experiments to evaluate our theoretical insights in practice: A visualization suggests low co-occurrence and high entropy of backbone refinement class features. By comparison to a class of patterns sampled from the maximal patterns previously introduced by Al Hasan et al., we find a favorable tradeoff between the structural similarity and the resources needed to compute the descriptors. Cross-validation shows that classification accuracy is similar to the complete set of trees but significantly better than that of open trees, while feature set size is reduced by > 90% and > 30% compared to complete tree mining and open tree mining, respectively. Furthermore, compared to open or closed pattern mining, a large part of the search space can be pruned due to an improved statistical constraint (dynamic upper bound adjustment). This is confirmed experimentally by running times reduced by more than 60% compared to ordinary (static) upper bound pruning. The application of our method to the largest datasets that have been used in correlated graph mining so far indicates robustness against the minimum frequency parameter, and a cross-validation run on this data confirms that the novel descriptors render large training sets feasible, which previously might have been intractable. A C++ implementation of the mining algorithm is available at http://www.maunz.de/libfminer-doc. Animated figures, links to datasets, and further resources are available at http://www.maunz.de/mlj-res
引用
收藏
页码:193 / 218
页数:26
相关论文
共 50 条
  • [31] A fast algorithm for mining sequential patterns from large databases
    Ning Chen
    An Chen
    Longxiang Zhou
    Lu Liu
    [J]. Journal of Computer Science and Technology, 2001, 16 : 359 - 370
  • [32] Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases
    Zhao, Zhou
    Yan, Da
    Ng, Wilfred
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (05) : 1171 - 1184
  • [33] Efficient frequent subgraph mining algorithm
    Li, Xian-Tong
    Li, Jian-Zhong
    Gao, Hong
    [J]. Ruan Jian Xue Bao/Journal of Software, 2007, 18 (10): : 2469 - 2480
  • [34] An Efficient Approach to Discovering Sequential Patterns in Large Databases
    Yen, Show-Jane
    Cho, Chung-Wen
    [J]. LECTURE NOTES IN COMPUTER SCIENCE <D>, 2000, 1910 : 685 - 690
  • [35] Efficient algorithms for mining fuzzy rules in large relational databases
    Chen, Ning
    Chen, An
    Zhou, Long-Xiang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2001, 12 (07): : 949 - 959
  • [36] An efficient algorithm for mining quantitative association rules in large databases
    Lee, HJ
    Park, WH
    Song, SJ
    Park, DS
    [J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 571 - 576
  • [37] Efficient Mining of Frequent Item Sets on Large Uncertain Databases
    Wang, Liang
    Cheung, David Wai-Lok
    Cheng, Reynold
    Lee, Sau Dan
    Yang, Xuan S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (12) : 2170 - 2183
  • [38] Mining large networks with subgraph counting
    Bordino, Ilaria
    Donato, Debora
    Gionis, Aristides
    Leonardi, Stefano
    [J]. ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 737 - +
  • [39] Subgraph mining in a large graph: A review
    Nguyen, Lam B. Q.
    Zelinka, Ivan
    Snasel, Vaclav
    Nguyen, Loan T. T.
    Vo, Bay
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2022, 12 (04)
  • [40] HUSM: High utility subgraph mining in single graph databases
    Chen, Zhaoming
    He, Cheng
    Chen, Guoting
    Gan, Wensheng
    Fournier-Viger, Philippe
    [J]. INFORMATION SCIENCES, 2024, 675