Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning

被引:1
|
作者
Lee, Kyubin [1 ,2 ]
Hyung, Daejin [1 ]
Cho, Soo Young [3 ]
Yu, Namhee [1 ]
Hong, Sewha [1 ]
Kim, Jihyun [1 ,4 ]
Kim, Sunshin [1 ]
Han, Ji-Youn [1 ]
Park, Charny [1 ,5 ]
机构
[1] Natl Canc Ctr, Res Inst, 232 Ilsan Ro, Goyang Si 10408, Gyeonggi Do, South Korea
[2] Univ Virginia, Ctr Publ Hlth Genom, Sch Med, Charlottesville, VA 22908 USA
[3] Hanyang Univ, Dept Mol & Life Sci, 55 Hanyangdaehak Ro, Ansan 15588, Gyeonggi Do, South Korea
[4] Dept Precis Med, Natl Inst Hlth, Korea Dis Control & Prevent Agcy, Osong Hlth Technol Adm Complex, 187 Osongsaengmyeong 2 Ro, Cheongju 28159, Chungcheongbug, South Korea
[5] 323 Ilsan Ro, Goyang Si 10408, Gyeonggi Do, South Korea
基金
新加坡国家研究基金会;
关键词
Machine; -learning; Alternative splicing; Tumor transcriptome; Database; Gene signature; ESRP1; GENES; ATLAS;
D O I
10.1016/j.csbj.2023.02.052
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.(c) 2023 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页码:1978 / 1988
页数:11
相关论文
共 50 条
  • [1] Transcriptomic signature of cancer cachexia by integration of machine learning, literature mining and meta-analysis
    Zhao K.
    Ebrahimie E.
    Mohammadi-Dehcheshmeh M.
    Lewsey M.G.
    Zheng L.
    Hoogenraad N.J.
    Computers in Biology and Medicine, 2024, 172
  • [2] ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining
    Lee, Myunggyo
    Lee, Kyubum
    Yu, Namhee
    Jang, Insu
    Choi, Ikjung
    Kim, Pora
    Jang, Ye Eun
    Kim, Byounggun
    Kim, Sunkyu
    Lee, Byungwook
    Kang, Jaewoo
    Lee, Sanghyuk
    NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) : D784 - D789
  • [3] Study of prognostic splicing factors in cancer using machine learning approaches
    Yang, Mengyuan
    Liu, Jiajia
    Kim, Pora
    Zhou, Xiaobo
    HUMAN MOLECULAR GENETICS, 2024, 33 (13) : 1131 - 1141
  • [4] A database for using machine learning and data mining techniques for coronary artery disease diagnosis
    R. Alizadehsani
    M. Roshanzamir
    M. Abdar
    A. Beykikhoshk
    A. Khosravi
    M. Panahiazar
    A. Koohestani
    F. Khozeimeh
    S. Nahavandi
    N. Sarrafzadegan
    Scientific Data, 6
  • [5] A database for using machine learning and data mining techniques for coronary artery disease diagnosis
    Alizadehsani, R.
    Roshanzamir, M.
    Abdar, M.
    Beykikhoshk, A.
    Khosravi, A.
    Panahiazar, M.
    Koohestani, A.
    Khozeimeh, F.
    Nahavandi, S.
    Sarrafzadegan, N.
    SCIENTIFIC DATA, 2019, 6 (1)
  • [6] Development of Human Face Literature Database Using Text Mining Approach: Phase I
    Kaur, Paramjit
    Krishan, Kewal
    Sharma, Suresh K.
    JOURNAL OF CRANIOFACIAL SURGERY, 2018, 29 (04) : 966 - 969
  • [7] DEVELOPMENT AND VALIDATION OF A MACHINE-LEARNING-DERIVED RNASEQ PROGNOSTIC SIGNATURE IN ENDOMETRIAL CANCER
    Beinse, G.
    Belda, M. A. Le Frere
    Pierre-Alexandre, J.
    Bekmezian, N.
    Koual, M.
    Garinet, S.
    Leroy, K.
    Delanoy, N.
    Blons, H.
    Gervais, C.
    Durdux, C.
    Chapron, C.
    Goldwasser, F.
    Terris, B.
    Badoual, C.
    Laurent-Puig, P.
    Taly, V.
    Borghese, B.
    Bats, A. S.
    Alexandre, J.
    INTERNATIONAL JOURNAL OF GYNECOLOGICAL CANCER, 2021, 31 : A340 - A341
  • [8] Development of IDS using mining and machine learning techniques to estimate DoS malware
    Revathy, G.
    Kumar, P. Sathish
    Rajendran, Velayutham
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2021, 24 (03) : 259 - 275
  • [9] Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods
    Taghizadeh, Eskandar
    Heydarheydari, Sahel
    Saberi, Alihossein
    JafarpoorNesheli, Shabnam
    Rezaeijo, Seyed Masoud
    BMC BIOINFORMATICS, 2022, 23 (01)
  • [10] Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods
    Eskandar Taghizadeh
    Sahel Heydarheydari
    Alihossein Saberi
    Shabnam JafarpoorNesheli
    Seyed Masoud Rezaeijo
    BMC Bioinformatics, 23