MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning

被引：32

作者：

Alkhalifah, Tariq ^{[1
]}

Wang, Hanchen ^{[1
]}

Ovcharenko, Oleg ^{[1
]}

机构：

[1] King Abdullah Univ Sci & Technol, Phys Sci & Engn, Mail Box 1280, Thuwal 239556900, Saudi Arabia

来源：

ARTIFICIAL INTELLIGENCE IN GEOSCIENCES | 2022年 / 3卷

关键词：

Neural networks; Induced seismicity; Image processing; Computational seismology; Waveform inversion; INVERSION;

D O I：

10.1016/j.aiig.2022.09.002

中图分类号：

P [天文学、地球科学];

学科分类号：

07 ;

摘要：

Among the biggest challenges we face in utilizing neural networks trained on waveform (i.e., seismic, electromagnetic, or ultrasound) data is its application to real data. The requirement for accurate labels often forces us to train our networks using synthetic data, where labels are readily available. However, synthetic data often fail to capture the reality of the field/real experiment, and we end up with poor performance of the trained neural networks (NNs) at the inference stage. This is because synthetic data lack many of the realistic features embedded in real data, including an accurate waveform source signature, realistic noise, and accurate reflectivity. In other words, the real data set is far from being a sample from the distribution of the synthetic training set. Thus, we describe a novel approach to enhance our supervised neural network (NN) training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input section are not crucial to the prediction, like classification, or can be corrected after the prediction, like velocity model building using a well, we suggest a series of linear operations on the input to the network data so that the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN, whether the input is from the synthetic or real data subset domain: (1) The crosscorrelation of the input data section (i.e., shot gather, seismic image, etc.) with a fixed-location reference trace from the input data section. (2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated sections from the other subset domain. In the training stage, the input data are from the synthetic subset domain and the auto-corrected (we crosscorrelate each trace with itself) sections are from the real subset domain, and the random selection of sections from the real data is implemented at every epoch of the training. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and on active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of our trained NNs to real data.

引用

页码：101 / 114

页数：14

共 50 条

[31] Machine learning phases and criticalities without using real data for training
Tan, D-R
Jiang, F-J
PHYSICAL REVIEW B, 2020, 102 (22)
[32] Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation
Guo, Wenyu
Fang, Qingkai
Yu, Dong
Feng, Yang
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2863 - 2874
[33] Negative Data in Data Sets for Machine Learning Training
Maloney, Michael P.
Coley, Connor W.
Genheden, Samuel
Carson, Nessa
Helquist, Paul
Norrby, Per-Ola
Wiest, Olaf
ORGANIC LETTERS, 2023, 25 (17) : 2945 - 2947
[34] Negative Data in Data Sets for Machine Learning Training
Maloney, Michael P.
Coley, Connor W.
Genheden, Samuel
Carson, Nessa
Helquist, Paul
Norrby, Per-Ola
Wiest, Olaf
JOURNAL OF ORGANIC CHEMISTRY, 2023, 88 (09): : 5239 - 5241
[35] Bridging the Data Gap in Federated Preference Learning with AIGC
Wang, Chenyu
Zhou, Zhi
Zhang, Xiaoxi
Chen, Xu
2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS 2024, 2024, : 105 - 116
[36] Clustering Mixed Data: Bridging the Gap with Deep Learning
Yerra, Harini
Kommu, Siddartha
Kumar, B. Vijay
Sudam, Rachana
PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, MACHINE LEARNING AND APPLICATIONS, VOL 1, ICDSMLA 2023, 2025, 1273 : 695 - 702
[37] Inverse Biomechanical Modeling of the Tongue via Machine Learning and Synthetic Training Data
Tolpadi, Aniket A.
Stone, Maureen L.
Carass, Aaron
Prince, Jerry L.
Gomez, Arnold D.
MEDICAL IMAGING 2018: IMAGE-GUIDED PROCEDURES, ROBOTIC INTERVENTIONS, AND MODELING, 2018, 10576
[38] Providing Cooperative Data Analytics for Real Applications Using Machine Learning
Iyengar, Arun
Kalagnanam, Jayant
Patel, Dhaval
Reddy, Chandra
Shrivastava, Shrey
2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 1878 - 1890
[39] Towards Bridging the Gap between Machine Learning Researchers and Practitioners
Assem, Haytham
O'Sullivan, Declan
2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 702 - 708
[40] Machine Learning Approaches for Prediction of Facial Rejuvenation Using Real and Synthetic Data
Shah, Syed Afaq Ali
Bennamoun, Mohammed
Molton, Michael K.
IEEE ACCESS, 2019, 7 : 23779 - 23787

← 1 2 3 4 5 →