Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

被引:5
|
作者
Chen, Yanzhe [1 ,2 ]
Zhong, Huasong [3 ]
He, Xiangteng [1 ,2 ]
Peng, Yuxin [1 ,2 ]
Cheng, Lele [3 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
[2] Peking Univ, Natl Key Lab Multimedia Informat Proc, Beijing, Peoples R China
[3] Kuaishou Technol, Beijing, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金; 国家重点研发计划;
关键词
Large-scale data collection; E-commerce datasets; Cross-domain retrieval;
D O I
10.1145/3581783.3612408
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multimodal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.
引用
收藏
页码:4939 / 4948
页数:10
相关论文
共 50 条
  • [1] MEP-3M: A large-scale multi-modal E-commerce product dataset
    Liu, Fan
    Chen, Delong
    Du, Xiaoyu
    Gao, Ruizhuo
    Xu, Feng
    PATTERN RECOGNITION, 2023, 140
  • [2] E-ConvRec: A Large-Scale Conversational Recommendation Dataset for E-Commerce Customer Service
    Jia, Meihuizi
    Liu, Ruixue
    Wang, Peiying
    Song, Yang
    Xi, Zexi
    Li, Haobin
    Shen, Xin
    Chen, Meng
    Pang, Jinhui
    He, Xiaodong
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5787 - 5796
  • [3] CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
    Zhu, Qi
    Huang, Kaili
    Zhang, Zheng
    Zhu, Xiaoyan
    Huang, Minlie
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 (08) : 281 - 295
  • [4] Cross-domain Attention Network with Wasserstein Regularizers for E-commerce Search
    Qiu, Minghui
    Wang, Bo
    Chen, Cen
    Zeng, Xiaoyi
    Huang, Jun
    Cai, Deng
    Zhou, Jingren
    Bao, Forrest Sheng
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2509 - 2515
  • [5] Large-scale Visual Search and Similarity for E-Commerce
    Anand, Gaurav
    Wang, Siyun
    Ni, Karl
    APPLICATIONS OF MACHINE LEARNING 2021, 2021, 11843
  • [6] Ontology management for large-scale e-commerce applications
    Lee, J
    Goodwin, R
    DEEC 2005: International Workshop on Data Engineering Issues in E-Commerce, Proceedings, 2005, : 7 - 15
  • [7] Large-Scale E-Commerce Image Retrieval with Top-Weighted Convolutional Neural Networks
    Zhao, Shichao
    Xu, Youjiang
    Han, Yahong
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 285 - 288
  • [8] DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL
    Li, Jieyu
    Chen, Zhi
    Chen, Lu
    Zhu, Zichen
    Li, Hanqi
    Cao, Ruisheng
    Yu, Kai
    APPLIED SCIENCES-BASEL, 2023, 13 (04):
  • [9] Cross-Domain Product Representation Learning for Rich-Content E-Commerce
    Bai, Xuehan
    Li, Yan
    Cheng, Yanhua
    Yang, Wenjie
    Chen, Quan
    Li, Han
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5674 - 5683
  • [10] Online E-Commerce Fraud: A Large-scale Detection and Analysis
    Weng, Haiqin
    Li, Zhao
    Ji, Shouling
    Chu, Chen
    Lu, Haifeng
    Du, Tianyu
    He, Qinming
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1435 - 1440