Multilingual Image Corpus - Towards a Multimodal and Multilingual Dataset

被引:0
|
作者
Koeva, Svetla [1 ]
Stoyanova, Ivelina [1 ]
Kralev, Jordan [1 ,2 ]
机构
[1] Bulgarian Acad Sci, Inst Bulgarian Language, Sofia, Bulgaria
[2] Tech Univ, Sofia, Bulgaria
基金
欧盟地平线“2020”;
关键词
multilingual image corpus; multilingual dataset; multimodal dataset;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
One of the processing tasks for large multimodal data streams is the automatic image description (image classification, object segmentation and classification). Although the number and the diversity of image datasets is constantly expanding, there is still a huge demand for more datasets in terms of variety of domains and object classes covered. The goal of the project Multilingual Image Corpus (MIC21) is to provide a large image dataset with annotated objects and object descriptions in 25 languages. The Multilingual Image Corpus relies on an Ontology of Visual Objects (based on WordNet) and comprises a collection of thematically related images whose objects are annotated with segmentation masks and labels linked to the ontology classes. The dataset is designed both for image classification and object detection and for semantic segmentation. The main contributions of our work are: a) the provision of a large collection of high-quality images licensed for commercial and non-commercial use; b) the compilation of the Ontology of Visual Objects based on WordNet noun hierarchies; c) the automatic object segmentation within the images followed by precise manual editing and the annotation of object classes; and d) the mapping of objects and images to extended multilingual descriptions based onWordNet inner- and interlingual relations. The dataset can be used also for multilingual image caption generation, image-to-text alignment and automatic question answering for multimedia content.
引用
收藏
页码:1509 / 1518
页数:10
相关论文
共 50 条
  • [1] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
  • [2] Towards semantically linked multilingual corpus
    Zhang, Junsheng
    Sun, Yunchuan
    Jara, Antonio J.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2015, 35 (03) : 387 - 395
  • [3] Multilingual Corpora and Multilingual Corpus Analysis
    Zeldes, Amir
    [J]. LANGUAGES IN CONTRAST, 2014, 14 (02) : 316 - 320
  • [4] Multilingual Corpora and Multilingual Corpus Analysis
    Vyatkina, Nina
    [J]. LANGUAGE LEARNING & TECHNOLOGY, 2014, 18 (02): : 70 - 74
  • [5] Multilingual Corpora and Multilingual Corpus Analysis
    Fu, Rongbo
    [J]. AUSTRALIAN JOURNAL OF LINGUISTICS, 2017, 37 (01) : 105 - 109
  • [6] Multilingual corpora and multilingual corpus analyses
    Beinborn, Lisa
    [J]. INTERNATIONAL JOURNAL OF MULTILINGUALISM, 2014, 11 (02) : 266 - 268
  • [7] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [8] Multilingual corpora and multilingual corpus anaysis
    Bale, Richard
    [J]. CALICO JOURNAL, 2013, 30 (03): : 446 - 448
  • [9] Multimodal Keyword Search for Multilingual and Mixlingual Speech Corpus
    Popli, Abhimanyu
    Kumar, Arun
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 535 - 545
  • [10] WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
    Srinivasan, Krishna
    Raman, Karthik
    Chen, Jiecao
    Bendersky, Michael
    Najork, Marc
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2443 - 2449