From Captions to Visual Concepts and Back

被引:0
|
作者
Fang, Hao [1 ]
Deng, Li [1 ]
Mitchell, Margaret [1 ]
Gupta, Saurabh [1 ]
Dollar, Piotr [1 ]
Platt, John C. [1 ]
Iandola, Forrest [1 ]
Gao, Jianfeng [1 ]
Zitnick, C. Lawrence [1 ]
Srivastava, Rupesh K. [1 ]
He, Xiaodong [1 ]
Zweig, Geoffrey [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held out test set, the system captions have equal or better quality 34% of the time.
引用
收藏
页码:1473 / 1482
页数:10
相关论文
共 50 条
  • [1] Tactile Captions: Augmenting Visual Captions
    Kushalnagar, Raja
    Ramachandran, Vignesh
    Oh, Tae
    [J]. COMPUTERS HELPING PEOPLE WITH SPECIAL NEEDS, ICCHP 2014, PT I, 2014, 8547 : 25 - 32
  • [2] Transforming Visual Scene Graphs to Image Captions
    Yang, Xu
    Peng, Jiawei
    Wang, Zihua
    Xu, Haiyang
    Ye, Qinghao
    Li, Chenliang
    Huang, Songfang
    Huang, Fei
    Li, Zhangzikang
    Zhang, Yu
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12427 - 12440
  • [3] CAPITULATING TO CAPTIONS - THE VERBAL TRANSFORMATION OF VISUAL IMAGES
    SIGNORILE, V
    [J]. HUMAN STUDIES, 1987, 10 (3-4) : 281 - 310
  • [4] From concepts to cures-and back
    Hunt, Tim
    [J]. EMBO MOLECULAR MEDICINE, 2009, 1 (01) : 4 - 4
  • [5] Learning visual representations using images with captions
    Quattoni, Ariadna
    Collins, Michael
    Darrell, Trevor
    [J]. 2007 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-8, 2007, : 1553 - 1560
  • [6] StyleNet: Generating Attractive Visual Captions with Styles
    Gan, Chuang
    Gan, Zhe
    He, Xiaodong
    Gao, Jianfeng
    Deng, Li
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 955 - 964
  • [7] Top-down Visual Saliency Guided by Captions
    Ramanishka, Vasili
    Das, Abir
    Zhang, Jianming
    Saenko, Kate
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3135 - 3144
  • [8] PORTING CONCEPTS FROM DNNS BACK TO GMMS
    Demuynck, Kris
    Triefenbach, Fabian
    [J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 356 - 361
  • [9] Generating Diverse and Descriptive Image Captions Using Visual Paraphrases
    Liu, Lixin
    Tang, Jiajun
    Wan, Xiaojun
    Guo, Zongming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4239 - 4248
  • [10] Learning Video Preferences Using Visual Features and Closed Captions
    Brezeale, Darin
    Cook, Diane J.
    [J]. IEEE MULTIMEDIA, 2009, 16 (03) : 39 - 47