Context-Aware Attention Network for Image-Text Retrieval
被引:181
|
作者:
Zhang, Qi
论文数: 0引用数: 0
h-index: 0
机构:
Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R ChinaChinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Zhang, Qi
[1
,2
]
Lei, Zhen
论文数: 0引用数: 0
h-index: 0
机构:
Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R ChinaChinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Lei, Zhen
[1
,2
]
Zhang, Zhaoxiang
论文数: 0引用数: 0
h-index: 0
机构:
Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R ChinaChinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Zhang, Zhaoxiang
[1
,2
]
Li, Stan Z.
论文数: 0引用数: 0
h-index: 0
机构:
Westlake Univ, Ctr AI Res & Innovat, Hangzhou, Peoples R ChinaChinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
Li, Stan Z.
[3
]
机构:
[1] Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Westlake Univ, Ctr AI Res & Innovat, Hangzhou, Peoples R China
来源:
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)
|
2020年
关键词:
D O I:
10.1109/CVPR42600.2020.00359
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
As a typical cross-modal problem, image-text bidirectional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.
机构:
Tianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R ChinaTianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R China
Meng, Lingtao
Zhang, Feifei
论文数: 0引用数: 0
h-index: 0
机构:
Tianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R ChinaTianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R China
Zhang, Feifei
Zhang, Xi
论文数: 0引用数: 0
h-index: 0
机构:
Chinese Acad Sci, Inst Automat, East Zhongguancun Rd, Beijing 100080, Peoples R ChinaTianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R China
Zhang, Xi
Xu, Changsheng
论文数: 0引用数: 0
h-index: 0
机构:
Chinese Acad Sci, Inst Automat, East Zhongguancun Rd, Beijing 100080, Peoples R ChinaTianjin Univ Technol, Sch Comp Sci & Engn, Binshui West St, Tianjin 300380, Tianjin, Peoples R China