All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

被引:6
|
作者
Scribano, Carmelo [1 ,3 ]
Sapienza, Davide [1 ,3 ]
Franchini, Giorgia [1 ,2 ]
Verucchi, Micaela [1 ]
Bertogna, Marko [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] Univ Ferrara, Ferrara, Italy
[3] Univ Parma, Parma, Italy
关键词
D O I
10.1109/CVPRW53098.2021.00481
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.
引用
收藏
页码:4248 / 4257
页数:10
相关论文
共 50 条
  • [1] Grounding Spatio-Temporal Language with Transformers
    Karch, Tristan
    Teodorescu, Laetitia
    Hofmann, Katja
    Moulin-Frier, Clement
    Oudeyer, Pierre-Yves
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Spatio-temporal Person Retrieval via Natural Language Queries
    Yamaguchi, Masataka
    Saito, Kuniaki
    Ushiku, Yoshitaka
    Harada, Tatsuya
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1462 - 1471
  • [3] MLPST: MLP is All You Need for Spatio-Temporal Prediction
    Zhang, Zijian
    Huang, Ze
    Hu, Zhiwei
    Zhao, Xiangyu
    Wang, Wanyu
    Liu, Zitao
    Zhang, Junbo
    Qin, S. Joe
    Zhang, Hongwei
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 3381 - 3390
  • [4] Building natural language responses from natural language questions in the spatio-temporal context
    Landoulsi G.
    Mahmoudi K.
    Faïz S.
    International Journal of Intelligent Information and Database Systems, 2021, 14 (01) : 1 - 25
  • [5] Vehicle recognition based on spatio-temporal image analysis
    Hirahara, K
    Ikeuchi, K
    ITSC 2004: 7TH INTERNATIONAL IEEE CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, PROCEEDINGS, 2004, : 725 - 730
  • [6] Attribute based spatio-temporal person retrieval in video surveillance
    Shoitan, Rasha
    Moussa, Mona M.
    El Nemr, Heba A.
    ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 441 - 454
  • [7] Attribute based spatio-temporal person retrieval in video surveillance
    Shoitan, Rasha
    Moussa, Mona M.
    El Nemr, Heba A.
    ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 441 - 454
  • [8] A Video Retrieval Algorithm Based on Spatio-temporal Feature Curves
    Chen, Xiuxin
    Jia, Kebin
    Zhuang, Xinyue
    2008 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 287 - 290
  • [9] Spatio-Temporal feature based VLAD for efficient Video retrieval
    Reddy, Mopuri K.
    Arora, Sahil
    Babu, R. Venkatesh
    2013 FOURTH NATIONAL CONFERENCE ON COMPUTER VISION, PATTERN RECOGNITION, IMAGE PROCESSING AND GRAPHICS (NCVPRIPG), 2013,
  • [10] Spatio-Temporal Feature Aware Vision Transformers for Real-Time Unmanned Aerial Vehicle Tracking
    Zhang, Hao
    Ye, Hengzhou
    Guo, Xiaoyu
    Zhang, Xu
    Rong, Yao
    Li, Shuiwang
    DRONES, 2025, 9 (01)