All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

被引:6
|
作者
Scribano, Carmelo [1 ,3 ]
Sapienza, Davide [1 ,3 ]
Franchini, Giorgia [1 ,2 ]
Verucchi, Micaela [1 ]
Bertogna, Marko [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] Univ Ferrara, Ferrara, Italy
[3] Univ Parma, Parma, Italy
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 | 2021年
关键词
D O I
10.1109/CVPRW53098.2021.00481
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.
引用
收藏
页码:4248 / 4257
页数:10
相关论文
共 50 条
  • [21] Contrastive Learning for Natural Language-Based Vehicle Retrieval
    Tam Minh Nguyen
    Quang Huu Pham
    Linh Bao Doan
    Hoang Viet Trinh
    Viet-Anh Nguyen
    Viet-Hoang Phan
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 4240 - 4247
  • [22] Clustering-Based Decision Tree for Vehicle Routing Spatio-Temporal Selection
    Liu, Yixiao
    Zhang, Lei
    Zhou, Yixuan
    Xu, Qin
    Fu, Wen
    Shen, Tao
    ELECTRONICS, 2022, 11 (15)
  • [23] STHARNet: spatio-temporal human action recognition network in content based video retrieval
    Sowmyayani, S.
    Rani, P. Arockia Jansi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 82 (24) : 38051 - 38066
  • [24] Posterior-Based Analysis of Spatio-Temporal Features for Sign Language Assessment
    Tarigopula, Neha
    Tornay, Sandrine
    Sincan, Ozge Mercanoglu
    Bowden, Richard
    Magimai-Doss, Mathew
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 284 - 292
  • [25] A study of semantic retrieval system based on geo-ontology with spatio-temporal characteristic
    Song, Jia
    Zhu, Yunqiang
    Wang, Juanle
    DCABES 2007 PROCEEDINGS, VOLS I AND II, 2007, : 1029 - 1034
  • [26] Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events
    Petkovic, M
    Jonker, W
    IEEE WORKSHOP ON DETECTION AND RECOGNITION OF EVENTS IN VIDEO, PROCEEDINGS, 2001, : 75 - 82
  • [27] A Kinect Based Sign Language Recognition System Using Spatio-temporal Features
    Memis, Abbas
    Albayrak, Songul
    SIXTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2013), 2013, 9067
  • [28] STHARNet: spatio-temporal human action recognition network in content based video retrieval
    S. Sowmyayani
    P. Arockia Jansi Rani
    Multimedia Tools and Applications, 2023, 82 : 38051 - 38066
  • [29] STOQL: An ODMG-Based spatio-temporal object model and query language
    Huang, B
    Claramunt, C
    ADVANCES IN SPATIAL DATA HANDLING, 2002, : 225 - 237
  • [30] Pixel is All You Need: Adversarial Spatio-Temporal Ensemble Active Learning for Salient Object Detection
    Wu, Zhenyu
    Wang, Wei
    Wang, Lin
    Li, Yacong
    Lv, Fengmao
    Xia, Qing
    Chen, Chenglizhao
    Hao, Aimin
    Li, Shuo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (02) : 858 - 877