All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

被引：6

作者：

Scribano, Carmelo ^{[1
,3
]}

Sapienza, Davide ^{[1
,3
]}

Franchini, Giorgia ^{[1
,2
]}

Verucchi, Micaela ^{[1
]}

Bertogna, Marko ^{[1
]}

机构：

[1] Univ Modena & Reggio Emilia, Modena, Italy

[2] Univ Ferrara, Ferrara, Italy

[3] Univ Parma, Parma, Italy

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 | 2021年

关键词：

D O I：

10.1109/CVPRW53098.2021.00481

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

引用

页码：4248 / 4257

页数：10

共 50 条

[21] Contrastive Learning for Natural Language-Based Vehicle Retrieval
Tam Minh Nguyen
Quang Huu Pham
Linh Bao Doan
Hoang Viet Trinh
Viet-Anh Nguyen
Viet-Hoang Phan
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 4240 - 4247
[22] Clustering-Based Decision Tree for Vehicle Routing Spatio-Temporal Selection
Liu, Yixiao
Zhang, Lei
Zhou, Yixuan
Xu, Qin
Fu, Wen
Shen, Tao
ELECTRONICS, 2022, 11 (15)
[23] STHARNet: spatio-temporal human action recognition network in content based video retrieval
Sowmyayani, S.
Rani, P. Arockia Jansi
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 82 (24) : 38051 - 38066
[24] Posterior-Based Analysis of Spatio-Temporal Features for Sign Language Assessment
Tarigopula, Neha
Tornay, Sandrine
Sincan, Ozge Mercanoglu
Bowden, Richard
Magimai-Doss, Mathew
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 284 - 292
[25] A study of semantic retrieval system based on geo-ontology with spatio-temporal characteristic
Song, Jia
Zhu, Yunqiang
Wang, Juanle
DCABES 2007 PROCEEDINGS, VOLS I AND II, 2007, : 1029 - 1034
[26] Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events
Petkovic, M
Jonker, W
IEEE WORKSHOP ON DETECTION AND RECOGNITION OF EVENTS IN VIDEO, PROCEEDINGS, 2001, : 75 - 82
[27] A Kinect Based Sign Language Recognition System Using Spatio-temporal Features
Memis, Abbas
Albayrak, Songul
SIXTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2013), 2013, 9067
[28] STHARNet: spatio-temporal human action recognition network in content based video retrieval
S. Sowmyayani
P. Arockia Jansi Rani
Multimedia Tools and Applications, 2023, 82 : 38051 - 38066
[29] STOQL: An ODMG-Based spatio-temporal object model and query language
Huang, B
Claramunt, C
ADVANCES IN SPATIAL DATA HANDLING, 2002, : 225 - 237
[30] Pixel is All You Need: Adversarial Spatio-Temporal Ensemble Active Learning for Salient Object Detection
Wu, Zhenyu
Wang, Wei
Wang, Lin
Li, Yacong
Lv, Fengmao
Xia, Qing
Chen, Chenglizhao
Hao, Aimin
Li, Shuo
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (02) : 858 - 877

← 1 2 3 4 5 →