MatchFormer: Interleaving Attention in Transformers for Feature Matching

被引：23

作者：

Wang, Qing ^{[1
]}

Zhang, Jiaming ^{[1
]}

Yang, Kailun ^{[1
]}

Peng, Kunyu ^{[1
]}

Stiefelhagen, Rainer ^{[1
]}

机构：

[1] Karlsruhe Inst Technol, Karlsruhe, Germany

来源：

COMPUTER VISION - ACCV 2022, PT III | 2023年 / 13843卷

关键词：

Feature matching; Vision transformers;

D O I：

10.1007/978-3-031-26313-2_16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

引用

页码：256 / 273

页数：18

共 50 条

[41] Feature correspondence by interleaving shape and texture computations
Beymer, D
1996 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1996, : 921 - 928
[42] Vision Transformers with Hierarchical Attention
Liu, Yun
Wu, Yu-Huan
Sun, Guolei
Zhang, Le
Chhatkuli, Ajad
Van Gool, Luc
MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 670 - 683
[43] Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion
Bae, Jongseong
Cheon, Byung Do
Kim, Ha Young
IEEE ACCESS, 2022, 10 : 131192 - 131201
[44] Quantifying Attention Flow in Transformers
Abnar, Samira
Zuidema, Willem
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4190 - 4197
[45] Predicting Attention Sparsity in Transformers
Treviso, Marcos
Gois, Antonio
Fernandes, Patrick
Fonseca, Erick
Martins, Andre F. T.
PROCEEDINGS OF THE SIXTH WORKSHOP ON STRUCTURED PREDICTION FOR NLP (SPNLP 2022), 2022, : 67 - 81
[46] Constituent Attention for Vision Transformers
Li, Haoling
Xue, Mengqi
Song, Jie
Zhang, Haofei
Huang, Wenqi
Liang, Lingyu
Song, Mingli
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
[47] Adaptive Attention Span in Transformers
Sukhbaatar, Sainbayar
Grave, Edouard
Bojanowski, Piotr
Joulin, Armand
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 331 - 335
[48] LGI-GT: Graph Transformers with Local and Global Operators Interleaving
Yin, Shuo
Zhong, Guoqiang
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4504 - 4512
[49] Improving Planar Transformers for LLC Resonant Converters: Paired Layers Interleaving
Saket, Mohammad Ali
Ordonez, Martin
Craciun, Marian
Botting, Chris
IEEE TRANSACTIONS ON POWER ELECTRONICS, 2019, 34 (12) : 11813 - 11832
[50] Ranked Time Series Matching by Interleaving Similarity Distances
Cuong Nguyen
Lovering, Charles
Neamtu, Rodica
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3530 - 3539

← 1 2 3 4 5 →