Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

被引:3
|
作者
Cao, Miao [1 ,2 ,3 ]
Wang, Lishun [2 ,3 ]
Zhu, Mingyu [2 ,3 ]
Yuan, Xin [2 ,3 ]
机构
[1] Zhejiang Univ, Hangzhou 310058, Zhejiang, Peoples R China
[2] Westlake Univ, Sch Engn, Hangzhou 310030, Zhejiang, Peoples R China
[3] Westlake Univ, Res Ctr Ind Future, Hangzhou 310030, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Computational imaging; Snapshot compressive imaging; Compressive sensing; Deep learning; Convolutional neural networks; Transformer;
D O I
10.1007/s11263-024-02101-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video snapshot compressive imaging (SCI) uses a low-speed 2D detector to capture high-speed scene, where the dynamic scene is modulated by different masks and then compressed into a snapshot measurement. Following this, a reconstruction algorithm is needed to reconstruct the high-speed video frames. Although state-of-the-art (SOTA) deep learning-based reconstruction algorithms have achieved impressive results, they still face the following challenges due to excessive model complexity and GPU memory limitations: (1) These models need high computational cost, and (2) They are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using hierarchical residual-like connections and hybrid CNN-Transformer structure within a single residual block, dubbed EfficientSCI++. The EfficientSCI++ network can well explore spatial-temporal correlation using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to demonstrate that a UHD color video (1644x3840x3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1644\times {3840}\times {3}$$\end{document}) with high compression ratio (40) can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 34 dB. Moreover, a mixed-precision model is trained to further accelerate the video SCI reconstruction process and save memory footprint. Extensive results on both simulation and real data demonstrate that, compared with precious SOTA methods, our proposed EfficientSCI++ and EfficientSCI can achieve comparable reconstruction quality with much cheaper computational cost and better real-time performance. Code is available at https://github.com/mcao92/EfficientSCI-plus-plus.
引用
收藏
页码:4521 / 4540
页数:20
相关论文
共 50 条
  • [21] Weak Appearance Aware Pipeline Leak Detection based on CNN-Transformer Hybrid Architecture
    Zhang, Bulin
    Yuan, Haiwen
    Ge, Jie
    Cheng, Li
    Li, Xuan
    Xiao, Changshi
    IEEE Transactions on Instrumentation and Measurement, 2024,
  • [22] Weak Appearance Aware Pipeline Leak Detection Based on CNN-Transformer Hybrid Architecture
    Zhang, Bulin
    Yuan, Haiwen
    Ge, Jie
    Cheng, Li
    Li, Xuan
    Xiao, Changshi
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2025, 74
  • [23] Add-Vit: CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing
    Chen, Jinhui
    Wu, Peng
    Zhang, Xiaoming
    Xu, Renjie
    Liang, Jia
    NEURAL PROCESSING LETTERS, 2024, 56 (03)
  • [24] Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging
    Wen, Jiaxuan
    Huang, Junru
    Chen, Xunhao
    Huang, Kaixuan
    Sun, Yubao
    APPLIED SCIENCES-BASEL, 2023, 13 (10):
  • [25] Water-Land Segmentation via Structure-Aware CNN-Transformer Network on Large-Scale SAR Data
    Zhou, Yongsheng
    Yang, Kun
    Ma, Fei
    Hu, Wei
    Zhang, Fan
    IEEE SENSORS JOURNAL, 2023, 23 (02) : 1408 - 1422
  • [26] LEFORMER: A HYBRID CNN-TRANSFORMER ARCHITECTURE FOR ACCURATE LAKE EXTRACTION FROM REMOTE SENSING IMAGERY
    Chen, Ben
    Zou, Xuechao
    Zhang, Yu
    Li, Jiayu
    Li, Kai
    Xing, Junliang
    Tao, Pin
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 5710 - 5714
  • [27] Agricultural innovation through deep learning: a hybrid CNN-Transformer architecture for crop disease classification
    Padshetty, Smitha
    Umashetty, Ambika
    JOURNAL OF SPATIAL SCIENCE, 2024,
  • [28] CNN-VWII: An efficient approach for large-scale video retrieval by image queries
    Zhang, Chengyuan
    Lin, Yunwu
    Zhu, Lei
    Liu, Anfeng
    Zhang, Zuping
    Huang, Fang
    PATTERN RECOGNITION LETTERS, 2019, 123 : 82 - 88
  • [29] Cross Attention Multi Scale CNN-Transformer Hybrid Encoder Is General Medical Image Learner
    Zhou, Rongzhou
    Yao, Junfeng
    Hong, Qingqi
    Li, Xingxin
    Cao, Xianpeng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XIII, 2024, 14437 : 85 - 97
  • [30] Distributed architecture for large-scale video servers
    Tanaka, K
    Sakamoto, H
    Suzuki, H
    Nishimura, K
    ICICS - PROCEEDINGS OF 1997 INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATIONS AND SIGNAL PROCESSING, VOLS 1-3: THEME: TRENDS IN INFORMATION SYSTEMS ENGINEERING AND WIRELESS MULTIMEDIA COMMUNICATIONS, 1997, : 578 - 583