VidQ: Video Query Using Optimized Audio-Visual Processing

被引:0
|
作者
Felemban, Noor [1 ]
Mehmeti, Fidan [2 ]
Porta, Thomas F. [3 ]
机构
[1] Imam Abdulrahman Bin Faisal Univ, Dept Comp Engn, Dammam 34212, Saudi Arabia
[2] Tech Univ Munich, Chair Commun Networks, Munich D-80333, Germany
[3] Penn State Univ, Dept Comp Sci & Engn, State Coll, PA 16801 USA
关键词
Mobile networks; deep learning; convolutional neural networks; performance optimization; heuristics; SPEECH RECOGNITION;
D O I
10.1109/TNET.2022.3215601
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As mobile devices become more prevalent in everyday life and the amount of recorded and stored videos increases, efficient techniques for searching video content become more important. When a user sends a query searching for a specific action in a large amount of data, the goal is to respond to the query accurately and fast. In this paper, we address the problem of responding to queries which search for specific actions in mobile devices in a timely manner by utilizing both visual and audio processing approaches. We build a system, called VidQ, which consists of several stages, and that uses various Convolutional Neural Networks (CNNs) and Speech APIs to respond to such queries. As the state-of-the-art computer vision and speech algorithms are computationally intensive, we use servers with GPUs to assist mobile users in the process. After a query is issued, we identify the different stages of processing that will take place. Then, we identify the order of these stages. Finally, solving an optimization problem that captures the system behavior, we distribute the process among the available network resources to minimize the processing time. Results show that VidQ reduces the completion time by at least 50% compared to other approaches.
引用
收藏
页码:1338 / 1352
页数:15
相关论文
共 50 条
  • [31] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [32] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [33] Perceptual Quality of Audio-Visual Content with Common Video and Audio Degradations
    Becerra Martinez, Helard
    Hines, Andrew
    Farias, Mylene C. Q.
    APPLIED SCIENCES-BASEL, 2021, 11 (13):
  • [34] Combining text and audio-visual features in video indexing
    Chang, SF
    Manmatha, R
    Chua, TS
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1005 - 1008
  • [35] Audio-visual interactive services and video on demand (VOD)
    CSELT
    CSELT Tech Rep, 2 (195-209):
  • [36] A NO-REFERENCE AUDIO-VISUAL VIDEO QUALITY METRIC
    Martinez, Helard Becerra
    Farias, Mylene C. Q.
    2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2125 - 2129
  • [37] Toward Long Form Audio-Visual Video Understanding
    Hou, Wenxuan
    Li, Guangyao
    Tian, Yapeng
    Hu, Di
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
  • [38] Audio-Visual Atoms for Generic Video Concept Classification
    Jiang, Wei
    Cotton, Courtenay
    Chang, Shih-Fu
    Ellis, Dan
    Loui, Alexander C.
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2010, 6 (03)
  • [39] Audio-visual synchrony for detection of monologues in video archives
    Iyengar, G
    Nock, HJ
    Neti, C
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 772 - 775
  • [40] Audio-visual speaker recognition for video broadcast news
    Maison, B
    Neti, C
    Senior, A
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2): : 71 - 79