共 63 条
- [1] Pei W,, Zhang J,, Wang X,, Ke L,, Shen X, Tai Y-W., Memory-attended recurrent network for video captioning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8347-8356, (2019)
- [2] Peng-Jie Tang, Han-Li Wang, From video to language:survey of video captioning and description, Acta Automatica Sinica, 47, pp. 1-23, (2021)
- [3] Antol S, ,Agrawal A,Lu J,Mitchell M,Batra D,Zitnick C L and Parikh D. Vqa:Visual question answering, Proceedings of the IEEE International Conference on Computer Vision, pp. 2425-2433, (2015)
- [4] Jang Y, Song Y, Yu Y,, Kim Y, Kim G., TGIF-QA:toward spatio-temporal reasoning in visual question answering, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1359-1367, (2017)
- [5] Lei J,, Yu L,, Bansal M, Berg T L., TVQA:localized,compositional video question answering, Proceedings of the Empirical Methods in Natural Language Processing, pp. 1369-1379, (2018)
- [6] Zhen Dong, Ming-Tao Pei, Cross-modality face retrieval based on heterogeneous hashing network, Chinese Journal of Computers, 42, 1, pp. 75-86, (2019)
- [7] Shuang-Yong Yan, Chang-Hong Liu, Ai-Wen Jiang, Ji-Hua Ye, Ming-Wen Wang, Discriminative cross-modal hashing with coupled semantic correlation, Chinese Journal of Computers, 42, 1, pp. 164-175, (2019)
- [8] Qi-Lu Zhao, Zong-Min Li, Cross-modal social image clustering, Chinese Journal of Computers, 41, 1, pp. 100-113, (2018)
- [9] Fan C,, Zhang X,, Zhang S, Wang W,, Zhang C, Huang H., Heterogeneous memory enhanced multimodal attention model for video question answering, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999-2007, (2019)
- [10] Jiang P, Han Y., Reasoning with heterogeneous graph alignment for video question answering, Proceedings of the Association for the Advance of Artificial Intelligence, pp. 11109-11116, (2020)