共 33 条
- [1] Mao J, Xu W, Yang Y, Et al., Deep captioning with multimodal recurrent neural networks (m-RNN), (2014)
- [2] Vinyals O, Toshev A, Bengio S, Et al., Show and tell: A neural image caption generator, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156-3164, (2015)
- [3] Karpathy A, Fei-Fei Li, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, (2015)
- [4] Cho K, Van Merrienboer B, Gulcehre C, Et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, (2014)
- [5] Bahdanau D, Cho K, Bengio Y., Neural machine translation by jointly learning to align and translate, (2014)
- [6] Sutskever I, Vinyals O, Le Q V., Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems (NIPS), pp. 3104-3112, (2014)
- [7] Xu K, Ba J, Kiros R, Et al., Show, attend and tell: Neural image caption generation with visual attention, Proceedings of the International Conference on Machine Learning (ICML), pp. 2048-2057, (2015)
- [8] Chen L, Zhang H, Xiao J, Et al., SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298-6306, (2017)
- [9] Anderson P, He X, Buehler C, Et al., Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077-6086, (2018)
- [10] Lu J, Yang J, Batra D, Et al., Neural baby talk, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7219-7228, (2018)