共 22 条
- [1] ZHANG H, WU F, ZHUANG Y T., Research on cross-media correlation inference and retrieval, Computer Research and Development, 45, 5, (2008)
- [2] DONG J, LI X, SNOEK C G M., Predicting visual features from text for image and video caption retrieval, IEEE Transactions on Multimedia, 20, 12, pp. 3377-3388, (2018)
- [3] MITHUN N C, LI J, METZE F, Et al., Learning joint embedding with multimodal cues for cross-modal video-text retrieval, Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 19-27, (2018)
- [4] DONG J, LI X, SNOEK C G M., Word2VisualVec: Image and video to sentence matching by visual feature prediction
- [5] TORABI A, TANDON N, SIGAL L., Learning language-visual embedding for movie understanding with natural-language
- [6] RASIWASIA N, COSTA P J, COVIELLO E, Et al., A new approach to cross-modal multimedia retrieval, Proceedings of the 18th ACM International Conference on Multimedia, pp. 251-260, (2010)
- [7] FAGHRI F, FLEET D J, KIROS J R, Et al., VSE++: Improving visual-semantic embeddings with hard negatives
- [8] GONG Y, KE Q, ISARD M, Et al., A multi-view embedding space for modeling internet images, tags, and their semantics, International Journal of Computer Vision, 106, 2, pp. 210-233, (2014)
- [9] HODOSH M, YOUNG P, HOCKENMAIER J., Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research, 47, 24, pp. 853-899, (2013)
- [10] LI Z X, SHI Z P, CHEN H C, Et al., Multi-modal image retrieval based on semantic learning, Computer Engineering, 39, 3, pp. 258-263, (2013)