Recommending Themes for Ad Creative Design via Visual-Linguistic Representations

被引：7

作者：

Zhou, Yichao ^{[1
]}

Mishra, Shaunak ^{[2
]}

Verma, Manisha ^{[2
]}

Bhamidipati, Narayan ^{[2
]}

Wang, Wei ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA

[2] Yahoo Res, Sunnyvale, CA USA

来源：

WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020) | 2020年

关键词：

Online advertising; transformers; visual-linguistic representation;

D O I：

10.1145/3366423.3380001

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

There is a perennial need in the online advertising industry to refresh ad creatives, i.e., images and text used for enticing online users towards a brand. Such refreshes are required to reduce the likelihood of ad fatigue among online users, and to incorporate insights from other successful campaigns in related product categories. Given a brand, to come up with themes for a new ad is a painstaking and time consuming process for creative strategists. Strategists typically draw inspiration from the images and text used for past ad campaigns, as well as world knowledge on the brands. To automatically infer ad themes via such multimodal sources of information in past ad campaigns, we propose a theme (keyphrase) recommender system for ad creative strategists. The theme recommender is based on aggregating results from a visual question answering (VQA) task, which ingests the following: (i) ad images, (ii) text associated with the ads as well as Wikipedia pages on the brands in the ads, and (iii) questions around the ad. We leverage transformer based cross-modality encoders to train visual-linguistic representations for our VQA task. We study two formulations for the VQA task along the lines of classification and ranking; via experiments on a public dataset, we show that cross-modal representations lead to significantly better classification accuracy and ranking precision-recall metrics. Cross-modal representations show better performance compared to separate image and text representations. In addition, the use of multimodal information shows a significant lift over using only textual or visual information.

引用

页码：2521 / 2527

页数：7

共 6 条

[1] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
Chiou, Meng-Jiun
Zimmermann, Roger
Feng, Jiashi
[J]. IEEE ACCESS, 2021, 9 : 50441 - 50451
[2] Compressing Visual-linguistic Model via Knowledge Distillation
Fang, Zhiyuan
Wang, Jianfeng
Hu, Xiaowei
Wang, Lijuan
Yang, Yezhou
Liu, Zicheng
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1408 - 1418
[3] CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
Yin, Pengwei
Zeng, Guanzhong
Wang, Jingjing
Xie, Di
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6729 - 6737
[4] Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos
Ha, Jung-Woo
Kim, Kyung-Min
Zhang, Byoung-Tak
[J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 522 - 528
[5] Faster Zero-shot Multi-modal Entity Linking via Visual-Linguistic Representation
Qiushuo Zheng
Hao Wen
Meng Wang
Guilin Qi
Chaoyu Bai
[J]. Data Intelligence, 2022, 4 (03) : 493 - 508
[6] Faster Zero-shot Multi-modal Entity Linking via Visual-Linguistic Representation
Zheng, Qiushuo
Wen, Hao
Wang, Meng
Qi, Guilin
Bai, Chaoyu
[J]. DATA INTELLIGENCE, 2022, 4 (03) : 493 - 508

← 1 →