White-Box Transformers via Sparse Rate Reduction

被引：0

作者：

Yu, Yaodong ^{[1
]}

Buchanan, Sam ^{[2
]}

Pai, Druv ^{[1
]}

Chu, Tianzhe ^{[1
]}

Wu, Ziyang ^{[1
]}

Tong, Shengbang ^{[1
]}

Haeffele, Benjamin D. ^{[3
]}

Ma, Yi ^{[1
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

[2] TTIC, Chicago, IL USA

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

CONNECTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.

引用

页数：36

共 50 条

[1] Damage Reduction via White-Box Failure Shaping
Jones, Thomas B.
Ackley, David H.
[J]. SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2018, 2018, 11036 : 213 - 228
[2] Emergence of Segmentation with Minimalistic White-Box Transformers
Yu, Yaodong
Chu, Tianzhe
Tong, Shengbang
Wu, Ziyang
Pai, Druv
Buchanan, Sam
Ma, Yi
[J]. CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 72 - 93
[3] ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
Chan, Kwan Ho Ryan
Yu, Yaodong
You, Chong
Qi, Haozhi
Wright, John
Ma, Yi
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 103
[4] ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
Chan, Kwan Ho Ryan
Yu, Yaodong
You, Chong
Qi, Haozhi
Wright, John
Ma, Yi
[J]. Journal of Machine Learning Research, 2022, 23
[5] White-box benchmarking
Hernández, E
Hey, T
[J]. EURO-PAR '98 PARALLEL PROCESSING, 1998, 1470 : 220 - 223
[6] White-box testing
Cole, O
[J]. DR DOBBS JOURNAL, 2000, 25 (03): : 23 - +
[7] A White-Box Implementation of IDEA
Pang, Siyu
Lin, Tingting
Lai, Xuejia
Gong, Zheng
[J]. SYMMETRY-BASEL, 2021, 13 (06):
[8] Opportunities in White-Box Cryptography
Michiels, Wil
[J]. IEEE SECURITY & PRIVACY, 2010, 8 (01) : 64 - 67
[9] White-Box Program Tuning
Lee, Wen-Chuan
Liu, Yingqi
Liu, Peng
Ma, Shiqing
Choi, Hongjun
Zhang, Xiangyu
Gupta, Rajiv
[J]. PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO '19), 2019, : 122 - 135
[10] White-Box Atomic Multicast
Gotsman, Alexey
Lefort, Anatole
Chockler, Gregory
[J]. 2019 49TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2019), 2019, : 176 - 187

← 1 2 3 4 5 →