White-Box Transformers via Sparse Rate Reduction

被引:0
|
作者
Yu, Yaodong [1 ]
Buchanan, Sam [2 ]
Pai, Druv [1 ]
Chu, Tianzhe [1 ]
Wu, Ziyang [1 ]
Tong, Shengbang [1 ]
Haeffele, Benjamin D. [3 ]
Ma, Yi [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] TTIC, Chicago, IL USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
CONNECTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.
引用
收藏
页数:36
相关论文
共 50 条
  • [1] Damage Reduction via White-Box Failure Shaping
    Jones, Thomas B.
    Ackley, David H.
    [J]. SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2018, 2018, 11036 : 213 - 228
  • [2] Emergence of Segmentation with Minimalistic White-Box Transformers
    Yu, Yaodong
    Chu, Tianzhe
    Tong, Shengbang
    Wu, Ziyang
    Pai, Druv
    Buchanan, Sam
    Ma, Yi
    [J]. CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 72 - 93
  • [3] ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
    Chan, Kwan Ho Ryan
    Yu, Yaodong
    You, Chong
    Qi, Haozhi
    Wright, John
    Ma, Yi
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 103
  • [4] ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
    Chan, Kwan Ho Ryan
    Yu, Yaodong
    You, Chong
    Qi, Haozhi
    Wright, John
    Ma, Yi
    [J]. Journal of Machine Learning Research, 2022, 23
  • [5] White-box benchmarking
    Hernández, E
    Hey, T
    [J]. EURO-PAR '98 PARALLEL PROCESSING, 1998, 1470 : 220 - 223
  • [6] White-box testing
    Cole, O
    [J]. DR DOBBS JOURNAL, 2000, 25 (03): : 23 - +
  • [7] A White-Box Implementation of IDEA
    Pang, Siyu
    Lin, Tingting
    Lai, Xuejia
    Gong, Zheng
    [J]. SYMMETRY-BASEL, 2021, 13 (06):
  • [8] Opportunities in White-Box Cryptography
    Michiels, Wil
    [J]. IEEE SECURITY & PRIVACY, 2010, 8 (01) : 64 - 67
  • [9] White-Box Program Tuning
    Lee, Wen-Chuan
    Liu, Yingqi
    Liu, Peng
    Ma, Shiqing
    Choi, Hongjun
    Zhang, Xiangyu
    Gupta, Rajiv
    [J]. PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO '19), 2019, : 122 - 135
  • [10] White-Box Atomic Multicast
    Gotsman, Alexey
    Lefort, Anatole
    Chockler, Gregory
    [J]. 2019 49TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2019), 2019, : 176 - 187