The Transformer has recently achieved impressive success in image super-resolution due to its ability to model long-range dependencies with multi-head self-attention (MHSA). However, most existing MHSAs focus only on the dependencies among individual tokens, and ignore the ones among token clusters containing several tokens, resulting in the inability of Transformer to adequately explore global features. On the other hand, Transformer neglects local features, which inevitably hinders accurate detail reconstruction. To address the above issues, we propose a lightweight image super-resolution method with cluster and match attention (CMASR). Specifically, a token Clustering block is designed to divide input tokens into token clusters of different sizes with depthwise separable convolution. Subsequently, we propose an efficient axial matching self-attention (AMSA) mechanism, which introduces an axial matrix to extract local features, including axial similarities and symmetries. Further, by combining AMSA and Window Self-Attention, we construct a Hybrid Self-Attention block to capture the dependencies among token clusters of different sizes to sufficiently extract axial local features and global features. Extensive experiments demonstrate that the proposed CMASR outperforms state-of-the-art methods with fewer computational cost (i.e., the number of parameters and FLOPs).