With the proliferation of multimedia applications, data is frequently derived from multiple sources, leading to the accelerated advancement of multi-view clustering (MVC) methods. In this paper, we propose a novel MVC method, termed GLSEF, to handle the inconsistency existing in multiple spectral embeddings. To this end, GLSEF contains a two-level learning mechanism. Specifically, on the global level, GLSEF considers the diversity of features and selectively assigns smooth weights to partial more discriminative features that are conducive to clustering. On the local level, GLSEF resorts to the Grassmann manifold to maintain spatial and topological information and local structure in each view, thereby enhancing its suitability and accuracy for clustering. Moreover, unlike most previous methods that learn a low-dimension embedding and perform the k-means algorithm to obtain the final cluster labels, GLSEF directly acquires the discrete indicator matrix to prevent potential information loss during post-processing. To address the optimization involved in GLSEF, we present an efficient alternating optimization algorithm accompanied by convergence and time complexity analyses. Extensive empirical results on nine real-world datasets demonstrate the effectiveness and efficiency of GLSEF compared to existing state-of-the-art MVC methods. The code is publicly available here.