It is well known that polar coding achieves capacity, but it is so far unknown exactly how fast polar codes approach channel capacity as a function of their blocklength. More precisely, let us fix a binary-input memoryless symmetric channel W of capacity I (W) and a desired probability of error P-e. Given W and P-e, suppose we wish to communicate at rate I (W) - Delta using a polar code of length n. It has been recently shown that this value of n scales as O(Delta(-mu)), where the constant mu is known as the scaling exponent. In particular, if W is the binary erasure channel (BEC), then mu = 3.627. This is somewhat disappointing, since random codes achieve the (optimal) scaling exponent mu* = 2. As shown by Arikan, channel polarization can be induced via a simple linear transformation: iterated Kronecker product of a 2 x 2 binary matrix G, called the polarization kernel, with itself. Is it possible to improve the scaling exponent of polar codes (on the BEC) if G is replaced by an l x l binary kernel matrix K for some integer l >= 3? This is the question we address in the present paper. It was conjectured by Hassani that as l -> infinity, a random choice of the polarization kernel K approaches the optimal scaling exponent mu* = 2. However, herein, we are primarily interested in small values of l. We begin with the fact that a given l x l polarization kernel K transforms l copies of the underlying channel W into l bit-channels W-1, W-2, ... , W-l. Notably, if W is a BEC with erasure probability z, then each of W-1, W-2, ..., W-l is also a BEC. The erasure probabilities of W-1, W-2, ..., W-l are polynomials in z with integer coefficients and degree at most l. We refer to the corresponding set of polynomials {p(1)(z), p(2)(z), ..., p(l) (z)} as the polarization behavior of K; the scaling exponent of K is completely determined by its polarization behavior. We show that the polarization behavior can be characterized in terms of a nested chain of linear codes: {0} = C-0 subset of C-1 subset of ... subset of Cl-1 subset of C-l = {0, 1}(l) and use this nested chain of codes to prove that computing the polarization behavior is NP-hard. We further prove that an arbitrary l x l polarization kernel K can be transformed into a lower-triangular form without altering its polarization behavior. We then use this result to answer the following question: what is the smallest value of l for which Arikan's scaling exponent mu (G) = 3.627 can be improved? We show that mu (K) >= 3.627 for all l x l kernels with l <= 7. On the other hand, we explicitly construct an 8 x 8 matrix K-8 with mu(K-8) = 3.577 (and prove that it is optimal for l <= 8). We extend our construction of K-8 into a general heuristic design method. Guided by this design method, we employ the coset structure of Reed-Muller codes and bent functions in order to explicitly construct a 16 x 16 kernel K-16 with mu(K-16) = 3.356. We conjecture that this is optimal for l <= 16.