At CHES 2010, Rivain and Prouff (RP) introduced an elegant masking technique to protect the Advanced Encryption Standard (AES) against power analysis attacks. RP masking is provable secure in the probing model, but this solid theoretical underpinning comes at the cost of a massive increase in execution time. In this paper, we describe software optimization methods to accelerate the low-level arithmetic in the field F-28, which has a significant impact on the overall performance of a masked implementation of the AES. Among these optimizations is an improved technique for table-based multiplication in F-28 that allows one to avoid the special treatment of 0-values, thereby speeding up the multiplication of masked operands. Furthermore, we introduce a novel exponentiation-based algorithm for inversion in F-28, which reduces the overall number of table look-ups and the amount of randomness needed for the refreshing of masks compared to the original RP inversion. This new inversion provides some advanced (theoretical) security properties for the composition of gadgets, e.g. Strong Non-Interference (SNI) and Probe Isolating Non-Interference (PINI). We also describe a prototype implementation of a first-order masked inversion and AES encryption in ARMv7-M Assembly language. According to our simulation results, the first-order masked AES has an execution time of about 25k clock cycles per block when using a generic Cortex-M3 as target platform, which is roughly twice as fast as the RP-masked AES Assembly implementation presented at EUROCRYPT 2017 by Goudarzi and Rivain.