Polynomial multiplication over rings is a significant bottleneck of ring learning with error (RLWE)-based encryption. To speed it up, three algorithms are widely used, i.e., number theoretic transform (NTT), Schoolbook, and Toom-Cook. Compared with Schoolbook and NTT, Toom-Cook can achieve a better trade-off between performance and flexibility. However, in ToomCook postprocessing, there are many redundant steps and calculations that have not been eliminated. Therefore, we propose an efficient, compressed, and fused Toom-Cook postprocessing algorithm that reduces the number of steps and at least 33.33% of the arithmetic operations of postprocessing. A highly reconfigurable and efficient Toom-Cook-based polynomial multiplier (TCPM) is proposed to speed up polynomial multiplication over rings. In TCPM, a high-throughput and efficient heterogeneous processing element (PE) array is designed to exploit the parallelism of Toom-Cook, and based on the compressed algorithm, the PE array for postprocessing is scaled down. In addition, as it is provided with a reconfigurable evaluation module, a flexible polynomial data storage module and a universal PE array, TCPM can efficiently map and execute Toom-Cook-2, 3, and 4 on a unified hardware architecture. Implemented on the Xilinx VC709 field-programmable gate array (FPGA) platform, TCPM can perform a Toom-Cook-4-based 256 x 256 polynomial multiplication over rings with a modulus of a power of two or a prime every 3.28 mu s at a 360-MHz clock frequency. It achieves a 2.47x to 50.11x speedup compared with the previous designs.