Travel time estimation (TTE) is a fundamental and challenging problem for navigation and travel planning. Though many efforts have been devoted to this task, most of the previous research has focused on extracting useful features of the routes to improve the estimation accuracy. In our opinion, the key issue of TTE is how to handle the rich spatiotemporal information underlying a route and how to model the multi-faceted factors that affect travel time. Along this line, we propose a multi-faceted route representation learning framework that divides a route into three sequences: a trajectory sequence consists of GPS coordinates to describe spatial information, an attribute sequence to encode the features of each road segment, and a semantic sequence consists of the IDs of road segments to capture the context information of routes. Then, we design a sequential learning module and transformer encoder to get the representations of three sequences for each route respectively. Finally, we fuse the multi-faceted route representations together, and provide a self-supervised learning module to improve the generalization of final representation. Experiments on two real-world datasets demonstrate that our method could provide more accurate travel time estimation than baselines, and all the multi-faceted route representations contribute to the improvement of estimation accuracy.