Haptic technology enables robots to touch and understand the interactions between objects in the reality. Advanced haptic sensing systems can not only collect pressure, temperature and stiffness of touched objects, but also avoid destructive operations, and assist in navigation and posture control for robots. In order to smoothly interact with different types of objects, in the haptic system, it is necessary to develop haptic object recognition methods for effective haptic perception capability. However, compared to RGB images, haptic images collected by optically-based haptic sensors are similar in appearance, which makes traditional convolutional neural networks (e.g.,ResNet, VGG, etc.) ineffective. Therefore, in this paper, we are inspired by popular attention mechanism and multi-scale strategies, and propose a cross-scale attention based haptic object recognition network for object-robot interaction. In particular, On the one hand, we design a cross-scale attention module in convolutional neural networks to acquire spatial contextual feature. On the other hand, we design a learnable bilinear fusion strategy to integrate above spatial contextual feature with original haptic feature, so as to effectively discriminate haptic images. Experimental results on ViTac dataset have shown the effectiveness of our approach.