

The key issue of RGB-T tracking is to obtain an effective multimodal representation of targets by utilizing complementary RGB and TIR modality information. Previous methods of template fusion or bidirectional search-template interaction potentially diminish the target representation, resulting from noise information of both templates and search regions. Meanwhile, the direct fusion of sole search features without interacting with templates cannot fully utilize target-relevant contextual information. To mitigate these issues, we present UCTrack, which fuses complementary multimodal search features conditioned on undisturbed RGB and TIR template features. Specifically, we design a Unidirectional Cross-modal Fusion (UCF) module to effectively minimize the influence of background noise on templates by pruning the unnecessary template-to-search cross-modal interaction and to mutually enhance RGB and TIR search features with target-relevant information through multimodal spatial fusion. Furthermore, this module is seamlessly integrated into different layers of a ViT backbone to facilitate feature extraction and cross-modal fusion for RGB-T tracking. Benefiting from the UCF module, UCTrack can effectively and accurately represent multimodal target features without unnecessary template-to-search interaction flow and direct template fusion, making the first proposal of unidirectional cross-modal fusion paradigm for RGB-T tracking to our best knowledge. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves state-of-the-art performance.