

Multimodal learning has become a critical focus in computer science, particularly for robotic perception, where integrating diverse sensory data such as vision, audio, and tactile information is essential for interpreting complex environments. This study proposes a self-supervised multimodal learning framework that integrates spatiotemporal transformers and cross-modal attention mechanisms to address challenges in temporal modeling and feature fusion. The spatiotemporal transformer effectively captures sequential dependencies within individual modalities, while the cross-modal attention module dynamically assigns importance weights to modalities, enabling robust feature integration. Unlike traditional approaches, the proposed framework eliminates the need for extensive labeled data, increasing scalability and adaptability. Experimental results on benchmark datasets demonstrate that the framework significantly outperforms CNN, LSTM, and state-of-the-art Transformer-based models in terms of accuracy, F1 scores, and robustness, particularly under noisy conditions or incomplete modalities. Ablation studies validate the contributions of the transformer and attention modules, while qualitative analysis highlights the model’s ability to adaptively prioritize relevant features. This research advances self-supervised multimodal learning and provides a scalable, efficient, and robust solution for real-world robotic systems, with potential for further optimization to support additional modalities and real-time processing.