

Skeleton joint data, as a representation closely related to human actions, has received significant attention in the field of human behavior recognition in recent years. However, current methods for behavior recognition using skeleton data face several challenges, such as insufficient accuracy and reliance on a single modality. To address these challenges, a multi-stream network approach for behavior recognition using multiple skeleton joints is proposed. Firstly, a top-down approach is employed to extract the coordinates of 133 key skeleton joints from the human body, which are then represented as heatmaps. Subsequently, in the PoseRgb model, RGB data is used as input for the RGB channel to capture spatial features, while the heatmap representations of skeleton data are utilized as input for the Pose channel to capture temporal features. Leveraging the fusion capabilities of heatmaps, these two channels are merged using lateral connections, ultimately yielding behavior recognition results. Experimental results demonstrate that on widely-used UCF101 and HMDB51 datasets, the accuracy reaches 99.05% and 86.1%, respectively. Compared to the Temporal Shift Module (TSM) network, there is a 5% and 15% improvement in accuracy, respectively. The proposed method effectively combines spatial and temporal information in videos, resulting in more accurate behavior recognition.