

The rise of deep learning technology has significantly improved the recognition rate of voiceprint recognition technology, such as the success of the X-vector architecture, which utilizes Time Delay Neural Networks (TDNN) to transform variable-length speech segments into fixed-length outputs. However, the current popular voiceprint recognition models have significantly decreased applicability in noisy environments. To address this issue, this study investigates the limitations of the X-vector architecture and proposes an improved speaker verification model based on TDNN. This model incorporates Long Short-Term Memory (LSTM) to model the input speech features while retaining information related to previous time steps. Similar to the ECAPA-TDNN model, we introduce a one-dimensional Res2Net module with a channel attention mechanism (SE-Res2Block) at the frame level, which enhances channel correlation and rescales channels based on recorded global properties, thereby extending the temporal context of the frame layer. Finally, the model’s feature representation capacity is enhanced through multi-layer aggregation. The results show that the recognition performance of this system reaches 96.32% in a 15 dB noise environment. Furthermore, this system outperforms the commonly used ECAPA TDNN model, demonstrating good accuracy and robustness.