As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
The rise of deep learning technology has significantly improved the recognition rate of voiceprint recognition technology, such as the success of the X-vector architecture, which utilizes Time Delay Neural Networks (TDNN) to transform variable-length speech segments into fixed-length outputs. However, the current popular voiceprint recognition models have significantly decreased applicability in noisy environments. To address this issue, this study investigates the limitations of the X-vector architecture and proposes an improved speaker verification model based on TDNN. This model incorporates Long Short-Term Memory (LSTM) to model the input speech features while retaining information related to previous time steps. Similar to the ECAPA-TDNN model, we introduce a one-dimensional Res2Net module with a channel attention mechanism (SE-Res2Block) at the frame level, which enhances channel correlation and rescales channels based on recorded global properties, thereby extending the temporal context of the frame layer. Finally, the model’s feature representation capacity is enhanced through multi-layer aggregation. The results show that the recognition performance of this system reaches 96.32% in a 15 dB noise environment. Furthermore, this system outperforms the commonly used ECAPA TDNN model, demonstrating good accuracy and robustness.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.