Voice recognition by deep transfer learning and vision transformers to secure voice authentication

Nayem Uddin Prince 1 *, Abdullah Al Masum 2, Salman Mohammad Abdullah 3 and Touhid Bhuiyan 4

1 Information Technology (2022), Washington University of Science and Technology, USA.
2 Information Technology (2024), Westcliff University Irvine, USA.
3 Information Technology (2023), Washington University of science and technology, USA.
4 Cyber Security School of Information Technology Washington University of Science and Technology Virginia, USA.
 
Research Article
World Journal of Advanced Research and Reviews, 2024, 23(03), 1365–1377
Article DOI: 10.30574/wjarr.2024.23.3.2781
 
Publication history: 
Received on 02 August 2024; revised on 10 September 2024; accepted on 12 September 2024
 
Abstract: 
Speech recognition is crucial for ensuring the security of personal devices and financial transactions. Attaining high accuracy and robustness in voice authentication is challenging due to the presence of voice and environmental variability. Recent advancements in the field of deep learning, particularly in transfer learning and visual transformers, have the potential to enhance voice recognition systems. This study employs advanced deep transfer learning techniques, including Vision Transformers (ViT), VGG16, and a customized Convolutional Neural Network (CNN), to enhance the accuracy and security of speech authentication. The objective is to evaluate and contrast various solutions' voice recognition and authentication accuracy. The experiment included 3000 voice samples, with an equal distribution of 1500 samples from male participants and 1500 from female participants. The dataset was used to train Vision Transformers, VGG16 with transfer learning, and a custom CNN. The models were assessed based on their accuracy in identifying and authenticating voice samples. The VGG16 model achieved the highest level of accuracy in speech recognition, with a precision rate of 95%. The Vision Transformer and custom CNN exhibited satisfactory performance. However, VGG16 demonstrated higher accuracy. The most accurate voice authentication model studied is the VGG16 model based on transfer learning. This study suggests that the security and reliability of voice recognition systems can be enhanced through the use of deep learning techniques.
 
Keywords: 
Voice recognition; VGG16; CustomCNN; Vit; honey trap; webform; Cybercrime; Vision Transform; MFCCs
 
Full text article in PDF: 
Share this