Deepfake Image Detection: From CNN to Vision Transformer

Chaitali Charandas Daware; V. K. Shandilya; N. P. Mohod

doi:10.30574/wjarr.2026.30.2.1169

Chaitali Charandas Daware ^*, V. K. Shandilya and N. P. Mohod

Department of Computer Science and Engineering, Sipna College of Engineering and Technology Amravati, Maharashtra, India.

Research Article

World Journal of Advanced Research and Reviews, 2026, 30(02),1140-1151

Article DOI: 10.30574/wjarr.2026.30.2.1169

DOI url: https://doi.org/10.30574/wjarr.2026.30.2.1169

Publication history

Received on 22 March 2026; revised on 06 May 2026; accepted on 09 May 2026

Abstract

The exponential proliferation of synthetic media, colloquially known as "deepfakes," driven by advanced Generative Adversarial Networks (GANs) and diffusion models, presents a formidable challenge to digital forensics, personal privacy, and societal trust. While Convolutional Neural Networks (CNNs) have historically served as the cornerstone for detecting such manipulations, they frequently exhibit limitations regarding generalization to unseen manipulation algorithms and robustness against real-world distortions. This paper introduces DeepShield, an industry-grade, full-stack deepfake detection web application powered by a fine-tuned SigLIP2 (Sigmoid Loss for Image-Image Pre-training) vision-language encoder. Unlike traditional CNN-based approaches that rely solely on hierarchical spatial feature extraction, the proposed model utilizes a transformer-based architecture pre-trained with sigmoid loss, enabling the capture of global semantic context and subtle texture inconsistencies.
The system was evaluated on the prithiv ML mods/Open Deepfake-Preview dataset, achieving an overall accuracy of 94.44%. The model demonstrated exceptional performance, achieving a precision of 97.18% for the "Fake" class and a recall of 97.34% for the "Real" class, significantly minimizing false accusations in forensic scenarios. Furthermore, this research bridges the gap between theoretical modeling and practical application by implementing a user-centric forensic interface featuring an interactive Region of Interest (ROI) selector and temporal video analysis. Comparative analysis reveals that the proposed SigLIP2 model outperforms standard CNN architectures and existing Convolutional Vision Transformer (CViT) benchmarks, offering a robust, scalable solution for digital media authentication.

Keywords

Deepfake Detection; Siglip 2; Vision Transformers; Digital Forensics; Flask; Web Application; Generative Adversarial Networks

Download Article PDF

https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2026-1169.pdf

Preview Article PDF

How to cite this article

Chaitali Charandas Daware, V. K. Shandilya and N. P. Mohod. Deepfake Image Detection: From CNN to Vision Transformer. World Journal of Advanced Research and Reviews, 2026, 30(02), 1241-1255. Article DOI: https://doi.org/10.30574/wjarr.2026.30.2.1169

Deepfake Image Detection: From CNN to Vision Transformer

Chaitali Charandas Daware ^*, V. K. Shandilya and N. P. Mohod

Preview Article PDF

Get Certificates

Issue details

Deepfake Image Detection: From CNN to Vision Transformer

Chaitali Charandas Daware *, V. K. Shandilya and N. P. Mohod

Preview Article PDF

Get Certificates

Issue details

Chaitali Charandas Daware ^*, V. K. Shandilya and N. P. Mohod