DeepSense: An Explainable AI Multi-Modal Platform for Deepfake Detection Across Image, Audio, and Video
Keywords:
AIAbstract
The rapid proliferation of generative AI has given rise to highly realistic synthetic media, commonly known as
deepfakes, posing severe threats to personal identity, democratic processes, and digital trust. Existing detection
systems are predominantly uni-modal and opaque, offering little forensic evidence to support their binary
classifications. This paper presents DeepSense, a comprehensive, explainable AI-powered multi-modal deepfake
detection platform capable of concurrently analyzing static images, digital audio recordings, and video files. The
system integrates XceptionNet for image analysis, a hybrid XceptionNet+LSTM for video, and a CNN-BiLSTM
architecture for audio, achieving detection accuracies of 90.83%, 95.25%, and 98.32% respectively. Explainable
AI (XAI) techniques -- specifically Gradient-weighted Class Activation Mapping (Grad-CAM) for visual media and
high-resolution spectral feature visualization for audio -- are deeply integrated into the inference pipeline. The
Google Gemini 3.1 Flash LLM is employed to translate raw algorithmic outputs into natural-language forensic
narratives. DeepSense is deployed via an interactive Streamlit web interface, democratizing access to digital
forensics for non-technical users, journalists, and legal professionals
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.










