Trustworthy Deepfake Defense — Muhammad Ahmad Amin

№ P1

ABSTRACT摘要

What is the problem and what did we do? 我们解决了什么问题？

Deepfakes pose a growing threat to consumer security and trust, enabling identity theft, financial fraud, and misinformation across social and mobile platforms. Reliable detection pipelines are essential for trustworthy consumer applications, yet many current systems rely on conventional resizing operators, such as bilinear and bicubic interpolation, that behave as low-pass filters and systematically attenuate manipulation-relevant high-frequency (HF) forensic cues during preprocessing. 深度伪造对消费级安全与信任构成日益严重的威胁，导致身份盗窃、金融欺诈和社交媒体上的虚假信息。可靠的检测流程对可信消费应用至关重要，但许多现有系统依赖常规缩放操作（如双线性和双三次插值），这些操作作为低通滤波器，在预处理过程中系统性地衰减了与操纵相关的高频取证线索。

We propose a proactive two-stage forensic defense framework. The first stage, the Frequency-Aware Information Preserving Network (FA-IPNet), is the first rescaling module designed with an explicit forensic objective, retaining manipulation-sensitive HF cues rather than maximizing perceptual reconstruction quality, at only 0.304K parameters and 2.91M FLOPs. The second stage, the Adaptive Frequency-Enhanced Forgery Detection Network (AFE-FDNet), is a dual-branch detector whose spatial branch employs retentive vision transformers with distance-decayed dual-axis attention, and whose frequency branch applies learnable multi-scale spectral band decomposition with per-band windowed attention enhancement and temperature-scaled pyramid cross-band aggregation. 我们提出了一个主动式两阶段取证防御框架。第一阶段，频域感知信息保持网络（FA-IPNet），是首个以明确取证目标设计的缩放模块，保留操纵敏感的高频线索而非最大化感知重建质量，仅使用0.304K参数和2.91M FLOPs。第二阶段，自适应频域增强伪造检测网络（AFE-FDNet），是一个双分支检测器，其空间分支采用具有距离衰减双轴注意力的保留视觉Transformer，其频域分支应用可学习的多尺度频谱带分解，配合每带窗口注意力增强和温度缩放金字塔跨带聚合。

Evaluations on FaceForensics++, Celeb-DF-V2, DFDC, and the diffusion-based DF40 dataset demonstrate that FA-IPNet better preserves forensic signals than interpolation-based resizing, and that AFE-FDNet achieves robust generalization including on diffusion-era forgeries under a zero-shot protocol. Together, they establish a robust and trustworthy defense for consumer-facing applications against deepfake-driven security and privacy risks. 在FaceForensics++、Celeb-DF-V2、DFDC和基于扩散的DF40数据集上的评估表明，FA-IPNet比基于插值的缩放更好地保留取证信号，AFE-FDNet实现了稳健的泛化，包括在零样本协议下的扩散时代伪造。两者共同为面向消费者的应用建立了针对深度伪造驱动的安全和隐私风险的稳健可信防御。

№ P2

FRAMEWORK OVERVIEW框架概览

Two stages, one pipeline: preserve then detect. 两个阶段，一个流程：先保留再检测。

Fig. 1 — Overview of the proposed two-stage deepfake defense framework. Stage 1 (FA-IPNet) preserves forensic HF cues. Stage 2 (AFE-FDNet) extracts complementary evidence through spatial (RVT) and frequency (FET) branches. 图1 — 所提出的两阶段深度伪造防御框架概览。阶段1（FA-IPNet）保留取证高频线索。阶段2（AFE-FDNet）通过空间（RVT）和频域（FET）分支提取互补证据。

© 2026 IEEE. Figure reproduced with permission from IEEE Transactions on Consumer Electronics. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. © 2026 IEEE。经IEEE Transactions on Consumer Electronics许可转载。允许个人使用此材料。所有其他用途必须获得IEEE许可。

№ P3

METHOD方法

Stage I preserves evidence; Stage II exploits it. 阶段I保留证据；阶段II利用证据。

01

FA-IPNet: Frequency-Aware Information Preserving NetworkFA-IPNet：频域感知信息保持网络

FA-IPNet decomposes the input into three directional Haar sub-bands (Horizontal, Vertical, Diagonal) via fixed stride-2 convolutions—zero learnable parameters at decomposition. Bilinear interpolation provides the low-frequency path. Multi-Spectral Channel Attention (MSCA) uses rFFT2 spectral statistics with bounded affine recalibration (α ∈ (−1,1), β ∈ [−1,1]) to selectively emphasize forensically informative channels. Attention-Aware Channel Reduction (AACR) performs joint selection-and-projection from 3C → C channels before fusing with the low-frequency path to produce the HF-preserved output x_out. Only 0.304K parameters and 2.91M FLOPs. FA-IPNet通过固定步长为2的卷积将输入分解为三个方向性Haar子带（水平、垂直、对角线）——分解步骤零可学习参数。双线性插值提供低频路径。多光谱通道注意力（MSCA）使用rFFT2频谱统计量和有界仿射重校准（α ∈ (−1,1), β ∈ [−1,1]）选择性地强调取证信息丰富的通道。注意力感知通道降维（AACR）执行联合选择与投影，从3C → C通道，然后与低频路径融合，产生高频保持输出x_out。仅0.304K参数和2.91M FLOPs。

02

AFE-FDNet: Spatial Branch (RVT)AFE-FDNet：空间分支（RVT）

The spatial branch uses a Retentive Vision Transformer with distance-decayed dual-axis attention—distinct from Swin's binary window masks and standard retentive networks' fixed decay. Head-specific log-decay masks (δ_m = log(1 − 2^{−(α+β·m/h)})) provide a principled locality bias, enabling simultaneous modeling of near-neighbor forensic artifacts (blending seams) and long-range dependencies (global identity consistency) in a single forward pass. 空间分支使用具有距离衰减双轴注意力的保留视觉Transformer——与Swin的二进制窗口掩码和标准保留网络的固定衰减不同。头特定对数衰减掩码（δ_m = log(1 − 2^{−(α+β·m/h)})）提供了原则性的局部性偏置，能够在单次前向传播中同时建模邻近取证伪影（混合接缝）和长程依赖（全局身份一致性）。

03

AFE-FDNet: Frequency Branch (FET)AFE-FDNet：频域分支（FET）

The frequency branch decomposes the signal into three learnable spectral bands (Low, Mid, High) via strided convolutions with residual upsampling. Each band is independently enhanced by Multi-Dconv Head Transposed Attention (MDTA) and Gated Depthwise Feedforward Network (GDFN). Cross-band pyramid aggregation (FAM) uses a learnable temperature τ = exp(tmp) to adaptively weight which frequency scales carry the most manipulation-specific information—enabling the same architecture to handle both GAN checkerboard artifacts and diffusion mid-frequency noise. 频域分支通过带残差上采样的步长卷积将信号分解为三个可学习频谱带（低、中、高）。每个带通过多深度卷积头转置注意力（MDTA）和门控深度前馈网络（GDFN）独立增强。跨带金字塔聚合（FAM）使用可学习温度τ = exp(tmp)自适应加权哪些频率尺度携带最多的操纵特定信息——使同一架构能够处理GAN棋盘格伪影和扩散中频噪声。

04

Feature Fusion & Classification特征融合与分类

Spatial (128-D) and frequency (128-D) embeddings are concatenated into a 256-D fused vector, passed through a two-layer MLP with ReLU and dropout 0.3, and classified via label-smoothing cross-entropy (ε = 0.05). The entire pipeline is trained end-to-end with AdamW, cosine annealing, and EMA (decay 0.999). AFE-FDNet totals 13.5M parameters and 6.68 GFLOPs; inference time is ≈17ms per image on a single RTX 3090. 空间（128维）和频域（128维）嵌入被拼接为256维融合向量，通过带ReLU和dropout 0.3的两层MLP，并通过标签平滑交叉熵（ε = 0.05）分类。整个流程使用AdamW、余弦退火和EMA（衰减0.999）进行端到端训练。AFE-FDNet总计13.5M参数和6.68 GFLOPs；在单张RTX 3090上推理时间约为17ms/图像。

Fig. 2 — HF–HVD diagnostic: bilinear loses 89.5% more directional HF signal than FA-IPNet at eye contours and lip boundaries. © 2026 IEEE. 图2 — HF–HVD诊断：双线性在眼部轮廓和唇部边界处比FA-IPNet多损失89.5%的方向性高频信号。© 2026 IEEE。

№ P4

RESULTS结果

Best average cross-dataset AUC among all evaluated methods. 在所有评估方法中平均跨数据集AUC最佳。

Cross-Dataset Generalization (Table II)跨数据集泛化（表II）

Trained on FF++ (HQ) at 224×224, tested without fine-tuning on CDF-V2, DFDC, and zero-shot on DF40 (diffusion-era). AFE-FDNet achieves 99.76% AUC on FF++ HQ, 84.17% on CDF-V2, 78.45% on DFDC, and 78.94% on DF40—outperforming 11 SOTA baselines including FCG (CVPR'25), TCLF (TPRIVS'25), and DiffusionFacial (MM'24). Average cross-dataset AUC: 85.33%. 在FF++ (HQ) 224×224上训练，在CDF-V2、DFDC上无微调测试，在DF40（扩散时代）上零样本测试。AFE-FDNet在FF++ HQ上达到99.76% AUC，CDF-V2上84.17%，DFDC上78.45%，DF40上78.94%——超越11个SOTA基线，包括FCG (CVPR'25)、TCLF (TPRIVS'25)和DiffusionFacial (MM'24)。平均跨数据集AUC：85.33%。

Resizer Comparison (Table I)缩放器比较（表I）

At 512×512, FA-IPNet achieves 72.32% AUC on CDF-V2 and 67.43% on DF40 (zero-shot), outperforming bilinear (+2.6% / +5.9%), bicubic (+5.8% / +7.6%), LRI (+2.3% / +4.3%), LIIF (+5.1% / +7.2%), and LSAID (+1.9% / +3.7%). Learned perceptual resizers (LRI, LIIF, LSAID) do not outperform classical interpolation on forensic tasks—confirming that perceptual quality and forensic-signal preservation are distinct objectives. 在512×512下，FA-IPNet在CDF-V2上达到72.32% AUC，在DF40（零样本）上达到67.43%，超越双线性（+2.6% / +5.9%）、双三次（+5.8% / +7.6%）、LRI（+2.3% / +4.3%）、LIIF（+5.1% / +7.2%）和LSAID（+1.9% / +3.7%）。学习感知缩放器（LRI、LIIF、LSAID）在取证任务上并未超越经典插值——证实感知质量与取证信号保持是不同的目标。

Cross-Backbone Validation (Table IV)跨骨干验证（表IV）

FA-IPNet improves all downstream backbones under LQ compression: Xception (+6–12% AUC), EfficientNet-B4 (+2–4%), ViT-B/16 (+2–3%). On DF40 zero-shot, consistent +3–4% gains across all architectures. This confirms FA-IPNet is detector-agnostic. FA-IPNet在所有下游骨干网络下于LQ压缩中均有提升：Xception（+6–12% AUC）、EfficientNet-B4（+2–4%）、ViT-B/16（+2–3%）。在DF40零样本上，所有架构一致提升+3–4%。这证实FA-IPNet是与检测器无关的。

Fig. 4 — Conventional bilinear resizing erodes forensic HF cues. Downscaling to 128×128 causes severe AUC collapse (FF++: 99.62% → 95.45%). © 2026 IEEE. 图4 — 常规双线性缩放侵蚀取证高频线索。下采样至128×128导致严重的AUC崩溃（FF++: 99.62% → 95.45%）。© 2026 IEEE。

Fig. 5 — RVT concentrates on geometric inconsistencies (eyes, mouth); FET highlights spectral anomalies (compression banding, periodic noise). © 2026 IEEE. 图5 — RVT集中于几何不一致性（眼睛、嘴巴）；FET突出频谱异常（压缩带状、周期性噪声）。© 2026 IEEE。

№ P5

LIMITATIONS & FUTURE WORK局限性与未来工作

What we could not solve yet. 我们尚未解决的问题。

Resolution scaling extremes. FA-IPNet is trained on fixed input resolutions (224×224). Extreme upsampling (>4×) may reintroduce artifacts that bypass the detector. The HF–HVD diagnostic is most reliable at moderate scale changes. 分辨率缩放极端情况。FA-IPNet在固定输入分辨率（224×224）上训练。极端上采样（>4×）可能重新引入绕过检测器的伪影。HF–HVD诊断在中等尺度变化下最可靠。

Video temporal coherence. This work focuses on single-frame detection. Temporal inconsistencies across video frames are not explicitly modeled, though the framework architecture could be extended with a third temporal branch. 视频时序一致性。本工作专注于单帧检测。视频帧间的时间不一致性未被显式建模，但框架架构可通过第三时序分支扩展。

Generative model evolution. As diffusion-based face synthesis improves, artifact signatures shift to lower frequencies. The current spectral priors may need periodic retraining to remain effective against next-generation synthesis methods. 生成模型演进。随着基于扩散的人脸合成技术改进，伪影特征向低频转移。当前的频谱先验可能需要定期重新训练，以对抗下一代合成方法。

Ethical considerations. Deepfake detection models carry dual-use risk; responsible release and controlled weight sharing are essential. Consumer-facing applications should ensure transparency and informed disclosure when content is flagged. Given the asymmetric harm of false positives in identity-sensitive contexts, our framework is intended as one component in a multi-stage review pipeline rather than a sole decision authority. 伦理考量。深度伪造检测模型具有双重用途风险；负责任的发布和受控的权重共享至关重要。面向消费者的应用应确保在标记内容时的透明度和知情披露。鉴于身份敏感环境中假阳性的不对称危害，我们的框架旨在作为多阶段审查流程的一个组件，而非唯一决策权威。

№ P6

BIBTEX引用

Cite this paper. 引用此论文。

@article{amin2026trustworthy,
  author    = {Amin, Muhammad Ahmad and Ni, Jiangqun and Xu, Ping and Yu, Zeqin and Fu, Dahao},
  title     = {Trustworthy Deepfake Defense for Consumer Applications via Frequency-Preserving Resizing and Dual-Branch Forensics},
  journal   = {IEEE Transactions on Consumer Electronics},
  year      = {2026},
  volume    = {xx},
  number    = {xx},
  pages     = {xx--xx},
  publisher = {IEEE},
  doi       = doi={10.1109/TCE.2026.3699978}
}

№ P7

IEEE Copyright & Usage. IEEE 版权与使用。

© 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. © 2026 IEEE。允许个人使用此材料。所有其他用途必须获得IEEE许可，包括在任何当前或未来媒体中重印/再版此材料用于广告或促销目的、创建新的集体作品、转售或重新分发到服务器或列表，或在其他作品中重用此作品的任何受版权保护的组件。

This page is a personal academic landing page. The full paper is available via IEEE Xplore. Figures are reproduced with permission from IEEE Transactions on Consumer Electronics. 本页面为个人学术着陆页。完整论文可通过IEEE Xplore获取。图表经IEEE Transactions on Consumer Electronics许可转载。

№ P8

RELATED WORK相关工作