PUBLICATION论文 · IJIGSP · 2020
ABSTRACT摘要
In the era of information explosion today, videos are easily captured and made viral in a short time, and video tampering has become easier due to editing software. Therefore, the authenticity of videos has become increasingly essential. Video inter-frame forgeries are the most common type of video forgery, which are difficult to detect with the naked eye. Until now, some algorithms have been suggested for detecting inter-frame forgeries based on handicraft features, but the accuracy and processing speed of those algorithms remain challenging. 在当今信息爆炸的时代,视频易于拍摄并在短时间内病毒式传播,而视频编辑软件的普及使得视频篡改变得更加容易。因此,视频的真实性变得愈发重要。视频帧间伪造是最常见的视频伪造类型,肉眼难以检测。迄今为止,一些基于手工特征的帧间伪造检测算法已被提出,但其准确率和处理速度仍具挑战性。
In this paper, we propose a video forgery detection method for detecting video inter-frame forgeries based on convolutional neural network (CNN) models by retraining available CNN models pre-trained on the ImageNet dataset. The proposed method exploits state-of-the-art CNN models retrained to capture spatial-temporal relationships in video to robustly detect inter-frame forgeries. We also propose a confidence score instead of the raw output score to increase the accuracy of the proposed method. 本文提出了一种基于卷积神经网络(CNN)模型的视频帧间伪造检测方法,通过对在ImageNet数据集上预训练的CNN模型进行重训练来实现。所提出的方法利用重训练的最先进CNN模型捕获视频中的时空关系,以稳健地检测帧间伪造。此外,我们还提出了一种置信度分数来替代原始输出分数,以提高所提方法的准确率。
Through experiments, the detection accuracy of the proposed method reaches 99.17% when combining residual and optical flow features. This result demonstrates that the proposed method has significantly higher efficiency and accuracy than other recent methods. 实验结果表明,当结合残差和光流特征时,所提方法的检测准确率达到99.17%。这一结果证明,所提方法比近期其他方法具有显著更高的效率和准确率。
METHOD方法
We propose four methods to construct training datasets from original videos. Negative samples are built from adjacent frames (residuals or optical flow), which exhibit natural temporal consistency. Positive samples are built from non-adjacent frames separated by at least 15 frames, which exhibit inconsistency. Dataset_1 uses two-frame residuals; Dataset_2 uses three grey-value residuals from four frames; Dataset_3 uses two-frame optical flow; Dataset_4 uses three optical-flow magnitudes from four frames. 我们提出了四种从原始视频构建训练数据集的方法。负样本由相邻帧(残差或光流)构建,表现出自然时序一致性;正样本由间隔至少15帧的非相邻帧构建,表现出不一致性。Dataset_1使用两帧残差;Dataset_2使用四帧的三个灰度残差;Dataset_3使用两帧光流;Dataset_4使用四帧的三个光流幅度。
We fine-tune state-of-the-art CNN models—GoogleNet, ResNet, DenseNet, InceptionV3, InceptionResNetV2, MobileNetV2, and NasNet—pre-trained on ImageNet by replacing the last three layers with a fully connected layer, a softmax layer, and a classification output layer. Training uses SGD with momentum 0.9, initial learning rate 0.001, mini-batch size 10, max epochs 20, and L2 regularization 0.0001. 我们对在ImageNet上预训练的最先进CNN模型——GoogleNet、ResNet、DenseNet、InceptionV3、InceptionResNetV2、MobileNetV2和NasNet——进行微调,将最后三层替换为全连接层、Softmax层和分类输出层。训练采用动量0.9的SGD优化器,初始学习率0.001,批次大小10,最大轮数20,L2正则化0.0001。
Instead of using raw CNN output scores, we define a confidence score f_con(i) that incorporates temporal context from neighboring samples. A video is declared original if max(f_con(i)) < 0.5; otherwise it is forged. This temporal aggregation reduces isolated misclassifications and boosts overall accuracy. 我们定义了置信度分数f_con(i)来替代原始CNN输出分数,该分数融合了邻近样本的时序上下文。若max(f_con(i)) < 0.5则判定为原始视频,否则为伪造视频。这种时序聚合减少了孤立误分类,提升了整体准确率。
During testing, each video is decomposed into samples following the negative-sample construction strategy. Each sample is classified by the retrained model, producing output scores that are transformed into confidence scores. The final decision is based on the maximum confidence score across all samples in the video. 在测试阶段,每个视频按照负样本构建策略被分解为样本。每个样本由重训练模型分类,产生输出分数并转换为置信度分数。最终决策基于视频中所有样本的最大置信度分数。
RESULTS结果
On Dataset_1, ResNet18, DenseNet201, InceptionV3, and ResNet50 all achieve 97.5% detection accuracy. MobileNetV2 achieves 96.67% with only 3.5M parameters, offering the best accuracy-to-parameter trade-off. SqueezeNet reaches 94.17% with merely 1.24M parameters. 在Dataset_1上,ResNet18、DenseNet201、InceptionV3和ResNet50均达到97.5%的检测准确率。MobileNetV2以仅3.5M参数达到96.67%,在准确率与参数数量间取得了最佳平衡。SqueezeNet以仅1.24M参数达到94.17%。
Individually, Dataset_1 (two-frame residual) achieves 96.67% and Dataset_4 (optical-flow magnitudes) achieves 95.0%. Combining Dataset_1 and Dataset_4 yields 99.17% accuracy with 3.33% FPR and 100% TPR, significantly outperforming single-feature approaches. 单独使用时,Dataset_1(两帧残差)达到96.67%,Dataset_4(光流幅度)达到95.0%。将Dataset_1与Dataset_4结合后,准确率达到99.17%,假阳性率3.33%,真阳性率100%,显著优于单特征方法。
Transfer learning from ImageNet is crucial: MobileNetV2 improves from 84.17% (scratch) to 96.67% (transfer), and ResNet18 improves from 93.33% to 97.5%. This confirms that pre-trained visual features generalize effectively to forensic tasks despite domain differences. 来自ImageNet的迁移学习至关重要:MobileNetV2从从头训练的84.17%提升至迁移学习的96.67%,ResNet18从93.33%提升至97.5%。这证实预训练视觉特征能够有效泛化到取证任务,尽管领域存在差异。
The proposed method achieves 99.17% accuracy, outperforming optical-flow consistency [18] at 91.12%, spatial-temporal correlation [17] at 86.67%, DCT mean correlation [3] at 93.34%, and grey-value correlation [4] at 88.89%. It detects copy-move, insertion, and deletion forgeries within a unified framework. 所提方法达到99.17%的准确率,优于光流一致性[18]的91.12%、时空相关性[17]的86.67%、DCT均值相关性[3]的93.34%和灰度值相关性[4]的88.89%。该方法能在统一框架内检测复制-移动、插入和删除伪造。
LIMITATIONS & FUTURE WORK局限性与未来工作
Forgery-type classification. The method detects inter-frame forgeries but does not explicitly classify the specific manipulation type (insertion, deletion, or duplication). Future work will extend the framework to classify forgery types. 伪造类型分类。该方法能够检测帧间伪造,但尚未明确分类具体的篡改类型(插入、删除或复制)。未来工作将扩展框架以实现伪造类型分类。
Dataset scale. The training set comprises 270 original videos from five surveillance cameras. Larger and more diverse datasets may further improve generalization across different capture conditions. 数据集规模。训练集包含来自五台监控摄像头的270个原始视频。更大更多样的数据集可能进一步提升不同拍摄条件下的泛化能力。
Computational cost of optical flow. Optical flow extraction increases computational cost compared to residual-only methods. Future architectures may integrate learnable motion estimation to reduce this overhead. 光流计算成本。相比仅使用残差的方法,光流提取增加了计算成本。未来架构可能集成可学习的运动估计以降低这一开销。
Lightweight forensic architectures. Future work includes designing lightweight CNN architectures specialized for video forensic tasks, balancing detection accuracy with real-time processing requirements. 轻量级取证架构。未来工作包括设计面向视频取证任务的轻量级CNN架构,在检测准确率与实时处理需求之间取得平衡。
BIBTEX引用
@article{nguyen2020detecting,
author = {Nguyen, Xuan Hau and Hu, Yongjian and Amin, Muhammad Ahmad and Hayat, Khan Gohar and Le, Van Thinh and Truong, Dinh-Tu},
title = {Detecting Video Inter-Frame Forgeries Based on Convolutional Neural Network Model},
journal = {International Journal of Image, Graphics and Signal Processing},
year = {2020},
volume = {12},
number = {3},
pages = {1--12},
publisher = {MECS Press},
doi = {10.5815/ijigsp.2020.03.01}
}
COPYRIGHT NOTICE版权声明
© 2020 MECS Press. Personal use of this material is permitted. Permission from MECS Press must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. © 2020 MECS Press。允许个人使用此材料。所有其他用途必须获得MECS Press许可,包括在任何当前或未来媒体中重印/再版此材料用于广告或促销目的、创建新的集体作品、转售或重新分发到服务器或列表,或在其他作品中重用此作品的任何受版权保护的组件。
This page is a personal academic landing page. The full paper is available via MECS Press. The dataset is published online at Mendeley Data (VIFFD). 本页面为个人学术着陆页。完整论文可通过MECS Press获取。数据集已在Mendeley Data(VIFFD)上发布。