PUBLICATION论文 · IJIGSP · 2020

Detecting Video Inter-Frame Forgeries Based on Convolutional Neural Network Model 基于卷积神经网络模型的视频帧间伪造检测

X. H. Nguyen1,2 Y. Hu1 M. A. Amin1 K. G. Hayat1 V. T. Le2 D.-T. Truong3
1Research Centre of Multimedia Information Security Detection and Intelligent Processing, School of Electronics and Information Engineering, South China University of Technology, Guangzhou, P.R. China华南理工大学电子与信息工程学院多媒体信息安全检测与智能处理研究中心,中国广州 2Faculty of Electronics and Informatics Engineering, Mien Trung Industrial and Trade College, Phu Yen, Vietnam越南富安省中原工业贸易学院电子与信息工程系 3Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam越南胡志明市 Ton Duc Thang University 信息技术学院自然语言处理与知识发现实验室
№ P1

ABSTRACT摘要

What is the problem and what did we do? 我们解决了什么问题

In the era of information explosion today, videos are easily captured and made viral in a short time, and video tampering has become easier due to editing software. Therefore, the authenticity of videos has become increasingly essential. Video inter-frame forgeries are the most common type of video forgery, which are difficult to detect with the naked eye. Until now, some algorithms have been suggested for detecting inter-frame forgeries based on handicraft features, but the accuracy and processing speed of those algorithms remain challenging. 在当今信息爆炸的时代,视频易于拍摄并在短时间内病毒式传播,而视频编辑软件的普及使得视频篡改变得更加容易。因此,视频的真实性变得愈发重要。视频帧间伪造是最常见的视频伪造类型,肉眼难以检测。迄今为止,一些基于手工特征的帧间伪造检测算法已被提出,但其准确率和处理速度仍具挑战性。

In this paper, we propose a video forgery detection method for detecting video inter-frame forgeries based on convolutional neural network (CNN) models by retraining available CNN models pre-trained on the ImageNet dataset. The proposed method exploits state-of-the-art CNN models retrained to capture spatial-temporal relationships in video to robustly detect inter-frame forgeries. We also propose a confidence score instead of the raw output score to increase the accuracy of the proposed method. 本文提出了一种基于卷积神经网络(CNN)模型的视频帧间伪造检测方法,通过对在ImageNet数据集上预训练的CNN模型进行重训练来实现。所提出的方法利用重训练的最先进CNN模型捕获视频中的时空关系,以稳健地检测帧间伪造。此外,我们还提出了一种置信度分数来替代原始输出分数,以提高所提方法的准确率。

Through experiments, the detection accuracy of the proposed method reaches 99.17% when combining residual and optical flow features. This result demonstrates that the proposed method has significantly higher efficiency and accuracy than other recent methods. 实验结果表明,当结合残差和光流特征时,所提方法的检测准确率达到99.17%。这一结果证明,所提方法比近期其他方法具有显著更高的效率和准确率。

№ P2

FRAMEWORK OVERVIEW框架概览

Retraining and testing pipeline for inter-frame forgery detection. 面向帧间伪造检测的重训练与测试流程。

The proposed framework consists of a retraining stage and a testing stage. In retraining, state-of-the-art CNN models pre-trained on ImageNet are fine-tuned on forensic datasets built from residuals and optical flow. In testing, input videos are decomposed into frame-difference samples, classified by the retrained network, and aggregated via a temporal confidence score to reach a final authenticity verdict. 所提出的框架包含重训练阶段和测试阶段。在重训练中,在ImageNet上预训练的最先进CNN模型在由残差和光流构建的取证数据集上进行微调。在测试中,输入视频被分解为帧差样本,由重训练网络分类,并通过时序置信度分数聚合以得出最终真实性判决。

№ P3

METHOD方法

Dataset construction, retraining, and confidence scoring. 数据集构建、重训练与置信度评分

01

Training Dataset Construction from Residuals and Optical Flow基于残差与光流的训练数据集构建

We propose four methods to construct training datasets from original videos. Negative samples are built from adjacent frames (residuals or optical flow), which exhibit natural temporal consistency. Positive samples are built from non-adjacent frames separated by at least 15 frames, which exhibit inconsistency. Dataset_1 uses two-frame residuals; Dataset_2 uses three grey-value residuals from four frames; Dataset_3 uses two-frame optical flow; Dataset_4 uses three optical-flow magnitudes from four frames. 我们提出了四种从原始视频构建训练数据集的方法。负样本由相邻帧(残差或光流)构建,表现出自然时序一致性;正样本由间隔至少15帧的非相邻帧构建,表现出不一致性。Dataset_1使用两帧残差;Dataset_2使用四帧的三个灰度残差;Dataset_3使用两帧光流;Dataset_4使用四帧的三个光流幅度。

02

Transfer Learning and Fine-Tuning迁移学习与微调

We fine-tune state-of-the-art CNN models—GoogleNet, ResNet, DenseNet, InceptionV3, InceptionResNetV2, MobileNetV2, and NasNet—pre-trained on ImageNet by replacing the last three layers with a fully connected layer, a softmax layer, and a classification output layer. Training uses SGD with momentum 0.9, initial learning rate 0.001, mini-batch size 10, max epochs 20, and L2 regularization 0.0001. 我们对在ImageNet上预训练的最先进CNN模型——GoogleNet、ResNet、DenseNet、InceptionV3、InceptionResNetV2、MobileNetV2和NasNet——进行微调,将最后三层替换为全连接层、Softmax层和分类输出层。训练采用动量0.9的SGD优化器,初始学习率0.001,批次大小10,最大轮数20,L2正则化0.0001。

03

Confidence Score Transformation置信度分数转换

Instead of using raw CNN output scores, we define a confidence score f_con(i) that incorporates temporal context from neighboring samples. A video is declared original if max(f_con(i)) < 0.5; otherwise it is forged. This temporal aggregation reduces isolated misclassifications and boosts overall accuracy. 我们定义了置信度分数f_con(i)来替代原始CNN输出分数,该分数融合了邻近样本的时序上下文。若max(f_con(i)) < 0.5则判定为原始视频,否则为伪造视频。这种时序聚合减少了孤立误分类,提升了整体准确率。

04

Testing and Decision测试与决策

During testing, each video is decomposed into samples following the negative-sample construction strategy. Each sample is classified by the retrained model, producing output scores that are transformed into confidence scores. The final decision is based on the maximum confidence score across all samples in the video. 在测试阶段,每个视频按照负样本构建策略被分解为样本。每个样本由重训练模型分类,产生输出分数并转换为置信度分数。最终决策基于视频中所有样本的最大置信度分数。

№ P4

RESULTS结果

Best detection accuracy among all evaluated methods. 在所有评估方法中达到最佳检测精度。

Cross-Model Comparison (Table III)跨模型比较(表III)

On Dataset_1, ResNet18, DenseNet201, InceptionV3, and ResNet50 all achieve 97.5% detection accuracy. MobileNetV2 achieves 96.67% with only 3.5M parameters, offering the best accuracy-to-parameter trade-off. SqueezeNet reaches 94.17% with merely 1.24M parameters. 在Dataset_1上,ResNet18、DenseNet201、InceptionV3和ResNet50均达到97.5%的检测准确率。MobileNetV2以仅3.5M参数达到96.67%,在准确率与参数数量间取得了最佳平衡。SqueezeNet以仅1.24M参数达到94.17%。

Feature Fusion Analysis (Table IV)特征融合分析(表IV)

Individually, Dataset_1 (two-frame residual) achieves 96.67% and Dataset_4 (optical-flow magnitudes) achieves 95.0%. Combining Dataset_1 and Dataset_4 yields 99.17% accuracy with 3.33% FPR and 100% TPR, significantly outperforming single-feature approaches. 单独使用时,Dataset_1(两帧残差)达到96.67%,Dataset_4(光流幅度)达到95.0%。将Dataset_1与Dataset_4结合后,准确率达到99.17%,假阳性率3.33%,真阳性率100%,显著优于单特征方法。

Transfer Learning vs. Training from Scratch (Table V)迁移学习与从头训练对比(表V)

Transfer learning from ImageNet is crucial: MobileNetV2 improves from 84.17% (scratch) to 96.67% (transfer), and ResNet18 improves from 93.33% to 97.5%. This confirms that pre-trained visual features generalize effectively to forensic tasks despite domain differences. 来自ImageNet的迁移学习至关重要:MobileNetV2从从头训练的84.17%提升至迁移学习的96.67%,ResNet18从93.33%提升至97.5%。这证实预训练视觉特征能够有效泛化到取证任务,尽管领域存在差异。

Comparison with Recent Methods (Table VI)与近期方法对比(表VI)

The proposed method achieves 99.17% accuracy, outperforming optical-flow consistency [18] at 91.12%, spatial-temporal correlation [17] at 86.67%, DCT mean correlation [3] at 93.34%, and grey-value correlation [4] at 88.89%. It detects copy-move, insertion, and deletion forgeries within a unified framework. 所提方法达到99.17%的准确率,优于光流一致性[18]的91.12%、时空相关性[17]的86.67%、DCT均值相关性[3]的93.34%和灰度值相关性[4]的88.89%。该方法能在统一框架内检测复制-移动、插入和删除伪造。

№ P5

LIMITATIONS & FUTURE WORK局限性与未来工作

What we could not solve yet. 我们尚未解决的问题。

Forgery-type classification. The method detects inter-frame forgeries but does not explicitly classify the specific manipulation type (insertion, deletion, or duplication). Future work will extend the framework to classify forgery types. 伪造类型分类。该方法能够检测帧间伪造,但尚未明确分类具体的篡改类型(插入、删除或复制)。未来工作将扩展框架以实现伪造类型分类。

Dataset scale. The training set comprises 270 original videos from five surveillance cameras. Larger and more diverse datasets may further improve generalization across different capture conditions. 数据集规模。训练集包含来自五台监控摄像头的270个原始视频。更大更多样的数据集可能进一步提升不同拍摄条件下的泛化能力。

Computational cost of optical flow. Optical flow extraction increases computational cost compared to residual-only methods. Future architectures may integrate learnable motion estimation to reduce this overhead. 光流计算成本。相比仅使用残差的方法,光流提取增加了计算成本。未来架构可能集成可学习的运动估计以降低这一开销。

Lightweight forensic architectures. Future work includes designing lightweight CNN architectures specialized for video forensic tasks, balancing detection accuracy with real-time processing requirements. 轻量级取证架构。未来工作包括设计面向视频取证任务的轻量级CNN架构,在检测准确率与实时处理需求之间取得平衡。

№ P6

BIBTEX引用

Cite this paper. 引用此论文

@article{nguyen2020detecting,
  author    = {Nguyen, Xuan Hau and Hu, Yongjian and Amin, Muhammad Ahmad and Hayat, Khan Gohar and Le, Van Thinh and Truong, Dinh-Tu},
  title     = {Detecting Video Inter-Frame Forgeries Based on Convolutional Neural Network Model},
  journal   = {International Journal of Image, Graphics and Signal Processing},
  year      = {2020},
  volume    = {12},
  number    = {3},
  pages     = {1--12},
  publisher = {MECS Press},
  doi       = {10.5815/ijigsp.2020.03.01}
}