Scalable Content Authentication in H.264/SVC Videos Using Perceptual Hashing based on Dempster-Shafer theory

Abstract The content authenticity of the multimedia delivery is important issue with rapid development and widely used of multimedia technology. Till now many authentication solutions had been proposed, such as cryptology and watermarking based methods. However, in latest heterogeneous network the video stream transmission has been coded in scalable way such as H.264/SVC, there is still no good authentication solution. In this paper, we firstly summarized related works and proposed a scalable content authentication scheme using a ratio of different energy (RDE) based perceptual hashing in Q/S dimension, which is used Dempster-Shafer theory and combined with the latest scalable video coding (H.264/SVC) construction. The idea of “sign once and verify in scalable way” can be realized. Comparing with previous methods, the proposed scheme based on perceptual hashing outperforms previous works in uncertainty (robustness) and efficiencies in the H.264/SVC video streams. At last, the experiment results verified t...


Introduction
Multimedia data is the most important information that human can recognize and understand. With the development of multimedia services and network communication technology nowadays, we ca n conveniently acquire various multimedia information such as video, audio, and others media, even at any time and any place. Due to adopting some new scalable video coding and transmission mechanisms (H.264/SVC) people can e ven consume same video contents in broadcast way via various terminates such as home TV, PC or mobile TV without transcoding [1, 2] (see Figure  1). However the security of delivery of these video contents has become a cruci al problem including solutions to the confidentiality, originality, integrity and so on. Traditional security method such as encry ption and data hashing, such as AES and SHA algorithms etc, are usually adopted to protect data contents. And in later years watermarking methods are developing especially in copyright applications. These two ways are m ain research directions in multimedia security till now. Although it may be achi eved good performance in traditional ways, it seem s not adaptive to the latest heterogeneous multimedia networks in which scalable video coding has been adopted. In this paper we will fo cus the multimedia content authentication issue. Multimedia authentication techniques have been used for protecting the originality and integrality of multimedia contents and can tell us whether or not the c ontents received are tampered or authentic [3]. The related works can be divided into two groups including digital signature and watermarking. The former utilizes encryption algorithm to extract hash codes from image or other multimedia data as signature, and the hash code is saved in header file or other extra space transmitting with the image. When authentication is needed, signature produced in the same way will be compared with signature saved before. If they match, then received multimedia data is authentic. The latte r utilize semi fragile watermarking to v erify the originality of received contents, and hot t o design the watermarking algorithm is t he key point in authentication. As we know, cryptology based hash or signature will be quite different when introducing even a bit modification at th e input. This is no t obviously practice in real multimedia transmission or processing applications.
And pure watermarking based authentication scheme seems more vulnerable to attacks such as va rious watermarking attacks or delicate authentication attacks due to watermarks are i nherent visual redundancy information in multimedia data. Perceptual hashing (P-Hashing) is a latest technique for the authentication of multimedia content [12]. It works by computing hash values from features of multimedia data. Compared to conventional hashing and signature, the idea perceptual hashing has the advantage that they will not change when multimedia data undergoes common processing while detecting malicious modifications. This kind of percept ual hashing seems a good option for multimedia content authentication, especially in the public transmission channel. For video authentication, some video hashing had been proposed and can achieve trade-off between robustness and discrimination [19][20][21][22][23][24][25]. However, as we mentioned above, there is still n o good authentication solution to the scalable video coding (H.264/SVC) streams content in latest heterogeneous networks. In the scalable video coding streams broadcast system, the server coding once and receiver can decoding in many ways. So how to secure the authenticity of SVC co ntents delivery and make it "Sign once and verify it in scalable way" is th e key point of our works. Till now there are some works have been done including cryptology hashing, watermarking method. However, perceptual hashing based scalable authentication method for SVC videos seems a new solution someone seldom proposed. The main contribution of this paper is that we proposed a new scalable perceptual video hashing can maintain invariance output result when the scalable visual content had not changed and can discriminate malicious modifications. Our scheme is adapted to the three dimensions' structure of H.264/SVC including spatial layers and quality layers as well as te mporal layers. The features extracted from quality layers and spatial layers belong to same frame contents were constructed and further produced on invariance perceptual hash bits which are used for authentication in different receivers. Due to our hash bits generated from visual content features, different layers of same content ca n be authenticated using only one hash. If so me layer or packet discarded, it will no t affect authentication result. Regarding tampers in temporal axis such as dropping, reordering, the time stamp and NAL packets ID can tell the original order of frames and GOPs and discriminate malicious temporal tampers from normal fame rate changes. The rest of this paper is organized as follows. In next section we summarized some latest related works about scalable video authentication. Section 3 described our proposed quality or spatial layer b ased perceptual hashing (Q/S based P-Hashing) scheme and how to achieve the invariance features from different spatial layers and quality layers. In the last sectio n some experiments result in H.264/SVC streams was given as well as conclusions and future works.

Related Works
In [3], different approaches of multimedia authentication include conventional cryptography, fragile and semi-fragile watermarking and digital signatures that are based on the im age content were summarized. Although the classification and performance comparison analysis for different authentication methods are given by the author in details, the authentication for scalable multimedia distribution was not mentioned. In the early related works for scalable image coding such as JPEG 2000, C Peng et al proposed a flex ible and scalable authentication scheme based on the Merkle hash tree and digital signature [8,19]. And it allo ws users to verify the authenticity and integrity of different sub-images extracted from a single compressed codestream protected with a single digital signature. To the best of our knowledge, the earliest paper about video scalable authentication was by Sun QB et al [5]. They considered t hree common MPEG video transcoding manipulations and combined error correction coding (ECC) and watermarking to design a content authentication scheme for scalable video streaming. Yan WQ et al proposed a sc alable video signature that can authenticate video contents at three hierarchical levels (key-frame, shot and video) [7], but it is implemented in the pixel domain of video frames and not suitable for practice scalable video coding mode, as well as th e authentication precision is a little co arse. According to MPEG-2 video authentication, Ye et al used multi-features extracted from coefficients of macro-blocks, time order information and motion information to produce a signature for MPEG authentication. MPEG-4 is th e latest med ia stream processing standard which possesses the important "compress once, decompress many ways" property, Wu YD et al presente d three scalable a uthentication schemes for MPEG-4 streams in multicast and lossy networks [9] and share the novel property of "sign once, verify many ways". In heterogeneous wireless networks, Gabriele et al proposed a l oss tolerant video streaming authentication scheme mainly based on the video feature extraction idea [26]. In the e xperiments, feature difference indicator (FDI) an d two attacks are adopted to verify the sensitive of th e content feature extraction algorithm. This works can be referred in our feature hashing based scalable authentication scheme. However, all these above works are no t implemented in the H.264/SVC we mainly considered in this paper and which is more complex. So it is questionable whether or not they are suitable for scalable authentication for H.264/SVC. Subsequently, we summarized related works exactly in H.264/SVC streams. In [16,17], Su-Wan Park et al em bedded reversible watermarks into the intra prediction mode (IPM) of H.264/SVC bit streams for authentication and encrypted the IPMs of 4×4 luma block, the sign bits of texture, and the sign bits of MV difference values in the intra frames and the inter frames. If th e watermark is co rrectly detected then the cipher content is decrypted. Although the visual quality of videos is no t degraded too much, the watermarking scheme has a little b it-overhead. And it is questionable whether or not watermarking only IPMs can protect the whole content of SVC video streams. In [ 10,11], Mokhtarian et al proposed a n authentication scheme that accounts for the full scalability of video streams, and enables verification of all possible sub-streams that can be extracted from the original stream, in which the adaptation for spatial, quality and temporal dimension are all co nsidered. In the scheme, the hashing generated from MGS (mediumgrained scalable) packets to layers, then to CGS (coarsegrained scalable) and spatial layers, video frames to GOP according to temporal relation c onstructed a complex hash trees to au thentication video data in an y layers and this hash trees can also resist packets lossy as well as ada pt to different layers discarding case. Although this kind of cryptology hash based scheme achieved security, adaptive and computation performance to some extent, there still some p roblems can not be easily resolved.
Due to hash bits generated from every MGS packets and different layers s hould be attached in respective location, there is to o much communication overload even some improvement works was made such as generating hash bits from group truncation unit not individual one.
Due to cryptology hash can not resist even a bit change in video data, every bit modification of MGS blocks can ca use the fi nal frame or GOP can not pass the authentication while the v ideo content is still original. According to t hese problems, we c ombined perceptual video hashing and watermarking to design an authentication scheme adaptive to the three dimensions (S, Q and T) scalability of H.264/SVC video streams.

Proposed Scheme
Our scheme belongs to content based authentication. So how to design perceptual hashing that can both meet the requirement of scalab ility and sensitive to content changes. In addition, the system need some robustness to noise such as some bits change without affect video contents and this maybe often occur in real applications. Our scheme partly was in spired by the works in [5, 6, and 13] and based on the DCT robust features. Before illustrating our sch eme in detail, the structure of H.264/SVC streams should be introduced. When we say a frame in SVC, it is con stituted by different spatial, quality layers denoted by DID, QID. With the change of frame rate, the te mporal scalability is necessary. The frame in higher temporal layer can on ly be predicted from the one in lower temporal layers and the TID is the ID of tem poral layers. In content authentication, ideal perceptual hash extracted from same content should be the same to each other, no matter what the resolution or quality is. So it is n ature idea to extract same featu res from different spatial (resolution) or CGS (quality) layers included in one frame and can produce several same hashing bits sequence. It is so called Q/S based Hashing in our scheme. Notice that H.264/SVC allows up to 8 spatial and 16 quality layers for one fram e, we select the fra me of base la yer to extract main features and select e nhancement frame to e xtract enhancement features for authentication. This will reduce th e hash computation while not missing the significant content need to authenticate. Regarding the temporal tampering, with the TID we can know the correct temporal orders. If attackers forger a TID and disorder or replace the frames, the Q/S based hashing can do the authentication works. In [5,6] three transcoding include requantization, frame resizing and dropping had been considered. In H.264/SVC we firstly considered key frames in two cases, one is changed with diffe rent resolution (spatial change) and t he other is same resolution with different quality (quality change). The latter case is fu rther divided into different CGS layer and MGS layer.
a) The invariance feature between frames with sam e resolution in different CGS layers In H.264/SVC, 4x4 blocks are adopted and transformed into the matrix with16 DCT coefficients. Each coefficient is quantized by a same quantization step. In key frame intra 4x4 blocks (denoted by Io) should be intra coded by different intra prediction modes, the intra prediction pixel block's difference is denoted by Po and residual blocks are tra nsformed into DCT coefficients we denoted by r in Figure 2. The r then quantized by Q1 and denoted the quantization noise is (2) And when decoding in different quality layers at receiver, [Po+IDCT(a+b)]should be with the same visual feature as the [Po+IDCT(b)] except that different quantization distortion (in Figure 3). Due to m any AC coe fficients are ze ro especial those high frequency ones a fter quantization, the features extracted from the group of DCT coefficients are more robust than s ingle DCT coefficient and preferable selected units for producing perceptual hash, And it can be predicted that with the increasing of the num ber of blocks, the invariance relation of DCT frequency energy in the group pair can b e maintained due to the quantization distortion can be regarded as the noise with mean value is zero, which can be obtained by the Dempster-Shafer Evidence Theory (D-S) theory. D-S theory considers one or mo re evidences to enhance the information and decrease the uncertainty. In this paper, we can ensure the robust of features and enhance image quality and en ergy. Firstly, D-S th eory considers the frame of discernment, in this paper, the frame of discernment is composed of group DCT coefficients, and we can combine these together to get better features. Dempster-Shafer theory has five related concepts: basic probability assignment (BPA), co mbine rules, belief, plausibility, and belief interval [27]. The sum of all BPAs will equal 1.0. In this paper, we consider the rate of visual feature in th e total layer as t he input of BPA.

Combine Functions
Multiple evidences can be co mbined using D-S combination rule shown in following equation (1) and (2), which is also called th e orthogonal sum of evidences or mass function, shown as following: Through the combine functions, we can get the comprehensive features from the group of DCT coefficient and make the features more robust.
Belief Function Belief represents the total support for a h ypothesis, and will be drawn fro m the BPAs for all subsets o f that hypothesis. The b elief function is defined as equation (3).
Through the belief function, we can get the feature extracted from the single DCT coefficient, which is shown as probability. Plausibility Function In contrast to belief, plausibility represents the degree to which a h ypothesis cannot be disbelieved or false. Unlike the case in Bayesian probability theory, disbelief is not the complement of belief, but represents the degree of support for all hypotheses that do not intersect with that hypothesis. For each A 2 h Θ , plausibility function is defined as: Through the plausibility function, we can get the degree of feature that different quantization distortion.

Belief Interval
The above measures provide DS theory with an explicit measurement of ignorance about a hy pothesis and its complement sets. Belief in terval is d efined as an interval [Bel(A), Pls(A)] [28]. This can also b e interpreted as the imprecision on the 'true probability' of A. The final belief, plausibility, and belief interval for each of the hypotheses can the n be calculated based on the basic probability assignment using the above equations. Ignorance for the whole set can also be derived. In most cases, after a dding new evidence, the ignorance is reduced [29]. Based on DS theory combination rule, more than two hypotheses can be integrated. When hypotheses are c ombined by m ultiple pieces of evidence, the DS theory can be used to fuse these evidences. The final result represents the synthetic effects of all evidences [30].
b) The invariance feature between frames with sam e resolution with different MGS quality We divided n blocks into 2 gr oups with n/2 blocks respectively. These n blocks can be generated one feature bit according to th e different energy (DE) between two groups (basic idea can be referred in [18]). It can be expected that the relation of different energy of two groups are invariance as long as the proper n value is given, even if some higher frequency coefficients in certain MGS packet will be d iscarded when quality switch, due to the DC c oefficients are m ost important and usually largest coefficients in 4x4 intra blocks. c) The relation of residual DCT energy in same key frames with different resolutions The energy of different block groups reflects the content of video frame such as edges, textures and flat characteristic. It can be expected that the relation of DE will be invariant to the scaling of video frames, which is the same as in the different quality layers. However, in the H.264/SVC bit streams, the residual DCT coefficients in highe r spatial layer are based on the frame difference predicted from the fram e up-sampled from base l ayer, and the quantization factors is even different in respective spatial layer. So the statistica l relation between the residual DCT energy and the intra residual DCT en ergy in differen t spatial layers is complex and difficult to describe directly from exist compressed data. In t his paper we only consider different quality based invariance relation feature and different resolution based one will b e studied in fu ture work.
, 0 (7) And final perceptual hash bit sequence was produced by encryption algorithm E (.), r denotes certain resolution  When authenticating MGS frame, total of MGS layer should be together to construct one complete block coefficients sequence, include every coefficients such as first 3, latter 3 an d at last 10 ones in zigzag order (in figure 3). The video frame based authentication scheme can be described as f igure 5, in which one h ash bits sequence denoted the content of one res olution frame. All the results based on the every hash bits sequence are finally analyzed and voted to outcome the decision.

Simulation and Analysis
We simulate three types of test video, namely "Foreman", "Mobile" (Figure 4), and encoded them using H.264/SVC reference software, called JSVM. Each encoded stream consists of f our temporal layers (GoP size 8) and two spatial layers providing CIF and QCIF resolutions. The first frame (frame 0) is the key I frame of GOP which was utilized to ext ract perceptual hash.  1.1, 1.1, 1.1,  1,1,1,1,1,1,1,1,1,1,1,1} {0.8, 1.1, 1.1, 1.1,  1,1,1,1,1,1,1,1,1,1,1,1}   {1,0.7,0.7,0.7,1,1,1,1,1  ,1,1,1,1,1,1 A) The invariance of RDE based Q-hashing in deferent quality layers From the feature extraction experiment, in 4 quality frames reconstructed from 4 layers we can get (base layer, CGS, MGS1, and MGS2) 4 feature RDE value sequence. In figure 6, the threshold l τ are all 0.1, and the weighting factor 1.1, 1.1, 1.1, 1… 1}. It ca n be s een that acco rding to eve ry group inde x four feature points are all located in one side of threshold line. This is the best result and so all hash bits sequence generated in formula (7) are all matched. The results verified our RDE feature based P-hashing is quality scalability and adaptive to SVC coding. We further encoded and decoded test video "Mobile" in figure 6, set the weighting factor j λ ={1, 0.7, 0.7, 0.7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, the l τ are all 0.09 and get the same best results (all hashing bits are m atched). However, when applied these into the cif of "foreman", there are two groups unmatched. Then we generated corresponding feature hashing bits from several key frame sequences of "foreman" including frame 1, 9,17,25,33,41 and 49, without tampering. And comparing the feature hashing bits extracted from base layer w ith different quality layer (CGS, MGS1, and MGS2) as well as spatial layer (CIF), the mean value of the verification ratio can be shown as Table 2. The verification ratio means how many video authentication units can be verified without any malicious modifications. The experiment results maintain above high ratio to some extent which verified the efficiency of the proposed scheme.

B) The sensitive of RDE to content tampers
At first, we verified the efficient of our scheme through two classical ta mpers including del eting and replacement attacks in two blocks contents in base layer frame and deleting attacks in respective quality frame (CGS, MGS 1 and M GS 2). The attacks and authentication results are sh own in figure 7 and 8. In figure 7 (a) and (b), the two top and left blocks were directly deleted or dedicate replaced with sophisticated contents, and both modified locations can be remarked in the figure 7 (c) and (d) in our proposed scheme. Secondly, according to the related two forging attacks (superimposing attack and picture in picture attack) proposed in [26], we tested the sensitive of our proposed RDE based perceptual hashing to these attacks methods which can prov ide different attacks intensity by adjusting the factor. The authenticated video is Foreman, attack video is Akiyo. In table 3, the RDE feature based hashing bit error rate (FER) in different quality and spatial layers under different attacks was given. For better describing the performance of R DE hashing to Pic-in Pic attacks, here in the b) of table 3 we used the globe FER and local FER to indicate the feature hashing bits changes under attacks with different intensity. It is obvious that the size a of Pic-in-Pic attacks can be used to convert eac h other between Globe FER and L ocal FER. It can be seen that increasing the mf and a, increase the Globe FER values with no exceptions. However, the local FER decreased when the size of Pic-in-Pic attacks increase d. This is because the modification area becam e smaller so that the rate became smaller with the same mismatch feature bits. Our one authentication unit is 16x16 block size. Comparing with the normal verification ratio, the superimposing attacks can be di scriminated from the FER values due to the lowest FER is 13.97% which was higher than the normal ratio (100%-94.81%= 5.19%). Nevertheless, in Pic-in-Pic attacks, the situation was more complex when the attack a rea decreased to 1/16 and below. Here the globe and local FER should be both and authentication results considered. So the local RDE hashing bit further extracted to judge whether some area are noise or intended modifications. And the local FER were basically high under the local attacks such as Pic-in-Pic attacks (above 50%) or deleting or replacement and etc (in Figure 7 and Figure 8).   The robustness and sensitive of the perceptual hashing are the two key points in designing authentication scheme. These performances are base d on the feature selection and extraction. In this p aper we used ratio of different energy of group blocks to extract the robust feature and computed one hashing sequence for each quality frame. The e xperimental results show that this energy ratio based hashing has good perceptual characteristics and ro bust to fram e quality changes in H.264/SVC while sensitive to content tampering such as deleting and replacement. At same time, the P-has hing can further s how the tampers location. In prev ious method, even a bit modification can c hange the final signature and the authentication result. This is the major difference of content based authentication from cryptology schemes. Due to generating one hashing for different quality layers, the computation cost and overhead caused by hash bits data was greatly reduced (in Table 2). However, we should notice that the discrimination of R DE based P-Hashing between robustness and sen sitive is still n ot so smart when all tampers case are considered. In future works we will consider the dynamic adaptive threshold and revise the weighting factor and expect to get better results, especially in different spatial layers.

CONCLUSION
In this paper we proposed a scalable robust perceptual hashing (P-Hashing) scheme which can be used in the scalable video stream authentication. Our feature based P-Hashing was designed adaptively to quality scalability and resolution scalability simultaneously, so that some invariance can b e achieved between different quality layers and spatial layers in the latest H.264 /SVC coding structure which was verified in the experi ments. It can be seen in th e simulations that the m alicious modification and coding noises can be discriminated using this scalable P-hashing. Comparing with previous works in scalable video authentication [10,11], the proposed scheme is m ore efficient in th e computation cost and robustness and communication overload. In the experiments we used the ratio of different energy as the features to construct P-Hashing. Here some improve feature extraction and optimal algorithms need to be applied in th is scheme in future works as written in experiment analysis. How t o achieve better tradition between robustness and sensitivity is th e key p oint of the scalable video authentication and the feature extraction is t he most important works should be done firstly in the future works. Consequently a set of DCT or others suitable features extracted from H.264/SVC coding data will b e applied in the next step compared experiment and be analyzed.

ACKNOWLEDGMENTS
This work was funded by the A*STAR SERC Grant No.1021010027 of Singapore, and Natural Science Foundation of China (60903197, 61272453).