# A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication

^{1}, Yasunari Yoshitomi

^{2}

^{, }yoshitomi@kpu.ac.jp, Taro Asada

^{2}

^{, }t_asada@mei.kpu.ac.jp, Masayoshi Tabuse

^{2}

^{, }tabuse@kpu.ac.jp

^{1}Nippon Telegraph and Telephone West Corp., 3-15 Bamba-cho, Chuo-ku, Osaka 540-8511, Japan

^{2}Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Nakaragi-cho, Shimogamo, Sakyo-ku, Kyoto 606-8522, Japan

- DOI
- 10.2991/jrnal.2018.5.2.4How to use a DOI?
- Keywords
- Secure communication; Audio data processing; Wavelet transform; Encoding
- Abstract
We developed a secure communication method using a discrete wavelet transform. Two users must each have a copy of the same piece of music to be able to communicate with each other. The message receiver can produce audio data similar to the sending user's speech by using our previously proposed method and the given recording of music. To improve the accuracy of speaker authentication, the quantization level for the scaling coefficients is increased. Furthermore, the amount of data sent to the message receiver can be remarkably reduced by exploiting the characteristics of this data.

- Copyright
- Copyright © 2018, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

## 1. Introduction

The elderly are often targets of telephone fraud. The fraudster pretends to be a grandchild of the elderly person while talking on the phone, and appeals to the elderly person to send money, for example, through a bank transfer. In the present study, we propose a method for secure communication using a discrete wavelet transform (DWT) and thus improve speaker authentication; this is an enhancement of our previously proposed method.^{1} It can be used with Internet protocol (IP) telephones, and it has the potential to help prevent telephone fraud.

## 2. Proposed Method

## 2.1. Encoding

## 2.1.1. Phenomenon exploited for the coding algorithm for audio data

In the course of our research,^{1} we found that the histogram of the scaling coefficients for each domain of a multiresolution analysis (MRA) sequence is centered at approximately zero when a DWT is performed on audio data. Exploiting this phenomenon, we have developed a secure communication method using audio data.^{1}

## 2.1.2. Use of five quantization levels for scaling coefficients

## (1) Parameter setting

In our reported study,^{1} we set the following coding parameters.

The values of *Th*(minus) and *Th*(plus) in Fig. 1 are chosen such that the nonpositive scaling coefficients (*S _{m}* in total frequency) are equally divided into two groups by

*Th*(minus), and the positive scaling coefficients (

*S*in total frequency) are equally divided into two groups by

_{p}*Th*(plus). Next, the values of

*T1, T2, T3,*and

*T4,*which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

- 1)
*T*1 <*Th*(minus) <*T*2 < 0 <*T*3 <*Th*(plus) <*T*4 - 2)
The value of

*S*_{T}_{1}, which is the number of scaling coefficients in (*T*1,*Th*(minus)), is equal to*S*_{T}_{2}, which is the number of scaling coefficients in [*Th*(minus),*T*2), i.e.,*S*_{T}_{1}=*S*_{T}_{2}. - 3)
The value of

*S*_{T}_{3}, the number of scaling coefficients in (*T*3,*Th*(plus)], is equal to*S*_{T}_{4}, the number of scaling coefficients in (*Th*(plus),*T*4), i.e.,*S*_{T}_{3}=*S*_{T}_{4}. - 4)
*S*_{T}_{1}/*S*=_{m}*S*_{T}_{3}/*S*._{p}

In the present study, the values of both *S _{T}*

_{1}/

*S*and

_{m}*S*

_{T}_{3}/

*S*are set to 0.3, which was determined experimentally.

_{p}## (2) Encoding

In the preprocessing of the audio data prior to encoding, the scaling coefficients *V* of the MRA sequence are separated into five sets (*G _{0}* to

*G*), as shown in Fig. 1, under the following criteria:

_{4}- •
*G*_{0}= {*V*|*V*∈*V*,^{SC}*V*≤*T*1}, - •
*G*_{1}= {*V*|*V*∈*V*,^{SC}*T*1 <*V < T*2}, - •
*G*_{2}= {*V*|*V*∈*V*,^{SC}*T*2 ≤*V ≤ T*3}, - •
*G*_{3}= {*V*|*V*∈*V*,^{SC}*T*3 <*V < T*4}, - •
*G*_{4}= {*V*|*V*∈*V*,^{SC}*T*4 ≤*V*},

*V*is the set of scaling coefficients in the audio data file.

^{SC}The scaling coefficients for the MRA sequence are encoded according to the following rules, where *V _{i}* denotes scaling coefficient

*i*:

*V*∈

_{i}*G*

_{0},

*c*= 0; when

_{i}*V*∈

_{i}*G*

_{1},

*c*= 1; when

_{i}*V*∈

_{i}*G*

_{2},

*c*= 2; when

_{i}*V*∈

_{i}*G*

_{3},

*c*= 3; and when

_{i}*V*∈

_{i}*G*

_{4},

*c*= 4. We represent the scaling coefficient for each set,

_{i}*G*, by its average value,

_{j}*m*. For the formation of audio data, we use a code

_{j}*C*, which is the sequence of

*c*and

_{i}*m*defined above.

_{j}## 2.1.3. Use of eight quantization levels for scaling coefficients

Here, we define eight sets of *G _{8},_{0}*, to

*G*, as follows:

_{8,7}- •
*G*_{8,0}= {*V*|*V*∈*V*,^{SC}*V*≤*T*1}, - •
*G*_{8,1}= {*V*|*V*∈*V*,^{SC}*T*1 <*V < Th*(minus)}, - •
*G*_{8,2}= {*V*|*V*∈*V*,^{SC}*Th*(minus) ≤*V ≤ T*2}, - •
*G*_{8,3}= {*V*|*V*∈*V*,^{SC}*T*2 <*V <*0}, - •
*G*_{8,4}= {*V*|*V*∈*V*, 0 ≤^{SC}*V*≤*T*3}, - •
*G*_{8,5}= {*V*|*V*∈*V*,^{SC}*T*3 <*V*<*Th*(plus)}, - •
*G*_{8,6}= {*V*|*V*∈*V*,^{SC}*Th*(plus) ≤*V*≤*T*4}, - •
*G*_{8,7}= {*V*|*V*∈*V*,^{SC}*T*4 <*V*}.

Again, we let the representative value for each set, *G _{8},_{i}*, be its average,

*m*. For the formation of audio data, we use the code

_{8,i}*C*

_{8}, which is the sequence of

*c*

_{8,}

*defined for eight quantization levels for scaling coefficients in the similar manner as*

_{i}*c*described in Section 2.1.2, and

_{i}*m*

_{8,}

*as defined above.*

_{j}## 2.1.4. Use of 16 quantization levels for scaling coefficients

## (1) Parameter setting

The values of *T*1*m*, *T*1*p*, *T*2*m*, *T*2*p*, *T*3*m*, *T*3*p*, *T*4*m*, and *T*4*p*, which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

- 1)
*T*1*m*<*T*1 <*T*1*p*<*Th*(minus) <*T*2*m*<*T*2 <*T*2*p*< 0 <*T*3*m*<*T*3 <*T*3*p*<*Th*(plus) <*T*4*m*<*T*4 <*T*4*p* - 2)
The value of

*T*1*m*is defined so that it equally divides the number of scaling coefficients in [*V*min,*T*1].*T*1*p*,*T*2*m*,...,*T*4*p*are defined similarly to*T*1*m*.

## (2) Encoding

Sixteen sets of *G _{16},_{0}* to

*G*are defined as follows:

_{16,15}- •
*G*_{16,0}= {*V*|*V*∈*V*,^{SC}*V*≤*T*1*m*}, - •
*G*_{16,1}= {*V*|*V*∈*V*,^{SC}*T*1*m*<*V < T*1}, - •
*G*_{16,2}= {*V*|*V*∈*V*,^{SC}*T*1 ≤*V ≤ T*1*p*}, - •
*G*_{16,3}= {*V*|*V*∈*V*,^{SC}*T*1*p*<*V < Th*(minus)}, - •
*G*_{16,4}= {*V*|*V*∈*V*,^{SC}*Th*(minus) ≤*V*≤*T*2*m*}, - •
*G*_{16,5}= {*V*|*V*∈*V*,^{SC}*T*2*m*<*V*<*T*2}, - •
*G*_{16,6}= {*V*|*V*∈*V*,^{SC}*T*2 ≤*V*≤*T*2*p*}, - •
*G*_{16,7}= {*V*|*V*∈*V*,^{SC}*T*2*p*<*V*< 0}, - •
*G*_{16,8}= {*V*|*V*∈*V*, 0 ≤^{SC}*V*≤*T*3*m*}, - •
*G*_{16,9}= {*V*|*V*∈*V*,^{SC}*T*3*m*<*V < T*3}, - •
*G*_{16,10}= {*V*|*V*∈*V*,^{SC}*T*3 ≤*V ≤ T*3*p*}, - •
*G*_{16,11}= {*V*|*V*∈*V*,^{SC}*T*3*p*<*V < Th*(plus)}, - •
*G*_{16,12}= {*V*|*V*∈*V*,^{SC}*Th*(plus) ≤*V*≤*T*4*m*}, - •
*G*_{16,13}= {*V*|*V*∈*V*,^{SC}*T*4*m*<*V < T*4}, - •
*G*_{16,14}= {*V*|*V*∈*V*,^{SC}*T*4 ≤*V ≤ T*4*p*}, - •
*G*_{16,15}= {*V*|*V*∈*V*,^{SC}*T*4*p*<*V*}.

As before, the value for each set, *G _{16,i},* is represented by its average value,

*m*. For the formation of audio data, we use the code

_{16,i}*C*

_{16}, which is the sequence of

*c*

_{16, }

*defined for 16 quantization levels for scaling coefficients in the similar manner as*

_{i}*c*described in Section 2.1.2, and

_{i}*m*

_{16,}

*defined above.*

_{j}## 2.2. Audio data formation using code replacement

In this subsection, the formation of sound data is explained; for this example, we use five quantization levels for the scaling coefficient.^{1} The scaling coefficient sequence for audio data *A* is expressed as *S*(*A*)* _{k}* = {

*x*

_{1},

*x*

_{2},

*x*

_{3},...,

*x*}, where

_{k}*k*is the total number of scaling coefficients of

*A*at this level. Then, the sequence

*C*(

*A*)

*= {*

_{k}*X*

_{1},

*X*

_{2},

*X*

_{3},...,

*X*} is determined, where

_{k}*X*∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients

_{i}*x*of

_{i}*A*belongs. Next, the audio data

*A′*is defined as having the scaling coefficient sequence

*S*(

*A*′)

*and a value of zero for all wavelet coefficient values at every level.*

_{k}*S*(

*A*′)

*is defined as*

_{k}*S*(

*A′*)

*= {*

_{k}*a*

_{1},

*a*

_{2},

*a*

_{3},...,

*a*}, where

_{k}*A*in the range denoted by

*X*∈ {0,1,2,3,4} and is obtained from

_{i}*A*. Then, the audio data

*B*′

*is defined as having the scaling coefficient sequence*

_{A}*S*(

*B′*)

_{A}*and a value of zero for all wavelet coefficient values at every level.*

_{k}*S*(

*B′*)

_{A}*is defined as*

_{k}*S*(

*B′*)

_{A}*= {*

_{k}*b*

_{A}_{,1},

*b*

_{A}_{,2},

*b*

_{A}_{,3},...,

*b*

_{A}_{, }

*}, where*

_{k}*B*in the range denoted by

*X*∈ {0,1,2,3,4} obtained from

_{i}*A*.

*S*(

*B′*)

_{A}*is obtained by replacing*

_{k}*Y*with

_{i}*X*when

_{i}*Y*≠

_{i}*X*, and then replacing

_{i}*b*with

_{i}*b*, where

_{A,i}*b*is the average of the scaling coefficients of

_{i}*B*in the range denoted by

*Y*. Therefore,

_{i}*C*(

*B′*)

_{A}*=*

_{k}*C(A*)

*. As a result,*

_{k}*B′*is expected to be similar to

_{A}*A*.

## 2.3 Data for communication

A sequence *D*1(*B*′* _{A}*)

*is defined as*

_{n}*D*1(

*B′*)

_{A}*= {*

_{n}*z*

_{1},

*z*

_{2},...,

*z*}, where

_{n}*n*is the total number of cases where

*Y*≠

_{i}*X*,

_{i}*z*= [|

_{p}*y*|]mod 256, and the integer

_{i}*p*is increased from 1 to

*n*, in steps of size 1, when

*Y*≠

_{i}*X*.

_{i}^{1}Here, [

*x*] signifies the maximum integer that is not greater than

*x*. Then, a sequence

*D*2(

*B′*)

_{A}*n*is defined as

*D*2(

*B′*)

_{A}*= {*

_{n}*Z*

_{1},

*Z*

_{2},...,

*Z*}, where

_{n}*n*is the total number of cases for which

*Y*≠

_{i}*X*and

_{i}*Z*=

_{p}*X*.

_{i}^{1}

In communications between two users, the message sender and the receiver each have the secret key ** B**, and the sender sends

*D*1(

*B′*)

_{A}*and*

_{n}*D*2(

*B′*)

_{A}*to the receiver.*

_{n}^{1}Then, the receiver composes

*B″*, which is defined in Section 2.4 and is expected to be similar to

_{A}*A*.

## 2.4. Audio data composition

In this subsection, the processing of sound data formation is also explained using the case of five quantization levels, as an example, for the scaling coefficient.^{1} The scaling coefficient sequence for audio data *B* is expressed as *S*(*B*)* _{k}* = {

*y*

_{1},

*y*

_{2},

*y*

_{3},...,

*y*}, where

_{k}*k*is the total number of scaling coefficients of

*B*at this level. Then, a sequence

*C*(

*B*)

*= {*

_{k}*Y*

_{1},

*Y*

_{2},

*Y*

_{3},...,

*Y*} is determined, where

_{k}*Y*∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients

_{i}*y*of

_{i}*B*belongs.

*S*(

*B′*)

*is defined as*

_{k}*S*(

*B′*)

*= {*

_{k}*b*

_{1},

*b*

_{2},

*b*

_{3},...,

*b*k}, where

*B*at the range denoted by

*Y*∈ {0,1,2,3,4} and is obtained from

_{i}*B*.

A sequence *D*3(*B*)* _{k}* is defined as

*D*3(

*B*)

*= {*

_{k}*z*

_{B}_{,1},

*z*

_{B}_{,2},...,

*z*}, where

_{B,k}*k*is the total number of scaling coefficients of

*B*at this level, and

*z*= [|

_{B,q}*y*|]mod 256.

_{q}*B″*is determined as follows:

_{A}*S*(

*B″*)

_{A}*is calculated from*

_{k}*S*(

*B′*)

_{k}by replacing

*b*with

_{q}*z*=

_{B,q}*z*, for

_{p}*p*= 1,…,

*n*, then the audio data

*B″*is composed using the inverse DWT (IDWT) of the scaling coefficient sequence

_{A}*S*(

*B″*)

_{A}*and the value of zero for all wavelet coefficients at every level. The receiver composes*

_{k}*B″*from

_{A}*D*1(

*B′*)

_{A}_{n}and

*D*2(

*B′*)

_{A}*, which are determined by both*

_{n}*A*and

*B*and are sent by the sender, and

*B*, which the receiver has obtained prior to the conversation.

*B″*is expected to be similar to

_{A}*A*.

## 2.5. Data reduction

## 2.5.1. *Processing for D*1

Because *z _{p}* = [|

*y*|]mod 256,

_{i}*z*is in the range from 0 to 255, and thus it can be expressed using 8 bits. In our computer, an integer is represented by 32 bits. Therefore, four values for

_{p}*z*, each expressed using 8 bits, can be integrated into a single value expressed by 32 bits. For

_{p}*D*1(

*B′*)

_{A}*= {*

_{n}*z*

_{1},

*z*

_{2},...,

*z*},

_{n}*z*′

*is defined as*

_{j}*i*,

*j*are natural numbers. As a result, we obtain a sequence for

*D*1′(

*B*′

*)*

_{A}*= {*

_{m}*z*′

_{1},

*z*′

_{2},...,

*z*′

*}, where*

_{m}*n*mod 4 ≠ 0,

*z*

_{4}

_{m}_{+}

_{k}_{–2}= 0 (

*k*= 0,...,|

*n*mod4–3|). Here, [

*x*] is defined as in Section 2.3. In the first case of the above formula on

*m*, the total amount of data,

*D*1′, stored in a computer is thus one quarter of that stored for

*D*1. However, the total amount of data sent to a receiver depends on the way in which the data are expressed.

## 2.5.2. *Processing for D*2

## (1) Case of five quantization levels

*D*2′(*B*′* _{A}*)

*= {*

_{n}*Z*

_{1},

*Z*

_{2},...,

*Z*} and

_{n}*D*2′(

*B*′

*)*

_{A}*= {*

_{l}*Z*′

_{1},

*Z*′

_{2},...,

*Z*′

*}, where*

_{l}*Z*′

*=*

_{j}*Z*

_{13}

_{i}_{–12}+

*Z*

_{13}

_{i}_{–11}× 5 +

*Z*

_{13}

_{i}_{–10}× 5

^{2}+ ∙ ∙ ∙ +

*Z*

_{13}

*×5*

_{i}^{12}, are defined as described in Section 2.5.1.

## (2) Case of eight quantization levels

*D*2′(*B*′* _{A}*)

*= {*

_{n}*Z*

_{1},

*Z*

_{2},...,

*Z*} and

_{n}*D*2″(

*B*′

*)*

_{A}*= {*

_{r}*Z*″

_{1},

*Z*″

_{2},...,

*Z*″

*}, where*

_{r}*Z″*=

_{j}*Z*

_{10}

_{i–}_{9}+

*Z*

_{10}

_{i–}_{8}× 8 +

*Z*

_{10}

_{i–}_{7}× 8

^{2}+ ∙ ∙ ∙ +

*Z*

_{10}

*× 8*

_{i}^{9}, are defined as described in Section 2.5.1.

## (3) Case of 16 quantization levels

*D*2′(*B*′* _{A}*)

*= {*

_{n}*Z*

_{1},

*Z*

_{2},...,

*Z*} and

_{n}*D*2″(

*B*′

*)*

_{A}*= {*

_{s}*Z′*″

_{1},

*Z′*″

_{2},...,

*Z′*″

*}, where*

_{s}*Z′″*=

_{j}*Z*

_{8}

_{i–}_{7}+

*Z*

_{8}

_{i–}_{6}× 16 +

*Z*

_{8}

_{i–}_{5}× 16

^{2}+ ∙ ∙ ∙ +

*Z*

_{8}

*× 16*

_{i}^{7}, are defined as described in Section 2.5.1.

## 3. Numerical Experiment

We applied the proposed method, using several voice recordings for *A*, and for *B*, we used two recordings of music, one classical and the other hip-hop. The music was taken from a copyright-free database.^{2} In all cases, all of the produced *B*″* _{A}* were audible and sounded similar to

*A*; each

*B*″

*was made with five, eight, or 16 quantization levels. An increase in the quantization level improved the sound quality because a waveform made from*

_{A}*B*″

*with a higher quantization level was more similar to the original waveform than was one made with a lower quantization level, as shown in Fig. 2. For (1), (2), and (3) in Section 2.5.2, the data reduction for one minute of audio data at 44.1 kHz, 16 bits, a single channel, and volume of 87 KB was as follows:*

_{A}- (1)
*D*1(75 KB) →*D*1′(48 KB),*D*2(49 KB) →*D*2′(9 KB) - (2)
*D*1(86 KB) →*D*1′(55 KB),*D*2(57 KB) →*D*2″(21 KB) - (3)
*D*1(92 KB) →*D*1′(59 KB),*D*2(65 KB) →*D*2″′(29 KB)

## 4. Conclusion

We developed a secure communication method using a discrete wavelet transform for audio data; we used an increased number of quantization levels for the scaling coefficients along with a data reduction technique. The waveform produced by the proposed method was more similar to the original one than that produced by our previously proposed method.^{1}

## References

### Cite this article

TY - JOUR AU - Kouhei Nishimura AU - Yasunari Yoshitomi AU - Taro Asada AU - Masayoshi Tabuse PY - 2018 DA - 2018/09/30 TI - A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication JO - Journal of Robotics, Networking and Artificial Life SP - 93 EP - 96 VL - 5 IS - 2 SN - 2352-6386 UR - https://doi.org/10.2991/jrnal.2018.5.2.4 DO - 10.2991/jrnal.2018.5.2.4 ID - Nishimura2018 ER -