Blur2Sharp: A GAN-based Model for Document Image Deblurring

Model for document deblurring using cycle-consistent adversarial networks


Introduction
Thanks to mobile technology we are able to capture documents in a simple way and at any moment. Text documents such as bank documents, advertisements, courier receipts, hand-written notes, digitized receipts, public information signboards and information panels captured by portable cameras are very common in our daily lives. Portable cameras offer great conveniences for acquiring and memorizing information as they provide a new alternative for document acquisition in less constrained imaging environments than personal scanners. However, due to the variations in the imaging conditions as well as the target document type, there are many factors that can degradate the images such as image blurring, sharing distortions, geometrical warping or noise pollution. Frequently, the motion blur caused by camera shake and the out-of-focus blur can affect the quality of the images obtained by mobile devices. Although blur may be not very relevant at first sight, it may be the cause of problems in ulterior processing tasks such as text segmentation or Optical Character Recognition (OCR). Thus, traditional scanner-based OCR systems cannot be directly applied on camera-captured images and a new level of processing needs to be addressed. The objective of this work is to study a new model for removing different types of blur from real blurry document images and generate the correspondent sharp images. Fig. 1 illustrates the difference between source blur images (right) and sharp images (right).
Various deblurring techniques have been proposed so far based on blur kernel estimation: blind deconvolution methods, and non-blind deconvolution methods. In non-blind deconvolution methods, we have some knowledge about the blur kernel. In contrast, in blind deconvolution methods no information about the blur kernel is known. Blind deblurring estimates a latent sharp image and a kernel blur, namely Point Spread Function (PSF), from a blurred image. This problem has been deeply studied (see section 2). However, generic methods have not been effective with real world blurred images.
The proposal of this work is to eliminate blur to restore a blurred image that was captured via handheld camera or smartphones. We want to show how mobile phones can replace desktop scanners and ceaselessly improve mobile devices (image quality and computing power). For this purpose, we propose to extend a Cycle-Consistent Generative Adversarial Network (CycleGAN) for translating a blurred input text image into a sharp one. Recent methods based on Generative Adversarial Networks (GAN) for tasks such as image-to-image translation 5 depend on the availability of training examples where the same data is available in both domains. However, CycleGAN 18 is able to learn such pair information without one-to-one mapping between training data in source and target domains. The challenge of this work is to propose a new architecture based on CycleGAN, which we call 'Blur2Sharp CycleGAN', for the task of text document deblurring.
The rest of this work is organized as follows. Section 2 reviews the state of the art in blind deconvolution methods. Section 3 describes our suggested system. The evaluation and experimental results are provided in section 4. Finally, section 5 provides some concluding remarks.

State of the art
Given its wide range of applications, document image deblurring has attracted considerable attention in recent years, and various approaches have been proposed, overall in the field of blind deconvolution methods.
In blind deconvolution methods, the first task is to estimate the blur kernel. After this, the second part of the method is to restore the final latent image thanks to a non-blind deblurring algorithm. With the assumption that the blur is uniform and spatially invariant, the mathematical formulation of the blurry image can be modeled as Y = X * K + ε, where X is a latent sharp image, K is an unknown blur kernel, * denotes the convolution operator, ε is the additive noise and Y is the blurred observation.
Chen et al. 1 suggested an effective document image deblurring algorithm based on a Bayesian method that analyzes the local structure of a document image and uses an intensity probability density function as prior for deblurring. However, it is not generally applicable to blurred text images because it depends on text segmentation. Cho et al. 2 proposed another Bayesian method that takes into account more specific properties of text documents. Pan et al. 13 proposed another approach for text deblurring that makes profit of L0 regularized intensity and gradient priors. Nayef et al. 10 suggested a method for document image deblurring that uses sparse representations improved by non local means. Zhang et al. 17 used a gradient histogram preservation strategy for document image deblurring.
More recently, Ljubenovic et al. 8 proposed the use of a dictionary-based prior for class-adapted blind deblurring of document images. Using a large set of motion blurred images with the associated ground-truth blur kernels, Pan et al. 12 proposed a method to learn data fitting functions. Last, Jiang et al. 6 proposed a method based on the two-tone prior for text image deblurring.
Convolutional Neural Networks (CNN) have been also applied to various image enhancement tasks 7,15 . Focusing on image text image deblurring, Hradiš et al. 4 proposed an end-to-end method to generate sharp images from blurred ones using CNN. Their network, consisting of 15 convolutional layers, is trained on 3 classes (blur, sharp image and kernel blur), but there are some disadvantages: on the one hand, pairwise images are required to train the networks; on the other hand, the result of deblurring is not appropriate for natural images which have a color background. Pan et al. 14 presented a genetic approach with two main ideas: modify the prior to assume that the dark channel of natural images is sparse instead of zero, and impose the sparsity for kernel estimation. They used CNN to predict the blur kernel.
In the past few years, Generative Adversarial Networks (GAN) have been used in different imagerelated applications 3 such as generating synthetic data, style transfer, super-resolution, denoising, deblurring or text-to-image generation. Using this approach, Xu et al. 16 proposed a method for jointly deblurring and super-resolving images that are typically degraded by out-of-focus blur. However, they focus on two classes of images: low-resolution blurred face and text images. Based on disentangled representations, Lu et al. 9 presented an unsupervised method for domain-specific, single-image deblurring. Nimisha et al. 11 also used GAN for endto-end deblurring network. Using adversarial loss, the network learns a strong prior on the clean image domain and it maps the blurred image to its clean equivalent.
In summary, the previous approaches cannot be generalized to deal with different scenarios in document text deblurring. Therefore, it is a challenging task to count on a method based on deep learning for building a general image prior that is able to handle different scenarios.

Proposed Method
Image deblurring aims to restore a clean image from a blurred one to improve the result of OCR. As mentioned in the introduction, this work explores the feasibility of applying CycleGAN 18 with a new architecture for the challenging task of cleaning blurred text images. We want to demonstrate that CycleGAN approach is suitable for the simpler task of text image deblurring where the results are good. Our idea is that if we can treat blur and sharpness as a kind of image style, successful image deblurring may be achieved with unpaired image dataset based on CycleGAN. The main advantage of CycleGAN is that it does not require a pairwise image dataset. To solve the issue of not having a ground-truth document of every input, CycleGAN proposes a reverse mapping function Gy. If a generator function Gx translates a document from domain X to domain Y just changing the appearance, it should be possible to map the image back to domain X with another generator function Gy and reconstruct the initial image: X → Gx(X) → Gy(Gx(X)) ≈ X. The functions could be also reversely applied, Y → Gy(Y ) → Gx(Gy(Y )) ≈ Y , and no ground-truth documents should be needed for deblurring. The model contains also the associated adversarial discriminators Dy and Dx. Dy encourages Gx to translate X into outputs indistinguishable from domain Y , and vice-versa for Dx and Gy. To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if we translate from one domain to the other and back again we should arrive at where we started. Fig. 2 shows a schema of this model that allows us to train our network with unpaired images to translate from blur to sharp or inversely. and two discriminators. The main goal of generators is to learn the mapping between two image domains. In addition, the two adversarial discriminators aim to distinguish between images and translated images in the same way as they aim to discriminate between the sharp images and the generated images. The following subsections explain the details of the generative and discriminator networks.

Generative network
As shown in Fig. 4, both generators consist of three parts: an encoder, a transformer and a decoder. This figure summarizes the architecture of these three parts, which has shown good performance for image deblurring.
Encoder The input of the network is [256, 256, 3]. There are 3 layers to extract the features from an image by one stride each time. In the first layer each filter is 7x7 and the stride is 1x1. The activation function is ReLU. The shape of the output is [256, 256, 64] (by padding). Then, the output of the first layer is passed to the following layers. The hyperparameters of the second layer are: 128 filters where each filter is 3x3 with stride 2; the activation function is ReLU; and the output is [128, 128, 128]. The hyperparameters of the third layer are: 256 filters where each filter is 3x3 with stride 2; the activation function is ReLU; and the output is [64, 64, 256].
Transformer The transformer aims to transform feature vectors of an image from one domain to another domain. The input is [64, 64, 256]. We started building our model from scratch. We started with 9 layers of ResNet blocks. However, as we did not obtained good results, we tried to change the model structure to improve the result. After testing different possibilities, we chose the hyperparameters and structures that best suited for the deblurring task. We have used 12 layers of ResNet blocks. All filters have 3x3 size, and the stride is 1. Therefore, the output of our model transformer is [64, 64, 256].
Decoder The decoding works in the opposite way to the encoder. We reconstruct the image from the feature vectors by applying three deconvolution layers which use the reversed parameters of the encoding step. Therefore, the output of our model decoder is [256,256,3].

Discriminative network
The main goal of discriminator is to extract features. It contains 5 convolutional layers. The hyperparameters of all the layers are the following: the total filters are [64,128,256,5,1]; the stride is [2,2,2,2,1]; and the activation function is LeakyReLU (slope=0.2) for all the layers except for the last layer with LeastSquareGAN. All the filters have 4x4 size. The principal role of discriminator is to decide whether these features belong to one particular category or not by the last layer, which is a convolutional layer with output 1x1.

Loss function
There are two components of the CycleGAN objective function: an adversarial loss and a cycle consistency loss. Both are essential to obtaining good results. We train the models using Least Square GAN loss. To achieve our goal, the loss function must satisfy that the discriminator X should be trained for images as close as possible to domain X, and fake images close to 0. Thus, discriminator X would minimize (DiscriminatorX(x) − 1) 2 . In the same way, it should be able to distinguish original and generated images. The discriminator must return 0 for images generated by the generator X. Discriminator X would minimize (DiscriminatorX(GeneratorY → X(y))) 2 .
On the other hand, generator X should be able to deceive the discriminator Y about the authenticity of its generated images. This can be done if the recommendation by discriminator Y for the generated images is as close to 1 as possible. Therefore, generator X would like to minimize (DiscriminatorY ( GeneratorX → Y (x)) − 1) 2 . And the most impor-tant cycle loss captures that we are able to get the image back using another generator and thus the difference between the original image and the cyclic image, which is [Gy(Gx(x)) − x] + [Gx(Gy(y)) − y], should be as small as possible.

Experiments
The following subsections describe the experiments that have been performed using the proposed 'Blur2Sharp CycleGAN' method: the corpus of documents for training and testing together with the implementation details (see section 4.1), and the results that have been obtained (see section 4.2).

Datasets and implementation
With respect to the unpaired dataset used for the experiments, we are using the dataset proposed by Hradiš et al. 4 . It consists of images with both defocus and motion blur generated by the random walk. We randomly cropped 2000 blurred image patches with 256x256 size and 2001 cleaned image patches 256x256 size from the dataset for training. For the testing phase, depicted in Fig. 5, we used the test dataset proposed by Hradiš et al. 4 , which includes 100 pair of images (blurred images and their corresponding sharp versions) to the evaluate the quality of image restoration.
We have implemented our Blur2Sharp Cycle-GAN on document deblurring in Python with the help of Keras and running the code on top of Google Tensorflow framework. In addition, it must be noted that all the computation works were run on an Ubuntu server with NVIDIA Quadro P6000 GPUs. Fig. 6 shows two examples of the application of our 'Blur2Sharp CycleGAN' method on the test dataset. Although the visual result of sharp images seems satisfactory, it is difficult to distinguish visually a real image from a synthetic one when the differences between them are small. Therefore, we have compared quantitatively our method using two metrics: Structural Similarity Index (SSIM) and Peak Signal to Noise Ratio (PSNR). The equation for SSIM is:

Results
where I is the Structural Similarity Index; (x, y) are coordinates indicating a nearby NxN window; σ x , σ y , are the variances of intensities in x, y directions; σ xy is the covariance; and µ x , µ y are the average intensities in x, y directions.
The equation for PSNR is: where MSE stands for the Mean Square Error; I ori is the original image; and I deblur is the deblurred image. MSE is computed as follows: where m, n is the size of the image. Based on the two metrics defined above, we have compared the results obtained with our proposed Blur2Sharp CycleGAN architecture and the state of the art methods using the test dataset previously described in section 4.1: the 100 images extracted from the dataset proposed by Hradiš et al. 4 . The results are shown in Table 1. Our PSNR is equal to 32.52db and according to SSIM, we achieve 0.7689 on average using CycleGAN. Therefore, we can conclude that our 'Blur2Sharp CycleGAN' has achieved a comparable SSIM and PSNR with the advantage of not having to use pairwise images for the training phase.
We have also compared the quality of the OCR output obtained with the test dataset. After applying Tesseract software * to obtain OCR from the sharp images in the test dataset (used as reference text) and the images returned by the methods, we have computed the average cosine similarity between the character frequency vectors derived from each pair of corresponding OCR output files. We have considered only the detection of letters and digits within the range of printable characters in ASCII encoding. As shown in table 2, our method performs better than the methods of Pan et al. 13,14 and a bit worse than the method of Hradiš et al. 4 . It is logical that the method of Hradiš et al. performs better because it is a non-blind deconvolution method. It must also be noticed that the obtained OCR is not perfect, even * https://opensource.google/projects/tesseract for the sharp images in the testing dataset, because we are using cropped images with incomplete words and lines. In addition, the results are good in terms of visual comparison with respect to the state of the art. For instance, Fig. 8 compares the results obtained with different methods on an input image that belongs to the text image deblurring dataset proposed by Hradiš et al. 4 . It can be observed that our Blur2Sharp Cy-cleGAN method generates a sharp image with much clearer characters.

Conclusions
We presented a novel model that uses Cycleconsistent adversarial networks for document deblurring. Our proposed 'Blur2Sharp CycleGAN' architecture adjusts CycleGAN to the task of text image deblurring. We can both deblur images and blur sharp images because CycleGan has the property of cycle-consistency. It is worth noting that since we use unpaired images as training dataset, we do not need the ground-truth sharp images. Based on our prior knowledge, successful image deblurring can be achieved with an unpaired image dataset using Cy-cleGAN.
In addition, it must be noted that our model for image document deblurring obtains comparable results with respect to the current state of the art. Moreover, the obtained PSNR values are at the same level as the best methods found in the research literature. This demonstrates that the idea of treating blur and sharpness as a kind of image style actually works. In addition, a by-product of generating fake blurring images using CycleGAN is also provided. Last, it must be noted that our model significantly improved the speed of the deblurring process thanks to GPU acceleration.
As future work we want to test the feasibility of 'Blur2Sharp CycleGAN' for the problem of document denoising. This is specially relevant for the processing of historical documents in the first time of printing offices (e.g., incunabula documents in 15th and 16th centuries) and the correct application of Optical Character Recognition and other text analysis processes. The documents in this period present a wide range of noise derived from the use of hand-made typefaces, varied concentration of ink, and the logical degradation of paper.