Evaluation of the Relationships between Saliency Maps and Keypoints

Ryuugo Mochizuki; Kazuo Ishii

doi:10.2991/jrnal.k.200512.004

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 7, Issue 1, June 2020, Pages 16 - 21

Evaluation of the Relationships between Saliency Maps and Keypoints

Authors

Ryuugo Mochizuki^*, Kazuo Ishii

Center for Socio-Robotic Synthesis, Kyushu Institute of Technology, 2-4, Hibikino Wakamatsuku, Kitakyushu 808-0196, Japan

^*Corresponding author. Email: rmochizuki@lsse.kyutech.ac.jp

Corresponding Author

Ryuugo Mochizuki

Received 26 November 2019, Accepted 19 February 2020, Available Online 18 May 2020.

DOI: 10.2991/jrnal.k.200512.004 How to use a DOI?
Keywords: Saliency map; binary robust invariant scalable keypoint; keypoint stability
Abstract: The saliency map is proposed by Itti et al., to represent the conspicuity or saliency in the visual field and to guide the selection of attended locations based on the spatial distribution of saliency, which works as the trigger of bottom-up attention. If a certain location in the visual field is sufficiently different from its surrounding, we naturally pay attention to the characteristic of visual scene. In the research of computer vision, image feature extraction methods such as Scale-Invariant Feature Transform (SIFT), Speed-Up Robust Features (SURF), Binary Robust Invariant Scalable Keypoint (BRISK) etc., have been proposed to extract keypoints robust to size change or rotation of target objects. These feature extraction methods are inevitable techniques for image mosaicking and Visual SLAM (Simultaneous Localization and Mapping), on the other hand, have big influence to photographing condition change of luminance, defocusing and so on. However, the relation between human attention model, Saliency map, and feature extraction methods in computer vision is not well discussed. In this paper, we propose a new saliency map and discuss the stability of keypoints extraction and their locations using BRISK by comparing other saliency maps.
Copyright: © 2020 The Authors. Published by Atlantis Press SARL.
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

In recent years, many attempts have been done such as the selection of desired information in input information [1,2]. If attention models can be constructed to select information, the intelligence and awareness of humans can be implemented in computers.

According to Itti et al., saliency is defined as the property of images, which triggers bottom-up attentions. Saliency occurs by the local conspicuity over the entire visual scene [3]. In this model (Figure 1), input image is decomposed into luminance, color, and orientation components, then, each component is processed individually with Gaussian filter.

Considering that the saliency map is applied to environment recognition by mobile robots, various changes in photographing condition are expected to affect the input image. The change affects spatial frequency components of the image. If the spatial frequency changes, the response of Gaussian filter also changes, then, the effect reflects saliency map. Considering that the saliency map is used to select the keypoints of the image, changes in the saliency map affect the results of feature selection, then, input data of detectors vary. Thus, recognition results are influenced according to the change in photographing conditions.

For keypoint extraction, small influence is desirable in spite of the variety of object size, angle and luminance. In case of the keypoint application for object detection, repetitively extracted keypoints are ideal to select.

In our research, we propose a method for generating saliency maps, which can absorb the effect of spatial frequency changes. If the parameters of the filters can be determined automatically, the effect of the spatial frequency change can be diminished in saliency maps (Figure 2 Bottom). We evaluated the relationship between saliency and keypoints.

2. RELATED WORK

2.1. Saliency Map

Itti et al. [3] simulated human eye movement, and expressed the result as saliency maps (Figure 1). In the process of saliency map creation input image is reduced by 1/2ⁿ and nine resolutions of the images are obtained. The Center and the Surround can be obtained through the smoothing operation by a common Gaussian filter. This signal process is similar to the different responses from fovea and its neighbor in retina for the common stimuli. All the reduced images are enlarged to the same size, and the across scale difference image of the two components is normalized and added to obtain a map for each component (i.e. Luminance, Color, Orientation). Saliency map is obtained through the addition of all the maps of the three components.

According to Frintrop et al. [4], saliency map changes if the parameter of the Gaussian filters are changed. The ratio of filter parameter σ_c/σ_s is crucial for the determination of saliency. Arbitral selection of σ_c/σ_s enabled high granularity in saliency map. However, in Itti et al. [3] and Frintrop et al. [4], the parameters cannot be adjusted depending on the variety of spatial frequency. As the result, saliency map can be affected in the event of spatial frequency change (Figure 2 Top).

2.2. Keypoint Extraction

Keypoint extraction is often applied for object recognition tasks [5], image stitching tasks [6], etc. by robot vision. A keypoint has a co-ordinate, a descriptor which explains brightness gradient in the neighborhood. In the object recognition task, the database image and the newly observed image are searched. Recently, scale-invariant keypoint extraction methods have been proposed, such as SIFT [7], and Binary Robust Invariant Scalable Keypoints (BRISK) [8] (Figure 3). As the result, the stability of object detection has been improved. However, if photographing conditions (brightness of the environment, size of the observed object, focusing conditions, camera internal parameters, etc.) change, the number of extracted keypoints changes significantly. Stably extracted keypoints are desirable for the use of object detection tasks by robot vision.

3. PROPOSAL OF SALIENCY MAP

3.1. Outline

In this research, we developed the theory of Frintrop et al. [4] to mitigate the effect of spatial frequency variation. The strategy is automatic adjustments of σ_c and σ_s. In the saliency map generation process (Figure 4), the input image is decomposed into luminance, color, and orientation components in advance. For each component, the Center and the Surround are generated by the combination of integral image and box filters. The parameter of the filters are automatically adjusted so that the pixel values of the across scale difference are maximized. The across scale differences of all three components are merged to form saliency map.

3.2. Decomposition of Input

We utilize CIE-Lab color system to simplify the difference of complimentary color channel. I_L, I_a and I_b indicates luminance, color (Red-Green), color (Blue-Yellow) component, each other. For the obtainment of orientation component, Haar-Like Filters (Figure 5 [9,10]) are convoluted on I_L. The operations are expressed as Equation (1).

Iθ(pp)=|I1(pp)¯-12I2(pp)¯-12I3(pp)¯| (1)

3.3. The Center and Surround

We align two box filters F_Bs, F_Bc centered with point p_ρ as shown in Figure 6. The filters are used for convolution to generate the Center and Surround. The filter widths W_Bs, W_Bc can be variable up to W_pmax and fulfills W_Bs > W_Bc. This arrangement is same as Mochizuki et al. [11].

3.4. Filter Adjustment

To obtain across scale difference of luminance, color components, we maximize the pixel value of the difference I_cs(p_p) as in Mochizuki et al. [11] by changing W_Bs(p_p), W_Bc(p_p) according to Equation (2) and Figure 6.

Ics(pp)^=maxWBc(pp),WBs(pp)Ics(pp) =maxWBc(pp),WBs(pp)|Ic(pp)-Is(pp)| (2)

Here, W_Bs (p_p), W_Bc (p_p) satisfies WBs(pp)^ , WBc(pp)^ .

On the other hand, for orientation component, to obtain across scale differences the sizes of the Haar-like Filters are set to WBs(pp)^ , WBc(pp)^ , then, the filters are convoluted with I_L. The responses of the Center and the Surround are denoted as I_{θ_c} (p_p), I_{θ_s} (p_p). The across scale differences of all directions are obtained and merged to map M_{O_θ} (p_p).

3.5. Saliency Map Generation

Map M_C (for Color), and M_O (for Orientation) are obtained by Equations (4) and (5). Saliency map M_Sal is formed through the merge of M_I, M_C, M_O with Eq. (6). The functions f_mix, g_mix, h_mix for merging maps can be selected arbitrarily.

MC=f(MCa, MCb) (4)

MO= g(MO_0, MO_45,MO_90, MO_135) (5)

MSal=hmix(MI,MC,MO) (6)

4. EVALUATION OF THE RELATIONSHIP BETWEEN SALIENCY AND KEYPOINT EXTRACTION

4.1. Outline of the Experiment

In this experiment, we assume that the selected image keypoints are used for object detection. Thus, we evaluate the relationship between saliency M_Sal and feature stability F_Stb. Suppose the number of small regions is N_q, F_Stb and M_Sal are expressed in line vector of N_q dimensions. However, we treat M_Sal and F_Stb as two dimensions (Figure 7). Then, we calculate the relationship ϕ_i by obtaining inner product F_Stb · M_Sal. The saliency maps were generated by conventional methods (i.e. Itti method, VOCUS2) and our proposal to compare ϕ_i. The source codes for the experiment are Simpsal [12] by Caltech for Itti method, and [13] for VOCUS2. We chose BRISK [8] as keypoint extraction method because descriptor is expressed in binary system. Such system is reported to require shorter time for matching than SIFT [7]. Furthermore, the descriptor has properties of rotation and scale invariance.

4.2. Evaluation Function

We consider two conditions of keypoints which have high stability under photographing condition variety. First, the keypoints must extracted at the same location. We define the property as repeatability. Second, the descriptors must remain the same, that is, the similarity.

To evaluate keypoint stability, keypoint displacement have to be considered because of image flicker, resize of observed object size. For example, the combination of the same keypoints is considered as (I) or (II) in Figure 8 in different photographing condition. We define a small region of W_q × H_q [Pixels] to search identical keypoints.

Suppose N_kp_{,n_q, i} keypoints are extracted at n_q-th small region under i-th photographing condition, the variance σ_kp_{, n_q} of keypoint number is obtained by Equation (7). The average Nkp, nq¯ (for N variations of a parameter) of extracted keypoints is obtained by Equation (8).

σkp, nq=1N∑i=1N(Nkp, nq, i-Nkp, nq¯)2 (7)

Nkp, nq¯=1N∑i=1NNkp, nq,i (8)

r_{ftr, n_q} is obtained by the normalization of σ_{kp,n_q} r_{ftr, n_q} should be larger if σ_{kp, n_q} is smaller as shown in Equation (9).

rftr, nq=1-σkp, nqmaxnqσkp, nq (9)

To obtain similarity, we select two keypoints from the same small region (as seen in Figure 8) and different photographing conditions, then calculate Hamming distance between the two descriptors. To obtain average Hamming distance of all combinations of the keypoint pairs, we use Equation (10). The similarity s_{Dsc,n_q} is calculated with normalization by Equation (11) so that the range satisfies [0,1], and r_{Derr, n_q} is smaller as the distance is larger.

rDerr, nq= minK∑l=1N-1∑m=l+1NdH(dnq,l,kl, dnq,m,km)c(N2)*1LD (10)

s.t. K=[k1,k2,k3,…,kNi],l,m=1,2,…,Ni,l≠m

sDsc, nq=1-rDerr, nqmaxnqrDerr, nq (11)

Keypoint stability of F_{Stb,n_q} is calculated by the weighting of r_{Derr,n_q} and s_{Dsc,n_q} as shown in Equation (12).

FStb, nq=wrftr, nq+(1-w)sDsc, nq (12)

For the saliency M_{Sal,i,n_q}, The maximum response of M_Sal is searched within each small region. Maximum saliency and feature stability are expressed as N_q dimensions of line vectors (denoted as M_Sal,i, F_Stb, respectively). ϕ_i is calculated as the angle between the two vectors [Equation (13)]. To be noted that r_ftr, s_Dsc are calculated only for the regions where keypoints are extracted more than twice during N variations of photographing conditions.

ϕi=cos-1(MSal,i⋅FStb‖MSal,i‖‖FStb‖)FStb=[FStb, 1, FStb, 2,…, FStb, nq,…,FStb, Nq]MSal,i=[MSal, i,1, MSal, i,2, …,MSal,i, nq,…,MSal,i, Nq] (13)

The average ϕ¯ is obtained according to Equation (14).

ϕ¯=1N∑i=1Nϕi (14)

4.3. Method

Figure 9 shows the experimental images (Lenna, Flower, Tree, Things). These images were selected in the database of Caltech [14] and Standard Image Data BAse (SIDBA) [15]. The spatial frequency spectrums of the images are shown in Figure 10. Lenna is well known for test image to be used image analysis. Flower has wider spectrum than Lenna with higher frequency component. As well as the comparison of Things and Tree, Things has higher frequency component than Tree.

The photographing condition to adjust to vary extracted keypoint number is I_Max,i/I_Max,1 for luminance, W_Obj,i/W_Obj,1 for object size, each other, whose range is from 0.5 to 1.0 with the step 0.1 of increase.

For changing W_Obj,i, we selected images of no white background, (i.e. Tree and Flower). We selected T_FAST = 20 (T_FAST: Threshold of FAST Score [8]) and I_Max,1 = 255 during the adjustment of I_Max,i and W_Obj,i. As the setting of the proposal, for Setting 1, W_pmax = W_IM/4 and for Setting 2, W_pmax = W_IM / 2. W_IM indicates the image width. The resolution of the image is W_IM × H_IM = 512 × 512 [Pixel].

4.4. Results and Discussion

Tables 1 and 2 shows the relationship of M_Sal, r_ftr and M_Sal, s_Dsc under variable I_{Max, i}, and W_Obj,i, each other. Flower has higher frequency component than Lenna, and Things has higher frequency component than Tree.

(a) w = 1 (F_Stb = r_ftr)

Image	VOCUS2		Itti	Proposal

	1/10	5/10		Set 1	Set 2

Lenna	18.13	20.63	31.61	19.94	18.04
Flower	29.00	29.64	41.39	19.72	18.98
Tree	21.85	24.34	33.97	20.76	20.10
Things	31.24	29.77	28.36	19.60	18.37

(b) w = 0 (F_Stb = s_Dsc)

Lenna	17.20	18.45	33.15	20.05	17.65
Flower	28.85	29.32	42.18	19.66	18.82
Tree	18.97	21.53	33.90	19.24	17.27
Things	28.58	27.40	26.67	15.21	13.11

Bold type indicates the best value of ϕ¯ among the all settings mentioned in the table.

Table 1

Comparison of ϕ¯ (variable I_Max,i)

*(a) w = 1 (F_Stb* = r_ftr)**

Image	VOCUS2		Itti	Proposal

	1/10	5/10		Set 1	Set 2

Tree	22.62	23.22	34.06	21.87	22.16
Things	28.74	25.83	29.56	18.96	17.68

*(b) w = 0 (F_Stb* = s_dsc)**

Tree	24.57	25.37	35.30	23.24	23.88
Things	28.30	27.06	29.74	21.52	20.09

Bold type indicates the best value of ϕ¯ among the all settings mentioned in the table.

Table 2

Comparison of ϕ¯ (variable W_Obj,i)

We discuss the comparison of ϕ¯ under variable I_{Max, i}. Referring to Tables 1 and 2 for VOCUS2 and Itti, ϕ¯ was large for high spatial frequency. While, for proposal, ϕ¯ is less influenced by spatial frequency change compared with conventional method.

Figures 11 (for Lenna) and 12 (for Flower) show the location of keypoints on saliency map (Left), the histogram which indicates the response of saliency at the locations respectively. There is difference in frequency component, however, for the case of the proposal, the location of the peak in the histogram is higher saliency than other saliency map. Thus the inner product in Equation (13) becomes larger. The change of W_Obj,i means the image reduction, then high frequency component increases. The proposal recorded smaller ϕ¯ than others. As the results, the M_Sal of our proposal turned out to have larger correlation in feature stability and saliency, which means our proposal is more suitable for keypoint selection.

5. CONCLUSION

In this research, we proposed method of saliency map generation which consists of adaptive filter adjustment to spatial frequency to aim at preventing fluctuation of saliency caused by different input image and photographing condition change. Our saliency map method turned to be suitable for selecting keypoints less affectable by photographing condition change compared with other conventional methods.

CONFLICTS OF INTEREST

The authors declare they have no conflicts of interest.

AUTHORS INTRODUCTION

Dr. Ryuugo Mochizuki

He received his master of engineering degree at Kyushu Institute of Technology in 2008. Then he has been involved in spec test and design of Integrated Circuit as a worker in Shikino High-tech CO., LTD. up to 2013. His research topic during the doctor course was the relationship between scale-invariant keypoint extraction and saliency in the domain of image processing. He finished PhD degree in September 2019.

Dr. Kazuo Ishii

He is a Professor in the Kyushu Institute of Technology, where he has been since 1996. He received his PhD degree in engineering from University of Tokyo, Tokyo, Japan, in 1996. His research interests span both ship marine engineering and Intelligent Mechanics. He holds five patents derived from his research. His lab got “Robo Cup 2011 Middle Size League Technical Challenge 1st Place” in 2011. He is a member of the Institute of Electrical and Electronics Engineers, the Japan Society of Mechanical Engineers, Robotics Society of Japan, the Society of Instrument and Control Engineers and so on.

REFERENCES

[1]AA Dalve and S Shiravale, Real time traffic signboard detection and recognition from street level imagery for smart vehicle, Int. J. Comput. Appl., Vol. 135, 2016, pp. 18-22.

[2]H Wang, X Dong, J Shen, X Wu, and Z Chen, Saliency-based adaptive object extraction for color underwater images, Atlantis Press, Paris, France, in Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, 2013, pp. 2651-2655.

[3]L Itti, C Koch, and E Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 20, 1998, pp. 1254-1259.

[4]S Frintrop, T Werner, and GM García, Traditional saliency reloaded: a good old model in new shape, in 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE (Boston, MA, USA, 2015), pp. 82-90.

[5]M Aly, M Munich, and P Perona, Bag of words for large scale object recognition properties and benchmark, in Proceedings of the Sixth International Conference on Computer Vision Theory and Applications (VISAPP) (2011), pp. 299-306.

[6]M Brown and DG Lowe, Automatic panoramic image stitching using invariant features, Int. J. Comput. Vis., Vol. 74, 2007, pp. 59-73.

[7]DG Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., Vol. 60, 2004, pp. 91-110.

[8]S Leutenegger, M Chili, and RY Siegwart, BRISK: binary robust invariant scalable keypoints, in 2011 International Conference on Computer Vision, IEEE (Barcelona, Spain, 2011), pp. 2548-2555.

[9]P Viola and M Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (Kauai, HI, USA, 2001), pp. 511-518.

[10]R Lienhart, A Kuranov, and V Pisarevsky, Empirical analysis of detection cascades of boosted classifiers for rapid object detection, in Joint Pattern Recognition Symposium, DAGM 2003: Pattern Recognition, Springer (Berlin, Heidelberg, 2003), pp. 297-304.

[11]R Mochizuki, S Yasukawa, and K Ishii, Proposition of saliency map based on the maximization of center-surround difference, in Proceedings of the International Conference on Artificial Life and Robotics, vols. 24 (2019), pp. 487-492.

[12]MATLAB Saliency –Caltech Vision- HomePage. Available from: http://www.vision.caltech.edu/~harel/share/gbvs.php.

[13]Saliency System VOCUS2 Universität Bonn Home Page. Available from: http://pages.iai.uni-bonn.de/frintrop_simone/vocus2.html.

[14]Caltech Database Homepage. Available from: http://www.vision.caltech.edu/Image_Datasets.

[15]Kanagawa Institute of Technology. Available from: http://www.ess.ic.kanagawa-it.ac.jp/app_images_j.html.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: Journal of Robotics, Networking and Artificial Life
Volume-Issue: 7 - 1
Pages: 16 - 21
Publication Date: 2020/05/18
ISSN (Online): 2352-6386
ISSN (Print): 2405-9021
DOI: 10.2991/jrnal.k.200512.004 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Ryuugo Mochizuki
AU  - Kazuo Ishii
PY  - 2020
DA  - 2020/05/18
TI  - Evaluation of the Relationships between Saliency Maps and Keypoints
JO  - Journal of Robotics, Networking and Artificial Life
SP  - 16
EP  - 21
VL  - 7
IS  - 1
SN  - 2352-6386
UR  - https://doi.org/10.2991/jrnal.k.200512.004
DO  - 10.2991/jrnal.k.200512.004
ID  - Mochizuki2020
ER  -

download .riscopy to clipboard