International Journal of Computational Intelligence Systems

Volume 13, Issue 1, 2020, Pages 1483 - 1497

Contextualizing Support Vector Machine Predictions

Authors
Marcelo Loor1, 2, *, ORCID, Guy De Tré1, ORCID
1Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41 B-9000, Ghent, 9000, Belgium
2Department of Electrical and Computer Engineering, ESPOL Polytechnic University, Campus Gustavo Galindo V., Km. 30.5 Via Perimetral, Guayaquil, 09015863, Ecuador
*Corresponding author. Email: Marcelo.Loor@UGent.be
Corresponding Author
Marcelo Loor
Received 10 May 2020, Accepted 6 September 2020, Available Online 22 September 2020.
DOI
10.2991/ijcis.d.200910.002How to use a DOI?
Keywords
Explainable artificial intelligence; Augmented appraisal degrees; Context handling; Support vector machine classification
Abstract

Classification in artificial intelligence is usually understood as a process whereby several objects are evaluated to predict the class(es) those objects belong to. Aiming to improve the interpretability of predictions resulting from a support vector machine classification process, we explore the use of augmented appraisal degrees to put those predictions in context. A use case, in which the classes of handwritten digits are predicted, illustrates how the interpretability of such predictions is benefitted from their contextualization.

Copyright
© 2020 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

As the ubiquity of artificial intelligence (AI) grows, computer applications like word processors that translate documents, or videoconference applications that generate transcripts of meetings, are thoroughly satisfying business or user needs. Nevertheless, AI applications like profiling tools that predict capabilities of people without providing any explanation have to be banned in situations where transparency and accountability are mandatory [1,2]. An existing challenge in this regard is to find suitable mechanisms by which the reasons and reasoning behind computer predictions involving complex techniques can be explained with ease [3,4].

To address that challenge in predictions made by a support vector machine (SVM) classification process [5,6], we explore the use of augmented appraisal degrees (AADs) [7] for the contextualization of the evaluations that yield such predictions. Since an AAD has been conceived as a mathematical representation of a connotative meaning in an experience-based evaluation, it can be used for recording not only the level to which an object belongs (or not) to a particular class, but also the object’s features that support that level assignment. Hence, we propose a novel variant of an SVM classification process whereby the resulting predictions are augmented in such a way that those predictions are put in context and an explanation is provided. Our main motivation here is to obtain predictions that expose the aspects deemed to be relevant to the classification.

An important facet of the proposed variant is that, by explicitly representing context, it yields predictions that are better interpretable. Hence, our variant, named explainable SVM classification (XSVMC), can be used within an explainable artificial intelligence (XAI) system [8], by which users can take advantage of such interpretable predictions to make better informed decisions.

A key component of XSVMC is a novel evaluation procedure in which the most influential support vector (MISV) is used for identifying what has been relevant to the classification. This evaluation procedure, which is the main contribution of this work, contextualizes the evaluations in such a way that the forthcoming predictions can be explained with ease.

To describe how XSVMC works, we develop a process whereby handwritten numbers are evaluated to predict the class(es) those handwritten numbers belong to. A visual representation of a resulting prediction is shown in Figure 1: while the left side of this figure shows a handwritten number, which is used as input, the right side of the figure shows a representation of why the proposition “the handwritten number is a ‘3’” is true up to a specific level. The resulting prediction has also been used within an XAI system to produce the following output: “The green part suggests that the drawing is a ‘3’ with a computed grade of 0.16; yet, the red part, which a ‘3’ should have, and the gray part, which a ‘3’ should not have, indicate that it is not a ‘3’ with a computed grade of 0.64.” Notice in this example that the output not only indicates why a proposition (or prediction) is true, but also why it is not. This provides the system and users with extra information and illustrates an advantage of including explainability into AI systems.

Figure 1

Predicting handwritten numbers.

In the next section, we introduce the AAD concept and briefly describe how it can be integrated into the intuitionistic fuzzy set (IFS) concept. Then, we provide a comprehensive explanation of our novel XSVMC in Section 3 and illustrate in Section 4 how to use it. After that, other existing techniques for explaining individual predictions are reviewed in Section 5. We conclude the paper in Section 6.

2. PRELIMINARIES

As indicated previously, classification in AI is commonly understood as a process in which several objects are evaluated in order to predict the class(es) those objects belong to [9]. In this regard, a classification algorithm can look into the features of an object to evaluate the level to which this object is member of one or more well-known classes Using these evaluations, the algorithm can provide the best evaluated class(es) as a prediction. It is worth mentioning that herein by ‘feature’ is meant a distinctive aspect that is relevant for the classification. For instance, the level of illumination of either one pixel or a group of pixels of the handwritten number shown on the left side of Figure 1 can be deemed to be relevant for the classification of this number.

In situations where an object, say x, has features suggesting a partial membership of this object in a given class, say A, the aforementioned classification algorithm can use the framework of fuzzy set theory[10] to model the evaluation of the level to which x belongs to A. In that framework, the evaluation of a proposition having the canonical form ‘x BELONGS TO A’, meaning x is member of A, can mathematically be denoted by a membership grade. A membership grade is a number μA(x) in the unit interval [0,1], where 0 and 1 represent respectively the lowest and the highest membership grades. For instance, if x represents the handwritten number shown on the left side of Figure 1 and A denotes (what has been learned about) the class of handwritten 3’s, then μA(x)=0.16 indicates the level to which this handwritten number belongs to the class of handwritten 3’s. Moreover, if B represents the class of handwritten 2’s and μB(x) denotes the level to which x belongs to B, then μA(x)<μB(x) means that the level to which x belongs to the class of handwritten 3’s is less than the level to which x belongs to the class of handwritten 2’s. In this manner, the classification algorithm can perform a numeric comparison to determine what class should be offered as a prediction.

As shown in Figure 1, the handwritten number can also have features suggesting that it does not belong to the class of handwritten 3’s. – see, e.g., the right side of Figure 1 in which the gray and the red parts suggest the handwritten number is not a ‘3’. In this case, the evaluation of the proposition ‘x BELONGS TO A’ can be better described in the IFS framework [11,12] by means of an IFS element. An IFS element, say x,μA(x),νA(x), consists of the evaluated object x, a membership grade μA(x) and a nonmembership grade νA(x), where μA(x),νA(x)[0,1] must satisfy the consistency condition 0μA(x)+νA(x)1. For example, the proposition “the handwritten number depicted on the left side of Figure 1 is a ‘3’” can be represented by the canonical form ‘x BELONGS TO A’, where x and A denote in that order the handwritten number and the class of handwritten 3’s; thus, the evaluation of this proposition can be denoted by the IFS element x,μA(x),νA(x)=x,0.16,0.64. In addition, the buoyancy[13] of this IFS element, i.e., ρA(x)=μA(x)νA(x) can be used for comparing IFS elements to each other. For example, if the IFS element x,μB(x),νB(x) represents the evaluation of the proposition “the handwritten number depicted on the left side of Figure 1 is a ‘2’” then ρA(x)<ρB(x) means that the level to which x belongs to the class of handwritten 3’s is less than the level to which x belongs to the class of handwritten 2’s. Such a comparison can be used by a classification algorithm for making a prediction.

As noticed above, while a membership grade and an IFS element make it possible to record the level(s) to which an object belongs (or not) to a given class, none of these representations enables the recording of the object’s characteristics that lead to and hence explain this (these) level(s). To record those characteristics, the idea of AADs [7] has been introduced. An AAD of an object x, say μ̂A@K(x), can be seen as a pair μA@K(x),FμA@K(x) that denotes the level μA@K(x) to which x belongs to the class A, as well as the particular collection FμA@K(x) of x’s features that are deemed to be relevant to the evaluation according to the knowledge K. For instance, the evaluation depicted on the right side of Figure 1 can be denoted by the AAD μ̂A@K(x)=0.16,FμA@K(x), where: (i) x and A represent the handwritten number on the left of Figure 1 and a class of handwritten 3’s respectively; (ii) K symbolizes the knowledge about handwritten 3’s used for the evaluation of x; and (iii) FμA@K(x) represents a collection consisting of the green pixels that indicate why x should be a ‘3’ according to K.1

To record the characteristics that indicate why the aforementioned handwritten number should not be a ‘3’, the augmentation of IFS elements with AADs has been proposed [7]. An augmented IFS element, say x,μ̂A@K(x),ν̂A@K(x), consists of a membership AAD, μ̂A@K(x), and a nonmembership AAD, ν̂A@K(x): while μ̂A@K(x)=μA@K(x),FμA@K(x) indicates the level μA@K(x) to which x belongs to A and the collection FμA@K(x) of x’s features considered for quantifying this membership level, ν̂A@K(x)=νA@K(x),FνA@K(x) indicates the level νA@K(x) to which x does not belong to A and the collection FνA@K(x) of x’s features considered for quantifying this nonmembership level. For instance, keeping x, A, K and FμA@K(x) as given in the previous example, one can represent the evaluation depicted in Figure 1 by x,μ̂A@K(x),ν̂A@K(x)=x,0.16,FμA@K(x),0.64,FνA@K(x),where FνA@K(x) represents a collection consisting of the red and the gray pixels that indicate why x should not be a ‘3’ according to K. In the next section, we explain how to use these concepts to explain predictions made by an SVM classification process.

3. EXPLAINABLE SVM CLASSIFICATION

As was mentioned earlier, our aim is the contextualization of SVM predictions to make them better interpretable. For that purpose, in this section we describe our novel XSVMC process, by which SVM predictions are augmented with AADs. As depicted in Figure 2, the main components of XSVMC are a learning process, an evaluation process and a prediction step. Among these components, the fundamental contribution of this work is the novel evaluation process that makes use of the MISV to contextualize the evaluations. In what follows we give details of each component.

Figure 2

A contextual view of XSVMC [14].

3.1. Learning Process

The aim of the learning process in XSVMC is to obtain a knowledge model for each class in a collection of well-known classes. To describe how it works, we make use of a process that mimics a learning behavior where a person learns about a concept (or class) by studying objects that satisfy or dissatisfy an evaluation criterion related to the concept. The process is based on the feature-influence representational model[15], which is summarized below.

3.1.1. Feature-influence representational model

Let be a m-dimensional feature space in which each dimension corresponds to a feature fj in a collection ={f1,,fm}. Let x be an object with a collection of features x. And let pA be a proposition having the canonical form ‘ x BELONGS TO A’ (see Section 2). Under these considerations, the influence of the features of x on the appraisal of pA is modeled as follows:

  • The overall influence x of the features of x on the classification is given by the vector

    x=j=1mβjf̂j,(1)
    where βj denotes the overall importance (or weight) on the classification of fj among the features in , and f̂j is the unit vector representing the dimension related to fj in . For instance, Figure 3 depicts the overall influence of x in a 2-dimensional feature space, where ={f1,f2}. In this case, if f1 and f2 represent, e.g., two pixels in a digitized image, β1 and β2 might represent their respective levels of illumination.
    Figure 3

    Overall influence of the features of an object x.

  • A particular knowledge model about A, say KA, is represented by a line in and described by a pair ûA,tA such that: (i) ûA represents a unit vector that points to a location in where the fulfillment of pA is favored; and (ii) tA is a point on the line defined by ûA where the fulfillment of pA is neither favored nor disfavored. For instance, Figure 4 shows a particular knowledge model KA in the aforementioned 2-dimensional feature space. In this case, while the zone with the label ‘+’ represents a location where the fulfillment of pA is favored, the zone with the label ‘-’ represents a location where the fulfillment of pA is disfavored.

    Figure 4

    Characterization of a particular knowledge about A.

  • The specific influence of the features of x on the appraisal of pA is given by the vector

    xA=(xûA)ûA=j=1mβjAûA,(2)
    where βjAûA denotes the specific influence of fj on the appraisal of pA, and ‘ ’ denotes the inner product. Notice that xA is the vector projection of the overall influence vector on the line that represents KA, i.e., xA corresponds to the vector projection of x on ûA. For instance, Figure 5 depicts the specific influence β1f̂1 of f1 on the appraisal of pA according to the particular knowledge model about A characterized in Figure 4. In this case, if f1 and β1 represent respectively the aforementioned pixel and level of illumination, then β1A represents the specific influence of that pixel on the appraisal of pA.
    Figure 5

    Specific influence of one of the features of x.

  • The level to which x satisfies (or dissatisfies) pA is determined by the magnitude of the vector lA defined by

    lA=xAtAûA,(3)
    i.e, this level is given by
    ||lA||=lAlA.(4)

    If the directions of lA and ûA are the same, x satisfies pA to the extent ||lA||. By the contrary, if the direction of lA is opposite to the direction of ûA, x dissatisfies pA to the extent ||lA||. For example, Figure 6 shows the vector lA that represents the resulting specific influence of x on the appraisal of pA according to the particular knowledge model about A characterized in Figure 4. Since in this case the directions of lA and ûA are the same, x satisfies pA to the extent ||lA||.

    Figure 6

    Resulting specific influence xA of the features of x.

3.1.2. Obtaining knowledge models

At this point, the feature-influence model can be used for explaining how to extract a model of the knowledge about a class A, say KA=ûA,tA,2 by looking into the features of each object xi in a training collection, say X0={x1,,xn} (see Figure 7). Such a training collection consists of objects that satisfy the proposition pA (positive examples), as well as objects that dissatisfy that proposition (negative examples). The main steps of the algorithm proposed in a previous work [15] to extract KA are the following – the interested reader is referred to that work for a detailed description of this algorithm:

Figure 7

Training and test collections consisting of positive examples, denoted by black circles, and negative examples, denoted by white circles.

  1. For each xiX0, identify its features and put them into X0.

  2. Assign an overall importance βi,j to each feature fjX0 based on its overall influence on the appraisal of pA for each xiX0.

  3. Compute ûA,tA in such a way that (i) the correspondence between each xiX0 satisfying or dissatisfying pA and the resulting specific influence of its features is preserved, and (ii) both the aggregate of the specific influences that favor the fulfillment of pA and the aggregate of the specific influences that disfavor such fulfillment are maximized.

In the first step, the objects’ features that will be considered during the learning process are identified. It is worth mentioning that a feature can represent something about one or more characteristics of an object. For example, a feature can represent the presence of either one pixel or a group of pixels in a digitalized handwritten number.

In the second step, an overall weight for each of the features identified in the first step is assigned based on its relative influence on the classification. For example, the level of illumination can be considered as the overall weight of a feature consisting of one pixel.

In the third step, the components of KA, i.e., ûA and tA, are adjusted in such a way that the following two conditions are (mostly) satisfied: (i) the resulting specific influence of the features of each object in the training collection is in agreement with the label assigned to the object (i.e., positive or negative example); and (ii) both the aggregate of the specific influence of positive examples and the aggregate of the specific influence of negative examples are maximized. For instance, Figures 8 and 9 illustrate, in that order, how the adjustments of tA and ûA can modify the resulting specific influence xA of x shown in Figure 6.

Figure 8

Adjusting the component tA of KA.

Figure 9

Adjusting the component ûA of KA.

The problem of finding an optimal couple ûA,tA in the third step can be related to the problem of finding an optimal separating hyperplane with an SVM, which is stated as follows [5,6]:

  • Suppose that a hyperplane H separates positive examples from negative ones. Let H+ be a hyperplane that is parallel to H and contains the nearest positive example(s) and let H be another hyperplane that is also parallel to H and contains the nearest negative example(s). Find H such that the distance between H+ and H is the largest.

The hyperplane H is defined by wxi+b=0, where w and b represent in that order the normal vector to H and the intersect term, and xi denotes any vector related to an object xiX0. An illustration of a hyperplane H is shown in Figure 10 along with the hyperplane H+, defined by wxi+b=1, and the hyperplane H, defined by wxi+b=1. Notice that, while the normal vector w and the directional vector (DV) ûA are parallel to each other and point to the same side, the intersect term b corresponds to tA. Hence, the equations

ûA=ww(5)
and
tA=bw(6)
hold and (2), (3) and (4) can be rewritten as
xA=(xw)wûA,(7)
lA=(xw+b)||w||ûA,(8)
and
||lA||=(xw+b)w(9)
respectively.

Figure 10

An optimal couple <ûA, tA> in relation to an optimal separating hyperplane H.

To find the values of w and b, the Euclidean distance between H+ and H, i.e., d(H+,H)=2||w||, should be maximized subject to the following constraints: if xi is a positive example, then wxi+b1; and if xi is a negative example, then wxi+b1. However, minimizing 12||w||2 is preferred. Thus, w and b are computed by the Lagrangian formulation of the linearly separable case of an SVM classifier[16], in which the value of Λ, given by equation

Λ=12w2i=1nλiyi(wxi+b)1,(10)
is minimized subject to yi(wxi+b)10 and (λi{λ1,,λn})(λi>0). In this equation, yi{1,1} is a label that indicates whether xi is a positive example (yi=1) or a negative one (yi=1); and λ1,,λn are Lagrange multipliers.

The previous problem is reformulated to an equivalent dual problem[16], which consists in finding the Lagrange multipliers such that the gradient of Λ with respect to w and b yields zero, and Λ is maximized. The conditions for the gradient of Λ, i.e., δΛδw=0 and δΛδb=0, result in

w=i=1nλiyixi(11)
and
i=1nλiyi=0,(12)
which are introduced in (10) to obtain
Λ=i=1nλi12i=1,k=1nλiλkyiyk(xixk).(13)

In this equation, Λ is formulated as a function of the Lagrange multipliers only, and is maximized subject to the constraints (12) and λi0,i=1,,n. In the linearly non-separable case, the last constraint is generalized to 0λiC,i=1,,n, where C is called the regularization parameter[16]. The solution is given by (11) and

b=yi(wxi)(14)
for any vector xi associated to 0<λi<C,i=1,,n.3 The objects related to these vectors are deemed to be crucial elements in X0 since any of them can change the direction of H if removed. Because of this, these vectors are named support vectors.

In situations where the vectors are not linearly separable, those vectors can be mapped to another space in which they can be separated by a linear hyperplane. This means that a vector xi in the feature space can be mapped to a higher dimensional space, say N, through a mapping ϕ:N, such that (13) can be written as

Λ=i=1nλi12i=1,k=1nλiλkyiyk(ϕ(xi)ϕ(xk)).(15)

Instead of computing the inner product between ϕ(xi) and ϕ(xk) in a higher dimensional space, the use of a kernel function K(xi,xk)=ϕ(xi)ϕ(xk) is preferred [5,6] – notice that K computes the inner product (or a reflection of similarity) between xi and xk in N. Hence, (15) becomes

Λ=i=1nλi12i=1,k=1nλiλkyiyk(K(xi,xk)).(16)

Among others, the Polynomial kernel of degree d, defined by

K(xi,xk)=(xixk+1)d,(17)
and the radial basis function (RBF) kernel with parameter γ>0, defined by
K(xi,xk)=exp(γ||xixk||2),(18)
are examples of such kernel functions.

As noticed, SVMs can be used for the computation of the optimal couple KA=ûA,tA even if the objects in the training collection are not linearly separable.

3.2. Augmented Evaluation Process

A conventional classification algorithm can use the knowledge model resulting from the above-described learning process to evaluate the level to which an object is member of a given class. For example, after obtaining a feature-influence model of the knowledge about class A, i.e., KA=ûA,tA, the conventional classification algorithm can use KA to evaluate, by means of (3) and (4), the level to which an object x is a member of A. In a similar way, the algorithm can use a model about class B, say KB=ûB,tB, to evaluate the level to which x is a member of B. After that, the resulting levels can be used for making a prediction about the class of x: if the level to which x is member of A, i.e., ||lA||, is greater than the level to which x is member of B, i.e., ||lB||, A can be returned as the predicted class of x.

If a user would like to know in the previous example why the predicted class is A, the conventional classification algorithm is limited to offer an answer like “x is more A than B because ||lA||>||lB||”. As noticed, nothing is mentioned in this answer about the relevant features of x that support that prediction. In this regard, the purpose of the evaluation process in XSVMC is to put those evaluations in context and, thus, help explaining the forthcoming predictions. Two procedures that put those evaluations in context are explained below.

Consider a class A, a collection consisting of the features in a m-dimensional feature space, an object x in a test collection X (see Figure 7) and a proposition pA:x BELONGS TO A’. Consider also a collection x consisting of the features of x, as well as a collection X0 consisting of the features identified after following the previous learning process with a training collection X0 (see Figure 11). Assume that KA=ûA,tA is a feature-influence representation of a particular knowledge about A. Assume also that ûA=ω1f̂1++ωmf̂m and x=β1f̂1++βmf̂m respectively represent the DV and the overall influence vector. Under these considerations, a procedure for performing an augmented evaluation of pA that yields an augmented IFS element μ̂A(x),ν̂A(x)=μA(x),FμA(x),νA(x),FνA(x) as a result, consists of the following steps [14]:

Figure 11

Features collections.

  1. For each feature fj in xX0, compute its specific influence on the appraisal of pA, i.e., compute fjA=βjAûA=βjωjûA – recall from (2) that xA=β1Af̂1A++βmAf̂mA. If βjA>0 include fj in FμA(x); else include fj into FνA(x) if βjA<0; otherwise, exclude fj from both FμA(x) and FνA(x). For instance, if the specific influence of f9 in Figure 11 is positive (i.e., β9A>0), f9 will be included in FμA(x); likewise, if the specific influence of f7 is negative (i.e., β7A<0), f7 will be included in FνA(x); and if the specific influence f5 is zero (i.e., β5A=0), f5 will be excluded from both FμA(x) and FνA(x).

  2. For each feature fj in X0x, take into consideration the following rationale to decide whether or not fj should be included into either FμA(x) or FνA(x): (i) if ωj>0, it can be considered that the nonexistence of fj in x will be against the membership of x in A and, thus, fj should be included in FνA(x); (ii) else, if ωj<0, it can be considered that the nonexistence of fj in x will favor the membership of x in A and, thus, fj should be included in FμA(x); (iii) otherwise, it can be considered that fj does not favor nor disfavor the membership of x in A and, thus, fj should be excluded from both FμA(x) and FνA(x). For instance, if ω6>0 and it is considered that the nonexistence of f6 in x will be against the membership of x in A, f6 should be included in FνA(x). It is worth mentioning that, even though the features considered in this step are not part of x, their inclusion in FμA(x) or FνA(x) can help the users to be aware of what has been focused on during the evaluation of pA.

  3. For each feature fj in xX0, if it is considered that the existence of fj in x will be against the membership of x in A, fj should be included in FνA(x). Otherwise, fj should be excluded from both FμA(x) and FνA(x). For instance, if it is considered that the existence of f8 in x (see Figure 11) will disfavor the membership of x in A, f8 should be included in FνA(x). In a similar way to the previous step, the inclusion of the features considered in this step can help the users to get insights into what features of x have not been focused on during the evaluation of pA due to the absence of those features from the model.

  4. Compute μA(x) and νA(x) by means of the equations

    μA(x)=μ̌A(x)ηA(x)(19)
    and
    νA(x)=ν̌A(x)ηA(x)(20)
    respectively, where
    μ̌A(x)=tA+j=1mβjAxiff βjA>0tA<0;j=1mβjAxiff βjA>0tA0;0otherwise;(21)
    ν̌A(x)=tA+j=1m|βjA|xiff βjA<0tA>0j=1m|βjA|xiff βjA<0tA0;0otherwise;(22)
    and
    ηA(x)=max1,μ̌A(x)+ν̌A(x).(23)

It is worth mentioning that (21) and (22) are obtained as follows. Using (2), (3) can be rewritten as

lA=(j=1mβjAtA)ûA.(24)
and, thus, (4) can be rewritten as
||lA||=j=1mβjAtA.(25)

The first term of (25) can be split into the sum of positive specific influences and the sum of negative specific influences. Thus, (25) can be rewritten as

||lA||=j=1,βjA>0mβjA+j=1,βjA<0mβjAtA.(26)

Notice in (26) that, while the sum of positive specific influences will be increased by |tA| if tA<0, this sum will be decreased by |tA| if tA>0. Thus, if tA<0, the first term of (26) along with |tA| will be taken into account for the computation of μ̌A(x) in (21). Likewise, if tA>0, the second term of (26) along with |tA| will be taken into account for the computation of ν̌A(x) in (22). Since μA(x) and νA(x) are considered to be numbers in the unit interval [0,1], the sums of the specific influences are first divided by x in (21) and (22); then, μ̌A(x) and ν̌A(x) are divided by the result of (23) in (19) and (20) respectively.

The idea behind (21) and (22) is to quantify the levels to which each of the features of x favors or disfavors the membership of x in A. Notice in (21) that the membership level μ̌A(x) increases when a feature fj has a positive specific influence βjA. For instance, consider the specific influence of the feature f1 depicted by f1A=β1AûA in Figure 12. Since f1A and ûA point to the same location, f1 favors the fulfillment of pA, i.e., f1 has a positive specific influence on the appraisal of pA. Likewise, notice in (22) that the nonmembership level ν̌A(x) increases when fj has a negative specific influence. This case is illustrated in Figure 12 by the specific influence f2A=β2AûA of the feature f2. Since f2A points to the opposite direction of ûA, f2 is against the fulfillment of pA, i.e., f1 has a negative specific influence on the appraisal of pA. In this regard, the resulting specific influence vector lA is given by lA=f1A+f2AtAûA=(β1A+β2AtA)ûA, where β1A>0, β2A<0 and tA>0. Thus, the membership level μ̌A(x) and nonmembership level ν̌A(x) will be μ̌A(x)=β1Ax and ν̌A(x)=(|β2A|+tA)x respectively in this case.

Figure 12

Specific influence of two features, f1 and f2, on the appraisal of a proposition pA: ‘x BELONGS TO A’.

In contrast to a conventional classification algorithm, XSVMC can use the above procedure to perform a contextualized evaluation of the membership (and nonmembership) of an object in a given class. For example, to evaluate the membership of an object x in a class A, XSVMC makes use of the evaluation procedure with a model of the knowledge about A, say KA=ûA,tA, to obtain μ̂A(x),ν̂A(x) as a result. Likewise, XSVMC uses the procedure with the knowledge model about another class, say KB=ûB,tB, to evaluate the membership of x in B and, so, obtain μ̂B(x),ν̂B(x) as a result. Then, XSVMC can compare those evaluations to predict whether the class of x is A or B: if the buoyancy of μ̂A(x),ν̂A(x), i.e., ρA(x)=μA(x)νA(x), (see Section 2) is greater than the buoyancy of μ̂B(x),ν̂B(x), i.e., ρB(x)=μB(x)νB(x), the predicted class will be A. In this case, if a user would like to know why the predicted class of x is A, an XAI system that incorporates XSVMC (see Section 1) can use the previous prediction to offer an answer such as “the features in FμA(x) suggest that x belongs to A with a grade of μA(x); yet, the features in FνA(x) indicate that x does not belong to A with a grade of νA(x).”

In some situations, the collections FμA(x) and FνA(x) might include features having complex arrangements of attributes – e.g., when a polynomial kernel has been used for obtaining the knowledge model KA=ûA,tA. To reduce that complexity, XSVMC includes the following alternative evaluation procedure in which the MISV is used for obtaining features having simplified arrangements of attributes.

Consider the next variant of (9)

||lA||=i=1nλiyiK(xi,x)+b||i=1nλiyixi||,λi>0,(27)
in which w and xix have been replaced by (11) and K(xi,x) respectively. Recall that xi denotes any of the support vectors since λi>0, and yi represents the value, 1 or 1, associated to it. Notice that the influence of xi on the evaluation of x is given by λiyiK(xi,x). In this regard, the support vector having the greatest positive influence on the evaluation of x can be obtained by
v=arg maxxi{λiyiK(xi,x)|xiS},(28)
where S represents the collection of support vectors. From a semantic point of view, v represents the most similar support vector to x. Hence, v can be used for the identification of the features that have been relevant to the evaluation. With this consideration and representing v and x by means of v=j=1mαjf̂j and x=j=1mβjf̂j respectively, we compute the specific influence αjβj for each fj in the collection of x’s features, i.e., fjx. In a similar way to the first step of the previous evaluation procedure, we include fj in FμA(x) if αjβj>0; else, we include fj into FνA(x) if αjβj<0; otherwise we exclude fj from FμA(x) and FνA(x). It might also be assumed that fj is against the membership if αj=0 or βj=0.

To obtain μA(x) and νA(x), we compute μ̌A(x) and ν̌A(x) by means of

μ̌A(x)=i=1nλiyiK(xi,x)+bxi=1nλiyixiiff λi>0b>0λiyiK(xi,x)>0;i=1nλiyiK(xi,x)xi=1nλiyixiiff λi>0b0λiyiK(xi,x)>0;0otherwise;(29)
and
ν̌A(x)=i=1n|λiyiK(xi,x)|+|b|xi=1nλiyixiiff λi>0b<0λiyiK(xi,x)<0;i=1n|λiyiK(xi,x)|xi=1nλiyixiiff λi>0b0λiyiK(xi,x)<0;0otherwise;(30)
and replace them in (19) and (20) respectively.

To obtain (29) and (30), we split (27) into the sum of positive specific influences and the sum of negative specific influences. Thus, we rewrite (27) as

||lA||=s++s+b||i=1nλiyixi||,λi>0(31)
where
s+=i=1nλiyiK(xi,x),λiyiK(xi,x)>0(32)
and
s=i=1nλiyiK(xi,x),λiyiK(xi,x)<0.(33)

Notice in (31) that, while the sum of positive specific influences increases if b>0, this sum decreases if b<0. Hence, (32) along with b are taken into account for the computation of μ̌A(x) in (29) if b>0. In a similar way, the absolute value of (33) along with |b| are taken into account for the computation of ν̌A(x) in (30) if b<0. As was done with (21) and (22), to obtain μA(x) and νA(x) the sums of the specific influences are divided by x in (29) and (30) and, then, μ̌A(x) and ν̌A(x) are divided by the result of (23) in (19) and (20) respectively.

Notice in (29) that the membership level μ̌A(x) increases when a support vector xi has a positive influence, which is computed by λiyiK(xi,x), or when the intersect term b is positive. Likewise, notice in (30) that the nonmembership level ν̌A(x) increases when a support vector has a negative influence or when the intersect term is negative. For instance, in Figure 13 while the support vector x1 has a positive specific influence x1A=λ1y1K(x1,x)λ1y1x1+λ2y2x2ûA on the appraisal of pA, the support vector x2 has a negative specific influence x2A=λ2y2K(x2,x)λ1y1x1+λ2y2x2ûA on the appraisal. In this case, the resulting specific influence vector lA is given by lA=x1A+x2A+bA=λ1y1K(x1,x)+λ2y2K(x2,x)+bλ1y1x1+λ2y2x2ûA, where λ1y1K(x1,x)>0, λ2y2K(x2,x)<0 and b>0. Hence, the membership level μ̌A(x) and nonmembership level ν̌A(x) will be μ̌A(x)=λ1y1K(x1,x)+bxλ1y1x1+λ2y2x2 and ν̌A(x)=|λ2y2K(x2,x)|xλ1y1x1+λ2y2x2 respectively.

Figure 13

Specific influence of support vectors x1 and x2 on the appraisal of a proposition pA: ‘x BELONGS TO A’.

It is worth mentioning that the main difference between both aforementioned evaluation procedures lies in the strategy to identify which features have been relevant to the evaluation: while the first evaluation procedure makes use of the DV ûA, which incorporates all the support vectors, the second procedure uses only the support vector with the greatest positive influence on the evaluation.

4. USE CASE

Aiming to illustrate how the novel XSVMC works, in this section we implement a use case where the classes of handwritten digits are predicted. This use case consists of a successful scenario, in which the right class is predicted, and an unsuccessful scenario, in which a wrong class is predicted.

To implement the use case, we use two collections of digitized handwritten numbers: the first one is a very small collection consisting of the handwritten numbers depicted in Figures 1416, and the second one is the MNIST collection[18], which contains 70000 handwritten numbers.

Figure 14

User 1’s training collection (X0@usr1).

Figure 15

User 2’s training collection (X0@usr2).

Figure 16

User 1’s test collection (X@usr1).

As shown in Figure 17, such a digitized handwritten number consists of 784 pixels, each associated with a value between 0 and 1, where 0 and 1 denote, in that order, no strength and the maximum strength of a pen while handwriting on that pixel.

Figure 17

Characterization of a handwritten ‘5’.

To use the learning and evaluation procedures included in XSVMC, each digitized handwritten number has been modeled in a 784-dimensional feature space as a feature-influence vector x=β1f̂1++β784f̂784, such that βj denotes the strength of the pen in pixel fj. For instance, while in Figure 17 the value of β58 is 0 since no strength has been put on pixel f58, the value of β275 is 0.99 since the strength of the pen in this pixel is almost the maximum.

4.1. XSVMC on a Very Small Collection

An SVM classification process can be effective in cases where the dimension of the feature space is greater that the number of samples [5,6]. To illustrate how XSVMC works in such cases, we use the collections X0@usr1 (see Figure 14) and X0@usr2 (see Figure 15), which include handwritten numbers given by two users, say usr1 and usr2.

We first use X0@usr1 as a training collection to obtain the knowledge models for the classes of handwritten ‘3’s and handwritten ‘8’s given by usr1. Then, to evaluate the level to which the handwritten number x depicted in Figure 16 satisfies the propositions “x BELONGS TO ‘8’” and “x BELONGS TO ‘3’”, we use these models as input for the evaluation procedures described in Section 3.2, namely the procedure based on the DV and the procedure based on the MISV.

The results of those contextualized evaluations are listed in Table 1 and Figure 18. Notice in Table 1 that the levels computed with DV are the same levels computed with MISV. This observation makes the equivalence between (21) and (29), as well as the equivalence between (22) and (30) evident. In contrast, the visual representations listed in Figure 18 provide an evidence of the difference between the context of the evaluation obtained with DV and the context of the evaluation obtained with MISV. In these representations, while the green parts suggest that a handwritten number is part of a class, the red part, which a member of the class should have, and the gray part, which a member of the class should not have, indicate that the handwritten number is not part of the class. Notice in this case that, even though a difference between those contexts exists, this difference is rather small.

Figure 18

Visual results of the evaluations listed in Table 1.

DV MISV
μ̌’8‘(x) 1.4317 1.4317
ν̌’8‘(x) 1.2104 1.2104
η’8‘(x) 2.6422 2.6422
μ’8‘(x) 0.5419 0.5419
ν’8‘(x) 0.4581 0.4581
ρ’8‘(x) 0.0838 0.0838
μ̌’3‘(x) 1.2104 1.2104
ν̌’3‘(x) 1.4317 1.4317
η’3‘(x) 2.6422 2.6422
μ’3‘(x) 0.4581 0.4581
ν’3‘(x) 0.5419 0.5419
ρ’3‘(x) −0.0838 −0.0838
Table 1

Results of the evaluations of “x BELONGS TO ‘8’” and “x BELONGS TO ‘3’”, where ‘8’ and ‘3’ are two classes learned through X0@usr1 and x is the handwritten number depicted in Figure 16.

A potential explanation for such a small difference is that the DV, which incorporates all the support vectors, and the MISV are substantially similar since the knowledge models used for the previous evaluations have been obtained from a training collection consisting of numbers written only by one person, namely user1.

To obtain further insight in that regard, we use X0@usr1X0@usr2 (see Figures 14 and 15) as a training collection to obtain the knowledge models for the classes of handwritten ‘3’s and handwritten ‘8’s given by both usr1 and usr2. The resulting models were used as input of DV and MISV for the evaluation of the number depicted in Figure 16.

The results of those evaluations are listed in Table 2 and Figure 19. Notice that the equivalence between (21) and (29), as well as the equivalence between (22) and (30) are also visible in Table 2. Notice also that the difference between the contexts of the evaluations obtained with DV and the contexts of the evaluations obtained with MISV has increased.

DV MISV
μ̌’8‘(x) 1.4425 1.4425
ν̌’8‘(x) 1.2318 1.2318
η’8‘(x) 2.6743 2.6743
μ’8‘(x) 0.5394 0.5394
ν’8‘(x) 0.4606 0.4606
ρ’8‘(x) 0.0788 0.0788
μ̌’3‘(x) 1.2318 1.2318
ν̌’3‘(x) 1.4425 1.4425
η’3‘(x) 2.6743 2.6743
μ’3‘(x) 0.4606 0.4606
ν’3‘(x) 0.5394 0.5394
ρ’3‘(x) −0.0788 −0.0788
Table 2

Results of the evaluations of “x BELONGS TO ‘8’” and “x BELONGS TO ‘3’”, where ‘8’ and ‘3’ are two classes learned through X0@usr1X0@usr2 and x is the handwritten number depicted in Figure 16.

Figure 19

Visual results of the evaluations listed in Table 2.

An explanation for that increment is that in this case the DV incorporates features of the handwritten numbers given by usr2 whose influence differs from the influence of the features included in the MISV, which includes features of one of the handwritten numbers given by usr1. For instance, the influence of the features of the handwritten ‘8”s given by usr2 is reflected in the red part of the visual representation of the evaluation of “x BELONGS TO ‘8’” performed with DV (see Figure 19). In contrast, the visual representation of the evaluation of “x BELONGS TO ‘8'” performed with MISV only reflects the influence of the MISV, which is related to one of the ‘8”s given by usr1 – recall that x represents the handwritten number depicted in Figure 16, which is given by usr1.

Regarding the prediction of the class the handwritten number x, the contextualized evaluations of the propositions “x BELONGS TO ‘3’” and “x BELONGS TO ‘8’” have been sorted in descending order according to the computed buoyancy. Since only two classes have been considered in this case, the 2 best evaluated classes have been presented as the 2 most optimistic predictions.

It is worth mentioning that, even though the computed buoyancy is used for sorting the evaluations, it might be considered optional while offering an explanation with a contextualized evaluation. A reason for this it that, compared to the context, the computed buoyancy might have a limited significance for an explanation since the buoyancy could be a very small number – notice in Tables 1 and 2 the effect of scaling the membership levels μ̌’8‘(x) and μ̌’3‘(x) and the nonmembership levels ν̌’8‘(x) and ν̌’3‘(x) in order to satisfy the consistency conditions 0μ’8‘(x)+ν’8‘(x)1 and 0μ’3‘(x)+ν’3‘(x)1 respectively. For this reason, offering the k most optimistic predictions instead of a unique prediction can help a user to make a better decision – cf. the work of Alonso and Bugarín [19] where additional classes are highlighted in case of ambiguity.

4.2. XSVMC on a Large Collection

To illustrate how XSVMC works in cases where the number of samples is greater that the dimension of the feature space , in this section we use the MNIST collection. This collection is composed of a training collection of 60000 samples and a test collection of 10000 samples, which have been used for benchmarking several classifiers [18].

In contrast to the binary classification performed in the previous section, in this section we use XSVMC to perform a multi-class classification of handwritten decimal digits. In this regard, we use an ‘one-versus-the-rest’ strategy to build each of the 10 training collections. For instance, to build the training collection for the class of handwritten 8’s, the handwritten numbers with the tag ‘8’ in the MNIST training collection were considered as positive examples while the other numbers without the tag ‘8’ were considered as negative examples. These 10 collections were used as input of the XSVMC learning process to obtain knowledge models for the 10 handwritten decimal digits. The resulting models were used as input for the evaluation processes described in Section 3.2, i.e., DV and MISV, to evaluate each of the 10000 handwritten numbers included in the test collection.

The visual representations of two of those contextualized evaluations are shown in Figure 20. While the first column shows the visual representation of the evaluation of “x BELONGS TO ‘8’” performed with DV, the second column shows the visual representation of the same evaluation performed with MISV. In these representations, while the green parts suggest that the handwritten number is an ‘8’, the red parts, which an ‘8’ should have, and the gray part, which an ‘8’ should not have, indicate that the handwritten number is not an ‘8’. Notice that the representation on the second column shows more plainly what has been relevant during the evaluation process.

Figure 20

Difference between the context of an evaluation based on the directional vector (DV) and the context of an evaluation based on the support vector with the greatest positive influence on the evaluation (MISV).

Figure 21

Visual results of the evaluations of the membership and nonmembership of a handwritten ‘4’ in each of the classes of handwritten decimal digits.

An explanation for the difference between the above representations is that in this case the DV incorporates features of several shapes of handwritten numbers ‘8”s whose influence differs from the influence of the features included in the MISV, which is related to the shape of a particular handwritten number ‘8’ – cf. the visual representations listed in Figure 19.

To predict the class of a handwritten number, contextualized evaluations of the membership (and nonmembership) of this number to each of the 10 classes of handwritten decimal digits have been first performed. Then, these evaluations have been sorted in descending order according to the computed buoyancy. After that, the k best evaluated classes have been presented as the k most optimistic predictions. For the sake of illustration, Table 3 and Figure 21 show the results of the evaluations of the membership and nonmembership of a handwritten 4 in each of the 10 classes of handwritten decimal digits using a kernel (xixk)5,C=2. In this case, the three best evaluated classes have been presented as the three most optimistic predictions.

A μA(x) νA(x) ρA(x) Rank
‘0’ 0.0181 0.0410 −0.0229 10th
‘1’ 0.0280 0.0489 −0.0209 8th
‘2’ 0.0299 0.0500 −0.0201 7th
‘3’ 0.0434 0.0627 −0.0193 6th
‘4’ 0.1195 0.1083 0.0112 1st
‘5’ 0.0325 0.0518 −0.0193 9th
‘6’ 0.0180 0.0398 −0.0218 7th
‘7’ 0.0809 0.0971 −0.0162 4th
‘8’ 0.0711 0.0790 −0.0079 3rd
‘9’ 0.1487 0.1562 −0.0075 2nd
Table 3

Results of the evaluations of the membership and nonmembership of a handwritten ‘4’, denoted by x, in each of the classes of handwritten decimal digits.

Figure 22

Visual results of the evaluations of the membership and nonmembership of a handwritten ‘7’ in each of the classes of handwritten decimal digits.

The previous results were used as input of an XAI system that incorporates XSVMC to offer the following explanation of the most optimistic prediction: “The green part suggests that your drawing is a ‘4’ with a computed grade of 0.1195; however, the red part, which a ‘4’ should have, and the gray part, which a ‘4’ should not have, indicate that it is not a ‘4’ with a computed grade of 0.1083. Notice that not only the predicted class, but also the reasons behind that prediction are given.

A potential advantage of XSVMC over a conventional SVM classification process is shown in Table 4 and Figure 22. In this example, a conventional SVM classification process would offer ‘4’ as a prediction since the best evaluated class is ‘4’. In contrast, since XSVMC offers the 3 most optimistic contextualized predictions in this case, based on the provided context, users might give preference to ‘7’ which seems to be the class with the best credible justification.

A μA(x) νA(x) ρA(x) Rank
‘0’ 0.5056 0.0532 −0.0260 10th
‘1’ 0.0459 0.0681 −0.0222 5th
‘2’ 0.0305 0.0564 −0.0259 9th
‘3’ 0.0445 0.0670 −0.0225 6th
‘4’ 0.1370 0.1360 0.0010 1st
‘5’ 0.0406 0.0630 −0.0224 7th
‘6’ 0.0231 0.0485 −0.0254 8th
‘7’ 0.1203 0.1227 −0.0024 2nd
‘8’ 0.0699 0.0852 −0.0153 4th
‘9’ 0.1750 0.1872 −0.0122 3rd
Table 4

Results of the evaluations of the membership and nonmembership of a handwritten ‘7’ in each of the classes of handwritten decimal digits.

To further illustrate the potential advantage of XSVMC over a conventional SVM classification process, we measure the number of right predictions included in the k best optimistic contextualized predictions. Table 5 shows the number of right predictions made by XSVMC with a polynomial kernel (xixk)5,C=2. Notice that 9985 out of 10000 predictions are included in the top 3 of the optimistic contextualized predictions, which represents an error rate of 0.15% – cf. the error rates reported for the MNIST collection [18].

Rank Right Predictions Freq. Acc.
1st 9790 9790
2nd 169 9959
3rd 26 9985
4th 8 9993
5th 4 9997
6th 2 9999
7th 0 9999
8th 1 10000
9th 0 10000
10th 0 10000
Table 5

Number of right predictions according to the ranking of the k best optimistic contextualized predictions (Kernel: (xixk)5,C=2).

It is worth mentioning that in situations where only the best evaluated class is presented as the most optimistic prediction, a conventional SVM classification process and an XSVMC process have the same performance. To prove that, the test collection of 10000 handwritten numbers has been used as input of both processes with several kernel configurations. The results are listed in Table 6. Notice that the error rate is the same for both classifiers.

Kernel Error Rate
XSVM SVM
(xixk),C=2 8.19% 8.19%
(xixk),C=4 8.07% 8.07%
(xixk)5,C=0 2.46% 2.46%
(xixk)5,C=2 2.10% 2.10%
(xixk)5,C=4 2.11% 2.11%
(xixk)7,C=2 2.85% 2.85%
Table 6

XSVMC versus Conventional SVM classification.

4.3. XSVMC versus Alternative Approaches

To illustrate potential advantages of XSVMC over alternative approaches, we use LIME [20] and ABELE [21] to perform the evaluations of the handwritten numbers considered in Figures 21 and 22.

Figure 23

Visual results of the evaluation of the handwritten number ‘4’ depicted in Figure 21 using LIME.

Figure 24

Visual results of the evaluation of the handwritten number ‘7’ depicted in Figure 22 using LIME.

LIME is a technique that tries to explain a prediction made by a classifier through an interpretable local model that is built around the prediction without knowing the details of the classifier. To produce the visual representations depicted in Figures 23 and 24, we use the source code of LIME (which is available in https://github.com/marcotcr/lime) with the same configuration of the SVM classifier used in Figures 21 and 22 (i.e., a polynomial kernel (xixk)5,C=2). In both cases, the local model was built using 1000 synthetic samples. Notice that, in comparison to the visual representations produced with XSVMC, the visual representations produced with LIME show less plainly what has been relevant for the classifier during the evaluation process. In addition, while XSVMC needs only one evaluation to determine what has been relevant, LIME needs to evaluate all the generated synthetic samples.

Regarding ABELE, it is an extension of LORE [22] that, in a similar way to LIME, tries to explain a prediction by building an interpretable local classifier with a synthetic neighborhood of the handwritten number under evaluation, but, in addition, it takes into account existing relationships between the pixels of the handwritten number for building the synthetic neighborhood. To produce the visual representations depicted in Figure 25, we use the source code of ABELE (which is available in https://github.com/riccotti/ABELE) with the implemented Random Forest [23] classifier. Notice that, in comparison to LIME, the visual representations produced with ABELE show more clearly the approximations of what has been relevant to the classifier. However, ABELE needs more computational resources than LIME while evaluating all the generated synthetic samples.

Figure 25

Visual results of the evaluation of the handwritten numbers ‘4’ and ‘7’ depicted in Figures 21 and 22 respectively using ABELE.

5. RELATED WORK

An extensive survey of methods proposed for explaining computers predictions is presented in the work of Guidotti et al. [24]. In that survey, two main strategies have been identified: one is about the design of “transparent algorithms” that produce interpretable predictions, and the other is concerned with the interpretation of predictions without knowing the details of the algorithms that yield such predictions.

One of the methods aiming to interpret (and understand) predictions without knowing the internal details is a method proposed for decomposing a nonlinear image classification decision [25]. That method produces a heat map that highlights the relevant pixels, i.e., the pixels that have a significant influence on the classification decision. Another example is an explanation technique which tries to explain the predictions made by unknown classifiers by building interpretable local models that mimic the behavior of such classifiers [20]. In a similar way, the method proposed in the work of Baehrens et al. tries to extract a local model consisting of “explanation vectors,” which contain features that are relevant to a given prediction [26]. The visualization method proposed by Zeiler and Fergus also tries to interpret the influence of the features (pixels) and the behavior of a specific knowledge model [27].

A particularity about the aforementioned methods is that they try to identify what has been relevant to the classification decision after a prediction is made. In contrast, XSVMC identifies what has been relevant before the prediction. This aspect is deemed to be a key advantage since the influence of the features can be taken into account to guide the classification decision. In this regard, XSVMC can be considered to be part of the “transparent algorithms” identified by the above-mentioned survey. The method proposed by Loor and De Tré to contextualize naive Bayes predictions [28] is another example of such transparent algorithms.

A classification process that generates visual explanations has been proposed by Hendricks et al. [29]. In that process, images with annotated features are used as input to train an explanation model that combines classification and sentence generation in natural language. The process yields sentences including discriminative features that justify why an object belongs to the predicted class. However, such sentences do not include features justifying why the object does not belong to the class as XSVMC does.

The contributions made by the fuzzy logic community to the development of the explainable AI research field have been analyzed by Alonso et al. [30]. The results of that work suggest that the contributions made by the fuzzy logic community seem to be distant with the efforts made by the non-fuzzy community. However, that study suggests that those contributions can be linked to address the challenges arising in that field. In this regard, while potential options to develop XAI systems with fuzzy modeling have been proposed by Mencar and Alonso [31], non-fuzzy options have been proposed by Adadi and Berrada [32].

6. CONCLUSIONS

In this paper, we have proposed a novel variant of an SVM classification process by which the resulting predictions are contextualized in order to improve their interpretability. In the proposed variant, named XSVMC, the membership and nonmembership of an object in a particular class are evaluated in such a way that the context of the evaluation is explicitly recorded. Hence, predictions resulting from such contextualized evaluations can be explained with ease. In this regard, a key component of XSVMC is a novel evaluation method that makes use of the MISV to contextualize the evaluations.

An important aspect of XSVMC is that users can take advantage of such contextualized predictions to give preference to the class(es) with the best credible justification. We have illustrated this aspect through the implementation of a use case where the classes of handwritten numbers are predicted.

Even though the results of the aforementioned implementation suggest that the contextualization of SVM predictions can favor the interpretability of them, qualitative attributes like coherence, naturalness and clearness that might be perceived by a person on those predictions are still subject to validation. In this regard, studies oriented to conduct such validations are considered (and suggested) as future work.

CONFLICT OF INTEREST

None.

AUTHORS' CONTRIBUTION

Marcelo Loor, main author Guy De Tré, PhD thesis promoter, advisor.

FUNDING STATEMENT

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

ACKNOWLEDGMENTS

The authors acknowledge the valuable and insightful comments given by the anonymous reviewers. The authors are also very grateful to the persons who have written the numbers contained in the small collection used in Section 4.1.

Footnotes

This paper is an extended version of the work published in Marcelo Loor and Guy De Tré. Explaining Computer Predictions with Augmented Appraisal Degrees. Proceedings of the 11th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2019), pages 158–165. Atlantis Press, 2019/08. https://doi.org/10.2991/eusflat-19.2019.24

1

In this example, one can also say that A@K represents what has been learned about a class of handwritten 3s after following a learning process that yields K as a result.

2

To be consistent with the notation introduced in Figure 2 where the “source” of the knowledge about A is explicitly denoted, we should say KA@X0=ûA@X0,tA@X0. For the sake of readability we use hereafter this simplified form of the notation.

3

To find the values of w, b and all λi{λ1,,λn}, the software package SVMLight[17] can be used.

REFERENCES

4.W. Samek, T. Wiegand, and K.-R. Müller, Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models, ITU J. ICT Discov., Vol. 1, 2017, pp. 39-48.
6.V.N. Vapnik and V. Vapnik, Statistical Learning Theory, Wiley, New York, NY, USA, Vol. 1, 1998.
8.D. Gunning, Explainable Artificial Intelligence (XAI), 2017. http://www.darpa.mil/attachments/XAIIndustryDay_Final.pptx
17.T. Joachims, Making large-scale SVM learning practical, B. Schölkopf, C.J.C. Burges, and A. Smola (editors), Advances in Kernel Methods - Support Vector Learning, chap. 11, MIT Press, Cambridge, MA, USA, 1999, pp. 169-184.
18.Y. LeCun, C. Cortes, and C.J.C. Burges, The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/
26.D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. MÞller, How to explain individual classification decisions, J. Mach. Learn. Res., Vol. 11, 2010, pp. 1803-1831.
Journal
International Journal of Computational Intelligence Systems
Volume-Issue
13 - 1
Pages
1483 - 1497
Publication Date
2020/09/22
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.2991/ijcis.d.200910.002How to use a DOI?
Copyright
© 2020 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Marcelo Loor
AU  - Guy De Tré
PY  - 2020
DA  - 2020/09/22
TI  - Contextualizing Support Vector Machine Predictions
JO  - International Journal of Computational Intelligence Systems
SP  - 1483
EP  - 1497
VL  - 13
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.d.200910.002
DO  - 10.2991/ijcis.d.200910.002
ID  - Loor2020
ER  -