A Robust High-Dimensional Estimation of Multinomial Mixture Models

Azam Sabbaghi; Farzad Eskandari; Hamid Reza Navabpoor

doi:10.2991/jsta.d.210126.001

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 20, Issue 1, March 2021, Pages 21 - 32

A Robust High-Dimensional Estimation of Multinomial Mixture Models

Authors

Azam Sabbaghi, Farzad Eskandari^*, Hamid Reza Navabpoor

Department of Statistics, Faculty of Mathematical Sciences and Computer, Allameh Tabataba'i University, Tehran, Iran

^*Corresponding author. Email: askandari@atu.ac.ir

Corresponding Author

Farzad Eskandari

Received 17 June 2020, Accepted 18 January 2021, Available Online 8 February 2021.

DOI: 10.2991/jsta.d.210126.001 How to use a DOI?
Keywords: EM algorithm; Data corruption; High-dimensional; Multinomial logistic mixture models; Robustness
Abstract: In this paper, we are concerned with a robustifying high-dimensional (RHD) structured estimation in finite mixture of multinomial models. This method has been used in many applications that often involve outliers and data corruption. Thus, we introduce a class of the multinomial logistic mixture models for dependent variables having two or more discrete categorical levels. Through the optimization with the expectation maximization (EM) algorithm, we study two distinct ways to overcome sparsity in finite mixture of the multinomial logistic model; i.e., in the parameter space, or in the output space. It is shown that the new method is consistent for RHD structured estimation. Finally, we will implement the proposed method on real data.
Copyright: © 2021 The Authors. Published by Atlantis Press B.V.
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

The Bernoulli mixture model (BMM) is applied for a binary dependent variable and showing how the model is estimated using the regularized maximum likelihood. The development and application of BMMs have gained increasing attention. Grantham [1] focused on BMMs for binary data clustering. Grilli et al. [2] used a binomial finite mixture to model the number of credits. Melkersson and Saarela [3] applied the binomial finite mixture model for nonzero counts. Brooks et al. [4] studied fetal deaths in litters with different types of finite mixture models including binomial finite mixture model.

The high-dimensional estimation under an additional sparse error vector for computing corrupted observations in recent studies has been widely considered (Wang et al. [5]; Nguyen and Tran [6]; Chen et al. [7]; Tibshirani and Manning [8]). Yang et al. [9] added an outlier error parameter for modeling the corrupted response. They applied two techniques for outlier modeling in GLMs. The first approach is in the parameter space, which is a convex optimization approach under stringent conditions. The second, which is in the output space, yields the nonconvex method with milder conditions. In this study, these two outlier modelings were used in binomial finite mixture modeling. We also used the multinomial logistic mixture models (MLMMs) to examine the problem of data corruption. Finally, the expectation maximization (EM) algorithm was considered for robust estimation.

The rest of this article is organized as follows: Section 2. introduces the Bernoulli finite mixture model (BMM) framework for binary data. Section 3. describes the MLMMs for dependent variables having two or more discrete categorical levels. In Section 4. we study properties of our approach by using the proposed method on real data.

2. MODELING OUTLIER ERRORS IN BMMs

In this paper, we examine the classification data set, the response variable yc of this set consisting of two classes −1 and 1, which are considered diametrically opposite such as pass/fail, win/lose, alive/dead. Bernoulli distribution is an effective method for studying the grouping variables. Assume that P(Yc=1)=1−P(Yc=−1)=p. So, it can be expressed that P(Yc=yc)=pI1(Yc)(1−p)1−I1(Yc).

A binary logistic model has a dependent variable with two possible values that are expressed by an indicator variable y=I1(yc), where the two values are labeled as 0 and 1. Eskandari and Meshkani [10] presented the maximum likelihood equations from the probability distribution of the logistic regression and solved them using the Newton-Raphson method for nonlinear systems of equations. Biohning [11] applied the lower bound principle in the Newton-Raphson iteration instead of the Hessian matrix, which led to a monotonically converging sequence of iterates.

In real-world problems, we are interested in studying the logistic regression model in high-dimensional data problems with a small number of nonzero observations. Due to the presence of sparse parameters vector and outliers, the desirable theoretical properties of standard methods do not hold exactly. To solve this challenge and obtain a robust estimator for the lack of our model assumptions, we propose modeling outlier errors on BMM. Detail of the two approaches (i.e., modeling outlier errors in the parameter space and output space respectively) on Bernoulli mixture models will be explained later. Before that, we discuss the performance of the standard ℓ1 penalized BMMs over the uncorrupted version of the dataset. Let y=(y1,⋯,yn) be a random sample of binary vectors. We consider yi arises from a finite mixture density p(yi|Ψ)=∑k=1Kπkp(yi|pik) of order K, where the mixture component density p(yi|pik) is Bernoulli with success probability of pik:

p(yi|Ψ)=∑k=1Kπkexp{yilnpik1−pik+ln(1−pik)}(1)

where Ψ˜=(θ1∗,…,θK∗,π) is a vector of mixture parameters and π=(π1,…πK) are mixing weights, such that πk>0 and ∑k=1Kπk=1. In logistic regression, equate the logit function to the linear component of a covariate vector xi∈ℝp for the ith observation and the true regression parameter vector θ∗∈ℝp as follows:

lnpik1−pik=〈θk∗,xi〉=xitθk∗, i=1,⋯,n

Since the BMMs are presented in high-dimensional data problems, we assume p is significantly larger than n. An ℓ1-penalized version of the maximum likelihood estimator (MLE) is defined to cope with observations that deviate from the true model. Suppose that {(xi,yi)}i=1n is an independent sample of observations from (1). The negative log-likelihood function over n values is given by

lp=−1n∑i=1nlog{∑k=1Kπkexp{yi〈θk∗,xi〉−ln1+exp〈θk∗,xi〉}}(2)

Now we estimate Θ^=(θ^1∗,⋯,θ^K∗) by imposing the ℓ1 regularized maximum likelihood constrained proposed by Stadler et al. [12]:

θ^K∗∈ argmin lp+λn,θ∑k=1Kπk∥θk∗∥1 ∥θk∗∥2≤a0(3)

where a0 denotes a constant independent of n and p. Note that ∥.∥1 for a vector is sum of absolute values and ∥.∥2 is the usual Euclidean norm. To compute this parameter, we proposed an iterative EM algorithm. At iteration m, the algorithm consists of Expectation (E-step) and Maximization (M-step) and seeks minimization of (2) using the complete negative log-likelihood function:

lc(Ψ˜)=−1n∑ni=1∑Kk=1νik{logπk+yi〈θk∗,xi〉−ln(1+exp〈θk∗,xi〉)}+λn,θ∑Kk=1πk∥θk∗∥1(4)

Based on the following Algorithm 1 for BMM, Steps 3 marks the E-step of the algorithm, where ωik(m) is updated by E[νik|x,y] where unobserved imaginary indicator variables νik show the component membership of the ith observation in the model. The conditional expectation of l c(Ψ˜) with respect to νik is

Q(Ψ˜,Ψ˜(m))=−1n∑i=1n∑k=1Kωik(m){yi〈θk∗,xi〉−ln(1+exp〈θk∗,xi〉)}−1n∑i=1n∑k=1Kωik(m)logπk+λn,θ∑k=1Kπk∥θk∗∥1(5)

Steps 4 and 5 show the M-step, where Ψ(m+1) is obtained by minimizing (5) with respect to Ψ. Algorithm 1:

To motivate robust high-dimensional estimators, we begin with modeling the outlier errors approach on the parameter space in the next section.

Algorithm 1: EM Algorithm for BMM

step 1: Begin with initial values θk∗0 and πk0 for all k.

step 2: Compute Pk(m)=(p1k(m),⋯,pnk(m)) for all i's and k's as follows:

pik(m)=exp{〈θk∗(m),xi〉}1+exp〈θk∗(m),xi〉

step 3: Compute ωk(m)=(ω1k(m),⋯,ωnk(m)) for all i's and k's as follows:

ωik(m)=πkexp{yi〈θk∗(m),xi〉−ln1+exp〈θk∗(m),xi〉}∑l=1Kπlexp{yi〈θl∗(m),xi〉−ln1+exp〈θl∗(m),xi〉}

step 4: Compute θ^k∗ via the EM algorithm and the following equations:

θk(m+1)=θk(m)+(X′(diag(ωk(m))Pk(m)(1−Pk(m)))X)−1X′diag(ωk(m))(y−Pk(m))(6)

where X=(x1′,⋯,xn′)

step 5: Determine πk(m+1) via under formula (for all k).

πk(m+1)=1n∑i=1nωik(m)

step 6: Assign m+1←m+2 and iterate steps 2, 3 and 4 until reaching a predefined convergence criterion.

2.1. Parameter Space

Based on the i-th response yi that is drawn from (1) and the reformulation of logit function with a corrupted parameter 〈θk∗,xi〉+neik∗, we propose the robust estimators for general high-dimensional problems by modeling outlier errors in the parameter space. We can then write down the negative log-likelihood as

lp=−1n∑i=1nlog{∑k=1Kπkexp{yi〈θk∗,xi〉+neik∗)−ln(1+exp{〈θk∗,xi〉+neik∗})}}(7)

The robust estimator problem can be solved with the following constrained ℓ1 regularized maximum likelihood where a0,b0 are constants independent of n and p.

(θ^k∗,e^k∗)∈argmin lp+λn,θ∑k=1Kπk∥θk∗∥1+λn,e∑k=1Kπk∥ek∗∥1 ∥θk∗∥2≤a0 ∥ek∗∥2≤b0n(8)

We now focus on the EM algorithm and provide the complete negative log-likelihood function as follows:

lc(Ψ˜)=λn,θ∑Kk=1πk∥θk∗∥1+λn,e∑Kk=1πk∥ek∗∥1−1n ∑ni=1∑Kk=1νik{logπk+yi(〈θk∗,xi〉+neik∗)−ln(1+exp{〈θk∗,xi〉+neik∗})}(9)

In the E-step, the conditional expectation of l c(Ψ˜) with respect to νik given the data (xi,yi) is

Q(Ψ˜,Ψ˜(m))=λn,θ∑k=1Kπk∥θk∗∥1+λn,e∑k=1Kπk∥ek∗∥1−1n∑i=1n∑k=1Kωik(m)logπk−1n∑i=1n∑k=1Kωik(m){yi(〈θk∗,xi〉+neik∗)−ln(1+exp{〈θk∗,xi〉+neik∗})}(10)

By adding the estimation of outlier errors parameter, Algorithm 2 is obtained. In the M-step, the estimates of Ψ˜=(e1∗,⋯,eK∗,θ1∗,…,θK∗,π) are updated by Steps 4–6.

Algorithm 2: EM Algorithm for modeling errors in the parameter space in BMM (PBMM)

step 1: Begin with initial values θ∗k0, πk0 and e∗k0 for all k's.

step 2: Compute Pk(m)=(p1k(m),⋯,pnk(m)) at the mth itration for all i's and k's as follows:

pik(m)=exp{〈θk∗(m),xi〉+neik∗}1+exp{〈θk∗(m),xi〉+neik∗}

step 3: Compute ωk(m)=(ω1k(m),⋯,ωnk(m)) for all i's and k's:

ωik(m)=πkexp{yi(〈θk∗, xi〉+neik∗)−ln(1+exp{〈θk∗, xi〉+neik∗})}∑l=1Kπlexp{yi(〈θl∗,xi〉+neil∗)−ln(1+exp{〈θl∗,xi〉+neil∗})}

step 4: Obtain θk∗(m+1) from the EM algorithm and following equation:

θk∗(m+1)=θk∗(m)+(X′(diag(ωk(m))Pk(m)(1−Pk(m)))X)−1X′diag(ωk(m))(y−Pk(m))

Where X=(x1′,⋯,xn′)

step 5: Obtain ek∗(m+1) via the EM algorithm as follows:

ek∗(m+1)=ek∗(m)+(nPk(m)(1−Pk(m)))−1(y−Pk(m))

step 6: Assign πk(m+1)←1n∑i=1nωik(m).

step 7: Consider m+1←m+2 and iterate steps 2-6 until reaching a predefined convergence criterion.

Although the optimization problem is convex, stringent conditions are needed to achieve a consistent estimator. The constraint ∥ek∗∥2≤b0n ensures that the consistent optimum exists as discussed in Yang et al. [9]. To investigate the larger errors, we introduce modeling errors in the output space.

2.2. Output Space

In this section, we give statistical error directly in the response space of BMM (RBMM). Under the certain assumption, the group random variable yic−nei∗ has two values 1−nei∗ and −1−nei∗. Therefore, based on indicator variable yi′=I1−nei∗(yic−nei∗) drawn from the conditional distribution in (1) with logit function 〈θk∗,xi〉 for each component, we have

p(yi|xi)=∑k=1Kπkexp{I1−nei∗(yic−nei∗)〈θk∗,xi〉−ln1+exp〈θk∗,xi〉}

Our starting point is the following negative log-likelihood function:

lp=−1n∑i=1nlog{∑k=1Kπkexp{I1−nei∗(yic−nei∗)〈θk∗,xi〉−ln(1+exp〈θk∗,xi〉)}}(11)

The EM iteration alternates between performing E- and M-steps. The E-step creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters:

l c(Ψ˜)=−1n∑i=1L∑k=1Kνik{logπk+yi′〈θk∗,xi〉−ln(1+exp〈θk∗,xi〉)}(12)

The M-step computes parameters maximizing the expected log-likelihood found on the E-step.

Q=−1n∑i=1L∑k=1Kωik(m)logπk −1n∑i=1L∑k=1Kωik(m){yi′〈θk∗,xi〉−ln(1+exp〈θk∗,xi〉)}(13)

To this end, in an RBMM approach, iterate Algorithm 1 for each value in the set L∈{2,3,⋯,n} and choose the parameters estimates that have the least empirical prediction errors.

3. MODELING OUTLIER ERRORS IN THE MLMM

In the MLMM, we consider vector yi=(yi1,⋯,yiJ), i=1,⋯,n, with yij=0 for all j except one j′ with yij′=1 and corresponding probability pij′. Let J represent the number of levels of the dependent variable. To simulate sparsity, we arbitrarily corrupt some of the observations yi. For logistic regression models, the corrupted response yi is obtained by yi=(1−y¯i), which suggests

E(yi)=(pi1pi2pi3), Cov(yi)=(pi1(1−pi1)−pi1pi2−pi1pi3−pi1pi2pi2(1−pi2)−pi2pi3−pi1pi3−pi2pi3pi3(1−pi3))

Recall that yi follows a finite mixture of multinomial model of order K with the conditional density function as

p(yi|xi)=∑k=1Kπkexp{yi1lnpi1kpi3k+yi2lnpi2kpi3k+ln(pi3k)}(14)

where the multinomial logit-model is given by

lnpi1kpi3k=〈θ1k∗,xi〉, lnpi2kpi3k=〈θ2k∗,xi〉(15)

The conditional log-likelihood function of Ψ˜ has the form

lp=−1n∑i=1nlog∑k=1Kπkexpyi1θ1k∗,xi+yi1θ2k∗,xi-ln(1+expθ1k∗,xi+θ2k∗,xi)(16)

The ℓ1 penalized version of the classical MLE is as follows:

(θ^1k∗,θ^2k∗)∈argmin lp+λn,θ1∑k=1Kπk∥θ1k∗∥1+λn,θ2∑k=1Kπk∥θ2k∗∥1 ∥θ1k∗∥2≤a10 ∥θ2k∗∥2≤a20(17)

In this situation, the joint estimation of (θ1k∗,θ2k∗) is achieved through the complete-case framework required by the EM algorithm.

l c(Ψ˜)=−1n∑i=1n∑k=1Kνik{logπk+yi1〈θ1k∗,xi〉+yi2〈θ2k∗,xi〉−ln1+exp〈θ1k∗,xi〉+exp〈θ2k∗,xi〉}(18)

where Ψ˜=(θ11∗,…,θ1K∗,θ21∗,…,θ2K∗,π). By using (18) we can rewrite the new conditional expectation of l c(Ψ˜):

Q(Ψ˜,Ψ˜(m))=−1n∑i=1n∑k=1Kωik(m)logπk−1n∑i=1n

∑k=1Kωik(m){yi1〈θ1k∗,xi〉+yi2〈θ2k∗,xi〉−ln1+exp〈θ1k∗,xi〉+exp〈θ2k∗,xi〉}

The following Algorithm 3 illustrates the procedure employed for estimating the parameters:

Algorithm 3: EM Algorithm for multinomial logistic mixture models

step 1: Begin with initial values (θ∗1k0,θ∗2k0) and πk0 for all k.

step 2: Compute (P1k(m),P2k(m),P3k(m)) for all i's and k's as follow:

pirk(m)=exp{〈θrk∗(m),xi〉}1+∑r=12exp〈θrk∗(m),xi〉, r=1,2

pi3k(m)=11+∑r=12exp〈θrk∗(m),xi〉

step 3: Compute ωk(m)=(ω1k(m),⋯,ωnk(m)) for all i's and k's:

ωik(m)=πk(m)exp∑r=12yir〈θrk∗(m),xi〉−ln1+∑r=12exp〈θrk∗(m),xi〉∑l=1Kπl(m)exp∑r=12yir〈θrl∗(m),xi〉−ln1+∑r=12exp〈θrl∗(m),xi〉

step 4: Compute (θ^1k∗,θ^2k∗) from the EM algorithm and the following equations:

(θ1k(m+1)θ2k(m+1))=(θ1k(m)θ2k(m))+(X′W1(m)XX′V1.2(m)XX′V1.2(m)XX′W2(m)X)−1(X′diag(ωk(m))(y1k−P1k(m))X′diag(ωk(m))(y2k−P2k(m)))

where X=(x1′,⋯,xn′) and

W1(m)=diag(ωk(m)P1k(m)(1−P1k(m)))

W2(m)=diag(ωk(m)P2k(m)(1−P2k(m)))

V1.2(m)=diag(−ωk(m)P1k(m)P2k(m))

step 5: Determine πk(m+1) from the following formula (for all k).

πk(m+1)=1n∑i=1nωik(m)

step 6: Assign m+1←m+2 and iterate steps 2, 3 and 4 until reaching a predefined convergence criterion.

We studied the standard ℓ1 penalized finite mixture of the multinomial logistic model over the corrupted data. We also introduce two other methods, modeling outlier errors in the parameter space and output space, respectively. Finally, we will compare the performance of these three methods.

3.1. Parameter Space

As in the previous section, we assume that yi follows the conditional distribution of (14) with the new multinomial logit-model:

lnpi1kpi3k=〈θ1k∗,xi〉+neik∗, lnpi2kpi3k=〈θ2k∗,xi〉+neik∗

We can rewrite the log-likelihood function as follows:

lp=−1n∑i=1nlog∑k=1K πkexp∑r=12yir(θrk∗,xi+neik∗)-ln(1+∑r=12expθrk∗,xi+neik∗)(19)

A penalized log-likelihood function is defined as

(θ^1k∗,θ^2k∗,e^k∗)∈ argmin lp+λn,θ1∑k=1Kπk∥θ1k∗∥1+λn,θ2∑k=1Kπk∥θ2k∗∥1+λn,e∑k=1Kπk∥ek∗∥1 ∥θ1k∗∥2≤a10 ∥θ2k∗∥2≤a20 ∥ek∗∥2≤b0n(20)

where ℓ1-norm penalty function is

pn(Ψ˜)=λn,θ1∑k=1Kπk∥θ1k∗∥1+λn,θ2∑k=1Kπk∥θ2k∗∥1+λn,e∑k=1Kπk∥ek∗∥1

The complete log-likelihood function, after substituting νik, is

l c(Ψ˜)=pn(Ψ˜)−1n∑i=1n∑k=1Kνik logπk+∑r=12yir〈θrk∗,xi〉+neik∗−ln1+∑r=12exp{〈θ1k∗,xi〉+neik∗}(21)

where Ψ˜=(e1∗,⋯,eK∗,θ11∗,…,θ1K∗,θ21∗,…,θ2K∗,π). Note that after taking the conditional expectation of (21), we have

Q(Ψ˜,Ψ˜(m))=pn(Ψ˜)−1n∑i=1n∑k=1Kωik(m)logπk−1n∑i=1n∑k=1K ωik(m){∑r=12(yir〈θrk∗,xi〉+neik∗)−ln(1+∑r=12exp{〈θ1k∗,xi〉+neik∗})}(22)

To obtain the EM algorithm 4, we use the Newton-Raphson method, which involves calculating the first and second derivatives of (22).

3.2. Output Space

In the response space corrupted data for MLMM (RMLMM), the dependent variable yic−nei∗ has three levels. Therefore, we have the conditional distribution as follows:

p(yi|xi)=∑k=1Kπkexp {I1−nei∗(yic−nei∗)ln(pi1kpi3k)+I−1−nei∗(yic−nei∗)ln(pi2kpi3k)+ln(pi3k)}(23)

Algorithm 4: EM Algorithm for modeling errors in the parameter space on multinomial logistic mixture models (PMLMM)

step 1: Begin with initial values (θ1k∗0,θ2k∗0,ek∗0) and πk0 for all k's.

step 2: Compute (P1k(m),P2k(m),P3k(m)) for all i's and k's as follows:

pirk(m)=exp〈θrk∗(m),xi〉+neik∗1+∑r=12exp〈θrk∗(m),xi〉+neik∗, r=1,2

pi3k(m)=11+∑r=12exp〈θrk∗(m),xi〉+neik∗

step 3: Compute ωk(m)=(ω1k(m),⋯,ωnk(m)) for all i's and k's as follows:

ωik(m)=πk(m)exp{∑r=12yir〈θrk∗(m),xi〉−ln(1+∑r=12exp〈θrk∗(m),xi〉)}∑l=1Kπl(m)exp{∑r=12yir〈θrl∗(m),xi〉−ln(1+∑r=12exp〈θrl∗(m),xi〉)}

step 4: Compute (θ^1k∗,θ^2k∗) via the EM algorithm and following equations:

(θ1k(m+1)θ2k(m+1))=(θ1k(m)θ2k(m))+(X′W1(m)XX′V1.2(m)XX′V1.2(m)XX′W2(m)X)−1(X′diag(ωk(m))(y1k−P1k(m))X′diag(ωk(m))(y2k−P2k(m)))

where X=(x1′,⋯,xn′) and

W1(m)=diag(ωk(m)P1k(m)(1−P1k(m)))

W2(m)=diag(ωk(m)P2k(m)(1−P2k(m)))

V1.2(m)=diag(−ωk(m)P1k(m)P2k(m))

step 5: Obtaining ek∗(m+1) from the EM algorithm as follows:

ek∗(m+1)=ek∗(m)+(nP3k(m)(1−P3k(m)))−1(y3k−P3k(m))

step 6: Determine πk(m+1) for all k from the following formula.

πk(m+1)=1n∑i=1nωik(m)

step 7: Assign m+1←m+2 and iterate steps 2, 3 and 4 until reaching a predefined convergence criterion.

We consider multinomial logit-model (15) and yi′=(yi1′,yi2′,yi3′), where

yi1′=I1−nei∗(yic−nei∗), yi2′=I−1−nei∗(yic−nei∗)

Let the complete log-likelihood function be

l c(Ψ˜)=−1n∑i=1n∑k=1Kνiklogπk+∑r=12yir′〈θrk∗,xi〉−ln1+∑r=12exp〈θrk∗,xi〉(24)

Using the log-likelihood function, it is clear the errors in the model of the RMLMM will affect the total number the sum of observations present.

Q(Ψ˜,Ψ˜(m))=−1n∑i=1L∑k=1Kωik(m){∑r=12yir〈θrk∗,xi〉−ln1+exp∑r=12〈θrk∗,xi〉}−1n∑i=1n∑k=1Kωik(m)logπk

Therefore, it is sufficient that we repeat Algorithm 3 with the different numbers of observations L∈{2,3,⋯,n}.

4. APPLICATION TO THE ANALYSIS OF REAL DATA

We consider two different finite mixture of logistic models: M1 and M2 models. M1 model has two components whereas M2 model has three components. We employ the real binary classification dataset from Yang et al. [9]. Australian dataset was obtained from LIBSVM (http//:www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). The dataset consists of n=690 units and the dimension of the true parameter is p=14, therefore we can set λn,θ=0. That is, there is no need to add further sparsity ℓ1 regularization to the parameters according to (p〈n). We divide our dataset into two groups; 40% of data as the training and 60% as the test data. By scaling the number of corrupted samples (r) of training examples (m), we make various datasets.

We compare the performances of the standard ℓ1 penalized mixture models and modeling outlier errors in parameter and output space.

To evaluate the performance of our proposed method, we use the Ratio of Generalized Mean Square Error (RGMSE) index that is defined as the ratio of generalized mean square error (GMSE) of M1 to M2, i.e.,

G^MSE(θ^k∗)=(θ^k∗−θk∗)E^(XX′)(θ^k∗−θk∗)

where E^(XX′)=1n∑i=1nxi′xi. And, we have

R^GMSEj=GMSE(Mj1)GMSE(Mj2),j=1,…,2000

This quantity was computed in each run and then the mean of R^GMSEs over 2,000 runs was reported in Table 1. The lower R^GMSE shows that fitting the mixture of two populations (M1) is better than fitting it with 3 components (M2).

r	25%			50%			1000%
r	BMM	PBMM	RBMM	BMM	PBMM	RBMM	BMM	PBMM	RBMM
w/o	0.6852	0.4112	0.3109	0.6135	0.4688	0.2969	0.0061	0.4629	0.4335
Log	0.0079	0.1498	0.2638	0.2970	0.3970	0.3467	0.2911	0.2920	0.1946
Sqrt	0.7193	0.0549	0.1168	0.2634	0.1067	0.2821	0.0673	0.2280	0.1181
Linear	0.1115	0.1503	0.1080	0.1055	0.1545	0.2747	0.0399	0.3961	0.1153

BMM, Bernoulli mixture model; PBMM, parameter space in BMM; RBMM, response space of BMM.

Table 1

Comparisons of the mean of RGMSEs under different models.

Figure 1 plots the RGMSE of the parameter estimates, against the number of samples n. We compare three methods:

the standard ℓ1 penalized finite mixture of generalized linear models (FMGLM) over the corrupted data (BMM, ℓ1reg),
our first M-estimator that models errors in the parameter space (error in parameters, PBMM),
our second M-estimator, which models error in the output space (error in output, RBMM).

Each row shows different types of outliers on the dataset: (w/o) original dataset without adding outliers (i.e., Log, Sqrt and Linear), where the number of outliers r scaled to three different ways as (r=log(m),m,0.1(m)).

Each column shows three different fraction of training dataset: 25% (Left column), 50% (Center), and 100% (Right column).

Table 2 (model M1) and Table 3 (model M2) represent estimates and their standard deviations.

k	BMM		PBMM		OBMM
k	θ^	sd	θ^	sd	θ^	sd
	−1.1318	0.3898	−2.4252	0.4315	−1.1469	0.3962
	−0.0339	0.0223	−0.0819	0.0153	−0.0343	0.0224
	−0.0616	0.0385	0.0121	0.0324	−0.0614	0.0386
	0.0186	0.4617	1.2399	0.3242	0.0285	0.4632
	0.1380	0.0491	0.2159	0.0457	0.1384	0.0490
	−0.1279	0.0927	0.0399	0.0808	−0.1278	0.0928
1	0.2291	0.0818	0.3160	0.0704	0.2301	0.0817
	2.0894	0.4519	4.6841	0.4094	2.1055	0.4628
	−0.7119	0.6548	−2.3642	0.5815	−0.7503	0.6907
	0.1330	0.1288	0.3511	0.0894	0.1357	0.1307
	−1.0179	0.3795	−1.8398	0.3286	−1.0262	0.3790
	0.2971	0.4797	−1.0341	0.4926	0.2990	0.4802
	−0.0016	0.0010	−0.0034	0.0010	−0.0016	0.0010
	0.0002	0.0009	0.0006	0.0002	0.0002	0.0009
	0.0113	0.3189	−0.1207	0.3782	0.0113	0.3189
	−0.0043	0.0148	−0.0183	0.0177	−0.0043	0.0148
	−0.0023	0.0338	0.0392	0.0384	−0.0023	0.0338
	0.0325	0.3526	1.0210	0.3784	0.0325	0.3526
	0.1339	0.0439	0.2152	0.0573	0.1339	0.0439
	0.1282	0.0820	0.4095	0.1049	0.1282	0.0820
2	0.0108	0.0560	−0.0293	0.0704	0.0108	0.0559
	4.0875	0.4269	8.6061	0.8816	4.0875	0.4269
	1.2307	0.4384	1.0877	0.5550	1.2307	0.4384
	0.1042	0.0661	0.2849	0.1026	0.1043	0.0661
	0.16908	0.3102	0.1089	0.3531	0.1691	0.3102
	−2.6615	0.4601	−6.0474	0.7829	−2.6615	0.4601
	−0.0025	0.0012	−0.0061	0.0013	−0.0025	0.0012
	0.0009	0.0001	0.0019	0.0004	0.0009	0.0001

BMM, Bernoulli mixture model; PBMM, parameter space in BMM.

Table 2

Estimates and their standard deviations based on 2,000 runs on the original dataset without adding artificial outliers.

k	BMM		PBMM		OBMM
k	θ^	sd	θ^	sd	θ^	sd
	−1.2010	0.3741	−1.4134	0.5099	−1.2166	0.3770
	−0.0370	0.0223	−0.0447	0.0239	−0.0376	0.0212
	−0.0422	0.0362	0.0279	0.0386	−0.0426	0.0349
	0.2765	0.4392	0.4722	0.4717	0.3072	0.4206
	0.1517	0.0462	0.1629	0.0511	0.1494	0.0466
	−0.0574	0.0874	0.0323	0.0946	−0.0527	0.0864
1	0.1541	0.0756	0.1795	0.0852	0.1551	0.0730
	2.2690	0.4286	2.6910	0.9504	2.2529	0.4337
	−1.2131	0.5075	−1.4257	0.4287	−1.2208	0.4859
	0.2220	0.0651	0.2452	0.0737	0.2192	0.0658
	−1.1937	0.3391	−1.3085	0.3758	−1.2083	0.3281
	0.0563	0.4798	−0.1952	0.6393	0.0511	0.4619
	−0.0028	0.0009	−0.0029	0.0009	−0.0028	0.0009
	0.0003	0.0001	0.0003	0.0002	0.0003	0.0001
	0.1157	0.3198	−0.0833	0.3290	0.1096	0.3275
	−0.0104	0.0148	−0.0125	0.0160	−0.0103	0.0144
	−0.0090	0.0351	0.0135	0.0365	−0.0062	0.0329
	0.2715	0.3605	1.4226	0.4718	0.2767	0.3599
	0.1379	0.0446	0.1502	0.0555	0.1362	0.0447
	0.2050	0.0837	0.2431	0.1198	0.2042	0.0822
2	0.0534	0.0617	−0.0488	0.0615	−0.0521	0.0603
	4.4087	0.4109	5.1751	1.8911	4.3724	0.3989
	0.9351	0.4350	0.9548	0.4725	0.9199	0.3903
	0.1620	0.0558	0.1816	0.0829	0.1592	0.0572
	0.1720	0.3284	0.1467	0.3258	0.1577	0.3027
	−2.9528	0.4649	−3.5070	1.4356	−2.9171	0.4669
	−0.0042	0.0011	−0.0045	0.0013	−0.0042	0.0010
	0.0010	0.0002	0.0012	0.0004	0.0010	0.0002
	−0.2239	0.3374	−0.2846	0.3565	−0.2295	0.3403
	−0.0021	0.0146	−0.0033	0.0155	−0.0019	0.0146
	−0.0027	0.0347	−0.0209	0.0365	−0.0290	0.0335
	0.4231	0.3952	0.2910	0.4358	0.4132	0.3933
	0.1379	0.0491	0.1505	0.0570	0.1362	0.0492
	−0.0482	0.0862	−0.03477	0.0934	−0.0488	0.0855
3	0.2206	0.0773	0.2411	0.08556	0.2204	0.0755
	4.0891	0.4534	4.7372	1.5019	4.0570	0.4535
	1.5783	0.4913	1.6599	0.5687	1.5755	0.4643
	0.0348	0.1026	0.0484	0.1032	0.0326	0.1033
	0.0196	0.3324	0.1495	0.3233	0.1865	0.3130
	−2.1498	0.5279	−2.6216	1.1565	−2.1277	0.5287
	−0.0001	0.0009	−0.0001	0.0009	−0.0001	0.0009
	0.0006	0.0002	0.0007	0.0003	0.0007	0.0002

BMM, Bernoulli mixture model; PBMM, parameter space in BMM.

Table 3

Estimates and their standard deviations based on 2,000 runs on the original dataset without adding artificial outliers.

5. DISCUSSION

In this paper, for the modeling sparsity of the outlier response vector on the BMMs, we randomly have selected a small number of r samples from n observations, and corrupted them arbitrarily. We have considered the performance of the proposed method on Australian real binary classification dataset obtained from LIBSVM. We have obtained two distinct ways to analyze sparsity in the finite mixture of the generalized linear model (FMGLM); the parameter space of the GLM, and the space output. Using the EM algorithm that is a convenient approach for the optimization of finite mixture models, we have shown our performance is improved. Comparing results and figures in the paper, we saw that the proposed robust methods, as well as and are better than the finite mixture of the logistic regression with multiple components.

CONFLICTS OF INTEREST

The authors declare of no conflicts of interest.

AUTHORS' CONTRIBUTIONS

Prof. Eskandari with designed the model, idea and the computational framework and Mrs Sabbaghi analyzed the data. All authors discussed the results and contributed to the final manuscript.

ACKNOWLEDGMENTS

We would like to thank the editor and the referees for their valuable comments about our paper. This work is a part of my Ph.D Student thesis at Allameh Tabataba’i University.

REFERENCES

1.N.S. Grantham, Clustering Binary Data with Bernoulli Mixture Models, 2014.

2.L. Grilli, R. Varriale, and C. Rampichini, Commun. Stat. Theory Methods, Vol. 44, 2013, pp. 4866-4879.

3.M. Melkersson and J. Saarela, J. Popul. Econ., Vol. 17, 2004, pp. 409-431.

4.S.P. Brooks, B.J.T. Morgan, M.S. Ridout, and S.E. Pack, Biometrics, Vol. 53, 1997, pp. 1097-1115.

5.H. Wang, G. Li, and G. Jiang, J. Bus. Econ. Stat., Vol. 25, 2007, pp. 347-355.

6.N.H. Nguyen and T.D. Tran, IEEE Trans. Inf. Theory, Vol. 59, 2013, pp. 2036-2058.

7.Y. Chen, C. Caramanis, and S. Rampichini Mannor, in The Proceedings of the International Conference on Machine Learning, 2013.

8.J. Tibshirani and C.D. Manning, Association for Computational Linguistics, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Baltimore, MD, USA), 2014, pp. 124-129.

9.E. Yang, A. Tewari, and P. Ravikumar, in International Joint Conference on Artificial Intelligence, Vol. 13, 2013, pp. 1834-1840.

10.F. Eskandari and M.R. Meshkani, J. Iran. Stat. Soc., Vol. 5, 2006, pp. 9-24.

11.D. Biohning, Ann. Inst. Stat. Math., Vol. 44, 1992, pp. 197-200.

12.N. Stadler, P. Buhlmann, and S. van de Geer, TEST, Vol. 19, 2010, pp. 209-256.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: Journal of Statistical Theory and Applications
Volume-Issue: 20 - 1
Pages: 21 - 32
Publication Date: 2021/02/08
ISSN (Online): 2214-1766
ISSN (Print): 1538-7887
DOI: 10.2991/jsta.d.210126.001 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Azam Sabbaghi
AU  - Farzad Eskandari
AU  - Hamid Reza Navabpoor
PY  - 2021
DA  - 2021/02/08
TI  - A Robust High-Dimensional Estimation of Multinomial Mixture Models
JO  - Journal of Statistical Theory and Applications
SP  - 21
EP  - 32
VL  - 20
IS  - 1
SN  - 2214-1766
UR  - https://doi.org/10.2991/jsta.d.210126.001
DO  - 10.2991/jsta.d.210126.001
ID  - Sabbaghi2021
ER  -

download .riscopy to clipboard