9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # Parsimonious Gaussian Mixture Models

Parsimonious Gaussian Mixture Models

Paul David McNicholas?and Thomas Brendan Murphy??

?

Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, Canada, N1G 2W1. E-mail:

E-mail: pmcnicho@uoguelph.ca. ? School of Mathematical Sciences, University College Dublin, Bel?eld, Dublin 4, Ireland.

brendan.murphy@ucd.ie. ? This research was funded by the SFI Basic Research Grant (04/BR/M0057). Part of this work was completed during a visit to the Center for Statistics in the Social Sciences which was supported by NIH grant (8 R01 EB002137- 02). The authors would like to thank Prof. Adrian Raftery and the members of the Working Group on ModelBased Clustering at the University of Washington for useful suggestions and inputs into this work. The authors would like to thank an anonymous referee for helpful suggestions that have added to the overall quality of this work. This work was carried out while the authors were at the Department of Statistics, School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland.

1

Abstract Parsimonious Gaussian mixture models are developed using a latent Gaussian model which is closely related to the factor analysis model. These models provide a uni?ed modeling framework which includes the mixtures of probabilistic principal component analyzers and mixtures of factor of analyzers models as special cases. In particular, a class of eight parsimonious Gaussian mixture models which are based on the mixtures of factor analyzers model are introduced and the maximum likelihood estimates for the parameters in these models are found using an AECM algorithm. The class of models includes parsimonious models that have not previously been developed. These models are applied to the analysis of chemical and physical properties of Italian wines and the chemical properties of co?ee; the models are shown to give excellent clustering performance. Keywords: Mixture models, factor analysis, probabilistic principal components

analysis, cluster analysis, model-based clustering.

2

1

Introduction

Mixture models assume that data are collected from a ?nite collection of populations and that the data within each population can be modelled using a standard statistical model (e.g. Gaussian, Poisson or binomial). As a result, mixture models are particularly useful when modelling data collected from multiple sources. Mixture models are the basis of many modern clustering algorithms and methods based on mixtures o?er the advantage of allowing for uncertainties to be quanti?ed using probabilities. The Gaussian mixture model has received particular attention in the statistical literature; this model assumes a Gaussian structure for each population. In particular, the model density is of the form

G

f (x) =

g =1

πg φ(x|?g , Σg ),

(1)

where φ(x|?g , Σg ) is the density of a multivariate Gaussian with mean ?g and covariance Σg . Extensive reviews of mixtures are contained in Titterington et al. (1985), McLachlan and Basford (1988) and McLachlan and Peel (2000); these books examine many di?erent types of mixture models, but emphasis is given to Gaussian mixtures. Additionally, Fraley and Raftery (2002) provide an excellent review of Gaussian mixture models with an emphasis on applications to cluster analysis, discriminant analysis and density estimation. The general Gaussian mixture model (1) has a total of (G ? 1) + Gp + Gp(p + 1)/2 parameters of which Gp(p + 1)/2 are used in the group covariance matrices Σg . A simpler form of the mixture assumes that the covariances are constrained to be equal across groups, which reduces to a total of (G ? 1) + Gp + p(p + 1)/2 parameters of which p(p + 1)/2 are used for the common group covariance matrix Σg = Σ. Ban?eld and Raftery (1993), Celeux and Govaert (1995) and Fraley and Raftery (1998, 2002) exploit an eigenvalue decomposition of the group covariance matrices to give a wide range of covariance structures that use between one and Gp(p + 1)/2

3

parameters. The eigenvalue decomposition of the covariance matrix is of the form Σg = λg Dg Ag Dg where λg is a constant, Dg is a matrix of eigenvectors of Σg and Ag is a diagonal matrix with entries proportional to the eigenvalues of Σg . The model-based clustering methods that they have developed allow for constraining the components of the eigenvalue decomposition of Σg across components of the mixture model; the components of the eigenvalue decomposition are easily interpreted. Fraley and Raftery (2002) demonstrate that the parsimonious mixture models derived in this model-based clustering framework give excellent results in a variety of applications. We develop a new class of Gaussian mixture models with parsimonious covariance structure. These models are based on assuming a latent Gaussian model structure for each population; the latent Gaussian model is closely related to the mixture of factor analyzers model (Ghahramani and Hinton 1997). The mixture of factor analyzers model assumes a covariance structure of the form Σg = Λg Λg + Ψg , where the loading matrix Λg is a (p × q ) matrix of parameters typically with q p and the noise matrix

Ψg is a diagonal matrix; a detailed development of this covariance structure is given in Section 2. The loading and noise terms of the covariance matrix can be constrained to be equal or unequal across groups and the noise term can be further restricted to give a collection of eight parsimonious covariance structures. These covariance structures can have as few as pq ? q (q ? 1)/2 + 1 parameters or as many as G[pq ? q (q ? 1)/2 + p] parameters with q = 0, 1, 2, . . .. The full development of the collection of parsimonious Gaussian mixture models (PGMMs) and methods for ?tting these models are given in Section 3. Some computational aspects, including the issue of model selection for the class of parsimonious Gaussian mixture models, are addressed in Section 4. We propose using the BIC as a model selection criterion. In Section 5, we apply the models to the analysis of data recording chemical and physical properties of Italian wines collected from three areas of the Piedmont region. We ?nd that the mixture models can be used to cluster the observations into the three

4

areas with only one error. The results of the PGMM analysis are compared with other model-based methods of analysis. In Section 6, we apply the models to the analysis of data recording the chemical properties of co?ee from a wide range of locations. The mixture models are shown to capture the di?erence between Arabica and Robusta species of co?ee plant. The results are compared with other model-based methods and are shown to give superior performance. We conclude, in Section 7, with a discussion of the methods and results of this work.

2

Latent Gaussian Models

In this section, we brie?y review the latent Gaussian model underlying factor analysis and probabilistic principal components analysis; methods for ?tting these models are also discussed. Factor analysis (Spearman 1904) is a data reduction technique that aims to ?nd unobserved factors that explain the variability in the data. The model (see Bartholomew and Knott 1999, Chapter 3) assumes that a p-dimensional random vector x is modelled using a q -dimensional vector of unobserved latent factors u where q The model is x = ? + Λu + p.

where Λ is a p × q matrix of factor loadings, the

factors u ? N (0, Iq ) and ? N (0, Ψ), where Ψ = diag(ψ1 , ψ2 , . . . , ψp ). It follows from this model that the marginal distribution of x is multivariate Gaussian with mean ? and covariance ΛΛ + Ψ. The probabilistic principal components analysis (PPCA) model (Tipping and Bishop 1999b) is a special case of the factor analysis model but it assumes that the distribution of the errors are isotropic so that Ψ = ψ Ip . Suppose that x = (x1 , x2 , . . . , xn ) are iid from a factor analysis model, then the

5

log-likelihood is

n

log f (xi ) = ?

i=1

n n np log 2π ? log |ΛΛ + Ψ| ? tr S(ΛΛ + Ψ)?1 , 2 2 2 ? ?)(xi ? ?) is the sample covariance matrix; in fact the

where S = (1/n)

n i=1 (xi

data only appears in the model through S. ? = x. Clearly, the maximum likelihood estimate of ? is ? The EM algorithm (Dempster et al. 1977) is used to ?nd maximum likelihood estimates for Λ and Ψ. We take x as the observed data and let u be missing data; the values of u = (u1 , u2 , . . . , un ) being the unobserved latent values. The complete-data consist of (x, u), the observed and missing data. Therefore, the complete-data log-likelihood is lc (x, u) = C ?

n

n 1 log |Ψ| ? tr Ψ?1 2 2 (xi ? ?) Ψ

?1

n

(xi ? ?)(xi ? ?)

i=1 n

+

i=1

1 Λui ? tr Λ Ψ?1 Λ 2

ui ui ,

i=1

where C is a constant function of ?, Λ, and Ψ. The expected value of ui conditional on xi and the current model parameters is E[ui |xi , ?, Λ, Ψ] = Λ (ΛΛ + Ψ)?1 (xi ? ?) = β (xi ? ?), where β = Λ (ΛΛ + Ψ)?1 and the expected value of ui ui conditional on xi and the current model parameters is E[ui ui |xi , ?, Λ, Ψ] = Iq ? β Λ + β (xi ? ?)(xi ? ?) β . ? , is Hence, the expected complete-data log-likelihood Q, evaluated with ? = ? Q(Λ, Ψ) = C + n n ? S ? n tr Λ Ψ?1 ΛΘ , log |Ψ?1 | ? tr Ψ?1 S + n tr Ψ?1 Λβ 2 2 2 is a symmetric q × q matrix.

?Λ ? Sβ ? ? +β where Θ = Iq ? β

Di?erentiating Q with respect to Λ and Ψ?1 respectively, utilizing results from L¨ utkepohl (1996), and solving the resulting equations gives ? Θ?1 and Ψ ?S . ? = Sβ ? = diag S ? Λ ?β Λ (2)

6

Hence, the maximum likelihood estimates for Λ and Ψ can be obtained by itera? and Θ as required. tively applying (2) and updating the values of β For the PPCA model, we ?nd that ? = 1 tr S ? Λ ?S . ?β ψ p

3

Parsimonious Gaussian Mixture Models

Ghahramani and Hinton (1997) extended the factor analysis model (Section 2) by developing the mixtures of factor analyzers model which assumes a mixture of Gaussian distributions with factor analysis covariance structures; this work was further developed by McLachlan and Peel (2000). Additionally, Tipping and Bishop (1999a) developed a mixtures of probabilistic principal components analyzers model. Under the general mixtures of factor analyzers model, the density of an observation in group g is multivariate Gaussian with mean ?g and covariance Λg Λg + Ψg and the probability of membership of group g is πg . Therefore, the mixtures of factor analyzers model has density

G

f (xi ) =

g =1

πg 1 exp ? (xi ? ?g ) (Λg Λg + Ψg )?1 (xi ? ?g ) . 2 (2π )p/2 |Λg Λg + Ψg |1/2

We extend the mixture of factor analyzers model by allowing constraints across groups on the Λg and Ψg matrices and on whether or not Ψg = ψg Ip . The full range of possible constraints provides a class of eight di?erent parsimonious Gaussian mixture models (PGMM), which are given in Table 1. Three of the models (UCU, UUC and UUU) given in Table 1 have been developed previously. Ghahramani and Hinton (1997) assume the equal noise model (UCU) whereas McLachlan and Peel (2000) and McLachlan et al. (2003) assume unequal noise (UUU), however they comment that assuming equal noise (UCU) can give more stable results. In the context of the mixtures of probabilistic principal components analyzers model, Tipping and Bishop (1999a) assume unequal, but isotropic, noise (UUC), ie.

7

Table 1: Parsimonious covariance structures derived from the mixtures of factor analyzers model. ModelID Loading Matrix CCC CCU CUC CUU UCC UCU UUC UUU Constrained Constrained Constrained Constrained Error Variance Constrained Constrained Unconstrained Unconstrained Isotropic Constrained Unconstrained Constrained Unconstrained Constrained Unconstrained Constrained Unconstrained Covariance Parameters [pq ? q (q ? 1)/2] + 1 [pq ? q (q ? 1)/2] + p [pq ? q (q ? 1)/2] + G [pq ? q (q ? 1)/2] + Gp G[pq ? q (q ? 1)/2] + 1 G[pq ? q (q ? 1)/2] + p G[pq ? q (q ? 1)/2] + G G[pq ? q (q ? 1)/2] + Gp

Unconstrained Constrained Unconstrained Unconstrained Unconstrained Constrained Unconstrained Unconstrained

Ψg = ψg Ip . The other ?ve covariance structures (CCC, CCU,. . . , UCC) are new and we note that incorporating constraints on the loading matrices dramatically reduces the number of covariance parameters and yields parsimonious models. The alternating expectation-conditional maximization (AECM) algorithm (Meng and van Dyk 1997) is used for ?tting these models. This algorithm is an extension of the EM algorithm that uses di?erent speci?cations of missing data at each stage. For the PGMM, when estimating πg and ?g the missing data are the unobserved group labels z and when estimating Λg and Ψg the missing data are the group labels z and the unobserved latent factors u. At the ?rst stage of the algorithm, when estimating πg and ?g , we let z = (z1 , z2 , . . . , zn ) be the unobserved group labels, where zig = 1 if observation i belongs to group g and zig = 0 otherwise. Hence, the complete-data likelihood for the mixture model is

n G

L1 (πg , ?g , Λg , Ψg ) =

i=1 g =1

πg f (xi |?g , Λg , Ψg )

zig

.

8

Hence, we ?nd that the expected complete-data log-likelihood is of the form

G

Q1 (?g , πg ) =

g =1

np ng log πg ? log 2π ? 2

G g =1

ng log |Λg Λg + Ψg | 2

G

?

g =1

ng tr Sg (Λg Λg + Ψg )?1 , 2 ? g, Ψ ? g) ? g, Λ π ?g φ(xi |? , G ?g ,Ψ ?g ) ?g , Λ ?g φ(xi |? g =1 π

n ?ig (xi i=1 z

where z ?ig = ng =

n ?ig i=1 z

(3)

and Sg = (1/ng )

? ?g )(xi ? ?g ) . Maximizing the expected

complete-data log-likelihood with respect to ?g and πg yields ?g = ?

n ?ig xi i=1 z n ?ig i=1 z

and π ?g =

ng . n

At the second stage of the AECM algorithm, when estimating Λg and Ψg , we take the group labels z and the latent factors u to be the missing data. Therefore, the complete data log-likelihood is

G

l(πg , ?g ,Λg , Ψg ) = C +

g =1 n

?

ng ng 1 log |Ψg | ? tr Ψ? g Sg 2 2 1 1 tr Λg Ψ? g Λg 2

n

+

i=1

1 zig (xi ? ?g ) Ψ? g Λg ui ?

zig ui ui

i=1

,

where C is constant with respect to ?g , Λg and Ψg . ?g It follows that the expected complete-data log-likelihood evaluated with ?g = ? and πg = π ?g is of the form

G

Q(Λg , Ψg ) = C +

g =1

ng

1 1 1 1 ?1 ? log |Ψ? tr Ψ? g |? g Sg + tr Ψg Λg β g Sg 2 2 ,

?

1 1 tr Λg Ψ? g Λg Θ g 2

? Λ ? ? ? where Θg = Iq ? β ?ig are computed as in (3) with the estimates g g + β g Sg β g and the z ? g and π of ? ?g as calculated in the ?rst stage of the AECM algorithm. The resulting estimates when we impose constraints (Table 1) on the Λg and Ψg matrices can be easily derived from the expression for Q(Λg , Ψg ). We illustrate how the maxima are

9

computed for the CCC (Appendix A.1) and CUU (Appendix A.2) cases; the CUU case is included because it is di?erent to the other cases. Outline calculations required to compute the estimates for all of the models are given in Appendix B. The AECM algorithm iteratively updates the parameters until convergence to maximum likelihood estimates of the parameters. The resulting z ?ig values at convergence are estimates of the a posteriori probability of group membership for each observation and can be used to cluster observations into groups.

4

4.1

Computational Aspects

Initialization

The AECM algorithm was run multiple times for each dataset with di?erent random starting values of the zng to prevent the attainment of a local, rather than global, maximum log-likelihood. The details of how the algorithm was initialized are outlined below. To initialize the zng , random values were chosen for zng ∈ {0, 1} so that

g zng

=1

for all n. Then the CCC model was run from these starting values and the resulting z ?ng were then taken as the starting group membership labels for every other model. The initial values for the elements of Λg and Ψg were generated from the eigendecomposition of Sg as follows. The Sg were computed based on the initial values of the zng . The eigen-decomposition of each Sg was computed using Househoulder reduction and the QL method (details given by Press et al. 1992). Then the initial values of the elements of Λg were set as λij = dj ρij ,

where djj is the j th largest eigenvector of Sg , ρij is the ith element of the j th largest eigenvector of Sg , i ∈ {1, 2, . . . , p} and j ∈ {1, 2, . . . , q }. The Ψg where then initialized

10

as Ψg = diag{Sg ? Λg Λg }.

4.2

Aitken’s Acceleration

The Aitken’s acceleration procedure, described in McLachlan and Krishnan (1997), was used to estimate the asymptotic maximum of the log-likelihood. This allowed a decision to be made on whether the EM algorithm had converged or not. The Aitken’s acceleration at iteration k is given by a(k) = l(k+1) ? l(k) , l(k) ? l(k?1)

where l(k+1) , l(k) and l(k?1) are the log-likelihood values from iteration k + 1, k and k ? 1 respectively. Then the asymptotic estimate of the log-likelihood at iteration k + 1 is given by

(k+1) l∞ = l(k) +

1 (l(k+1) ? l(k) ). 1 ? a(k)

B¨ ohning et al. (1994) contend that the algorithm can be considered to have converged when

(k+1) (k) |l∞ ? l∞ |< ,

where

is a ‘small’ number. Lindsay (1995) gives an alternative stopping criterion;

that the algorithm can be stopped when

(k) |l∞ ? l(k) | < .

The latter criterion is used for the analysis herein, with

= 10?2 .

4.3

Model Selection & Performance

The Bayesian information criterion (Schwartz 1978) is proposed for selecting an appropriate PGMM. For a model with parameters θ, the Bayesian information criterion (BIC) is given by ?) ? m log n, BIC = 2l(x, θ

11

?) is the maximized log-likelihood, θ ? is the maximum likelihood estimate of where l(x, θ θ, m is the number of free parameters in the model and n is the number of observations. The use of the BIC can be motivated through an asymptotic approximation of the log posterior probability of the models (Kass and Raftery 1995). The usual regularity conditions for the asymptotic approximation used in the development of the BIC are not generally satis?ed by mixture models. However, Keribin (1998, 2000) shows that BIC gives consistent estimates of the number of components of a mixture model. In addition, Fraley and Raftery (1998, 2002) provide practical evidence that the BIC performs well as a model selection criterion for mixture models. We propose using the BIC to choose the model type (CCC, CCU, . . . , UUU), the number of latent factors q and the number of components G for the parsimonious Gaussian mixture models. In addition, the BIC can be used to choose between the PGMMs and model-based clustering as implemented by mclust. Note that there are alternatives to the BIC for mixture model selection; for example, the integrated completed likelihood (ICL) (Biernacki et al. 2000) penalises the BIC by subtracting estimated entropy of group membership for each observation. Although not described in the applications herein, the use of ICL as an alternative to the BIC often gave comparable clustering performance. The performance of the PGMMs in revealing group structures in data is demonstrated using a cross tabulation of the maximum a posteriori (MAP) classi?cation of the observations with the true group membership. From the cross-tabulation, it is clear that the PGMM has excellent clustering performance in the applications considered (sections 5 and 6). In cases where the results are less clear, clustering performance of various methods can be quanti?ed using a Rand index (Rand 1971) or an adjusted Rand index (Hubert and Arabie 1985).

12

5

5.1

Application: Italian Wines

The Data

Forina et al. (1986) reported twenty-eight chemical and physical properties of three types of wine (Barolo, Grignolino, Barbera) from the Piedmont region of Italy. We report results on the analysis of twenty-seven of the variables from the original study (Table 2); the sulphur measurements from the original data were not available for this analysis. A subset of thirteen variables (Table 2) are commonly available in the UCI Machine Learning data repository and as part of the gclus library (Hurley 2004) for R (R Development Core Team 2004). For completeness we also analyze the more common thirteen variable subset of the data.

Table 2: Twenty-seven of the twenty-eight chemical and physical properties of the Italian wine reported by Forina et al. (1986). The variables marked with an asterisk are contained in the UCI Machine Learning data repository version of the data. Chemical Properties Alcohol? Tartaric acid pH Potassium Phosphate Flavonoids? Color Intensity? OD280 /OD315 of ?avonoids Total nitrogen Sugar-free extract Malic acid? Ash? Calcium Chloride Non?avonoid phenols? Hue? Glycerol Proline? Fixed acidity Uronic acids Alcalinity of ash? Magnesium? Total phenols? Proanthocyanins? OD280 /OD315 of diluted wines? 2-3-butanediol Methanol

13

Forina et al. (1986), who originally analyzed these data, describe the data collection process in some detail. It is worth observing that the wines in this study were collected over the time period of 1970–1979 (Table 3). Furthermore, the wines were collected over a ten year period and the Barbera wines are predominantly from a period which is later than the Barolo or Grignolino wines.

Table 3: The year of production of the wine samples in the data. Year of Production 70 71 Barolo Grignolino Barbera 9 19 9 7 72 73 74 75 76 77 78 79

20 20 9 16 9 9 12 5 29 5

The parsimonious Gaussian mixture models are demonstrated on the analysis of these Italian wines (Section 5.3.1). Alternative model-based methods of analysis are discussed in sections 5.2.2 and 5.3.3 and the methods are compared in Section 5.3.4.

5.2

5.2.1

Twenty-Seven Variables

Parsimonious Gaussian Mixture Models

All eight PGMMs were ?tted to the data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The BIC value for each model was computed and the model with the highest BIC value (?11454.32) was selected; this was the CUU model with G = 3 and q = 4. Figure 1 shows the maximum BIC for the eight parsimonious models for each pair (G, q ). While q = 0 is possible, the PGMMs with q = 0 correspond to the mclust modelbased clustering models (EII, EEI, VII and VVI), so the case where q = 0 is not considered at this point but it is considered in Section 5.2.2.

14

q

1

2

3

4

5

1

2

3 G

4

5

Figure 1: A heat map representation of the maximum BIC value for each (G, q ), where the maximum is taken over all eight models.

A cross tabulation of the MAP classi?cations from the best PGMM versus the true wine type is given in Table 9; this model correctly classi?es all but one of the wines.

5.2.2

Model-Based Clustering (mclust)

Model-based clustering was also completed on the wine data using the mclust software (Fraley and Raftery 1999, 2003) for R (R Development Core Team 2004). The model with the highest BIC value (?12119.3) was a three component mixture with the VVI covariance structure; detailed explanations of the covariance structures are given in Fraley and Raftery (2002). The VVI covariance structure is a diagonal covariance structure which implies elliptical group structure with the variable shape and di?erent sized ellipses that are aligned to the axes. The MAP classi?cations were calculated for the three component VVI model; a

15

Table 4: A classi?cation table for the PGMM with highest BIC value on the wine data. Cluster 1 Barolo Grignolino Barbera 59 70 1 48 2 3

cross tabulation of the classi?cations and the wine types is shown in Table 5.

Table 5: The classi?cation table for the best model found using mclust on the wine data. Cluster 1 Barolo Grignolino Barbera 58 4 2 1 66 1 48 3

While model-based clustering found three groups, it is clear that it did less well than the PGMMs in separating the wines into type.

5.2.3

Variable Selection

Variable selection was carried out using clustvarsel (Dean and Raftery 2006), with the maximum number of groups preset at eight. Nineteen variables were selected and they are given in Table 6. A VVI model with four groups was chosen and the classi?cations for this model are given in Table 12.

16

Table 6: Variables selected for the wine dataset. Chloride Malic acid Flavanoids Proline Colour intensity Uronic acids Tartaric acid Hue Alcohol Total nitrogen OD280 /OD315 of diluted wines Sugar-free extract Magnesium 2-3-butanediol Calcium Methanol Phosphate Alcalinity of ash Non?avanoid phenols

Table 7: Classi?cation table for variable selection on the wine dataset. 1 Barolo Grignolino Barbera 5.2.4 Model Comparison 52 2 7 17 1 54 47 3 4

A comparison of the clustering results on the wine data indicates that all of the methods are good at discovering the group structure in the wine data. The PGMM and mclust methods both ?nd the correct number of groups. However, the PGMM method is more accurate. The Rand and adjusted Rand indices (Table 8) indicate that the PGMMs are better at ?nding the wine types than mclust or variable selection. Notably, mclust outperformed the variable selection method on these data.

Table 8: Rand and adjusted Rand indices for the models that were applied to the wine data. Model PGMM mclust Variable Selection Rand Index Adjusted Rand Index 0.99 0.95 0.91 0.98 0.90 0.78

17

The BIC values of the PGMM and mclust results can be compared; the BIC is higher for the PGMM method. Interestingly the model with the higher BIC also had better clustering performance, while this is not guaranteed to happen, this phenomenon has been observed previously in Fraley and Raftery (2002).

5.3

5.3.1

Thirteen Variables

Parsimonious Gaussian Mixture Models

All eight PGMMs were ?tted to the subset of the wine data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The model with the highest BIC (?5294.68) was selected; this was the CUU model with G = 4 and q = 2. Figure 2 shows the maximum BIC for the eight PGMMs for each pair (G, q ).

q

1

2

3

4

5

1

2

3 G

4

5

Figure 2: A heat map representation of the maximum BIC value for each value of (G, q ), where the maximum is taken over the eight models.

18

Again, the cases where q = 0 are considered in Section 5.3.2. A cross tabulation of the MAP classi?cations from the best PGMM versus the true wine type is given in Table 9. The Grignolino wines are spread mostly across clusters 2 and 3.

Table 9: A classi?cation table for the PGMM with highest BIC value for the subset of the wine data. Cluster 1 Barolo Grignolino Barbera 59 38 31 2 48 2 3 4

A cross tabulation of the MAP classi?cation results for the PGMM versus wine type and year is given in Table 10. The table shows that Cluster 2 consists primarily of Grignolino wines from 1970–1974 and Cluster 3 consists primarily of Grignolino wines from 1974–1976. Therefore, the mixture model appears to be dividing the Grignolino wines into two groups according to year of production. Incidentally, due to the data collection process, it is not possible to completely determine if the correspondence of the Barbera wines and Cluster 4 is due to the wine type or the year of production. The Barbera wines were primarily from 1978 and 1979 whereas the other wine types were from 1970–1976.

5.3.2

Model-Based Clustering (mclust)

Analysis of the subset of the wine data using the mclust software yielded eight groups with a VEI covariance structure and a BIC value of ?5469.95. The VEI covariance structure corresponds to equal shaped clusters which are aligned with the axes but

19

Table 10: A table of the MAP classi?cations from the PGMM ?tted to the subset of the wine data versus the wine type and year. Barolo 71 Cluster 1 19 Cluster 2 Cluster 3 Cluster 4 73 74 70 20 20 7 2 8 1 4 2 1 8 1 8 8 2 7 1 10 1 9 5 29 5 Grignolino 71 72 73 74 75 76 74 Barbera 76 78 79

with variable volume. Classi?cations for this model are given in Table 11.

Table 11: The classi?cation table for the best model found using mclust on the subset of the wine data. Cluster 1 Barolo Grignolino Barbera 2 3 1 21 22 4 27 1 17 27 4 5 6 7 8

40 18

5.3.3

Variable Selection

The clustvarsel package, with the maximum number of groups was preset at eight, was used to analyze the subset of the wine data. The variables selected were Malic acid, Proline, Flavanoids, colour intensity and OD280 /OD315 of diluted wines. The selected model was a VEV model with three groups and the classi?cations for this model are given in Table 12.

20

Table 12: Classi?cation table for variable selection on the subset of the wine data. Cluster 1 Barolo Grignolino Barbera 5.3.4 Model Comparison 51 3 2 8 67 1 1 47 3

A comparison of the clustering results on the subset of the wine data indicates that all of the methods are very good at discovering the group structure in these data. The PGMM and variable selection methods both ?nd the correct number of groups but the PGMM is more accurate. The mclust software ?nds eight groups. The Rand and adjusted Rand indices for all three techniques are given in Table 13.

Table 13: Rand and adjusted Rand indices for the models that were applied to the subset of the wine data. Model PGMM MCLUST Variable Selection Rand Index Adjusted Rand Index 0.91 0.80 0.90 0.79 0.48 0.78

Comparing the BIC values of the PGMM and mclust results, the BIC is higher for the PGMM. Again, the model with the higher BIC also had better clustering performance.

21

6

6.1

Application: Co?ee Data

The Data

Streuli (1973) reports on the chemical composition of co?ee samples collected from around the world. A total of 43 samples were collected from 29 countries and beans from the Arabica and Robusta species were collected. The thirteen chemical constituents reported in the study are listed in Table 14; only the ?rst twelve are used in our analysis because the Total Chlorogenic Acid measurement is the sum of the Chlorogenic, Neochlorogenic and Isochlorogenic acid values.

Table 14: Twelve of the thirteen chemical constituents of co?ee reported by Streuli (1973). Chemical Constituents Water pH Value Fat Bean Weight Free Acid Ca?eine Extract Yield Mineral Content Trigonelline Isochlorogenic Acid

Chlorogenic Acid Neochlorogenic Acid

6.2

Parsimonious Gaussian Mixture Models

All eight PGMM were ?tted to the data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The model with the highest BIC was the CCU model with G = 2 and q = 4. Figure 3 shows the maximum BIC for the eight parsimonious models for each pair (G, q ). A cross tabulation of the classi?cations from the best PGMM versus the co?ee species is given in Table 15; all of the co?ee samples are clustered into the two co?ee species.

22

q

1

2

3

4

5

1

2

3 G

4

5

Figure 3: A heat map representation of the maximum BIC value for each value of (G, q ) where the maximum is taken over the eight models (CCC,CCU,. . . , UUU).

Table 15: A classi?cation table for the PGMM with highest BIC value for the co?ee data. Cluster 1 Arabica Robusta 36 7 2

6.3

Model-Based Clustering (mclust)

The mclust software chose three groups with a VEI covariance structure; this corresponds to equal shaped clusters which are aligned with the axes but with variable volume. The model divides the Arabica co?ees into two groups; these groups have no discernable geographical basis.

23

Table 16: A classi?cation table for the best mclust results for the co?ee data. Cluster 1 Arabica Robusta 2 3

22 14 7

6.4

Variable Selection

The clustvarsel package, with the maximum number of groups was preset at eight, was used to analyze the co?ee data. Nine variables were selected, so that the only variables not selected were extract yield, free acid and chlorogenic acid. The selected model was a EEV model with ?ve groups and the classi?cations for this model are given in Table 17.

Table 17: Classi?cation table for variable selection on the co?ee data. Cluster 1 Arabica Robusta 10 2 9 3 9 4 8 7 5

6.5

Model Comparison

A comparison of the clustering results on the co?ee data indicates that the PGMM is the best method at discovering the group structure in the wine data. The PGMM was the only method to ?nd two groups, corresponding to the two varieties of co?ee. The mclust software found three groups and variable selection found eight. The Rand and adjusted Rand indices for all three methods that were applied to the co?ee data are given in Table 18.

24

Table 18: Rand and adjusted Rand indices for the models that were applied to the co?ee data. Model PGMM MCLUST Variable Selection Rand Index Adjusted Rand Index 1.00 0.66 0.46 1.00 0.38 0.16

Comparing the BIC values of the PGMM and mclust results, the BIC is higher for the PGMM. Once again, the model with the higher BIC also had better clustering performance.

7

Discussion

A new family of parsimonious Gaussian mixture models has been proposed. These models are closely related to the mixtures of factor analyzers and mixtures of principal components analyzers models and includes them as special cases. The number of covariance parameters in the PGMM family grows linearly with data dimension which is especially important in high-dimensional situations. In modelbased clustering as implemented in mclust, the number of parameters in any of the diagonal covariance structures is linear or constant in the data dimension, however the number of parameters in the non-diagonal covariance structures is quadratic in data dimension. This means that PGMMs may o?er a more ?exible modelling structure for high-dimensional data than mclust. The PGMMs work particularly well in high-dimensional settings because of the parsimonious nature of the covariance structure. Additionally, the PGMMs appear to be particularly good at modelling situations where some of the variables are highly correlated. The PGMMs allow for model-comparison with other model-based methods like mclust, which allows for the methods to be used in conjunction with each other.

25

Chang (1983) shows that the principal components corresponding to the highest eigenvalues do not necessarily contain the most group information. Directly modelling data using PGMMs avoids the problems of implementing data reduction using principal components analysis before clustering. The application of the PGMMs to the wine and co?ee datasets indicates that the models give excellent clustering performance. The clusters found using the model showed greater ability to capture the group structure of the data than other techniques.

A

Calculations for AECM Algorithm

G g =1

We adopt the following notation: ?= S ?Λ ?S ?. ? = Iq ? β ? +β ?β π ?g Sg and Θ

A.1

Model CCC

? =Λ ?Ip )?1 ? (Λ ?Λ ? +ψ For Model CCC we have Λg = Λ and Ψg = Ψ = ψ Ip . Therefore, β ? takes the place of Sg and so Θg is replaced by Θ ? . Therefore, Q and it follows that S can be written as Q(Λ, ψ ) = C + n ?S ? + 2ψ ?1 tr Λβ ? ? ψ ?1 tr Λ ΛΘ ? p log ψ ?1 ? ψ ?1 tr S 2 .

Di?erentiating Q with respect to Λ and ψ ?1 respectively gives the score functions below. ?Q(Λ, ψ ) ? ? ΛΘ ?β = ψ ?1 n S ?Λ ?Q(Λ, ψ ) n ?S ? + 2 tr Λβ ? ? tr Λ ΛΘ ? S2 Λ, ψ = = pψ ? tr S ? 1 ?ψ 2 S1 Λ, ψ = ? = 0 for Λ ? new , ψ ? new gives Solving S1 Λ ?Θ ? new = S ?β ? ?1 , Λ

26

?new = 0 for ψ ?new we obtain ? new , ψ and solving S2 Λ ?new = 1 tr S ?S ? ?Λ ? new β ? . ψ p

A.2

Model CUU

? = Λ ? (Λ ?Λ ? +Ψ ? g )?1 and Θg = For Model CUU we have Λg = Λ. Therefore, β g ? Λ ? Sg β ? ), so that Q can be written as ? +β (Iq ? β g g g

G

Q(Λ, Ψg ) = C +

g =1

ng

1 1 1 1 1 ?1 ? 1 log |Ψ? tr Ψ? tr Λ Ψ? g |? g Sg + tr Ψg Λβ g Sg ? g ΛΘg 2 2 2

.

1 Di?erentiating Q with respect to Λ and Ψ? g respectively gives the score functions

S1 Λ, Ψg

?Q(Λ, Ψ) = = ?Λ

G g =1

1 ?1 ? ng Ψ? g Sg β g ? Ψg ΛΘg ,

S2 Λ, Ψg =

?Q(Λ, Ψg ) 1 1 ? Sg ? 1 ΛΘ Λ . = ng Ψg ? Sg + Λβ g g ?1 2 2 2 ? Ψg

? new , Ψ ? g = 0 gives Now, solving S1 Λ

G g =1

? ?1 Λ ? new Θg = ng Ψ g

G g =1

? , ? ?1 Sg β ng Ψ g g

(4)

? new in a row-by-row manner. Write which must be solved for Λ λi = λi1 λi2 · · · λiq ,

to represent the ith row of the matrix Λ and let ri represent the ith row of the matrix on the right-hand-side of (4). Now, row i of (4) can be written ? new λ i

G g =1

ng Θg = ri , ? ψg(i)

?g(i) is the ith entry along the diangonal of Ψg . Therefore, where ψ ? ? new = ri ? λ i

G g =1

??1 ng Θ ? , ?g(i) g ψ

27

for i = 1, 2, . . . , p. Unfortunately, the row-by-row nature of this solution slows the ?tting of the CUU model to a great degree. ? new , Ψ ? new } = 0 for Ψ ? new gives Solving diag{S2 Λ g g ? Sg + Λ ? new = diag Sg ? 2Λ ? new β ? new Θg (Λ ? new ) Ψ g g .

B

Estimation Procedure for Eight Covariance

Structures

?, Λ ? and Ψ ? for each of the eight parsimonious models are The resulting equations for β given in the following sections.

B.1

Model CCC

? =Λ ?Ip )?1 , Λ ?Θ ?)new = 1 tr S ?S ? (Λ ?Λ ? +ψ ? new = S ?β ? ?1 , (ψ ? ?Λ ? new β ? . β p

B.2

Model CCU

? =Λ ?Θ ?S ? (Λ ?Λ ? +Ψ ? )?1 , Λ ? new = S ?β ? ?1 , Ψ ? new = diag S ? ?Λ ? new β ? . β

B.3

Model CUC

?

? =Λ ?g Ip )?1 , Λ ? (Λ ?Λ ? +ψ ? new = ? β g ?g )new = (ψ

G g =1

?? ng ? ? ? S β ?g g g ψ

G g =1

??1 ng Θ ? , ?g g ψ

1 ? Sg + Λ ? new β ? new Θg (Λ ? new ) . tr Sg ? 2Λ g p

28

B.4

Model CUU

In this case the loading matrix must be solved in a row-by-row manner. This slows the ?tting of this model to a great degree. ? ? =Λ ? ? (Λ ?Λ ? +Ψ ? g )?1 , λ β g i

new

G g =1

= ri ?

??1 ng Θ ? , ?g(i) g ψ

? Sg + Λ ? new = diag Sg ? 2Λ ? new β ? new Θg (Λ ? new ) , Ψ g g ? new is the ith row of the matrix Λ ?g(i) denotes the ith element along ? new , ψ where λ i ? g , ri represents the ith row of the matrix the diagonal of Ψ i = 1, 2, . . . , p.

G ? ?1 ? g =1 ng Ψg Sg β g

and

B.5

Model UCC

G g =1

?Ip )?1 , Λ ? Θ?1 , (ψ ?)new = 1 ? =Λ ? +ψ ? new = Sg β ? (Λ ? gΛ β g g g g g g p

? Sg . ? new β π ?g tr Sg ? Λ g g

B.6

Model UCU

G g =1

? Θ?1 , Ψ ? =Λ ? +Ψ ? )?1 , Λ ? new = Sg β ? new = ? (Λ ? gΛ β g g g g g g

? Sg . ? new β π ?g diag Sg ? Λ g g

B.7

Model UUC

? =Λ ?g Ip )?1 , Λ ? Θ?1 , (ψ ?g )new = 1 tr Sg ? Λ ? Sg . ? +ψ ? new = Sg β ? (Λ ? gΛ ? new β β g g g g g g g g p

B.8

Model UUU

? =Λ ? Θ?1 , Ψ ? Sg . ? (Λ ? gΛ ? +Ψ ? g )?1 , Λ ? new = Sg β ? new = diag Sg ? Λ ? new β β g g g g g g g g g

References

Ban?eld, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3), 803–821.

29

Bartholomew, D. J. and M. Knott (1999). Latent variable models and factor analysis (2nd ed.). London: Edward Arnold. Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis & Machine Intelligence 22 (7), 719–725. B¨ ohning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46, 373–388. Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793. Chang, W.-C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. J. Roy. Statist. Soc. Ser. C 32 (3), 267–275. Dean, N. and A. E. Raftery (2006). The clustvarsel Package. R package version 0.2-4. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 (1), 1–38. With discussion. Forina, M., C. Armanino, M. Castino, and M. Ubigli (1986). Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201. Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41 (8), 578–588. Fraley, C. and A. E. Raftery (1999). Mclust: Software for model-based clustering. Journal of Classi?cation 16, 297–306. Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 (458), 611–612.

30

Fraley, C. and A. E. Raftery (2003). Enhanced model-based clustering, density estimation and discriminant analysis software: MCLUST. Journal of Classi?cation 20 (2), 263–296. Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University Of Toronto, Toronto. Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classi?cation 2, 193–218. Hurley, C. (2004). Clustering visualizations of multivariate data. Journal of Computational and Graphical Statistics 13 (4). Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795. Keribin, C. (1998). Estimation consistante de l’ordre de mod` eles de m? elange. C. R. Acad. Sci. Paris S? er. I Math. 326 (2), 243–248. Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhy? a Ser. A 62 (1), 49–66. Lindsay, B. (1995). Mixture Models: Theory, Geometry and Applications. Hayward, CA: Institute of Mathematical Statistics. L¨ utkepohl, H. (1996). Handbook of Matrices. Chicester: John Wiley & Sons. McLachlan, G. J. and K. E. Basford (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker Inc. McLachlan, G. J. and T. Krishnan (1997). The EM algorithm and extensions. New York: John Wiley & Sons Inc. McLachlan, G. J. and D. Peel (2000). Finite Mixture models. New York: John Wiley & Sons.

31

McLachlan, G. J., D. Peel, and R. W. Bean (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3–4), 379–388. Meng, X. L. and van Dyk (1997). The EM algorithm — an old folk song sung to the fast tune (with discussion). Journal of the Royal Statistical Society Series B 59, 511–567. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992). Numerical Recipies in C — The Art of Scienti?c Computation (2nd ed.). Cambridge University Press. R Development Core Team (2004). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850. Schwartz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 31–38. Spearman, C. (1904). The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101. Streuli, H. (1973). Der heutige stand der ka?eechemie. In Association Scienti?que International du Cafe, 6th International Colloquium on Co?ee Chemisrty, Bogat? a, Columbia, pp. 61–72. Tipping, M. E. and C. M. Bishop (1999a). Mixtures of probabilistic principal component analysers. Neural Computation 11 (2), 443–482. Tipping, M. E. and C. M. Bishop (1999b). Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B 61 (3), 611–622.

32

Titterington, D. M., A. F. M. Smith, and U. E. Makov (1985). Statistical analysis of ?nite mixture distributions. Chichester: John Wiley & Sons.

33

赞助商链接

- Incremental learning of temporally-coherent Gaussian mixture models
- Gaussian mixture models of phonetic boundaries for speech recognition
- Gaussian Mixture Models(杂志)
- Incremental Learning of Gaussian Mixture Models
- Gaussian Mixture Model and EM
- Gaussian Mixture Models
- Efficient greedy learning of Gaussian mixture models
- Outline of Today’s Lecture ■ Gaussian Mixture Models- A ■ Gaussian Mixture Models- B ■
- Gopinath, “Extended mllt for gaussian mixture models
- Incremental Learning of Gaussian Mixture Models
- Gaussian mixture models of phonetic boundaries for speech recognition
- gaussian mixture modle
- Incremental learning of temporally-coherent Gaussian mixture models
- Ecient greedy learning of Gaussian mixture models. Neural Computation
- MGMM Multiresolution Gaussian Mixture Models for Computer Vision

更多相关文章：
更多相关标签：