当前位置:首页 >> >>

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Randomized Algorithms for Fast Bayesian Hierarchical Clustering
Katherine A. Heller and Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, London, WC1N 3AR, UK {heller,zoubin}@gatsby.ucl.ac.uk October 18, 2005
Abstract We present two new algorithms for fast Bayesian Hierarchical Clustering on large data sets. Bayesian Hierarchical Clustering (BHC) [1] is a method for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. BHC has several advantages over traditional distancebased agglomerative clustering algorithms. It de?nes a probabilistic model of the data and uses Bayesian hypothesis testing to decide which merges are advantageous and to output the recommended depth of the tree. Moreover, the algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably in?nite) mixture model (DPM). While the original BHC algorithm has O(n2 ) computational complexity, the two new randomized algorithms are O(n log n) and O(n).



Hierarchical structures are ubiquitous in the natural world. For example, the evolutionary tree of living organisms (and consequently features of these organisms such as the sequences of homologous genes) is a natural hierarchy. Hierarchical structures are also a natural representation for data which was not generated by evolutionary processes. For example, internet newsgroups, emails, or documents from a newswire, can be organized in increasingly broad topic domains. Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, the output is a binary tree (dendrogram) whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. The traditional method for hierarchically clustering data as given in [2] is a bottom-up agglomerative algorithm. It starts with each data point assigned to its own cluster and iteratively merges the two closest clusters together until all the data belongs to a single cluster. The nearest pair of clusters is chosen based on a given distance measure (e.g. Euclidean distance between cluster means, or distance between nearest points). There are several limitations to the traditional hierarchical clustering algorithm. The algorithm provides no guide to choosing the “correct” number of clusters or the level at which to prune the tree. It is often di?cult to know which distance metric to choose, especially for structured data such as images or sequences. The traditional algorithm does not de?ne a probabilistic model of the data, so it is hard to ask how “good” a clustering is, to compare to other models, to make predictions and cluster new data into an existing hierarchy. We use statistical inference to overcome these limitations. Previous work which uses probabilistic methods to perform hierarchical clustering is discussed in detail in our technical report [1]. Our Bayesian hierarchical clustering algorithm uses marginal likelihoods to decide which clusters to merge and to avoid over?tting. Basically it asks what the probability is that all the data in a potential merge were generated from the same mixture component, and compares this to exponentially many hypotheses at lower levels of the tree (section 2). The BHC algorithm has O(n2 ) complexity, where n is the number of data points. In practice, we would like run the Bayesian hierarchical clustering algorithm on very large data sets.



Tk Tj








Figure 1: (a) Schematic of a portion of a tree where Ti and Tj are merged into Tk , and the associated data sets Di and Dj are merged into Dk . (b) An example tree with 4 data points. The clusterings (1 2 3)(4) and (1 2)(3)(4) are tree-consistent partitions of this data. The clustering (1)(2 3)(4) is not a tree-consistent partition. Two fast versions of BHC using randomized algorithms, with O(n log n) and O(n) complexity, are presented in section 3.


The BHC Algorithm

Our Bayesian hierarchical clustering algorithm is similar to traditional agglomerative clustering in that it is a one-pass, bottom-up method which initializes each data point in its own cluster and iteratively merges pairs of clusters. As we will see, the main di?erence is that our algorithm uses a statistical hypothesis test to choose which clusters to merge. Let D = {x(1) , . . . , x(n) } denote the entire data set, and Di ? D the set of data points at the leaves of the subtree Ti . The algorithm is initialized with n trivial trees, {Ti : i = 1 . . . n} each containing a single data point Di = {x(i) }. At each stage the algorithm considers merging all pairs of existing trees. For example, if Ti and Tj are merged into some new tree Tk then the associated set of data is Dk = Di ∪ Dj (see ?gure 1(a)). k In considering each merge, two hypotheses are compared. The ?rst hypothesis, which we will denote H 1 is that all the data in Dk were in fact generated independently and identically from the same probabilistic model, p(x|θ) with unknown parameters θ. Let us imagine that this probabilistic model is a multivariate Gaussian, with parameters θ = (?, Σ), although it is crucial to emphasize that for di?erent types of data, di?erent probabilistic models may be appropriate. To evaluate the probability of the data under this hypothesis we need to specify some prior over the parameters of the model, p(θ|β) with hyperparameters β. We now have k the ingredients to compute the probability of the data Dk under H1 :
k p(Dk |H1 ) =

p(Dk |θ)p(θ|β)dθ =

p(x(i) |θ) p(θ|β)dθ
x(i) ∈Dk


This calculates the probability that all the data in Dk were generated from the same parameter values assuming a model of the form p(x|θ). This is a natural model-based criterion for measuring how well the data ?t into one cluster. If we choose models with conjugate priors (e.g. Normal-Inverse-Wishart priors for Normal continuous data or Dirichlet priors for Multinomial discrete data) this integral is tractable. Throughout this paper we use such conjugate priors so the integrals are simple functions of su?cient statistics of D k . For example, in the case of Gaussians, (1) is a function of the sample mean and covariance of the data in D k . k The alternative hypothesis to H1 would be that the data in Dk has two or more clusters in it. Summing over the exponentially many possible ways of dividing Dk into two or more clusters is intractable. However, if we restrict ourselves to clusterings that partition the data in a manner that is consistent with the subtrees Ti and Tj , we can compute the sum e?ciently using recursion. (We elaborate on the notion of tree-consistent k partitions in ?gure 1(b)). The probability of the data under this restricted alternative hypothesis, H 2 , is k simply a product over the subtrees p(Dk |H2 ) = p(Di |Ti )p(Dj |Tj ) where the probability of a data set under a tree (e.g. p(Di |Ti )) is de?ned below.


k k Combining the probability of the data under hypotheses H1 and H2 , weighted by the prior that all points k in Dk belong to one cluster, πk = p(H1 ), we obtain the marginal probability of the data in tree Tk : k p(Dk |Tk ) = πk p(Dk |H1 ) + (1 ? πk )p(Di |Ti )p(Dj |Tj ) def


This equation is de?ned recursively, there the ?rst term considers the hypothesis that there is a single cluster in Dk and the second term e?ciently sums over all other other clusterings of the data in D k which are consistent with the tree structure (see ?gure 1(a)). In our tech report [1] we show that equation 2 can be used to derive an approximation to the marginal likelihood of a Dirichlet Process mixture model, and in fact provides a new lower bound on this marginal likelihood. We also show that the prior for the merged hypothesis, πk , can be computed bottom-up in a DPM. The posterior probability of the merged hypothesis def k rk = p(H1 |Dk ) is obtained using Bayes rule: rk =
k πk p(Dk |H1 ) k πk p(Dk |H1 ) + (1 ? πk )p(Di |Ti )p(Dj |Tj )


This quantity is used to decide greedily which two trees to merge, and is also used to determine which merges in the ?nal hierarchy structure were justi?ed. The general algorithm is very simple (see ?gure 2).

input: data D = {x(1) . . . x(n) }, model p(x|θ), prior p(θ|β) initialize: number of clusters c = n, and Di = {x(i) } for i = 1 . . . n while c > 1 do Find the pair Di and Dj with the highest probability of the merged hypothesis, rk (eq. 3) Merge Dk ← Di ∪ Dj , Tk ← (Ti , Tj ) Delete Di and Dj , c ← c ? 1 end while output: Bayesian mixture model where each tree node is a mixture component The tree can be cut at points where rk < 0.5 Figure 2: Bayesian Hierarchical Clustering (BHC) Algorithm Our Bayesian hierarchical clustering algorithm has many desirable properties which are absent in traditional hierarchical clustering. For example, it allows us to de?ne predictive distributions for new data points, it decides which merges are advantageous and suggests natural places to cut the tree using a statistical model comparison criterion (via rk ), and it can be customized to di?erent kinds of data by choosing appropriate models for the mixture components. For any tree, the probability of a new test point given the data can be computed by recursing through the tree starting at the root node. Each node k represents a cluster, with an associated predictive distribution p(x|Dk ) = p(x|θ)p(θ|Dk , β)dθ. The overall predictive distribution sums over all nodes weighted by their posterior probabilities: ωk p(x|Dk ) (4) p(x|D) =

where N is the set of all nodes in the tree, ωk = rk i∈Nk(1 ? ri ) is the weight on cluster k, and Nk is the set of nodes on the path from the root node to the parent of node k. For Gaussian components, for example, with conjugate priors, this results in a predictive distribution which is a mixture of multivariate t distributions. We give a more detailed explanation and show some examples of this in our tech report [1]. In summary, p(D|T ) sums the probabilities for all tree-consistent partitions, weighted by the prior mass assigned to each partition by the DPM. The computational complexity of constructing the tree is O(n 2 ), the complexity of computing the marginal likelihood is O(n) , and the complexity of computing the predictive distribution is O(n).




Randomized BHC Algorithms

Our goal is to run Bayesian Hierarchical Clustering on very large datasets, but to accomplish this we need a BHC algorithm which has very small computational complexity. The BHC algorithm as given in section 2 is O(n2 ), and computation is dominated by pairwise comparisons of data points at the lowest levels. This seems wasteful and limits its use for large data sets. We aim to capitalize on this ine?ciency, combined with the powerful resource of randomized algorithms [3], to create a faster BHC algorithm. We propose the following randomized algorithm for fast Bayesian Hierarchical Clustering (RBHC):

input: data D = {x(1) . . . x(n) } pick m n points randomly from D, so D [m] ? D run BHC D[m] obtaining a tree T Filter D\D[m] through the top level of tree T , obtaining DL and DR where D = DL ∪ DR recurse: run RBHC(DL ) and RBHC(DR ) output: Bayesian mixture model where each tree node is a mixture component Figure 3: Randomized Bayesian Hierarchical Clustering (RBHC) Algorithm The algorithm takes in a data set D and randomly selects a subset D [n] of m data points from the data set. The original BHC algorithm is run on that subset of m data points, obtaining a tree T . The remaining (n ? m) data points (D\D [m] ) are then ?ltered through the top level (last merge) of tree T . The ?lter [m] [m] algorithm (?gure 4) takes in the top level partitioning of tree T (DL and DR ), along with the priors (πL and πR ) computed in the BHC algorithm, and all remaining data points. It then takes each remaining data point (xi ) and computes the probabilities that xi belongs to the left subtree and right subtree (in cluster [m] [m] DL or DR ). The data point is then added to the highest probability cluster (subtree). The assignments of all n data points to the left and right subtrees are returned to the RBHC algorithm, which then runs itself separately on each subtree. The constant m may be reduced as the data set becomes smaller.
[m] [m]

input: DL , DR , πL , πR , D\D[m] [m] [m] initialize: DL = DL , DR = DR (i) [m] foreach x ∈ D\D [m] [m] compute p(x(i) |DL ) and p(x(i) |DR ) [m] [m] if πL p(x(i) |DL ) > πR p(x(i) |DR ) (i) then DL ← DL ∪ {x } else DR ← DR ∪ {x(i) } output: DL , DR Figure 4: Filter Algorithm The RBHC algorithm rests on two assumptions. The ?rst assumption is that the top level clustering, built from a subset of m data points (D [m] ), will be a good approximation to the top level clustering (D). This means that the assignments of data points into DL and DR will be similar for the subsample and ?lter based RBHC as compared to running the full BHC algorithm. The second assumption is that the BHC algorithm tends to produce roughly balanced trees with (αn, (1 ? α)n) points in each of the top level clusters (i.e. O(n) points per branch rather than O(1) points). This is necessary for RBHC to maintain its smaller running time. Proposition 1 The RBHC algorithm is O(n log n)


The number of operations required to run RBHC can be expressed recursively as: Ops(RBHC(n)) = m2 + n + Ops(RBHC(αn)) + Ops(RBHC((1 ? α)n)) Here the m2 term comes from running BHC on m data points and the n term comes from the Filter algorithm. Expanding this expression out to L levels of recursion and letting α = 1 we get: 2 n n n ) + 4(m2 + ) + ... + 2L (m2 + L ) 2 4 2 This gives us log n terms of O(n). We can generalize so that α takes on other values, and this yields the same result, merely adjusting the base of the log. In practice, when a level where m is comparable to n/2L is reached the algorithm can simply call BHC on the n/2L points. Since we are generally interested in the top few levels of the tree, it may make sense to truncate the algorithm after the ?rst several, say eight or so, levels, and avoid running down to the levels where individual points are in their own clusters. This truncated algorithm is O(nL). We also propose the following alternative randomized algorithm based on EM. This algorithm takes advantage of the fact that BHC can be run on clusters of points rather than individual points (i.e. it can cluster clusters): Ops(RBHC(n)) = m2 + n + 2(m2 +

input: data D = {x(1) . . . x(n) } subsample m points randomly from D, so D [m] ? D [m] foreach point x(i) ∈ D[m] create a cluster Di = {x(i) } Filter D\D[m] into these m clusters re?ne the clusters by running k steps of hard EM: for each point in D reassign to most probable cluster run BHC on the m clusters output by EM output: Bayesian mixture model where each tree node is a mixture component Figure 5: A randomized algorithm using EM (EMBHC) Proposition 2 The EMBHC algorithm is O(n) The number of operations for this alternate algorithm is nm ? m2 + knm + m2 = O(knm), where the nm ? m2 term comes from the ?ltering step, which ?lters n ? m points into m clusters, the knm term from running EM, and m2 from the BHC step. So for small k and m this algorithm is linear in n.



We have compared Bayesian Hierarchical Clustering to traditional hierarchical clustering methods. Extensive results are available in our tech report, showing signi?cant improvements in tree structure, clustering, and classi?cation ability [1]. We reproduce two small examples comparing BHC and average linkage hierarchical clustering (ALHC) below. Using the CEDAR Bu?alo digits data set (8x8 images of digits, 64 attributes), we ran the algorithms on 20 examples each of 3 digits (0,2,4) (?gure 6). If we compare the structure of the trees generated by average linkage hierarchical clustering versus Bayesian hierarchical clustering, we can see that BHC tree structure is much more desirable, particularly at the highest levels of the tree, where it immediately ?nds 3 distinct clusters (one for each of the three digits), unlike ALHC. We also compared the BHC and ALHC algorithms on 200 examples each of 4 newsgroups (rec.sport.baseball, rec.sport.hockey, rec.autos, and sci.space) from the CMU 20newsgroups dataset. Our dataset was constucted using Rainbow [4], where a stop list was used and words appearing fewer than 5 times were ignored. The dataset was then binarized based on word presence/absence in a document.


Average Linkage Hierarchical Clustering 5 4.5

?780 80



3 2.5 2




?47 20 ?114 ?21.6 10 ?42.1 ?6.76 ?58.5

1 0.5 0




Figure 6: (a) The tree given by performing average linkage hierarchical clustering on 120 examples of 3 digits (0,2, and 4). Numbers on the x-axis correspond to the true class label of each example, while the y-axis corresponds to distance between merged clusters. (b) The tree given by performing Bayesian hierarchical clustering on the same dataset. Here higher values on the y-axis also correspond to higher levels of the hierarchy (later merges), though the exact numbers are irrelevant. Red dashed lines correspond to merges with negative posterior log probabilities and are labeled with these values.
All Data

All Data 446 Car Space NASA 149 NHL Hockey Round 284 Car Dealer Drive 162 Space NASA Orbit
Quebec Jet Boston

Car Baseball Engine

354 Game Team Play 205 Baseball Pitch Hit

Pitcher Boston Ball

Car Player Space

Vehicle Dealer Driver

Team Game Hockey

Figure 7: Top level structure, of BHC (left) vs. Average Linkage HC, for the newsgroup dataset. The 3 words shown at each node have the highest mutual information between the cluster of documents at that node versus its sibling, and occur with higher frequency in that cluster. The number of documents at each cluster is also given. Figure 7 compares the top three levels (last three merges) of the newsgroups hierarchy (using all 800 examples and the 50 word attributes with highest information gain) from BHC and ALHC. Continuing to look at lower levels does not improve ALHC. Other traditional hierarchical clustering algorithms perform similarly to ALHC. Full dendrograms for this dataset, plus many more results are availabe in our tech report [1]. We are currently working on implementions of RBHC and EMBHC so as to apply the algorithm to much larger datasets.

[1] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. Technical Report 2005-002, Gatsby Computational Neuroscience Unit, 2005. [2] R. O. Duda and P. E. Hart. Pattern classi?cation and scene analysis. Wiley, 1973. [3] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [4] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi?cation and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.


学霸百科 | 新词新语

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。