9512.net
甜梦文库
当前位置:首页 >> >>

Agglomerative multivariate information bottleneck


Multivariate Information Bottleneck

Nir Friedman Ori Mosenzon Noam Slonim Naftali Tishby School of Computer Science & Engineering, Hebrew University, Jerusalem 91904, Israel nir, mosenzon, noamm, tishby @cs.huji.ac.il

Abstract
The Information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution , this method constructs a new variable that extracts partitions, or clusters, over the values of that are informative about . The information bottleneck has already been applied to document classi?cation, gene expression, neural code, and spectral analysis. In this paper, we introduce a general principled framework for multivariate extensions of the information bottleneck method. This allows us to consider multiple systems of data partitions that are inter-related. Our approach utilizes Bayesian networks for specifying the systems of clusters and what information each captures. We show that this construction provides insight about bottleneck variations and enables us to characterize solutions of these variations. We also present a general framework for iterative algorithms for constructing solutions, and apply it to several examples.

as,

1 Introduction
Clustering, or data partitioning, is a common data analysis paradigm. A central question is understanding general underlying principles for clustering. One information theoretic approach to clustering is to require that clusters should capture only the “relevant” information in the data, where the relevance is explicitly determined by various components of the data itself. A common data type which calls for such a principle is co-occurrence data, such as verbs and direct objects in sentences [7], words and documents [1, 4, 11], tissues and gene expression patterns [14], galaxies and spectral components [10], etc. In most such cases the objects are discrete or categoric and no obvious “correct” measure of similarity exists between them. Thus, we would like to rely purely on the joint statistics of the co-occurrences and organize the data such that the “relevant information” among the variables is captured in the best possible way. Formally, we can quantify the relevance of one variable, , with respect to another one, , in terms of the mutual information, . This well known quantity, de?ned

is symmetric, non-negative, and equals to zero if and only if the variables are independent. It measures how many bits are needed on the average to convey the information has about (or vice versa). The aim of information theoretic clustering is to ?nd (soft) partitions of ’s values that are informative about . This requires balancing two goals: we want to lose irrelevant distinctions made by , and at the same time maintain relevant ones. A possible principle for extracting such partitions is the information bottleneck (IB) method [13]. Clustering is posed as a construction of a new variable that represents partitions of . The principle is described by a variational tradeoff between the , and the one we information we try to minimize, try to maximize . We brie?y review this principle and its consequences in the next section. The main contribution of this paper is a general formulation of a multivariate extension of the information bottleneck principle. This extension allows us to consider cases where the clustering is relevant with respect to several variables, or where we construct several systems of clusters at the same time. To give concrete motivation, we brie?y mention two examples that we treat in detail in later sections. In symmetric clustering (also called two-sided or double clustering) we want to ?nd two systems of clusters: one of and one of that are informative about each other. A possible application is relating documents to words, where we seek clustering of documents according to word usage, and a corresponding clustering of words. This procedure aims to ?nd document clusters that correspond to different topics and at the same time identify cluster of words that characterize these topics [11]. Clearly, the two systems of clusters are in interaction, and we want a unifying principle that shows how to construct them simultaneously. In parallel clustering we attempt to build several systems of clusters of the values of . Our aim here is to capture independent aspects of the information conveys about . A biological example is the analysis of gene expression data, where multiple independent distinctions about tissues (healthy vs. tumor, epithelial vs. muscle, etc.) are relevant for the expression of genes. We present such tasks, and others, in our framework by

?

?

?

 ? ? ? 20?¤@9)¤? 0§ ? 7 3 0§ ? ? % "  ? ? 2?)¤? 865421)¤(' &$ #!? ? ?   ? B¨??A ? ?  ? ? ¨C ?  ? ? ?

?

?   ?§ ? ¨??¤?

?

?

?

 ? ? ¨?

?

where is the familiar KL divergence [3]. This equation must be satis?ed self consistently. The practical solution of these equations can be done by repeated iterations of the self-consistent equations, for every given value of , similar to clustering by deterministic annealing [8]. The convergence of these iterations to a (generally local) optimum was proven in [13] as well.

3 Bayesian Networks and Multi-Information
A Bayesian network structure over a set of random variables is a DAG in which vertices are annotated by names of random variables. For each variable , we denote by the (potentially empty) set of parents of in . We say that a distribution is consistent with , if can be factored in the form:

2 The Information Bottleneck
We start with some notation. We use capital letters, such , for random variable names and lowercase as letters to denote speci?c values taken by those variables. Sets of variables are denoted by boldface cap, and assignments of values to the variital letters ables in these sets are denoted by boldface lowercase let. The statement is used as a shorthand for ters . Tishby et al. [13] considered two variables, and , with their (assumed given) joint distribution . Here is the variable we try to compress, with respect to the “relevant” variable . Namely, we seek a (soft) partition of through an auxiliary variable and the probabilistic mapping , such that the the mutual information is minimized (maximum compression) while the relevant information is maximized. The dependency relations between the 3 variables can be described by the relations: independent of given ; and on the other hand we want to predict from . By introducing a positive Lagrange multiplier , Tishby et al. formulate this tradeoff by minimizing the following Lagrangian,

and use the notation to denote that. One of the main issues that we will deal with is the amount of information that variables contain about each other. A quantity that captures this is the multiinformation given by

The multi-information captures how close is the distribution to the factored distribution of the marginals. This is a natural generalization of the pairwise concept of mutual information. If this quantity is small, we do not lose much by approximating by the product distribution. Like mutual information, it measures the average number of bits that can be gained by a joint compression of the variables vs. independent compression. When has additional known independence relations, we can rewrite the multi-information in terms of the dependencies among the variables: Proposition 3.1 : Let be a Bayesian network structure over , and let be a distribution over such that . Then,

b Y h d c ? ?@Beg V b 8 "

e t

where we take . By taking the variation (i.e derivative in the ?nite case) of w.r.t. , under the proper normalization con-

That is, the multi-information is the sum of local mutual information terms between each variable and its parents. We denote the sum of these informations with respect to a network structure as:

? Y ` u? 2? Y v Y v Y v § X X ? u¤ 7 U y1   ` a¤DDPPDwu¤? ? 8653 xS §  ? ?vvv ? "  YYY ?  ` 8¤@PDw X 8¤? " P ` ¨§ PDP§ X 8¤( ??? G

b ?

U S  1 UW 7 35 VT ¨R " ??? G &     " ? "  " ? G & 3 C A  @@@QI?¤? " P9) I?¤??? H9¨? F(E DB @8A¤??)?§ ? @ 9) 98¤?  " ? b h c Y? Beg dr b u?A "#  a¨2PDD§ ?8? t ` §YYY X ? " s ?? ? ?a¨§ DDP&? ? `  YYY§X ?

Y Y ` § DYPD§ X  ? " s 7? b  h eg drq" b 8?¤? ip  `a¨§ YDDP§ ?8¤? c YY X ? ? ? ? ? b  h4egfVc d ` §YYY X a2DPD§ ?

?

?

 ` 2DPD§ X 8¤? §YYY ?

&

specifying a pair of Bayesian networks. One network, , represents which variables are compressed versions of the observed variables (each new variable compresses its parents in the network). The second network, , represents which relations should be maintained or predicted (each variable is predicted by its parents in the network). We formulate the general principle as a tradeoff between the information each network carries. We want to minimize the information maintained by and to maximize the information maintained by . We further give another interpretation to this principle, as a tradeoff between compression of the source (given by ) and ?tness to a target model, where the model is described by . Using this interpretation we can think of our new principle as a generalized compression distortion tradeoff (as in rate-distortion theory [3]). This interpretation may allow us to investigate the principle in a general parametric setup. In addition, we show that, as with the original IB, the new principle provides us with self-consistent equations in the unknown probabilistic partition(s) which can be iteratively solved and shown to converge. We show how to combine this in a deterministic annealing procedure which enables us to explore the information tradeoff in an hierarchical manner. There are many possible applications for our new principle and algorithm. To mention just a few, we consider semantic clustering of words based on multiple parts of speech, complex geneexpression data analysis, and neural code analysis.

straints, Tishby et al. show that the optimal partition satis?es,

 ` 2DPD§ X u? t §YYY

?





?

¤??? ?

?

&

 ?§ ? ¨??¤? ?

 47¤? ?" ?  ? 6C¤?  ?¤? B@§ ¨??¤? " ? ?§ ?   ?§ ?  ? ? & 3   ? 1 ? " ? ¨C 54B@? 2B80¤?

?§? ¨??

?



?

¤?? ?



 8#)¤? 0" ?

?§? ¨?

?

 ? ? ¨C

 " ? B? %C¤?

 20 ? $) ?¤? " ? !§ 8 ¤ §   0§ § ¨§ 1A)   ?§ § @§ ¨?



?

?§? ¨?

( )'

  ? B¨?

'

¤?? ?

?

(2) and the variation is done subject to the normalization constraints on the partition distributions. It leads to tractable self-consistent equations, as we henceforth show. It is easy to see that the form of this Lagrangian is a direct generalization of the original IB principle. Again, we try to balance between the information loses about in and the information it preserves with respect to . Example 4.1 : As a simple example, consider application and of Figure 1. of the variational principle with speci?es that compresses and speci?es that we want to predict . For this choice of DAGs, and . The resulting Lagrangian is

4 Multi-Information Bottleneck Principle
The multi-information allows us to introduce a simple “liftup” of the original IB variational principle to the multivariate case, using the semantics of Bayesian networks of the previous section. Given a set of observed variables, , instead of one partition variable , we now consider a set , which correspond to different partitions of various subsets of the observed variables. More speci?cally, we want to “construct” new variables, where the relations between the observed variables and these new compression variables are speci?ed using a DAG over . Since we assume that the new variables in are functions of the original variables, we restrict attentions to DAGs where the variables in are leafs. Thus, each is a stochastic function of a
It will be convenient to think of cases where restricted to forms a complete graph. However, this is not crucial in the following development. To simplify the technical details, we do assume that is consistent with .

Since, is constant, we can ignore it, and we end up with a Lagrangian equivalent to that of the original IB method.

5 Analogous Variational Principle
We now describe a closely related variational principle. This one is based on approximating distributions by a class de?ned by the Bayesian network , rather than on preservation of multi-information. We face the problem of choosing the conditional distributions . Thus, we also need to specify what is exactly our target in constructing these variables. As with the original IB method, we are going to assume that there are two goals. On the one hand, we want to compress, or partition, the original variables. As before the natural multivariate

§

¤?? ?

'CBA9 e t V3 2 1e t & 0

2 01e t

?§? ¨?? 

 ? ? ¨??A ¨C 53  ¨??A B¨C ? ? & ? ? R ? ? TFX D'  ? ? B¨C ?CBA9 e t ?#SB¨C ? ?  R  ? ? ?   ¤?? ?

F? ¨? ? § F?¨§$ ? $ D ? D

?§? ¨?

1  2 0 Qe3 Vq" ¤6§ YY D2 2 0 Ie3 rq" HC¤? GX P d c  ? ? Y§ H d c X ? F



?

?? ?

 2 13e VQ8C¤? 0 U4 d c "  ?

(

D E'

Thus, we see that can be expressed as a sum of conditional information terms, where each term corresponds to a Markov independence assumption with respect to the order . Recall that the Markov independence assumptions (with respect to a given order) are necessary and suf?cient to require the factored form of distributions consistent with [6]. We see that is measured in terms of the extent these independencies are violated, since if and only if is independent of given its parents. Thus, if and only if is consistent with . An alternative representation of this distance measure is given in terms of multi-informations, since we can think of as the amount of information between the variables that cannot be captured by the dependencies of the structure .

'CBA @e 9 b ? t

?§? ¨?

Proposition 3.2 : Let be a Bayesian network structure over , and let be a distribution over . Assume that the order is consistent with the DAG, then



This measure has two immediate interpretations in terms of the graph .

Analogously to the original IB formulation, the information that we would like to minimize is now given by . Minimizing this quantity attempts to make variables as independent of each other as possible. (Note that since we only modify conditional distributions of variables in , we cannot modify the dependencies among the original variables.) The “relevant” information that we want to preserve is speci?ed by another DAG, . This graph speci?es, for each which variables it predicts. These are simply its children in . Conversely, we want to predict each (or ) by its parents in . Thus, we think of as a measure of how much information the variables in maintain about their target variables. This suggest that we wish to maximize is . The generalized Lagrangian can be written as

20

e t

 2 0 e rq" 8¤? i   ¤? d c  ? ?



?§? ¨?



'CBA@e t 9

  ? §  ¤?
?§? ¨?

When is not consistent with the DAG , we often want to know how close is to a distribution that is consistent with . That is, what is the distance (or distortion) of from its projection onto the sub-space of distributions consistent with . We naturally de?ne this distortion as

set of variables . Once these are set, we have a joint distribution over the combined set of variables: (1)

6 d 72 1540 3e rc



 8

e 3 `  YYY X t 4 ?§ PDD§ ?8? t b  h eg rq" h eg r5x? X ?¨?2DPD§ X  ? a8 " d c d c 3 b §YYY b ?

? ?   ?¨? " ??? G h Beg dVc 3 ? X ?¨a¨§ PDP§ X  b  YYY b a   h eg Vc " h eg r5x?? X ?¨b ¨2§PYDYPY§ ? ?  b uA d d c 3 X ?  ¨? " ??? G ?

?



" !

? @§2PYDD§ X  ?  Y Y

Y  " ?R " ?? G

?

YY ` § YPDD§ X  ? ? `  YYY§X a¨§ DDP&? ? ?

" 



e

 

? ?W  " ¤?? §?? ¨? " ???  
 " ¨? " ?? G
  ¤?? ?

 YYY ` ¨§ DDP§ X 

?

? ` ¨§ DDP§ X  ?  YYY

( % )'#&$

G

X

?

?

?

?

 " ¨? " ??? G

 " ¨? " ??? G

?





#

  

Figure 1: The source and target networks for the original IB. form of this is to minimize the multi-information of . Since is consistent with , we can denote this multi. information as While in the previous section the second goal was to preserve the multi-information to other (target, relevant) variables, here we think of a target class of model distributions, speci?ed by a target Bayesian network. In that interpretation the compressed variables should help us in describing the joint distribution in a different desired structure. We specify this structure by another DAG that represents which independencies we would like to impose. To make this more concrete consider the simple twovariable case shown in Figure 1. In this example, we are given the distribution of two variables and . The DAG speci?es that is a compressed version of . On the other hand, we would like this to make and as independent as possible. A way of formally specifying this desire, is to specify the DAG of Figure 1. In this DAG, separates between and . The question now is how to force the construction of such that it will lead to the independencies that are speci?ed in the target DAG. Notice that these two DAGs are, in general, incompatible: Except for trivial cases, we cannot achieve both sets of independencies simultaneously. Instead, we aim to come as close as possible to achieving this by a tradeoff between the two. We formalize this by requiring that can be closely approximated by a distribution consistent with . As previously discussed, a natural information theoretic measure for this approximation is , the minimal KL divergence from to distributions consistent with . As before, we introduce a Lagrange multiplier that controls the tradeoff between these objectives. To distinguish it from the parameter we use above, we denote this parameter by . The Lagrangian we want to minimize in this formulation is thus: (3)

where the parameters that we can change during the minimization are again over the conditional distributions . The range of is between , in which case we have a trivial solution in which the ’s are independent of their parents, and , in which we strive to make as close as possible to .

2 1e t 0

Example 5.1 : with and

Consider again the example of Figure 1 . In this case, we have that

which is equivalent to the Lagrangian presented in the previous section, under the transformation . Where the range corresponds to the range . (Note that when , we have that , which is the extreme case of .) Thus, from a mathematical perspective, is a special case of with the restriction . This transformation raises the question of the relation between the two variational principles. As we have seen in Examples 4.1 and 5.1, we need different versions of in the two Lagrangians to reconstruct the original IB. To better understand the differences between the two, we consider the range of solutions for extreme values of and . When and , both Lagrangians minimize the term . That is, the emphasis is on loosing information in the transformation from to . In the other extreme case, the two Lagrangians differ. , minimizing is equivalent to maximizWhen ing . That is, the emphasis is on preserving information about variables that have parents in . For example, in the application of in Example 4.1 with , this extreme case results in maximization of . On the other hand, if we apply with , then we maximize . In this case, when approaches information about will be preserved even if it is irrelevant to . When , minimizing is equivalent to minimizing . By Proposition 3.2 this is equivalent to minimizing the violations of conditional independencies implied by . Thus, for , this minimizes . Using the structure of , we can write (as implied by Proposition 3.2), and so this is equivalent to maximizing . If instead we use , we minimize the information . Thus, we minimize while maximizing . Unlike, the application of to , we cannot ignore the term .

?§ ¨? ?

F? ¨? §

F? D'   ¨X

?

$D

?

 ? ? ¨C

  ? B¨? X F?¨§$ ? ? TFB?¨C  ? ¨C?   ?  ¨A 3  ¨A R D ¨??A D ' B@§ ? ? ? ? ?  ? ?   ? ? F? ¨? ? §  ? ? ¨C  ?¨A x$ D? ¤B " ¨??A  ? 3 ? ?  ? ? ?   ? ? B" ¨??A F? ¨? ? § ? §¨? ? 'D ?§ "  ¨? ? " ??? G ? !?  F ? D? ' ? ?  ¨A 1 ¨A ? ? R ? ?

&

?  §& TFX F  ¨a? " ?? D % ?§? " G' D? '? & ? &§  '& GFX D  §  ? ?  ( & ( TFX ?CAB9 D e ' ?3 2 1e t  §S¨?? ? 0 ? R  C'BA9 e t ? 0 F? t 3 2 0 e t ? §R 2 1e t D'

?§? ¨??

&

F?§¨' ? ? D



GFX

GFX

D

D' GFX D'

'

 ?

2 1e 0  ¤& t 

'CBA9 e ?  §& t

"  ¤? ?

?

?

?

?

? F¨§' ?

?

D
?
?

?

 ?¨?a? " "??? Gw??R 2 01e t §

?§? ¨??

?





? ? F? §¨? ? 'D 

F?¨§$ ?

¤?? ?

D
?

T

A

B

A

B

Since, is constant, we can ignore it, and we end up with a Lagrangian equivalent to that of the original IB method (setting ). Thus, we can think of the IB as ?nding a compression of that results in a joint distribution that is as close as possible to the DAG where and are independent given . Going back to the general case, we can apply Proposition 3.2 to rewrite the Lagrangian in terms of multiinformations:

'CBA9

A

B

T

T

and Proposition 3.2, we have that Putting, these together, we get the Lagrangian

e 3 e ?§ ? t 2 0? ¨Ct  R  ¨¨? " C? ? G 'CBA9 e   ? B?    ? t

 ? ? ?  & ?  ? ? ¨??A ¨??A@ ??RS?? R¨C ?¤34B?¨C ? ? ? ? ?  ? F? D'  ? ?  ? ? BC R ¨?

. Using .

?§ ¨? ?

?§ ¨? ?

?

F?¨§' ? ? D ?§? ¨?

20 e t



F?

?

D !'

?

¤?? ?

 ?" C¤? ? 

?

¤?? ?

 " ¨? " ??? G

 2 0 U3e Vc 4 d

?

¤?? ?

6 Bottleneck Variations
We now consider two examples of these principles applied to different situations. Example 6.1 : We ?rst consider a simple extension of the original IB. Suppose, we introduce two variables and . As speci?ed in of Figure 2(a), both of these variables are stochastic functions of . In addition, similarly to the original IB, we want and to extract information about from . We call this example the parallel bottleneck, as and compress in “parallel”. The DAG speci?es that and should predict . Based on these two choices, and . After dropping the constant term , the Lagrangian can be written as

(5)

Proof:

.

That is, we attempt to maximize the information maintains about and about , and at the same time try to minimize the information between and .

I  G  ? I 4 ? G   ? ? 3  ? ?  G C BB I C 3  I @ G A8P?? ?  ?

? ) $ $ ¤ ? 0D &0$) &(  $ ? $ ? % B $ (   % A 0& 4" ( ? & %¨"  ?¤% FE%" &¨ %" ?¤5&% 4C" 3%" ¨ ?!   ?¤ ) ' 2 $" ) 6 )4 $ $ 0& @2 5&0& " 4" 2 $ H & %" $ 2@%" 9 2 2 8H 31 6 2 ) 2 ' 2 & $ 1 @" H !  3 71  7 ? 2 % 2 3('1' H 0& (H 3%#?  ( 3?1¨2 ?8H ¤ § ?¤31 ? ?

Thus, minimizing minimizing

is equivalent for . In other words, an-

R ? 

? I C

?

R ? 1B? G C

?

I 

F ? D' '

G 

?

Thus, we attempt to minimize the information between and and while maximizing the information they preserve about . Since and are independent given , we can also rewrite

?

I 

G 

(4)

Thus, on one hand we attempt to compress, and on the other hand we attempt to make and as informative about each other as possible. (Note that if is informative about , then it is also informative about .) Alternatively, we might argue that and should each compress different “aspects” of the connection between and . This intuition is speci?ed by the target network of Figure 2(b). In this network and are independent of each other, and both are needed to make and conditionally independent. In this sense, our aim is to ?nd and that capture independent attributes of the connection between and . Indeed, following arithmetic similar to that of Example 6.1, we can write the Lagrangian as:

I 

?

?

?

 I @ G C ¤B  I A R B G A  ? ? 3 ? ?  ? ?

?

?

?

I 

G H

?

G 

G  I 

?

?§ ¨? ? ?§ ¨? F ? $ D

I 

?

?

G 

I 4

?

¤?? ?

I 

G 

G 

?

F ?$ D !'

I 

I 

G 4

X

?

To summarize, we might loosely say that focuses on the edges that are present in , while focuses (or more preon the edges that are not present in cisely, the conditional independencies they imply). This explains the somewhat different intuitions that apply to understanding the solutions found by the variational principles. Thus, although both variants can be applied to any choice of , some choices might make more sense for than for , and vice versa.

Thus, again, we attempt to minimize the information beand while maximizing the information they tween together contain about . Example 6.2: We now consider the symmetric bottleneck. In this case, we want to compress into and into so that extracts the information contains about , and at the same time extracts the information contains about . The DAG of Figure 2(b) captures the form of the compression. The choice of is less obvious. in Figure 2(b), attempts One alternative, shown as to make each of and suf?cient to separate from . As we can see, in this network is independent of and given . Similarly, separates from the other variables. The structure of the network states that and are dependent of each other. Developing the Lagrangian de?ned by this network, we get:

Y  ?  ? 3  ? 4@ ? @§ X C Q ? @ X A?? ? R   ? C5B X C ? ? R ? ?

F?

?§ ¨? ? ? F'D

F?

D

Figure 2: The source and target networks for the parallel and symmetric Bottleneck examples.

?CBA@e' t 9

(b) Symmetric Bottleneck

?

X

D

'

¨??A ? ? ? ? @§ X AR   ? § HC  ? ?  X ? ?  ?X  F§¨' ? ? ? ? D

G 

I 4

?

?

'

GFX  ? § C ? R B? ?   ? ? X ? A R D? ¨' ?  X 2 10 e t ?

?

X

 @ HC 1B4?§ HA  X ? R  ? ?  X ?   ? ? A R   X A ? ? ? ? Y ? ? @ XHC 34 ? ? A   HA B ?@§ HC ? ? R ? X ?  ? ?  X ? ? ?  X  ? ? X  ? ?

Y? ? @§ X C? &53BB ? C R B X C  ?  ? ? ?  ? ? TFX D'  ? 'C?¨9 AB ??e$At B? X C  ? ?§ ¨? ?

F

TF?X D ' D

? F¨§' ?

'

D

?

?§ ¨? ?

? ? X

?§? ¨?

?

F?¨§$ ?

D

TA

TB

B

A

A

B

?§? ¨?

A

B

TA

TB

TA

TB

F

F? §¨? ? 'D

?

(a) Parallel Bottleneck

?

?

?D

'

? F¨§' ?

D
?

F?¨§$ ?

D
? ?

T1

T2

A

B

A

B

?

?

X

A

B

T1

T2

T1

T2

other interpretation for the above optimization is that we aim to ?nd and that together try to compress , preserve the information about and remain independent of each other as possible. In this sense, we can say that we are trying to decompose the information contains about into two “orthogonal” components. Recall, that using we aim at minimizing violation of independencies in . This suggests that the DAG of Figure 2(a) captures our intuitions above. In this DAG, and are independent given and . Moreover, here again speci?es an additional independence requirement over and . To see that we examine the Lagrangian de?ned by the principle. In this case, . Using Eq.(5) and dropping the constant term , the Lagrangian can be written as

XF ?  ? H$  D ?

¤?? ?

F ? ¨§a? D?'?

¤?? ? ¤?? ?

GFX

D

?

'

7 Characterization of the Solution
In the previous sections we stated a variational principle. In this section we consider the form of the solutions of the , , and principle. More precisely, we assume that (or ) are given. We now want to describe the properties of the distributions . We present this characterization for the Lagrangians of the form of . However, we can easily recover the corresponding characterization for Lagrangians of the form (using the transformation ). In the presentation of this characterization, we need some additional notational shorthands. We denote by , , and . We also de§ ¨ ? ? ? ? ?

and similarly for note plify the presentation, we also assume that In addition, we use the notation
          ? ? ¤  ? ? ? ?

. To sim.

See Appendix A for a proof outline. The essence of this theorem is that it de?nes in . This distortion measures terms of the distortion how close are the conditional distribution in which is involved into these where we replace with . In other words, we can understand this as measuring how well performs as a “representative” of the particular assignment . The conditional distribution depends on the differences in the distortion for different values of . The theorem also allows us to understand the role of . When is small, the conditional distribution is diffused, since reduces the differences between the distortions for different values of . On the other hand, when is large, the exponential term acts as a “softmax” gate, and most of the conditional probability mass will be assigned to the value with the smallest distortion. This behavior matches the intuition that when is small, most of the emphasis is into and when on compressing the input variables is large, most of the emphasis is on predicting the “outputs” variables of , as speci?ed by . Example 7.2 : To see a concrete example, we reconsider the parallel bottleneck of Example 6.1. Applying the theorem to of Eq. (4), we get that the distortion term for is
? 3

and are variables (or sets of variables) and is the joint distribution over all variables given the speci?c value of . Note that this terms implies averaging over all values of and using the conditional distribution. In particular, if or intersects with , then only the values consistent with have positive weights in this averaging. Also note that if is empty, then this is term reduces to the standard KL divergence between and . The main result of this section is as follows.
? $

Thus, attempts to make predictions as similar to these of about , and similarly attempts to make predictions as similar to these of about .

where all probabilities in this term are derived from the de?nition of the model in Eq. (1).
E F

8 Iterative Optimization Algorithm

This can be done by standard variable elimination procedures.

We now consider algorithms for constructing solutions of the variational principle.

S

U

1

1

U V

S

U V

S

T %

T V

&

&

S %

U

? D ?

 

?

? "

3

@

 %

 

 !

? "

? ? D % @

? %



?





?

?

 

where given by

is a normalization term, and

is

Q R

P IG D"H

(6)

The ?rst term is a simple KL divergence, and last term can be simpli?ed to . By simple arithmetic operations, and using , we get the set of self-consistent equations
' ( '

?

?

3

Theorem 7.1 : Assume that , given. The conditional distributions stationary point of
 %

,

are are a if an only if


, and

 

 "

?

1 2

) ' 0( &









 





?



? C8 A 4 @ B 6

9 6 4 ?8 7 5



where

This term corresponds to the information of and . We see that increases when the predictions of given are similar to those given (when averaging over ). The distortion for is de?ned analogously. Example 7.3 : Consider now the symmetric bottleneck case of in Example 6.2. Applying the theorem, we get that the distortion term for is



G ? 1  ? ? S ? Y 2@  " 34 ¤? " D  " 34 ¤?? G F 4 ? U )R I  I  G  ( D 34 1  2@& § ? ? 34 ¨3 " C¤? " P"& § 34 ¨3 " C¤??? G ( F 4 ? D U S h " b R   ¨?§ ?0? 3  " I ? 20 su¤? ?FF ? @ 34 ? 3 D U ? ? F ' ? 3 D U D ¨  ¨? AI?)u? ¤? 12@& §   " b ?  " b ? § 34 h ¨g ?a8¤? " D"& § h34 ¨g ?au¤?? G ( F 4 ? D U S "  ? ?FF ? U F U  ¨  Gu¤?3 @ 9) " G u¤? ? D3 ? ? $ ? 3 D D ?  8 ? U3 @ &§ 4 ¨ X ? & §  uA ) ? 7  §  ? & R 4?)?¤? 8653 D3 G u? 7 35 3  G $)¤9 3 99" Gu¤? 8653  7 " ? ? " ? F 4 % 4 ¨   8¤? 34 @ &  8¤? ? 'CD BA9 e & 3 2 0 e F 12@ G " ???¤? " P486?¤??? G F U S   ")" ? t TX ? 1  "  ¤? ? t ? D' R 21@ G " I C¤? " D"9) " I C¤??? G ( F $ ?? D U S 9)§ G 8 ? ?  ? & ?§? ¨a? ?? ?  u¤? ? ( $ D G    q" ¤?  ? F?¨§$ ? ?  &1 " ¤? ? D  ? ? ) X  ? 9) " X 8¤? ?   ? § X  ?  " ? & 9v ¤? 12@ ? § X " ?¤? " P ? @§) " ?¤?? G F ? U S    ?  ? ) ? 9§ X uA ( $ D X § ? 1   ( " ¤?   § " ¤? 7 35 F 4 ? D U S  ? GFX ( D'    ¨§ "I ¤? " D& § " ¤?? ¨& s ¤? " ? " ? G " ? ?§ ¨??  8 1  2@& ¨§ " ¤? " P& ( I ¤?? G F 4 ? U S ? " § " ?   & ( D & 34   54B¨g 'CBA9Beg r7 h g ?  'C BA?9 U4e3 3 V3 c 34 U43 ¨3 2 1U40 e3 Vc 3h h d c d d

&

   ? & "  u¤? 
&

 


 "  ? &%C¤?
?



  !§  uA3  ?

 

&

&

 


&



 

?

GFX D' ?§? ¨a? ¤?? ?
 

  ! 

    ! 

 

F?

   "

D'  2 0 Ue3 rQ"  C¤? 4 dc ?
?     ?

 

?

  ¨X
 "

 #

?

&

Symmetric compression, Information curves, synthetic example 1
P(A;B) not sorted P(A;B) sorted by the clustering solution

0.9
x 10 2
?3

x 10 2

?3

0.8

Preserved information

10 20 30

1.8

10 20 30

1.8

0.7 0.6 0.5 0.4 0.3 0.2 0.1 I(TA;A)/H(A) vs. I(TA;B)/I I(TB;B)/H(B) vs. I(TB;A)/I 0.2 0.4 0.6 Compression factor 0.8 1

1.6

1.6

1.4

1.4

1.2

1.2

A

1

A

40 50 60

40 50 60

1

0.8

0.8

0.6

0.6

0.4

0.4

70 80

0.2

70 80

0.2

5

10 B

15

20

5

10 B

15

20

0 0

(a)

(b)

(c)

Figure 3: Application of the symmetric bottleneck on a simple synthetic example. (a) input joint distribution, (b) the same joint distribution where rows and columns were permuted to match the clustering found; dotted lines show cluster boundaries. (c) information curves showing the progression along the information tradeoff graph for increasing . The -axis is the fraction of the information about the original variable that is maintained by the compressed variable, and the -axis is the fraction of the information between and that is captured by or . Circles denote bifurcation events. We start with the case where is ?xed. In this case, following standard strategy in variational methods, we simply apply the self-consistent equations. More precisely, we use an iterative algorithm, that at the ’th iteration maintains the conditional distributions . At the ’th iteration, the algorithm applies an update step A key question is how to initialize this procedure, as different initialization points can lead to different solutions. We now describe a deterministic annealing like procedure [8, 13]. This procedure works by iteratively increasing the parameter and then adapting the solution for the previous value of to the new one. This allows the algorithm to “track” the changes in the solution as the system shifts its preferences from compression to prediction. Recall that when , the optimization problem tends to make independent of its parents. At this point the solution consists of essentially only one cluster for each which is not predictive about any other variable. As we increase , we suddenly reach a point where the values of diverge and show two different behaviors. This phenomena is a phase-transition of the system. Successive increases of will reach additional phase transitions in which additional splits of values of emerge. The general idea of this annealing procedure is to identify these bifurcations of clusters. At each step of the procedure, we maintain a set of values for each . Initially, when , each has a single value. We then progressively increase and try to detect bifurcations. At the end of the procedure we record for each a bifurcating tree that traces the sequence of solutions at different values of (see for example Figure 4(a)). The main technical problem is how to detect such bifurcations. We adopt the methods of Tishby et al. [13] to multiple variables. At each step, we take the solution from the previous step (i.e., for the previous value of we considered) and construct an initial problem in which we duplicate each value of each . To de?ne such an initial solution we need to specify the conditional probabilities of these “doubled” values given each value . SupIn deterministic annealing terminology, is the “temperature” of the system, and thus increasing corresponds to “cooling” the system.



?

&

'

%

See Appendix A for a proof for the case . The convergence proof for the general case is more involved and will appear in the full version of this paper. At the current stage we do not have a proof of convergence for the synchronous case, although in all our experiments, synchronous updates converge as well.

&



Theorem 8.1 : Asynchronous iterations of the selfconsistent equations converge to a stationary point of the optimization problem.





&



&



&



&



&

where and are computed with respect to the conditional probabilities . There are two main variants of this algorithm. In the synchronous variant, we apply the update step for all the conditional distributions in each iteration. That is, each conditional probability is updated by computing the distortion based on the conditional probabilities of the previous iterations. In the asynchronous variant, we choose one variable , and perform the update only for this variable. For all , we set . The main difference between the two variants is that the update of in the asynchronous update incorporates the implications of the updates of the other variables.



 8

(7)

&

$

 ? ? B I C

  G C ? ?

 ?& 

&

&

?

"

?& "p ? F  ? ? ? & 8 u? D  § F ?  &8 ? § 34 D  F 4  ¨  GFX  ? @ F? u ? ? D D

@

C? F

?? ? ?¤  R"

?

?

? ?

D



% 4 21D

? "& #

 @ 0" ?

 C? F

) ' 0( &

?? ? ?

D

@

C? TX  F

 

 ?  ? & $" C¤?

&

 !

? ?

D

?

3

R ?

F  ? ? & 8¤? D ?  "  8? TX  ? F

 ? "!



 8

? §YYY§ PDP8? ?

? §§YYY§ ¨2PDP8?

D

?

@

?

?



1

Symmetric compression, Information curves, 20NG example 1 0.9

2

3

0.8

Preserved information

0.7 0.6 0.5 0.4 0.3 0.2 0.1 I(Tw;W)/H(W) vs. I(Tw;C)/I I(Tc;C)/H(C) vs. I(Tc;W)/I 0.2 0.4 0.6 Compression factor 0.8 1

4

5

6

7

8

9

sci.med politics.guns politics.misc

autos motorcycles sci.space

10

comp.windows.x

ibm.pc.hardware mac.hardware misc.forsale electronics

sci.crypt

atheism religion.christian religion.misc

baseball hockey

politics.mideast

comp.graphics

comp.os.ms-windows.misc

0 0

(a)

(b)

Figure 4: Application of the symmetric bottleneck to the 20 newsgroup data set with 300 informative words. (a) The learned cluster hierarchy of categories. (b) information curves showing the progression along the information tradeoff graph. Note that with 16 word clusters we preserve most of the information present in the data. pose that and Then we set and
  "

9 Examples
To illustrate the ideas described above, we now examine few applications of symmetric and parallel versions of the bottleneck. As a simple synthetic example we produced a joint probability matrix (see Figure 3(a)) where and . Using the symmetric compression ( of Example 6.2) we ?nd natural clusters for and natural clusters for . Sorting the joint probability matrix

 &¨$#¤? %§ ?

?

?

 ? I ?  4A

 ? ?  G A I 



G 

"

I  I 4

I 

?



I 

I 

 ? G ? B A

?

G 

G 

 ? ? B I C I 

!

G 

?

?

   § I? 

I 4



?

G 

G 

where is a noise term and is a scale parameter. Thus, each copy and is a perturbed version of . If is high enough, this random perturbation suf?ces to allow the two copies of to diverge. If is too small to support such bifurcation, both perturbed versions will collapse to the same solution. After constructing this initial point, we iteratively perform the update equations of (7) until convergence. If at the convergence point the behavior of and is identical, then we declare that the value has not split. On the other hand, if the distribution is suf?ciently different from for some values of , then we declare that the value has split, and incorporate and into the bifurcation we construct for . Finally, we increase and repeat the whole process. We stress that this annealing procedure is heuristic in nature. We do not have formal guarantees that it will ?nd the global optima. Nonetheless, it has the distinct advantage in that charts the behavior of the system at different values of . An alternative and simpler approach that proved useful for the original IB formulation employed agglomerative clustering techniques to ?nd a bifurcating tree in a bottomup fashion [9]. Such an approach can also applied to the multivariate case and will be presented elsewhere.

&  &    %$   1 § 3 & ? ? ?  ? ?#  %'     ? ?   ?  ? ? %' ? ? §   ? ?3 ? & "  u¤? & X X" (  88??  ? ?  X? ?  ?? ¨§   ? ?¤R X? ? & "  8¤? & " %'  %$   88?? %$  
are the two copies of the value
 "  " ?   

.

by this solution illustrates this structure (Figure 3(b)). It is also interesting to see the fraction of information preserved by our clusters. One way of presenting these results is by considering the fraction of information preserved by about , and analogously, the fraction of information preserved in about . This amount of preserved information should be plotted with respect to the compression factor, i.e., how compact is the new clusters representation. This is given of course by and respectively. In Figure 3(c) we present these two information curves. In both curves we see that splitting the current clusters set increase the amount of information preserved about the relevant variable, and simultaneously reduces the compression (since more clusters induces less compression). In the ?rst split we ?nd clusters in and in . The second split results with clusters in and in . The next split ?nds clusters in and leaves with clusters. This is indeed the ”real” structure of this data. Interestingly, due to this last split, the information preserve about is increased, though there was no split in . The reason, of course, is that predicts through , thus the increases . On the other hand, since split in there was no split (at this step) in , remains unchanged. The next splits are practically over?tting effects and accordingly there is no real information gain due to these splits. As a more realistic example we used the standard 20 newsgroups corpus. This natural language corpus contains about articles evenly distributed among USENET discussion groups [5] and has been employed for evaluating text classi?cation techniques (e.g., [12]). Many of these groups have similar topics. Five groups discuss different issues concerning computers, three groups discuss religion issues, etc. Thus, there is an inherent hierarchy among these groups. To model this domain in our setting, we introduce two random variables. We let denote words, and denote a category (i.e., a newsgroup). The joint probability

 ?F D$ " ' 

   %$    "  8¤? ? %$  


?"

%'



?



%$

   & "


 ?§ ? ?¤?

%'

 8¤? ?
?

 I?

&

A" ? "

%'


&

Parallel compression, Information curves, 20NG example 0.9 0.8

Preserved information

0.6 0.5 0.4 0.3 0.2 0.1 0 ?0.1 0 0.2 I(T1 ;W)/H(W) vs. I(T1 ;C)/I W W I(T2W;W)/H(W) vs. I(T2W;C)/I I(T1W,T2W;W)/H(W) vs. I(T1W,T2W;C)/I 0.4 0.6 Compression factor 0.8 1

Figure 5: Information curves for parallel compression of the 20 newsgroup data set showing the progression along the information tradeoff graph.

10 Discussion
is the probability that a random word-position (e.g., word 218 in document 1255) in this collection is equal to and at the same time the category of the document is . To obtain such a joint distribution we performed several preprocessing steps: We removed ?le headers, transformed all words to lower case, and removed stop words and words containing digits or non-alphanumeric characters. We then sorted all words by the contribution to the mutual information about the category variable. More formally, we sorted all words by , and most “informative” words. used the subset of the top After re-normalization, we had a joint probability matrix with and . We ?rst used the symmetric bottleneck algorithm to cluster both dimensions of this matrix into two sets of clus, and clusters of categories, . ters: clusters of words, The hierarchy found in , shown in Figure 4(a) is in high agreement with the natural hierarchy one would construct. Additionally, each of the word clusters is in high correlation with one of these category clusters. For example, for the second word cluster, the most probable words (i.e. the words that maximize ), were ’islamic’, ’religious’, ’homosexual’, ’peace’ and ’religion’. Accordingly was maximized for the “religion” cluster in (left cluster in Figure 4(a)). As already explained, the general mapping scheme we use is a “soft” one. That is, each object could be assigned to each cluster with some normalized probability. The clustering of into was typically “hard” (for every , was approximately for one cluster and for the others). However, the clustering of into utilized the “soft” aspect of the clustering to deal with words that are relevant to several category clusters. Thus, some of the words were assigned to more than one cluster. For example, the word ’Clinton’ was assigned to two different word clusters dealing with politics. The word ’sexual’ was assigned to the same two clusters, and also (with lower probability) to a cluster of words dealing with religious issues. For the same data we used also the parallel compression ( of Example 6.1). In this case we have two compresWe presented a novel general framework for data analysis. This new framework provides a natural generalization of the information bottleneck method. Moreover, as we have shown, it immediately suggests new bottleneck like constructions, and provides generic tools to implement them. Many connections with other data analysis methods should be explored. The general structure of the iterative procedure is reminiscent of EM and k-means procedures. Other connections are, for example, to dimensionality reduction techniques, such as ICA [2]. The parallel bottleneck construction provides an ICA-like decomposition with an important distinction. In contrast to ICA, it is aimed at preserving information about speci?c aspects of the data, de?ned by the user. The suggested framework allows us to extract structure from data in numerous ways. In our examples, we explored only few relatively simple cases, but clearly this is the tip of the iceberg. We are currently working on several additional applications under this framework. These include analysis of gene expression data, neural coding and DNA sequence analysis, document clustering, and computational linguistic applications. Acknowledgements
This work was supported in part by the Israel Science Foundation (ISF), the Israeli Ministry of Science, and by the US-Israel Bi-national Science Foundation (BSF). N. Slonim was also supported by an Eshkol fellowship. N. Friedman was also supported by an Alon fellowship and the Harry & Abe Sherman Senior Lectureship in Computer Science. Experiments reported here were run on equipment funded by an ISF Basic Equipment Grant.

References
[1] L. D. Baker and A. K. McCallum. Distributional clustering of words for text classi?cation. In ACM SIGIR 98. 1998. [2] A.J. Bell & T.J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neur. Comp. 7, 1129-1159, 1995. [3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991. [4] T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR 99, pages 50–57. 1999.

?  ?

" § ? " %

  "  !    ? ?  & § ? § § X C "  ? % % "  !  ? ?   "  § ? C  & § HC ? " X ? % % " § X " %

"

0.7

sion variables, , that try simultaneously (and independently), to cluster the set of words in a way that will preserve the information about the category variable , as high as possible. In Figure 5, we present the information curves for these two cluster sets. Clearly, using the combination of the compression variables is much more informative than using each one of them independently. For example, after the second split, , and and preserve and of the original information , respectively. On the other hand, at the same stage, preserves almost of . Thus, only word-clusters are enough to preserve most of the information about the category variable.

!

§ % ? @§ § % X 

?

?

" ? %

 §

F § ?? D ? D F

%

#

U U 8654B# " % ¤? 7 3 ?
!

 ! " ) § 8¤? ?

? ?  ? ?6 ¤? B$#¤? ?
?

? 
¨

§? 

" a" "



?

 ? B# A



 ) § " C¤?  ?

8" ! "

 " C¤? % ? " ?

?

TFX

D

'

¨

A

Proofs

We now sketch the proofs of the two main theorems. We start with Theorem 7.1. Proof: The basic idea is to ?nd stationary points of subject to the normalization constraints. Thus, we add Lagrange multipliers and get the Lagrangian . To differentiate we use the following lemma. Lemma A.1 :

We now turn to the proof of Theorem 8.1. Proof: To prove convergence it suf?ces to prove that unless we are at a stationary point, each application of the assignment . ReEq. (7) reduces the Lagrangian call, that we require that . Also, recall that when , minimizing this Lagrangian is equivalent to minimizing the La. grangian To show convergence, we will introduce an auxiliary Lagrangian: subject to the constraints that , , and , where is the DAG without edges. It is easy to see that and coincide when and . That is, when and are the KLprojections of onto the space of distributions consistent with and . Using properties of KL-projections, we get. Lemma A.2 : For any choice of , and that are consistent , and , respectively, , with with equality if and only if and are the projections of onto and , respectively. Assume that and are ?xed, and suppose we want to modify to minimize . Taking derivatives of with respect to and equating to , we get the following self consistent equations:

We now can differentiate each mutual information term that appears in . Note that we can ignore terms that do not depend on the value of , since these will be absorbed by the normalization constant. Thus, a term of the form where can be ignored. Collecting terms that do refer to , equating to , and dividing by we get the following equation.

where is a term that does not depend on . To get the desired form of the self-consistent equations, we apply several manipulations. First, we can write since all the variables in

&h 6 (  h& 4R&$ ?  2 4 ? 7 )%" 4  BS  6 ( h % ! $ aU 4R&@?  2 4 § ? ' 4  74 )4bS % $ ! ? $ aU " y ? sqp} $ ri  F S U ` F u w  ( § % §X ( c ?¤&$ % § §  § ( % YB X ( &$ % $§   § $ n$ 6 ( & § &$ ?  2 4 ? ' 4  7 )4" E' B 4 % ! ? u w |{6 $  $ y 6 s? v? $  t ?6 s? s? o$  t h6 o$  $ ? ? u wn ? n S Q § 2? 2? P QH 4" & G W ' B $ y  sqpi  $ r ?¤&$ !   2 4 )%" % 6( & ?' 4 7  S W u w w z v ( W c W ?¤&$ u % % PH & )%" h QI"4 4GF ' B $  ( c ?¤&$ t7 4R&$ !   % w 6(h 2 4 ?' 4 7   6 xv? $  t ?6 s? s§w ? n $ ¨t § ? u U S VT F u? w ( § &$ ?!   ( ! § &$ !   % % §  ( d?y§ &$ % t § ( c ?¤&$ % u§ w  ( § &$ % E  w x v (6 W x? s? W &$ t( W W % n $ %§n $  t q  n $ § ¤ C D3 7  A @ 79 B25 8§ ?u ?( ? ' % ? r( ? % o$ % $ ! ! $ 6 ( 43&@?  2 4 ? ' 4  7 )4"  q n $ sx fg k i h mh § 5 6 ¤ ? § p §  a ljx fg ed 4 ? Q4 ? ??0? 2§ 4  ? !??? & ( § % u § ? ( &$ % §  ! § $ ( 2D ?26 2 ( 7 %$%%"$$ %$ " ?!  2 4 ? ' 4  7 0)4"  (% §  &$  2 2 4 (? '4&7 %$ $ %#$ " " " 1 % ? ?( ? ' % ? ( &$  % $ jx fg k i h mh ? ? § § a ljx fg ed 4 ? I4 ? ??0? !??? & § (  § % u 2§ 4  ( ! § &$ 4 ? 4 ? 4 ? ¨? D 'CBA9 ? ??' D 2 0 ?? ? %   ? § ¤? ¤? §  E ( & % § (   § ?¤&$ t t %$ $? 2  u w § 6 § y r sqpi  $ u ? vqp " E yw ri 6 xv? $  t ?6 $  2 ? $ ? ? u? w w u $ " E srqp i ( ?¤&$ d( $ 4R&$ F ? u % ? % ? ( & S u ?¤&$ w ?( U S 4R&$ § `§  w F % ? % ? y t 4 $ ??  § § h ?? F F y2 ? ?  u sqp  w " E  $ ri  ( '$% B E(  '%   w ? u $  t $ w  g & & a  u   g 6 x?v? ( srqp   '$% h& &a  B 2 0 ¤ ? 6 $  2 ? $ ? i ? g 1f2 e ?C' BA9 ¤ d' D 2 0 ¤ ¨ 6 $  2  $ " !  $ ? ? ?

.

It is easy to verify that the right hand side of this equation does not depend on . Thus, the assignment

and for results in a distribution such that is a stationary point with re. Moreover, it is easy to verspect to changes in ify that the second derivative of with respect to is positive, and thus this point is a local minima. We conclude that , with equality if and only if minimizes (with respect to ?xed choice of , , and for ). We now put all these together. Suppose that and are the projections of on and , respectively. Then,

(8)

Moreover, we have equality only if . This shows that an update step reduces the value of the Lagrangian. The only situation where the value remains the same is when the self consistent equation for is satis?ed. The only remaining issue is to show that this iteration is equivalent to the asynchronous iteration of Eq. (7) when and are the projections of on and , respectively. This is can be easily veri?ed by comparing to Eq. (8).

are independent of

(

?¤&$ %

[5] K. Lang. Learning to ?lter netnews. In 12th Int. Conf. on Machine Learning, pages 331–339. 1995. [6] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, San Francisco, 1988. [7] F.C. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, pages 183–190. 1993. [8] K. Rose. Deterministic annealing for clustering, compression, classi?cation, regression, and related optimization problems. Proceedings of the IEEE, 86:2210–2239, 1998. [9] N. Slonim & N. Tishby. Agglomerative Information Bottleneck. in Advances in Neural Information Processing Systems (NIPS) 12, pp. 617-623, 1999. [10] N. Slonim, R. Somerville, N. Tishby, and O. Lahav. Objective spectral classi?cation of galaxies using the information bottleneck method. in ”Monthly Notices of the Royal Astronomical Society”, MNRAS, 323, 270, 2001. [11] N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In ACM SIGIR 2000, pages 208–215. 2000. [12] N. Slonim and N. Tishby. The power of word clusters for text classi?cation. In 23rd European Colloquium on Information Retrieval Research. 2001. [13] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proc. 37th Allerton Conference on Communication and Computation. 1999. [14] N. Tishby and N. Slonim. Data clustering by Markovian relaxation and the information bottleneck method. In Neural Information Processing Systems (NIPS-00). 2000.

given . Second, we can transform the latter term into a KL divergence by extracting the term from terms that deal with . Similar transformation applies to the . Third, we use Bayes rule to rewrite . Since

does not involve we can ignore it. To get a KL divergence term, we subtract the term . Finally, we apply the normalization constraints for each distributo get the desired equations. tion

§  ! § )4" $ & ! ?   6 & % § d 4 S 2 4 ? ' 4  7  $ 4" $ %" ! ( 4 $ S &$ $ 6 2 4 B P %" ?  2 4 ? ' 4  7 )4"  § 6 2 4  %" ?  2 4 ? ' 4  7 )4"  $ $ ! B 2 4 #P 7 4 2 B 4 7 4 #P ( W ¤ % X 6 ( h § § §  ? 4 & bS aU 4R&$ ?  2 4 0)4" % ! $ ?7  D  F c ¤ § §

?

? B

9 #



更多相关文章:
基于凝聚式信息瓶颈的加权层次聚类算法.pdf
关黼:层次聚类;架构恢复;面向对象软件;聚类特征;信息瓶颈 WeightedBasedon HierarchicalClusteringAlgorithm AgglomerativeInformationBottleneckHe2,WANGYu.xinl,LIUPin91,...
基于凝聚信息瓶颈的音频事件聚类方法.pdf
2112.2017.05.006 AudioEventsClusteringBasedonAgglomerativeInformationBottleneckLIYan ? xiong,WANGQin,ZHANGXue,ZOULing(SchoolofElectronicandInformationEngineering,...
Agglomerative information bottleneck.pdf
Agglomerative information bottleneck_专业资料。noamm,tishby¢ We introduce a ... The information bottle... 16页 免费 Multivariate informati... 10页 免费...
The information bottleneck EM algorithm.pdf
Multivariate information... 10页 免费 Maximum likelihood and t... 暂无评价 8页 免费 Information bottleneck f... 暂无评价 8页 免费 Agglomerative multi...
Medical Image Segmentation Based on Mutual Information ....pdf
Our approach has similar characteristics to the agglomerative information bottleneck method [6] applied to document clustering. An important application of the ...
in_图文.pdf
resulting of the application of the Information Bottleneck method [6] (...on Distributional clustering adopted an agglomerative clustering algorithm schema....
...information bottleneck method. The annual symposium.pdf
3 The Agglomerative Information Bottleneck The second layer of the image grouping process is based on information theoretic principle, the information bottleneck...
一种新的基于多示例学习的场景分类方法.pdf
I(X;Y) , N.Tishby 等提出一个自底向上合并的算法 AIB (Agglomerative Information Bottleneck) [13]。AIB 算法从一个平凡的聚类(每个样本一个聚类)开始,迭代...
关键点轨迹特征-国防科技大学学报.doc
Key words: human behavior recognition; vocabulary compression;Agglomerative Information Bottleneck;generative model 人体行为识别在视频监控、 人机交互、 运动 与娱乐...
...classification using sequential information maxi....pdf
The results of [4, 14] show that agglomerative clustering methods that are motivated by the Information Bottleneck (IB) method [17] perform well in ...
Maximum likelihood and the information bottleneck.pdf
Tishby. Multivariate Information Bottleneck. In Proc. of UAI-17, 2001. The KL with respect to ? is de?ned as the minimum over all the members in ...
Unsupervised Clustering of Images using their Joint ....pdf
The mixtures and thus images corresponding to them are then clustered using agglomerative Information Bottleneck. Comparing to this work our suggestion to use...
Distributed Document Clustering Using Word-clusters....pdf
This paper presents a distributed document clustering approach called Distributed Information Bottleneck (DIB). DIB adopts a two stage agglomerative Information ...
基于信息论的潜在概念获取与文本聚类.pdf
文献[12]根据信息瓶颈的思想,提出了多元 信息瓶颈(multivariate information bottleneck).与对称 IB 模型不同,本文从概念获取和主题聚类角度出发,模型 的目标函数的...
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图