9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # Bayesian Field Theory Nonparametric Approaches to Density Estimation, Regression, Classific

arXiv:physics/9912005v3 [physics.data-an] 7 Mar 2000

Bayesian Field Theory

Nonparametric Approaches to Density Estimation, Regression, Classi?cation, and Inverse Quantum Problems

J¨ org C. Lemm? Institut f¨ ur Theoretische Physik I Universit¨ at M¨ unster Wilhelm–Klemm–Str.9 D–48149 M¨ unster, Germany

Abstract Bayesian ?eld theory denotes a nonparametric Bayesian approach for learning functions from observational data. Based on the principles of Bayesian statistics, a particular Bayesian ?eld theory is de?ned by combining two models: a likelihood model, providing a probabilistic description of the measurement process, and a prior model, providing the information necessary to generalize from training to non–training data. The particular likelihood models discussed in the paper are those of general density estimation, Gaussian regression, clustering, classi?cation, and models speci?c for inverse quantum problems. Besides problem typical hard constraints, like normalization and non– negativity for probabilities, prior models have to implement all the speci?c, and often vague, a priori knowledge available for a speci?c task. Nonparametric prior models discussed in the paper are Gaussian processes, mixtures of Gaussian processes, and non–quadratic potentials. Prior models are made ?exible by including hyperparameters.

?

Email: lemm@uni-muenster.de, WWW: http://pauli.uni-muenster.de/? lemm/

1

In particular, the adaption of mean functions and covariance operators of Gaussian process components is discussed in detail. Even if constructed using Gaussian process building blocks, Bayesian ?eld theories are typically non–Gaussian and have thus to be solved numerically. According to increasing computational resources the class of non–Gaussian Bayesian ?eld theories of practical interest which are numerically feasible is steadily growing. Models which turn out to be computationally too demanding can serve as starting point to construct easier to solve parametric approaches, using for example variational techniques.

Contents

1 Introduction 2 Bayesian framework 2.1 Basic model and notations . . . . . . . . . . . . . . . . . 2.1.1 Independent, dependent, and hidden variables . . 2.1.2 Energies, free energies, and errors . . . . . . . . . 2.1.3 Posterior and likelihood . . . . . . . . . . . . . . 2.1.4 Predictive density . . . . . . . . . . . . . . . . . . 2.1.5 Mutual information and learning . . . . . . . . . 2.2 Bayesian decision theory . . . . . . . . . . . . . . . . . . 2.2.1 Loss and risk . . . . . . . . . . . . . . . . . . . . 2.2.2 Loss functions for approximation . . . . . . . . . 2.2.3 General loss functions and unsupervised learning 2.3 Maximum A Posteriori Approximation . . . . . . . . . . 2.4 Normalization, non–negativity, and speci?c priors . . . . 2.5 Empirical risk minimization . . . . . . . . . . . . . . . . 2.6 Interpretations of Occam’s razor . . . . . . . . . . . . . . 2.7 A priori information and a posteriori control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 9 9 9 11 14 16 16 20 20 22 24 25 29 32 33 34 39 39 39 44 46 47

3 Gaussian prior factors 3.1 Gaussian prior factor for log–probabilities . . . . . . . . . . . 3.1.1 Lagrange multipliers: Error functional EL . . . . . . . 3.1.2 Normalization by parameterization: Error functional Eg 3.1.3 The Hessians HL , Hg . . . . . . . . . . . . . . . . . . . 3.2 Gaussian prior factor for probabilities . . . . . . . . . . . . . . 2

3.3

3.4

3.5 3.6 3.7

3.8 3.9

3.2.1 Lagrange multipliers: Error functional EP . . . . . . . 3.2.2 Normalization by parameterization: Error functional Ez 3.2.3 The Hessians HP , Hz . . . . . . . . . . . . . . . . . . . General Gaussian prior factors . . . . . . . . . . . . . . . . . . 3.3.1 The general case . . . . . . . . . . . . . . . . . . . . . 3.3.2 Example: Square root of P . . . . . . . . . . . . . . . . 3.3.3 Example: Distribution functions . . . . . . . . . . . . . Covariances and invariances . . . . . . . . . . . . . . . . . . . 3.4.1 Approximate invariance . . . . . . . . . . . . . . . . . 3.4.2 Approximate symmetries . . . . . . . . . . . . . . . . . 3.4.3 Example: In?nitesimal translations . . . . . . . . . . . 3.4.4 Example: Approximate periodicity . . . . . . . . . . . Non–zero means . . . . . . . . . . . . . . . . . . . . . . . . . . Quadratic density estimation and empirical risk minimization Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Gaussian regression . . . . . . . . . . . . . . . . . . . . 3.7.2 Exact predictive density . . . . . . . . . . . . . . . . . 3.7.3 Gaussian mixture regression (cluster regression) . . . . 3.7.4 Support vector machines and regression . . . . . . . . . Classi?cation . . . . . . . . . . . . . . . . . . . . . . . . . . . Inverse quantum mechanics . . . . . . . . . . . . . . . . . . . methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47 49 50 52 52 53 55 56 56 57 58 58 59 62 64 64 71 74 75 76 77 81 81 83 85 86 87 89 90 90 91 95 95 98 98 99

4 Parameterizing likelihoods: Variational 4.1 General parameterizations . . . . . . . 4.2 Gaussian priors for parameters . . . . . 4.3 Linear trial spaces . . . . . . . . . . . . 4.4 Mixture models . . . . . . . . . . . . . 4.5 Additive models . . . . . . . . . . . . . 4.6 Product ansatz . . . . . . . . . . . . . 4.7 Decision trees . . . . . . . . . . . . . . 4.8 Projection pursuit . . . . . . . . . . . . 4.9 Neural networks . . . . . . . . . . . . .

5 Parameterizing priors: Hyperparameters 5.1 Prior normalization . . . . . . . . . . . . 5.2 Adapting prior means . . . . . . . . . . . 5.2.1 General considerations . . . . . . 5.2.2 Density estimation . . . . . . . . 3

5.3

5.4 5.5 5.6

5.2.3 Unrestricted variation . . . . . 5.2.4 Regression . . . . . . . . . . . . Adapting prior covariances . . . . . . . 5.3.1 General case . . . . . . . . . . . 5.3.2 Automatic relevance detection . 5.3.3 Local smoothness adaption . . . 5.3.4 Local masses and gauge theories 5.3.5 Invariant determinants . . . . . 5.3.6 Regularization parameters . . . Exact posterior for hyperparameters . Integer hyperparameters . . . . . . . . Local hyper?elds . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

100 101 103 103 104 104 105 106 109 111 115 116

6 Non–Gaussian prior factors 6.1 Mixtures of Gaussian prior factors . . . . . . 6.2 Prior mixtures for density estimation . . . . 6.3 Prior mixtures for regression . . . . . . . . . 6.3.1 High and low temperature limits . . 6.3.2 Equal covariances . . . . . . . . . . . 6.3.3 Analytical solution of mixture models 6.4 Local mixtures . . . . . . . . . . . . . . . . 6.5 Non–quadratic potentials . . . . . . . . . . .

123 . 123 . 125 . 126 . 127 . 129 . 131 . 135 . 137 . . . . . . . . . . . . . . 140 140 143 143 144 145 149 150 151 153 153 155 156 158 158

7 Iteration procedures: Learning 7.1 Numerical solution of stationarity equations . . . . . . 7.2 Learning matrices . . . . . . . . . . . . . . . . . . . . . 7.2.1 Learning algorithms for density estimation . . . 7.2.2 Linearization and Newton algorithm . . . . . . 7.2.3 Massive relaxation . . . . . . . . . . . . . . . . 7.2.4 Gaussian relaxation . . . . . . . . . . . . . . . . 7.2.5 Inverting in subspaces . . . . . . . . . . . . . . 7.2.6 Boundary conditions . . . . . . . . . . . . . . . 7.3 Initial con?gurations and kernel methods . . . . . . . . 7.3.1 Truncated equations . . . . . . . . . . . . . . . 7.3.2 Kernels for L . . . . . . . . . . . . . . . . . . . 7.3.3 Kernels for P . . . . . . . . . . . . . . . . . . . 7.4 Numerical examples . . . . . . . . . . . . . . . . . . . . 7.4.1 Density estimation with Gaussian speci?c prior 4

7.4.2

Density estimation with Gaussian mixture prior . . . . 163

1

Introduction

The last decade has seen a rapidly growing interest in learning from observational data. Increasing computational resources enabled successful applications of empirical learning algorithms in various areas including, for example, time series prediction, image reconstruction, speech recognition, computer tomography, and inverse scattering and inverse spectral problems for quantum mechanical systems. Empirical learning, i.e., the problem of ?nding underlying general laws from observations, represents a typical inverse problem and is usually ill–posed in the sense of Hadamard [215, 216, 219, 145, 115, 221]. It is well known that a successful solution of such problems requires additional a priori information. It is a priori information which controls the generalization ability of a learning system by providing the link between available empirical “training” data and unknown outcome in future “test” situations. We will focus mainly on nonparametric approaches, formulated directly in terms of the function values of interest. Parametric methods, on the other hand, impose typically implicit restrictions which are often extremely di?cult to relate to available a priori knowledge. Combined with a Bayesian framework [12, 16, 33, 144, 197, 171, 18, 69, 207, 35, 230, 42, 105, 104], a nonparametric approach allows a very ?exible and interpretable implementation of a priori information in form of stochastic processes. Nonparametric Bayesian methods can easily be adapted to di?erent learning situations and have therefore been applied to a variety of empirical learning problems, including regression, classi?cation, density estimation and inverse quantum problems [167, 232, 142, 141, 137, 217]. Technically, they are related to kernel and regularization methods which often appear in the form of a roughness penalty approach [216, 219, 187, 206, 150, 223, 90, 83, 116, 221]. Computationally, working with stochastic processes, or discretized versions thereof, is more demanding than, for example, ?tting a small number of parameters. This holds especially for such applications where one cannot take full advantage of the convenient analytical features of Gaussian processes. Nevertheless, it seems to be the right time to study nonparametric Bayesian approaches also for non–Gaussian problems as they become computationally feasible now at least for low dimensional systems and, even if not directly solvable, they provide a well de?ned basis for further approximations. 5

In this paper we will in particular study general density estimation problems. Those include, as special cases, regression, classi?cation, and certain types of clustering. In density estimation the functions of interest are the probability densities p(y |x, h), of producing output (“data”) y under condition x and unknown state of Nature h. Considered as function of h, for ?xed y , x, the function p(y |x, h) is also known as likelihood function and a Bayesian approach to density estimation is based on a probabilistic model for likelihoods p(y |x, h). We will concentrate on situations where y and x are real variables, possibly multi–dimensional. In a nonparametric approach, the variable h represents the whole likelihood function p(y |x, h). That means, h may be seen as the collection of the numbers 0 ≤ p(y |x, h) ≤ 1 for all x and all y . The dimension of h is thus in?nite, if the number of values which the variables x and/or y can take is in?nite. This is the case for real x and/or y . A learning problem with discrete y variable is also called a classi?cation problem. Restricting to Gaussian probabilities p(y |x, h) with ?xed variance leads to (Gaussian) regression problems. For regression problems the aim is to ?nd an optimal regression function h(x). Similarly, adapting a mixture of Gaussians allows soft clustering of data points. Furthermore, extracting relevant features from the predictive density p(y |x, data) is the Bayesian analogue of unsupervised learning. Other special density estimation problems are, for example, inverse problems in quantum mechanics where h represents a unknown potential to be determined from observational data [142, 141, 137, 217]. Special emphasis will be put on the explicit and ?exible implementation of a priori information using, for example, mixtures of Gaussian prior processes with adaptive, non–zero mean functions for the mixture components. Let us now shortly explain what is meant by the term “Bayesian Field Theory”: From a physicists point of view functions, like h(x, y ) = p(y |x, h), depending on continuous variables x and/or y , are often called a ‘?eld’.1 Most times in this paper we will, as common in ?eld theories in physics, not parameterize these ?elds and formulate the relevant probability densities or stochastic processes, like the prior p(h) or the posterior p(h|f ), directly in terms of the ?eld values h(x, y ), e.g., p(h|f ) = p(h(x, y ), x ∈ X, y ∈ Y |f ). (In the parametric case, discussed in Chapter 4, we obtain a probability density

We may also remark that for example statistical ?eld theories, which encompass quantum mechanics and quantum ?eld theory in their Euclidean formulation, are technically similar to a nonparametric Bayesian approach [244, 103, 126].

1

6

p(h|f ) = p(ξ |f ) for ?elds h(x, y, ξ ) parameterized by ξ .) The possibility to solve Gaussian integrals analytically makes Gaussian processes, or (generalized) free ?elds in the language of physicists, very attractive for nonparametric learning. Unfortunately, only the case of Gaussian regression is completely Gaussian. For general density estimation problems the likelihood terms are non–Gaussian, and even for Gaussian priors additional non–Gaussian restrictions have to be included to ensure non–negativity and normalization of densities. Hence, in the general case, density estimation corresponds to a non–Gaussian, i.e., interacting ?eld theory. As it is well known from physics, a continuum limit for non-Gaussian theories, based on the de?nition of a renormalization procedure, can be highly nontrivial to construct. (See [20, 5] for an renormalization approach to density estimation.) We will in the following not discuss such renormalization procedures but focus more on practical, numerical learning algorithms, obtained by discretizing the problem (typically, but not necessarily in coordinate space). This is similar, for example, to what is done in lattice ?eld theories. Gaussian problems live e?ectively in a space with dimension not larger than the number of training data. This is not the case for non–Gaussian problems. Hence, numerical implementations of learning algorithms for non– Gaussian problems require to discretize the functions of interest. This can be computationally challenging. For low dimensional problems, however, many non–Gaussian models are nowadays solvable on a standard PC. Examples include predictions of one– dimensional time series or the reconstruction of two–dimensional images. Higher dimensional problems require additional approximations, like projections into lower dimensional subspaces or other variational approaches. Indeed, it seems that a most solvable high dimensional problems live e?ectively in some low dimensional subspace. There are special situations in classi?cation where non–negativity and normalization constraints are ful?lled automatically. In that case, the calculations can still be performed in a space of dimension not larger than the number of training data. Contrasting Gaussian models, however the equations to be solved are then typically nonlinear. Summarizing, we will call a nonparametric Bayesian model to learn a function one or more continuous variables a Bayesian ?eld theory, having especially in mind non–Gaussian models. A large variety of Bayesian ?eld theories can be constructed by combining a speci?c likelihood models with 7

speci?c functional priors (see Tab. 1). The resulting ?exibility of nonparametric Bayesian approaches is probably their main advantage. likelihood model describes measurement process (Chap. 2) generalization behavior (Chap. 2) prior model

is determined by parameters (Chap. 3, 4) hyperparameters (Chap. 5)

Examples include density estimation(Sects. 3.1–3.6, 6.2) regression (Sects. 3.7, 6.3) classi?cation (Sect. 3.8) inverse quantum theory (Sect. 3.9) hard constraints (Chap. 2) Gaussian prior factors (Chap. 3) mixtures of Gauss. (Sects. 6.1–6.4) non–quadratic potentials(Sect. 6.5)

Learning algorithms are treated in Chapter 7. Table 1: A Bayesian approach is based on the combination of two models, a likelihood model, describing the measurement process used to obtain the training data, and a prior model, enabling generalization to non–training data. Parameters of the prior model are commonly called hyperparameters. In “nonparametric” approaches the collection of all values of the likelihood function itself are considered as the parameters. A nonparametric Bayesian approach for likelihoods depending on one or more real variables is in this paper called a Bayesian ?eld theory.

8

The paper is organized as follows: Chapter 2 summarizes the Bayesian framework as needed for the subsequent chapters. Basic notations are de?ned, an introduction to Bayesian decision theory is given, and the role of a priori information is discussed together with the basics of a Maximum A Posteriori Approximation (MAP), and the speci?c constraints for density estimation problems. Gaussian prior processes, being the most commonly used prior processes in nonparametric statistics, are treated in Chapter 3. In combination with Gaussian prior models, this section also introduces the likelihood models of density estimation, (Sections 3.1, 3.2, 3.3) Gaussian regression and clustering (Section 3.7), classi?cation (Section 3.8), and inverse quantum problems (Section 3.9). Notice, however, that all these likelihood models can also be combined with the more elaborated prior models discussed in the following sections of the paper. Parametric approaches, useful if a numerical solution of a full nonparametric approach is not feasible, are the topic of Chapter 4. Hyperparameters, parameterizing prior processes and making them more ?exible, are considered in Section 5. Two possibilities to go beyond Gaussian processes, mixture models and non–quadratic potentials, are presented in Section 6. Chapter 7 focuses on learning algorithms, i.e., on methods to solve the stationarity equations resulting from a Maximum A Posteriori Approximation. In this section one can also ?nd numerical solutions of Bayesian ?eld theoretical models for general density estimation.

2

2.1

2.1.1

Bayesian framework

Basic model and notations

Independent, dependent, and hidden variables

Constructing theories means introducing concepts which are not directly observable. They should, however, explain empirical ?ndings and thus have to be related to observations. Hence, it is useful and common to distinguish observable (visible) from non–observable (hidden) variables. Furthermore, it is often convenient to separate visible variables into dependent variables, representing results of such measurements the theory is aiming to explain, and independent variables, specifying the kind of measurements performed and not being subject of the theory. Hence, we will consider the following three groups of variables 9

1. observable (visible) independent variables x, 2. observable (visible) dependent variables y , 3. not directly observable (hidden, latent) variables h. This characterization of variables translates to the following factorization property, de?ning the model we will study, p(x, y, h) = p(y |x, h) p(x) p(h). (1)

In particular, we will be interested in scenarios where x = (x1 , x2 , · · ·) and analogously y = (y1 , y2 , · · ·) are decomposed into independent components, meaning that p(y |x, h) = i p(yi |xi , h) and p(x) = i p(xi ) factorize. Then, p(x, y, h) =

i

p(yi |xi , h) p(xi ) p(h).

(2)

Fig.1 shows a graphical representation of the factorization model (2) as a directed acyclic graph [182, 125, 107, 196]. The xi and/or yi itself can also be vectors. The interpretation will be as follows: Variables h ∈ H represent possible states of (the model of ) Nature, being the invisible conditions for dependent variables y . The set H de?nes the space of all possible states of Nature for the model under study. We assume that states h are not directly observable and all information about p(h) comes from observed variables (data) y , x. A given set of observed data results in a state of knowledge f numerically represented by the posterior density p(h|f ) over states of Nature. Independent variables x ∈ X describe the visible conditions (measurement situation, measurement device) under which dependent variables (measurement results) y have been observed (measured). According to Eq. (1) they are independent of h, i.e., p(x|h) = p(x). The conditional density p(y |x, h) of the dependent variables y is also known as likelihood of h (under y given x). Vector–valued y can be treated as a collection of one–dimensional y with the vector index being part of the x variable, i.e., yα (x) = y (x, α) = y (? x) with x ? = (x, α). In the setting of empirical learning available knowledge is usually separated into a ?nite number of training data D = {(xi , yi)|1 ≤ i ≤ n} ={(xD , yD ) and, to make the problem well de?ned, additional a priori information D0 . For data D ∪ D0 we write p(h|f ) = p(h|D, D0 ). Hypotheses h 10

x1

?

x2

?

···

xn

?

y1

I

y2

M

···

yn

h

Figure 1: Directed acyclic graph for the factorization model (1). represent in this setting functions h(x, y ) = p(y |x, h) of two (possibly multidimensional) variables y , x. In density estimation y is a continuous variable (the variable x may be constant and thus be skipped), while in classi?cation problems y takes only discrete values. In regression problems on assumes p(y |x, h) to be Gaussian with ?xed variance, so the function of interest becomes the regression function h(x) = dy yp(y |x, h). 2.1.2 Energies, free energies, and errors

Often it is more convenient to work with log–probabilities L = ln p than with probabilities. Firstly, this ensures non–negativity of probabilities p = eL ≥ 0 for arbitrary L. (For p = 0 the log–probability becomes L = ?∞.) Thus, when working with log–probabilities one can skip the non–negativity constraint which would be necessary when working with probabilities. Secondly, the multiplication of probabilities for independent events, yielding their joint probability, becomes a sum when written in terms of L. Indeed, from p(A, B ) = p(A and B ) = p(A)p(B ) it follows for L(A, B ) = ln P (A, B ) that L(A, B ) =L(A and B ) = L(A)L(B ). Especially in the limit where an in?nite number of events is combined by and, this would result in an in?nite product for p but yields an integral for L, which is typically easier to treat. Besides the requirement of being non–negative, probabilities have to be normalized, e.g., dx p(x) = 1. When dealing with a large set of elementary events normalization is numerically a nontrivial task. It is then convenient to work as far as possible with unnormalized probabilities Z (x) from which normalized probabilities are obtained as p(x) = Z (x)/Z with partition sum Z = x Z (x). Like for probabilities, it is also often advantageous to work with 11

the logarithm of unnormalized probabilities, or to get positive numbers (for p(x) < 1) with the negative logarithm E (x) = ?(1/β ) ln Z (x), in physics also known as energy. (For the role of β see below.) Similarly, F = ?(1/β ) ln Z is known as free energy. De?ning the energy we have introduced a parameter β . Varying the parameter β generates an exponential family of densities which is frequently used in practice by (simulated or deterministic) annealing techniques for minimizing free energies [114, 153, 195, 43, 1, 199, 238, 68, 239, 240]. In physics β is known as inverse temperature and plays the role of a Lagrange multiplier in the maximum entropy approach to statistical physics. Inverse temperature β can also be seen as an external ?eld coupling to the energy. Indeed, the free energy F is a generating function for the cumulants of the energy, meaning that cumulants of E can be obtained by taking derivatives of F with respect to β [65, 9, 13, 160]. For a detailled discussion of the relations between probability, log–probability, energy, free energy, partition sums, generating functions, and also bit numbers and information see [132]. The posterior p(h|f ), for example, can so be written as p(h|f ) = eL(h|f ) = Z (h|f ) e?βE (h|f ) = Z (H | f ) Z (H | f ) ?β (E (h|f )?F (H |f )) ?βE (h|f )+c(H |f ) = e =e ,

(3)

with (posterior) log–probability L(h|f ) = ln p(h|f ), unnormalized (posterior) probabilities or partition sums Z (h|f ), (posterior) energy Z (H | f ) = dh Z (h|f ), (5) (4)

1 E (h|f ) = ? ln Z (h|f ) β 1 F (H |f ) = ? ln Z (H |f ) β 1 = ? ln dh e?βE (h|f ) , β 12

(6)

and (posterior) free energy (7) (8)

yielding Z (h|f ) = e?βE (h|f ) , Z (H | f ) = (9) (10)

dh e?βE (h|f ),

where dh represent a (functional) integral, for example over variables (functions) h(x, y ) = p(y |x, h), and c(H |f ) = ? ln Z (H |f ) = βF (H |f ). (11)

Note that we did not include the β –dependency of the functions Z , F , c in the notation. For the sake of clarity, we have chosen to use the common notation for conditional probabilities also for energies and the other quantities derived from them. The same conventions will also be used for other probabilities, so we will write for example for likelihoods p(y |x, h) = e?β (E (y|x,h)?F (Y |x,h)) ,

′

(12)

for y ∈ Y . Inverse temperatures may be di?erent for prior and likelihood. Thus, we may choose β ′ = β in Eq. (12) and Eq. (3). In Section 2.3 we will discuss the maximum a posteriori approximation where an optimal h is found by maximizing the posterior p(h|f ). Since maximizing the posterior means minimizing the posterior energy E (h|f ) the latter plays the role of an error functional for h to be minimized. This is technically similar to the minimization of an regularized error functional as it appears in regularization theory or empirical risk minimization, and which is discussed in Section 2.5. Let us have a closer look to the integral over model states h. The variables h represent the parameters describing the data generating probabilities or likelihoods p(y |x, h). In this paper we will mainly be interested in “nonparametric” approaches where the (x, y, h)–dependent numbers p(y |x, h) itself are considered to be the primary degrees of freedom which “parameterize” the model states h. Then, the integral over h is an integral over a set of real variables indexed by x, y , under additional non–negativity and normalization condition. dh →

x,y

dp(y |x, h) .

(13)

13

Mathematical di?culties arise for the case of continuous x, y where p(h|f ) represents a stochastic process. and the integral over h becomes a functional integral over (non–negative and normalized) functions p(y |x, h). For Gaussian processes such a continuum limit can be de?ned [51, 77, 223, 143, 149] while the construction of continuum limits for non–Gaussian processes is highly non–trivial (See for instance [48, 37, 103, 244, 184, 228, 229, 34, 192] for perturbative approaches or [77] for a non–perturbative φ4 –theory.) In this paper we will take the numerical point of view where all functions are considered to be ?nally discretized, so the h–integral is well–de?ned (“lattice regularization” [41, 200, 160]). 2.1.3 Posterior and likelihood

Bayesian approaches require the calculation of posterior densities. Model states h are commonly speci?ed by giving the data generating probabilities or likelihoods p(y |x, h). Posteriors are linked to likelihoods by Bayes’ theorem p(A|B ) = p(B |A)p(A) , p (B ) (14)

which follows at once from the de?nition of conditional probabilities, i.e., p(A, B ) = p(A|B )p(B ) = p(B |A)p(A). Thus, one ?nds p(h|f ) = p(h|D, D0 ) = =

i

p(D |h) p(h|D0) p(yD |xD , h) p(h|D0) = p(D |D0) p(yD |xD , D0 )

(15) (16)

using p(yD |xD , D0 , h) = p(yD |xD , h) for the training data likelihood of h and p(h|D0 , xi ) = p(h|D0 ). The terms of Eq. (15) are in a Bayesian context often referred to as likelihood × prior posterior = . (17) evidence Eqs.(16) show that the posterior can be expressed equivalently by the joint likelihoods p(yi , xi |h) or conditional likelihoods p(yi|xi , h). When working with joint likelihoods, a distinction between y and x variables is not necessary. In that case x can be included in y and skipped from the notation. If, however, p(x) is already known or is not of interest working with conditional likelihoods is preferable. Eqs.(15,16) can be interpreted as updating 14

p(xi , yi |h)p(h|D0 ) i p(yi |xi , h)p(h|D0 ) = , dh i p(xi , yi|h)p(h|D0 ) dh i p(yi |xi , h)p(h|D0 )

(or learning) formula used to obtain a new posterior from a given prior probability if new data D arrive. In terms of energies Eq. (16) reads, p(h|f ) = e?β i E (yi |xi,h)?βE (h|D0 ) Z (YD |xD , h) Z (H |D0) dh Z (YD |xD , h) Z (H |D0)

i

e?β

E (yi |xi ,h)?βE (h|D0 )

,

(18)

where the same temperature 1/β has been chosen for both energies and the normalization constants are Z (YD |xD , h) = Z (H |D0 ) = dyi e?βE (yi |xi,h) ,

i

(19) (20)

dh e?βE (h|D0) .

The predictive density we are interested in can be written as the ratio of two correlation functions under p0 (h), p(y |x, f ) = < p(y |x, h) >H |f < p(y |x, h) i p(yi |xi , h) >H |D0 = , < i p(yi|xi , h) >H |D0 dh p(y |x, h) e?βEcomb = dh e?βEcomb (21) (22) (23)

where < · · · >H |D0 denotes the expectation under the prior density p0 (h) = p(h|D0 ) and the combined likelihood and prior energy Ecomb collects the h–dependent energy and free energy terms Ecomb =

i

E (yi|xi , h) + E (h|D0 ) ? F (YD |xD , h),

(24)

with

1 F (YD |xD , h) = ? ln Z (YD |xD , h). β

(25)

Going from Eq. (22) to Eq. (23) the normalization factor Z (H |D0 ) appearing in numerator and denominator has been canceled. We remark that for continuous x and/or y the likelihood energy term E (yi |xi , h) describes an ideal, non–realistic measurement because realistic measurements cannot be arbitrarily sharp. Considering the function p(·|·, h) as element of a Hilbert space its values may be written as scalar product 15

p(x|y, h) = (vxy , p(·|·, h) ) with a function vxy being also an element in that Hilbert space. For continuous x and/or y this notation is only formal as vxy becomes unnormalizable. In practice a measurement of p(·|·, h) corresponds to a normalizable vx ?y ? = dy dx ?(x, y )vxy where the kernel ?(x, y ) has to ensure normalizability. (Choosing normalizable vx ?y ? as coordinates the Hilbert space of p(·|·, h) is also called a reproducing kernel Hilbert space [180, 112, 113, 223, 143].) The data terms then become p(? y i |x ?i , h) = dy dx ?i (x, y )p(y, x|h) . dy ?i (x, y )p(y, x|h) (26)

The notation p(yi |xi , h) is understood as limit ?(x, y ) → δ (x ? xi )δ (y ? yi ) and means in practice that ?(x, y ) is very sharply centered. We will assume that the discretization, ?nally necessary to do numerical calculations, will implement such an averaging. 2.1.4 Predictive density

Within a Bayesian approach predictions about (e.g., future) events are based on the predictive probability density, being the expectation of probability for y for given (test) situation x, training data D and prior data D0 p(y |x, f ) = p(y |x, D, D0) = dh p(h|f ) p(y |x, h) = < p(y |x, h) >H |f . (27)

Here < · · · >H |f denotes the expectation under the posterior p(h|f ) = p(h|D, D0), the state of knowledge f depending on prior and training data. Successful applications of Bayesian approaches rely strongly on an adequate choice of the model space H and model likelihoods p(y |x, h). Note that p(y |x, f ) is in the convex cone spanned by the possible states of Nature h ∈ H , and typically not equal to one of these p(y |x, h). The situation is illustrated in Fig. 2. During learning the predictive density p(y |x, f ) tends to approach the true p(y |x, h). Because the training data are random variables, this approach is stochastic. (There exists an extensive literature analyzing the stochastic properties of learning and generalization from a statistical mechanics perspective [62, 63, 64, 226, 234, 175]). 2.1.5 Mutual information and learning

The aim of learning is to generalize the information obtained from training data to non–training situations. For such a generalization to be possible, 16

= ? p(y |x, hi), hi ∈ H

p(y |x, htrue )

F p(y |x, f )

Figure 2: The predictive density p(y |x, f ) for a state of knowledge f = f (D, D0) is in the convex hull spanned by the possible states of Nature hi characterized by the likelihoods p(y |x, hi). During learning the actual predictive density p(y |x, f ) tends to move stochastically towards the extremal point p(y |x, htrue ) representing the “true” state of Nature. there must exist a, at least partially known, relation between the likelihoods p(yi|xi , h) for training and for non–training data. This relation is typically provided by a priori knowledge. One possibility to quantify the relation between two random variables y1 and y2 , representing for example training and non–training data, is to calculate their mutual information, de?ned as M (Y 1 , Y 2 ) =

y1 ∈Y1 ,y2 ∈Y2

p(y1 , y2) ln

p (y 1 , y 2 ) . p (y 1 )p (y 2 )

(28)

It is also instructive to express the mutual information in terms of (average) information content or entropy, which, for a probability function p(y ), is de?ned as H (Y ) = ? ln p(y ) ln p(y ). (29) We ?nd

y ∈Y

meaning that the mutual information is the sum of the two individual entropies diminished by the entropy common to both variables. 17

M (Y 1 , Y 2 ) = H (Y 1 ) + H (Y 2 ) ? H (Y 1 , Y 2 ),

(30)

To have a compact notation for a family of predictive densities p(yi |xi , f ) we choose a vector x = (x1 , x2 , · · ·) consisting of all possible values xi and corresponding vector y = (y1 , y2 , · · ·), so we can write p(y |x, f ) = p(y1 , y2 , · · · |x1 , x2 , · · · , f ). (31)

We now would like to characterize a state of knowledge f corresponding to predictive density p(y |x, f ) by its mutual information. Thus, we generalize the de?nition (28) from two random variables to a random vector y with components yi, given vector x with components xi and obtain the conditional mutual information M (Y |x, f ) = or M (Y |x, f ) = dyi H (Yi |x, f ) ? H (Y |x, f ) , (33) in terms of conditional entropies H (Y |x, f ) = ? dy p(y |x, f ) ln p(y |x, f ). (34) dyi p(y |x, f ) ln p(y |x, f ) , j p(yj |xj , f ) (32)

i

In case not a ?xed vector x is given, like for example x = (x1 , x2 , · · ·), but a density p(x), it is useful to average the conditional mutual information and conditional entropy by including the integral dx p(x) in the above formulae. It is clear from Eq. (32) that predictive densities which factorize p(y |x, f ) =

i

p(yi|xi , f ),

(35)

have a mutual information of zero. Hence, such factorial states do not allow any generalization from training to non–training data. A special example are the possible states of Nature or pure states h, which factorize according to the de?nition of our model p(y |x, h) =

i

p(yi|xi , h).

(36)

Thus, pure states do not allow any further generalization. This is consistent with the fact that pure states represent the natural endpoints of any learning process. 18

It is interesting to see, however, that there are also other states for which the predictive density factorizes. Indeed, from Eq. (36) it follows that any (prior or posterior) probability p(h) which factorizes leads to a factorial state, p(h) =

i

p(h(xi )) ? p(y |x, f ) =

i

p(yi |xi , f ).

(37)

This means generalization, i.e., (non–local) learning, is impossible when starting from a factorial prior. A factorial prior provides a very clear reference for analyzing the role of a–priori information in learning. In particular, with respect to a prior factorial in local variables xi , learning may be decomposed into two steps, one increasing, the other lowering mutual information: 1. Starting from a factorial prior, new non–local data D0 (typically called a priori information) produce a new non–factorial state with non–zero mutual information. 2. Local data D (typically called training data) stochastically reduce the mutual information. Hence, learning with local data corresponds to a stochastic decay of mutual information. Pure states, i.e., the extremal points in the space of possible predictive densities, do not have to be deterministic. Improving measurement devices, stochastic pure states may be further decomposed into ?ner components g , so that p(yi|xi , h) = dg p(g ) p(yi|xi , g ). (38) Imposing a non–factorial prior p(g ) on the new, ?ner hypotheses g enables again non–local learning with local data, leading asymptotically to one of the new pure states p(yi |xi , g ). Let us exemplify the stochastic decay of mutual information by a simple numerical example. Because the mutual information requires the integration over all yi variables we choose a problem with only two of them, ya and yb corresponding to two x values xa and xb . We consider a model with four of Nature hl , 1 ≤ l ≤ 4, with Gaussian likelihood p(y |x, h) = √ states ?1 ( 2πσ ) exp (?(y ? hi (x))2 /(2σ 2 )) and local means hl (xj ) = ±1. Selecting a “true” state of Nature h, we sample 50 data points Di = (xi , yi ) from the corresponding Gaussian likelihood using p(xa ) = p(xb ) =

19

0.5. Then, starting from a given, factorial or non–factorial, prior p(h|D0 ) we sequentially update the predictive density,

4

p(y |x, f (Di+1, · · · , D0 )) = by calculating the posterior p(hl |Di+1 , · · · , D0 ) =

l=1

p(y |x, hl ) p(hl |Di+1 , · · · , D0 ),

(39)

It is easily seen from Eq. (40) that factorial states remain factorial under local data. Fig. 3 shows that indeed the mutual information decays rapidly. Depending on the training data, still the wrong hypothesis hl may survive the decay of mutual information. Having arrived at a factorial state, further learning has to be local. That means, data points for xi can then only in?uence the predictive density for the corresponding yi and do not allow generalization to the other yj with j = i. For a factorial prior p(hl ) = p(hl (xa ))p(hl (xb )) learning is thus local from the very beginning. Only very small numerical random ?uctuations of the mutual information occur, quickly eliminated by learning. Thus, the predictive density moves through a sequence of factorial states.

p(yi+1 |xi+1 , hl ) p(hj |Di · · · , D0 ) . p(yi+1 |xi+1 , Di · · · , D0 )

(40)

2.2

2.2.1

Bayesian decision theory

Loss and risk

In Bayesian decision theory a set A of possible actions a is considered, together with a function l(x, y, a) describing the loss l su?ered in situation x if y appears and action a is selected [16, 127, 182, 197]. The loss averaged over test data x, y , and possible states of Nature h is known as expected risk, r (a, f ) = dx dy p(x) p(y |x, f ) l(x, y, a). (41) (42) (43)

= < l(x, y, a) >X,Y |f = < r (a, h) >H |f

where < · · · >X,Y |f denotes the expectation under the joint predictive density p(x, y |f ) = p(x)p(y |x, f ) and r (a, h) = dx dy p(x) p(y |x, h) l(x, y, a). 20 (44)

posterior

1 0.8 0.6 0.004 0.4 0.2 0.001 10 1 0.8 0.6 0.4 0.2 0.001 10 1 20 30 40 50 10 20 30 40 50 0.007 0.006 0.005 0.004 0.003 0.002 10 0.003 0.002

mutual information

0.007 0.006 0.005

(a)

(b)

20

30

40

50

(c)

(d)

20

30

40

50

(e)

0.8 0.6 0.4 0.2 2. 10

-16

(f)

-16 1.5 10 -16 1. 10 -17 5. 10 10 -5. 10 -17 -16 -1. 10 -16 50 -1.5 10 20 30 40 50

10

20

30

40

Figure 3: The decay of mutual information during learning: Model with 4 possible states hl representing Gaussian likelihoods p(yi|xi , hl ) with means ±1 for two di?erent xi values. Shown are posterior probabilities p(hl |f ) (a, c, e, on the left hand side, the posterior of the true hl is shown by a thick line) and mutual information M (y ) (b, d, f , on the right hand side) during learning 50 training data. (a, b): The mutual information decays during learning and becomes quickly practically zero. (c, d): For “unlucky” training data the wrong hypothesis hi can dominate at the beginning. Nevertheless, the mutual information decays and the correct hypothesis has ?nally to be found through “local” learning. (e, f ): Starting with a factorial prior the mutual information is and remains zero, up to arti?cial numerical ?uctuations. For (e, f ) the same random data have been used as for (c, d). 21

The aim is to ?nd an optimal action a? a? = argmina∈A r (a, f ). 2.2.2 Loss functions for approximation (45)

Log–loss: A typical loss function for density estimation problems is the log– loss l(x, y, a) = ?b1 (x) ln p(y |x, a) + b2 (x, y ) (46) with some a–independent b1 (x) > 0, b2 (x, y ) and actions a describing probability densities dy p(y |x, a) = 1, ?x ∈ X, ?a ∈ A. (47) Choosing b2 (x, y ) = p(y |x, f ) and b1 (x) = 1 gives r (a, f ) = dx dy p(x)p(y |x, f ) ln p(y |x, f ) p(y |x, a) (48) (49) (50)

p(y |x, f ) >X,Y |f p(y |x, a) = < KL( p(y |x, f ), p(y |x, a) ) >X , = < ln

which shows that minimizing log–loss is equivalent to minimizing the (x– averaged) Kullback–Leibler entropy KL( p(y |x, f ), p(y |x, a) )[122, 123, 13, 46, 53]. While the paper will concentrate on log–loss we will also give a short summary of loss functions for regression problems. (See for example [16, 197] for details.) Regression problems are special density estimation problems where the considered possible actions are restricted to y –independent functions a(x). Squared–error loss: The most common loss function for regression problems (see Sections 3.7, 3.7.2) is the squared–error loss. It reads for one– dimensional y l(x, y, a) = b1 (x) (y ? a(x))2 + b2 (x, y ), (51)

with arbitrary b1 (x) > 0 and b2 (x, y ). In that case the optimal function a(x) is the regression function of the posterior which is the mean of the predictive density a? (x) = dy y p(y |x, f ) = < y >Y |x,f . (52) 22

This can be easily seen by writing (y ? a(x))2 = = y ? < y >Y |x,f + < y >Y |x,f ? a(x) y ? < y >Y |x,f

2 2 2 2

(53)

+ a(x) ? < y >Y |x,f a(x) ? < y >Y |x,f

?2 y ? < y >Y |x,f

,

(54)

where the ?rst term in (54) is independent of a and the last term vanishes after integration over y according to the de?nition of < y >Y |x,f . Hence, r (a, f ) = dx b1 (x)p(x) a(x) ? < y >Y |x,f

2

+ const.

(55)

This is minimized by a(x) =< y >Y |x,f . Notice that for Gaussian p(y |x, a) with ?xed variance log–loss and squared-error loss are equivalent. For multi– dimensional y one–dimensional loss functions like Eq. (51) can be used when the component index of y is considered part of the x–variables. Alternatively, loss functions depending explicitly on multidimensional y can be de?ned. For instance, a general quadratic loss function would be l(x, y, a) =

k,k ′

(yk ? ak )K(k, k ′)(yk′ ? ak′ (x)).

(56)

with symmetric, positive de?nite kernel K(k, k ′ ). Absolute loss: For absolute loss l(x, y, a) = b1 (x)|y ? a(x)| + b2 (x, y ), with arbitrary b1 (x) > 0 and b2 (x, y ). The risk becomes r (a, f ) = dx b1 (x)p(x)

a(x) ?∞

(57)

+ dx b1 (x)p(x) = 2 dx b1 (x)p(x)

∞

dy (a(x) ? y ) p(y |x, f ) dy (y ? a(x)) p(y |x, f ) + const. (58)

a(x)

a(x) m(x)

dy (a(x) ? y ) p(y |x, f ) + const.′ , (59)

a(x) ?∞

where the integrals have been rewritten as

m(x) a(x)

=

+

∞ m(x)

introducing a median function m(x) which satis?es

m(x) ?∞

m(x) ?∞

+

a(x) m(x)

and

∞ a(x)

=

1 dy p(y |x, f ) = , ?x ∈ X, 2 23

(60)

so that a(x)

m(x) ?∞

dy p(y |x, f ) ?

∞ m(x)

dy p(y |x, f ) = 0, ?x ∈ X.

(61)

Thus the risk is minimized by any median function m(x). δ –loss and 0–1 loss : Another possible loss function, typical for classi?cation tasks (see Section 3.8), like for example image segmentation [150], is the δ –loss for continuous y or 0–1–loss for discrete y l(x, y, a) = ?b1 (x)δ (y ? a(x)) + b2 (x, y ), (62)

with arbitrary b1 (x) > 0 and b2 (x, y ). Here δ denotes the Dirac δ –functional for continuous y and the Kronecker δ for discrete y . Then, r (a, f ) = dx b1 (x)p(x) p( y = a(x) |x, f ) + const., (63)

so the optimal a corresponds to any mode function of the predictive density. For Gaussians mode and median are unique, and coincide with the mean. 2.2.3 General loss functions and unsupervised learning

Choosing actions a in speci?c situations often requires the use of speci?c loss functions. Such loss functions may for example contain additional terms measuring costs of choosing action a not related to approximation of the predictive density. Such costs can quantify aspects like the simplicity, implementability, production costs, sparsity, or understandability of action a. Furthermore, instead of approximating a whole density it often su?ces to extract some of its features. like identifying clusters of similar y –values, ?nding independent components for multidimensional y , or mapping to an approximating density with lower dimensional x. This kind of exploratory data analysis is the Bayesian analogue to unsupervised learning methods. Such methods are on one hand often utilized as a preprocessing step but are, on the other hand, also important to choose actions for situations where speci?c loss functions can be de?ned. From a Bayesian point of view general loss functions require in general an explicit two–step procedure [131]: 1. Calculate (an approximation of) the predictive density, and 2. Minimize the expectation of the loss function under that (approximated) predictive density. (Empirical risk minimization, on the 24

with action a(x, y ) being a mapping of y for given x to a ?nite number of cluster centers (prototypes). Another example of a clustering method based on the predictive density is Fukunaga’s valley seeking procedure [61]. For multidimensional x a space of actions a(Px x, y ) can be chosen depending only on a (possibly adaptable) lower dimensional projection of x. For multidimensional y with components yi it is often useful to identify independent components. One may look, say, for a linear mapping y ? = My minimizing the correlations between di?erent components of the ‘source’ variables y ? by minimizing the loss function l(y, y ′, M) =

i=j ′ y ?i y ?j ,

other hand, minimizes the empirical average of the (possibly regularized) loss function, see Section 2.5.) (For a related example see for instance [138].) For a Bayesian version of cluster analysis, for example, partitioning a predictive density obtained from empirical data into several clusters, a possible loss function is l(x, y, a) = (y ? a(x, y ))2 , (64)

(65)

with respect to M under the joint predictive density for y and y ′ given x, x′ , D, D0. This includes a Bayesian version of blind source separation (e.g. applied to the so called cocktail party problem [14, 7]), analogous to the treatment of Molgedey and Schuster [159]. Interesting projections of multidimensional y can for example be found by projection pursuit techniques [59, 102, 108, 206].

2.3

Maximum A Posteriori Approximation

In most applications the (usually very high or even formally in?nite dimensional) h–integral over model states in Eq. (23) cannot be performed exactly. The two most common methods used to calculate the h integral approximately are Monte Carlo integration [151, 91, 95, 194, 16, 70, 195, 21, 214, 233, 69, 167, 177, 198, 168] and saddle point approximation [16, 45, 30, 169, 17, 244, 197, 69, 76, 131]. The latter approach will be studied in the following. For that purpose, we expand Ecomb of Eq. (24) with respect to h around some h? Ecomb (h) = e(?h,?) E (h)

h =h ?

(66)

1 = Ecomb (h? ) + (?h, ?(h? )) + (?h, H(h? )?h) + · · · 2 25

with ?h = (h ? h? ), gradient ? (not acting on ?h), Hessian H, and round brackets (· · · , · · ·) denoting scalar products. In case p(y |x, h) is parameterized independently for every x, y the states h represent a parameter set indexed by x and y , hence ?(h? ) = H(h? ) = δEcomb (h) δh(x, y ) =

h =h ?

δEcomb (p(y ′|x′ , h)) δp(y |x, h)

,

h =h ?

(67) , (68)

δ 2 Ecomb (h) δh(x, y )δh(x′ , y ′)

=

h =h ?

are functional derivatives [97, 106, 29, 36] (or partial derivatives for discrete x, y ) and for example (?h, ?(h? )) = dx dy (h(x, y ) ? h? (x, y )) ?(h? )(x, y ). (69)

δ 2 Ecomb (p(y ′′|x′′ , h)) δp(y |x, h)δp(y ′|x′ , h)

h =h ?

Choosing h? to be the location of a local minimum of Ecomp (h) the linear term in (66) vanishes. The second order term includes the Hessian and corresponds to a Gaussian integral over h which could be solved analytically dh e?β (?h,H?h) = π 2 β ? 2 (det H)? 2 ,

d d 1

(70)

for a d–dimensional h–integral. However, using the same approximation for the h–integrals in numerator and denominator of Eq. (23), expanding then also p(y |x, h) around h? , and restricting to the ?rst (h–independent) term p(y |x, h?) of that expansion, the factor (70) cancels, even for in?nite d. (The result is the zero order term of an expansion of the predictive density in powers of 1/β . Higher order contributions can be calculated by using Wick’s theorem [45, 30, 169, 244, 109, 160, 131].) The ?nal approximative result for the predictive density (27) is very simple and intuitive p(y |x, f ) ≈ p(y |x, h? ), with h? = argminh∈H Ecomb = argmaxh∈H p(h|f ) = argmaxh∈H p(yD |xD , h)p(h|D0 ). (72) The saddle point (or Laplace) approximation is therefore also called Maximum A Posteriori Approximation (MAP). Notice that the same h? also maximizes the integrand of the evidence of the data yD p(yD |xD , D0 ) = dh p(yD |xD , h)p(h|D0 ). 26 (73) (71)

This is due to the assumption that p(y |x, h) is slowly varying at the stationary point and has not to be included in the saddle point approximation for the predictive density. For (functional) di?erentiable Ecomb Eq. (72) yields the stationarity equation, δEcomb (h) = 0. (74) δh(x, y ) The functional Ecomb including training and prior data (regularization, stabilizer) terms is also known as (regularized) error functional for h. In practice a saddle point approximation may be expected useful if the posterior is peaked enough around a single maximum, or more general, if the posterior is well approximated by a Gaussian centered at the maximum. For asymptotical results one would have to require ? 1 β L(yi |xi , h), (75)

i

to become β –independent for β → ∞ with some β being the same for the 1 prior and data term. (See [40, 237]). If for example n i L(yi |xi , h) converges for large number n of training data the low temperature limit 1/β → 0 can be interpreted as large data limit n → ∞, nEcomb = n ? 1 n

i

L(yi |xi , h) +

1 E (h|D0 ) . n

(76)

Notice, however, the factor 1/n in front of the prior energy. For Gaussian p(y |x, h) temperature 1/β corresponds to variance σ 2 1 1 Ecomb = 2 2 σ σ 1 2

i

(yi ? h(xi ))2 + σ 2 E (h|D0 ) .

(77)

For Gaussian prior this would require simultaneous scaling of data and prior variance. We should also remark that for continuous x,y the stationary solution h? needs not to be a typical representative of the process p(h|f ). A common example is a Gaussian stochastic process p(h|f ) with prior energy E (h|D0 ) related to some smoothness measure of h expressed by derivatives of p(y |x, h). Then, even if the stationary h? is smooth, this needs not to be the case for a typical h sampled according to p(h|f ). For Brownian motion, for instance, a typical sample path is not even di?erentiable (but continuous) while the 27

stationary path is smooth. Thus, for continuous variables only expressions like dh e?βE (h) can be given an exact meaning as a Gaussian measure, de?ned by a given covariance with existing normalization factor, but not the expressions dh and E (h) alone [51, 65, 223, 110, 83, 143]. Interestingly, the stationary h? yielding maximal posterior p(h|f ) is not only useful to obtain an approximation for the predictive density p(y |x, f ) but is also the optimal solution a? for a Bayesian decision problem with log–loss and a ∈ A = H . Indeed, for a Bayesian decision problem with log–loss (46) argmina∈H r (a, h) = h, and analogously, argmina∈F r (a, f ) = f. This is proved as follows: Jensen’s inequality states that dy p(y )g (q (y )) ≥ g ( dy p(y )q (y )), (80) (79) (78)

for any convex function g and probability p(y ) ≥ 0 with dy p(y ) = 1. Thus, because the logarithm is concave ? dy p(y |x, h) ln p(y |x, a) ≥ ? ln p(y |x, h) dy p(y |x, h) p(y |x, a) =0 p(y |x, h) (81) (82)

? ? dy p(y |x, h) ln p(y |x, a) ≥ ? dy p(y |x, h) ln p(y |x, h), with equality for a = h. Hence

r (a, h) = ? dx dy p(x)p(y |x, h) (b1 (x) ln p(y |x, a) + b2 (x, y )) (83) = ? dx p(x)b1 (x) dy p(y |x, h) ln p(y |x, a) + const. ≥ ? dx p(x)b1 (x) dy p(y |x, h) ln p(y |x, h) + const. = r (h, h), (84) (85) (86)

with equality for a = h. For a ∈ F replace h ∈ H by f ∈ F . This proves Eqs. (78) and (79). 28

2.4

Normalization, non–negativity, and speci?c priors

Density estimation problems are characterized by their normalization and non–negativity condition for p(y |x, h). Thus, the prior density p(h|D0 ) can only be non–zero for such h for which p(y |x, h) is positive and normalized over y for all x. (Similarly, when solving for a distribution function, i.e., the integral of a density, the non–negativity constraint is replaced by monotonicity and the normalization constraint by requiring the distribution function to be 1 on the right boundary.) While the non–negativity constraint is local with respect to x and y , the normalization constraint is nonlocal with respect to y . Thus, implementing a normalization constraint leads to nonlocal and in general non–Gaussian priors. For classi?cation problems, having discrete y values (classes), the normalization constraint requires simply to sum over the di?erent classes and a Gaussian prior structure with respect to the x–dependency is not altered [231]. For general density estimation problems, however, i.e., for continuous y , the loss of the Gaussian structure with respect to y is more severe, because non–Gaussian functional integrals can in general not be performed analytically. On the other hand, solving the learning problem numerically by discretizing the y and x variables, the normalization term is typically not a severe complication. To be speci?c, consider a Maximum A Posteriori Approximation, minimizing L(yi |xi , h) + βE (h|D0 ), (87) βEcomb = ?

i

where the likelihood free energy F (YD |xD , h) is included, but not the prior free energy F (H |D0) which, being h–independent, is irrelevant for minimization with respect to h. The prior energy βE (h|D0 ) has to implement the non–negativity and normalization conditions ZX (x, h) = dyi p(yi |xi , h) = 1, p(yi|xi , h) ≥ 0, ?xi ∈ Xi , ?h ∈ H ?yi ∈ Yi , ?xi ∈ Xi , ?h ∈ H. (88) (89)

It is useful to isolate the normalization condition and non–negativity constraint de?ning the class of density estimation problems from the rest of the ? 0 so that problem speci?c priors. Introducing the speci?c prior information D

29

? 0 , normalized, positive}, we have D0 = { D ? ? 0 , norm., pos.) = p(norm., pos.|h)p(h|D0 ) , p(h|D ? 0) p(norm., pos.|D ? 0 –independent with deterministic, D ? 0) p(norm., pos.|h) = p(norm., pos.|h, D = p(norm.|h)p(pos.|h) = δ (ZX ? 1)

xy

(90)

(91) (92)

Θ p(y |x, h) ,

and step function Θ. ( The density p(norm.|h) is normalized over all possible normalizations of p(y |x, h), i.e., over all possible values of ZX , and p(pos.|h) over all possible sign combinations.) The h–independent denomi? 0 ) can be skipped for error minimization with respect nator p(norm., pos.|D to h. We de?ne the speci?c prior as

? 0) ? 0 ) ∝ e?E (h|D p(h|D .

(93)

In Eq. (93) the speci?c prior appears as posterior of a h–generating pro? 0 . We will call therefore Eq. (93) the cess determined by the parameters D posterior form of the speci?c prior. Alternatively, a speci?c prior can also be in likelihood form ? 0 , h|norm., pos.) = p(D ? 0 |h) p(h|norm., pos.). p (D (94)

? 0 |h) is conditioned on h this means that the normalAs the likelihood p(D ? 0 |h) ? E ? 0 e (D ization Z = dD remains in general h–dependent and must be included when minimizing with respect to h. However, Gaussian speci?c priors with h–independent covariances have the special property that according to Eq. (70) likelihood and posterior interpretation coincide. Indeed, represent? 0 by a mean function t ? and covariance ing Gaussian speci?c prior data D D0 ?1 K (analogous to standard training data in the case of Gaussian regression, see also Section 3.5) one ?nds due to the fact that the normalization of a Gaussian is independent of the mean (for uniform (meta) prior p(h)) ? 0) = p(h|D = p(tD ? 0 |h, K) =

? ,K(h?tD ? )) 0 0 e? 2 (h?tD ? ,K(h?tD ? )) 0 0 dh e? 2 (h?tD ? ,K(h?tD ? )) 0 0 e? 2 (h?tD 1 1 1

(95) (96)

dt e? 2 (h?t,K(h?t)) 30

1

.

Thus, for Gaussian p(tD ? 0 |h, K) with h–independent normalization the speci?c prior energy in likelihood form becomes analogous to Eq. (93)

?E (tD ? |h,K) 0 p(tD , ? 0 |h, K) ∝ e

(97)

and speci?c prior energies can be interpreted both ways. Similarly, the complete likelihood factorizes ? 0 , norm., pos.|h) = p(norm., pos.|h) p(D ? 0 |h). p (D (98)

According to Eq. (92) non–negativity and normalization conditions are implemented by step and δ –functions. The non–negativity constraint is only active when there are locations with p(y |x, h) = 0. In all other cases the gradient has no component pointing into forbidden regions. Due to the combined e?ect of data, where p(y |x, h) has to be larger than zero by de?nition, and smoothness terms the non–negativity condition for p(y |x, h) is usually (but not always) ful?lled. Hence, if strict positivity is checked for the ?nal solution, then it is not necessary to include extra non–negativity terms in the error (see Section 3.2.1). For the sake of simplicity we will therefore not include non–negativity terms explicitly in the following. In case a non–negativity constraint has to be included this can be done using Lagrange multipliers, or alternatively, by writing the step functions in p(pos.|h) ∝ x,y Θ(p(y |x, h)) Θ(x ? a) =

∞ a

dξ

∞ ?∞

dηeiη(ξ?x) ,

(99)

and solving the ξ –integral in saddle point approximation (See for example [62, 63, 64]). Including the normalization condition in the prior p0 (h|D0 ) in form of a δ –functional results in a posterior probability p(h|f ) = e

i

? 0 )+? ? 0) Li (yi |xi ,h)?E (h|D c(H |D x∈X

δ

dy eL(y|x,h) ? 1

(100)

? 0) = ? ln Z ? (h|D ? 0 ) related to the normalization of the with constant c ?(H |D ? speci?c prior e?E (h|D0) . Writing the δ –functional in its Fourier representation δ (x) = 1 2π

∞ ?∞

dk eikx =

1 2πi

i∞ ?i∞

dk e?kx ,

(101)

31

i.e., δ ( dy eL(y|x,h) ? 1) = 1 2πi

i∞ ?i∞

dΛX (x) eΛX (x)(1?

dy eL(y |x,h) )

,

(102)

and performing a saddle point approximation with respect to ΛX (x) (which is exact in this case) yields P (h|f ) = e

i

? 0 )+? ? 0 )+ dx ΛX (x)(1? dyeL(y |x,h) ) Li (yi |xi ,h)?E (h|D c(H |D

.

(103)

This is equivalent to the Lagrange multiplier approach. Here the stationary ΛX (x) is the Lagrange multiplier vector (or function) to be determined by the normalization conditions for p(y |x, h) = eL(y|x,h) . Besides the Lagrange multiplier terms it is numerically sometimes useful to add additional terms to the log–posterior which vanish for normalized p(y |x, h).

2.5

Empirical risk minimization

In the previous sections the error functionals we will try to minimize in the following have been given a Bayesian interpretation in terms of the log– posterior density. There is, however, an alternative justi?cation of error functionals using the Frequentist approach of empirical risk minimization [219, 220, 221]. Common to both approaches is the aim to minimize the expected risk for action a r (a, f ) = dx dy p(x, y |f (D, D 0)) l(x, y, a). (104) The expected risk, however, cannot be calculated without knowledge of the true p(x, y |f ). In contrast to the Bayesian approach of modeling p(x, y |f ) the Frequentist approach approximates the expected risk by the empirical risk l(xi , yi, a), (105) E (a) = r ?(a, f ) =

i

i.e., by replacing the unknown true probability by an observable empirical probability. Here it is essential for obtaining asymptotic convergence results to assume that training data are sampled according to the true p(x, y |f ) [219, 52, 189, 127, 221]. Notice that in contrast in a Bayesian approach the density p(xi ) for training data D does according to Eq. (16) not enter the formalism because D enters as conditional variable. For a detailed discussion 32

of the relation between quadratic error functionals and Gaussian processes see for example [178, 180, 181, 112, 113, 150, 223, 143]. From that Frequentist point of view one is not restricted to logarithmic data terms as they arise from the posterior–related Bayesian interpretation. However, like in the Bayesian approach, training data terms are not enough to make the minimization problem well de?ned. Indeed this is a typical inverse problem [219, 115, 221] which can, according to the classical regularization approach [215, 216, 162], be treated by including additional regularization (stabilizer) terms in the loss function l. Those regularization terms, which correspond to the prior terms in a Bayesian approach, are thus from the point of view of empirical risk minimization a technical tool to make the minimization problem well de?ned. The empirical generalization error for a test or validation data set independent from the training data D , on the other hand, is measured using only the data terms of the error functional without regularization terms. In empirical risk minimization this empirical generalization error is used, for example, to determine adaptive (hyper–)parameters of regularization terms. A typical example is a factor multiplying the regularization terms controlling the trade–o? between data and regularization terms. Common techniques using the empirical generalization error to determine such parameters are cross– validation or bootstrap like techniques [163, 6, 225, 211, 212, 81, 39, 223, 54]. From a strict Bayesian point of view those parameters would have to be integrated out after de?ning an appropriate prior [16, 146, 148, 24].

2.6

Interpretations of Occam’s razor

The principle to prefer simple models over complex models and to ?nd an optimal trade–o? between ?tting data and model complexity is often referred to as Occam’s razor (William of Occam, 1285–1349). Regularization terms, penalizing for example non–smooth (“complex”) functions, can be seen as an implementation of Occam’s razor. The related phenomena appearing in practical learning is called over– ?tting [219, 96, 24]. Indeed, when studying the generalization behavior of trained models on a test set di?erent from the training set, it is often found that there is an optimal model complexity. Complex models can due to their higher ?exibility achieve better performance on the training data than simpler models. On a test set independent from the training set, however, they can perform poorer than simpler models. 33

Notice, however, that the Bayesian interpretation of regularization terms as (a priori) information about Nature and the Frequentist interpretation as additional cost terms in the loss function are not equivalent. Complexity priors re?ects the case where Nature is known to be simple while complexity costs express the wish for simple models without the assumption of a simple Nature. Thus, while the practical procedure of minimizing an error functional with regularization terms appears to be identical for empirical risk minimization and a Bayesian Maximum A Posteriori Approximation, the underlying interpretation for this procedure is di?erent. In particular, because the Theorem in Section 2.3 holds only for log–loss, the case of loss functions di?ering from log–loss requires from a Bayesian point of view to distinguish explicitly between model states h and actions a. Even in saddle point approximation, this would result in a two step procedure, where in a ?rst step the hypothesis h? , with maximal posterior probability is determined, while the second step minimizes the risk for action a ∈ A under that hypothesis h? [131].

2.7

A priori information and a posteriori control

Learning is based on data, which includes training data as well as a priori data. It is prior knowledge which, besides specifying the space of local hypothesis, enables generalization by providing the necessary link between measured training data and not yet measured or non–training data. The strength of this connection may be quanti?ed by the mutual information of training and non–training data, as we did in Section 2.1.5. Often, the role of a priori information seems to be underestimated. There are theorems, for example, proving that asymptotically learning results become independent of a priori information if the number of training data goes to in?nity. This, however,is correct only if the space of hypotheses h is already su?ciently restricted and if a priori information means knowledge in addition to that restriction. In particular, let us assume that the number of potential test situations x, is larger than the number of training data one is able to collect. As the number of actual training data has to be ?nite, this is always the case if x can take an in?nite number of values, for example if x is a continuous variable. The following arguments, however, are not restricted to situations were one considers an in?nite number of test situation, we just assume that their number is too large to be completely included in the training data. 34

If there are x values for which no training data are available, then learning for such x must refer to the mutual information of such test data and the available training data. Otherwise, training would be useless for these test situations. This also means, that the generalization to non–training situations can be arbitrarily modi?ed by varying a priori information. To make this point very clear, consider the rather trivial situation of learning a deterministic function h(x) for a x variable which can take only two values x1 and x2 , from which only one can be measured. Thus, having measured for example h(x1 ) = 5, then “learning” h(x2 ) is not possible without linking it to h(x1 ). Such prior knowledge may have the form of a “smoothness” constraint, say |h(x1 )?h(x2 )| ≤ 2 which would allow a learning algorithm to “generalize” from the training data and obtain 3 ≤ h(x2 ) ≤ 7. Obviously, arbitrary results can be obtained for h(x2 ) by changing the prior knowledge. This exempli?es that generalization can be considered as a mere reformulation of available information, i.e., of training data and prior knowledge. Except for such a rearrangement of knowledge, a learning algorithm does not add any new information to the problem. (For a discussion of the related “no–free-lunch” theorems see [235, 236].) Being extremely simple, this example nevertheless shows a severe problem. If the result of learning can be arbitrary modi?ed by a priori information, then it is critical which prior knowledge is implemented in the learning algorithm. This means, that prior knowledge needs an empirical foundation, just like standard training data have to be measured empirically. Otherwise, the result of learning cannot expected to be of any use. Indeed, the problem of appropriate a priori information is just the old induction problem, i.e., the problem of learning general laws from a ?nite number of observations, as already been discussed by the ancient Greek philosophers. Clearly, this is not a purely academic problem, but is extremely important for every system which depends on a successful control of its environment. Modern applications of learning algorithms, like speech recognition or image understanding, rely essentially on correct a priori information. This holds especially for situations where only few training data are available, for example, because sampling is very costly. Empirical measurement of a priori information, however, seems to be impossible. The reason is that we must link every possible test situation to the training data. We are not able to do this in practice if, as we assumed, the number of potential test situations is larger than the number of measurements one is able to perform. 35

Take as example again a deterministic learning problem like the one discussed above. Then measuring a priori information might for example be done by measuring (e.g., bounds on) all di?erences h(x1 ) ? h(xi ). Thus, even if we take the deterministic structure of the problem for granted, the number of such di?erences is equal to the number of potential non–training situations xi we included in our model. Thus, measuring a priori information does not require fewer measurements than measuring directly all potential non–training data. We are interested in situations where this is impossible. Going to a probabilistic setting the problem remains the same. For example, even if we assume Gaussian hypotheses with ?xed variance, measuring a complete mean function h(x), say for continuous x, is clearly impossible in practice. The same holds thus for a Gaussian process prior on h. Even this very speci?c prior requires the determination of a covariance and a mean function (see Chapter 3). As in general empirical measurement of a priori information seems to be impossible, one might thus just try to guess some prior. One may think, for example, of some “natural” priors. Indeed, the term “a priori” goes back to Kant [111] who assumed certain knowledge to be necessarily be given “a priori” without reference to empirical veri?cation. This means that we are either only able to produce correct prior assumptions, for example because incorrect prior assumptions are “unthinkable”, or that one must typically be lucky to implement the right a priori information. But looking at the huge number of di?erent prior assumptions which are usually possible (or “thinkable”), there seems no reason why one should be lucky. The question thus remains, how can prior assumptions get empirically veri?ed. Also, one can ask whether there are “natural” priors in practical learning tasks. In Gaussian regression one might maybe consider a “natural” prior to be a Gaussian process with constant mean function and smoothness– related covariance. This may leave a single regularization parameter to be determined for example by cross–validation. Formally, one can always even use a zero mean function for the prior process by subtracting a base line or reference function. Thus does, however, not solve the problem of ?nding a correct prior, as now that reference function has to be known to relate the results of learning to empirical measurements. In principle any function could be chosen as reference function. Such a reference function would for example enter a smoothness prior. Hence, there is no “natural” constant function and from an abstract point of view no prior is more “natural” than any other. 36

Formulating a general law refers implicitly (and sometimes explicitly) to a “ceteris paribus” condition, i.e., the constraint that all relevant variables, not explicitly mentioned in the law, are held constant. But again, verifying a “ceteris paribus” condition is part of an empirical measurement of a priori information and by no means trivial. Trying to be cautious and use only weak or “uninformative” priors does also not solve the principal problem. One may hope that such priors (which may be for example an improper constant prior for a one–dimensional real variable) do not introduce a completely wrong bias, so that the result of learning is essentially determined by the training data. But, besides the problem to de?ne what exactly an uninformative prior has to be, such priors are in practice only useful if the set of possible hypothesis is already su?ciently restricted, so “the data can speak for themselves” [69]. Hence, the problem remains to ?nd that priors which impose the necessary restrictions, so that uninformative priors can be used. Hence, as measuring a priori information seems impossible and ?nding correct a priori information by pure luck seems very unlikely, it looks like also successful learning is impossible. It is a simple fact, however, that learning can be successful. That means there must be a way to control a priori information empirically. Indeed, the problem of measuring a priori information may be arti?cial, arising from the introduction of a large number of potential test situations and correspondingly a large number of hidden variables h (representing what we call “Nature”) which are not all observable. In practice, the number of actual test situations is also always ?nite, just like the number of training data has to be. This means, that not all potential test data but only the actual test data must be linked to the training data. Thus, in practice it is only a ?nite number of relations which must be under control to allow successful generalization. (See also Vapnik’s distinction between induction and transduction problems. [221]: In induction problems one tries to infer a whole function, in transduction problems one is only interested in predictions for a few speci?c test situations.) This, however, opens a possibility to control a priori information empirically. Because we do not know which test situation will occur, such an empirical control cannot take place at the time of training. This means a priori information has to be implemented at the time of measuring the test data. In other words, a priori information has to be implemented by the measurement process [131, 134]. 37

1 1 0.8 0.6 0.4 0.2 0.8

0.6

0.4

0.2

20

40

60

80

100

20

40

60

80

100

Figure 4: The l.h.s. shows a bounded random function which does not allow generalization from training to non–training data. Using a measurement device with input averaging (r.h.s.) or input noise the function becomes learnable. Again, a simple example may clarify this point. Consider the prior information, that a function h is bounded, i.e., a ≤ h(x) ≤ b, ?x. A direct measurement of this prior assumption is practically not possible, as it would require to check every value h(x). An implementation within the measurement process is however trivial. One just has to use a measurement device which is only able to to produce output in the range between a and b. This is a very realistic assumption and valid for all real measurement devices. Values smaller than a and larger than b have to be ?ltered out or actively projected into that range. In case we nevertheless ?nd a value out of that range we either have to adjust the bounds or we exchange the “malfunctioning” measurement device with a proper one. Note, that this range ?lter is only needed at the ?nite number of actual measurements. That means, a priori information can be implemented by a posteriori control at the time of testing. A realistic measurement device does not only produce bounded output but shows also always input noise or input averaging. A device with input noise has noise in the x variable. That means if one intends to measure at xi the device measures instead at xi + ? with ? being a random variable. A typical example is translational noise, with ? being a, possibly multidimensional, Gaussian random variable with mean zero. Similarly, a device with input averaging returns a weighted average of results for di?erent x values instead of a sharp result. Bounded devices with translational input noise, for example, will always measure smooth functions [128, 23, 131]. (See Fig. 4.) This may be an explanation for the success of smoothness priors. 38

The last example shows, that to obtain adequate a priori information it can be helpful in practice to analyze the measurement process for which learning is intended. The term “measurement process” does here not only refer to a speci?c device, e.g., a box on the table, but to the collection of all processes which lead to a measurement result. We may remark that measuring a measurement process is as di?cult or impossible as a direct measurement of a priori information. What has to be ensured is the validity of the necessary restrictions during a ?nite number of actual measurements. This is nothing else than the implementation of a probabilistic rule producing y given the test situation and the training data. In other words, what has to be implemented is the predictive density p(y |x, D ). This predictive density indeed only depends on the actual test situation and the ?nite number of training data. (Still, the probability density for a real y cannot strictly be empirically veri?ed or controlled. We may take it here, for example, as an approximate statement about frequencies.) This shows the tautological character of learning, where measuring a priori information means controlling directly the corresponding predictive density. The a posteriori interpretation of a priori information can be related to a constructivistic point of view. The main idea of constructivism can be characterized by a sentence of Vico (1710): Verum ipsum factum — the truth is the same as the made [222]. (For an introduction to constructivism see [227] and references therein, for constructive mathematics see [25].)

3

3.1

3.1.1

Gaussian prior factors

Gaussian prior factor for log–probabilities

Lagrange multipliers: Error functional EL

In this chapter we look at density estimation problems with Gaussian prior factors. We begin with a discussion of functional priors which are Gaussian in probabilities or in log–probabilities, and continue with general Gaussian prior factors. Two section are devoted to the discussion of covariances and means of Gaussian prior factors, as their adequate choice is essential for practical applications. After exploring some relations of Bayesian ?eld theory and empirical risk minimization, the last three sections introduce the speci?c likelihood models of regression, classi?cation, inverse quantum theory.

39

We begin a discussion of Gaussian prior factors in L. As Gaussian prior factors correspond to quadratic error (or energy) terms, consider an error functional with a quadratic regularizer in L (L, KL) = ||L||2 K = 1 2 dx dy dx′ dy ′L(x, y )K(x, y ; x′, y ′)L(x′ , y ′), (106)

writing for the sake of simplicity from now on L(x, y ) for the log–probability L(y |x, h) = ln p(y |x, h). The operator K is assumed symmetric and positive semi–de?nite and positive de?nite on some subspace. (We will understand positive semi–de?nite to include symmetry in the following.) For positive (semi) de?nite K the scalar product de?nes a (semi) norm by ||L||K = (L, KL), (107)

and a corresponding distance by ||L ? L′ ||K . The quadratic error term (106) corresponds to a Gaussian factor of the prior density which have been called ? 0 ) = p(L|D ? 0 ) for L. In particular, we will consider the speci?c prior p(h|D here the posterior density , (108) where prefactors like β are understood to be included in K. The constant c ? referring to the speci?c prior is determined by the determinant of K according to Eq. (70). Notice however that not only the likelihood i Li but also the complete prior is usually not Gaussian due to the presence of the normalization conditions. (An exception is Gaussian regression, see Section 3.7.) The posterior (108) corresponds to an error functional

i

p(h|f ) = e

1 Li (xi ,yi )?2

c, dxdydx′ dy ′ L(x,y )K(x,y ;x′ ,y ′ )L(x′ ,y ′ )+ dx ΛX (x)(1? dy eL(x,y ) )+?

1 EL = βEcomb = ?(L, N ) + (L, KL) + (eL ? δ (y ), ΛX ), 2 with likelihood vector (or function) L(x, y ) = L(y |x, h), data vector (function)

n

(109)

(110)

N (x, y ) =

i

δ (x ? xi )δ (y ? yi ), 40

(111)

Lagrange multiplier vector (function) ΛX (x, y ) = ΛX (x), probability vector (function) eL (x, y ) = eL(x,y) = P (x, y ) = p(y |x, h), and δ (y )(x, y ) = δ (y ). (114) According to Eq. (111) N/n = Pemp is an empirical density function for the joint probability p(x, y |h). We end this subsection by de?ning some notations. Functions of vectors (functions) and matrices (operators), di?erent from multiplication, will be understood element-wise like for example (eL )(x, y ) = eL(x,y) . Only multiplication of matrices (operators) will be interpreted as matrix product. Element-wise multiplication has then to be written with the help of diagonal matrices. For that purpose we introduce diagonal matrices made from vectors (functions) and denoted by the corresponding bold letters. For instance, I(x, y ; x′, y ′) = δ (x ? x′ )δ (y ? y ′), L(x, y ; x′ , y ′) = δ (x ? x′ )δ (y ? y ′)L(x, y ), P(x, y ; x′, y ′) = eL (x, y ; x′ , y ′) = δ (x ? x′ )δ (y ? y ′)P (x, y ), N(x, y ; x′, y ′) = δ (x ? x′ )δ (y ? y ′) N (x, y ), ΛX (x, y ; x′ , y ′) = δ (x ? x′ )δ (y ? y ′)ΛX (x), or L = LI, where I (x, y ) = 1. (122) Being diagonal all these matrices commute with each other. Element-wise multiplication can now be expressed as (KL)(x′ , y ′, x, y ) = = dx′′ dy ′′K(x′ , y ′, x′′ , y ′′)L(x′′ , y ′′ , x, y ) dx′′ dy ′′K(x′ , y ′, x′′ , y ′′)L(x, y )δ (x ? x′′ )δ (y ? y ′′ ) (123) P = PI, eL = eL I, N = NI, ΛX = ΛX I, (121) (115) (116) (117) (118) (119) (120) (113) (112)

= K(x′ , y ′ , x, y )L(x, y ). 41

In general this is not equal to L(x′ , y ′)K(x′ , y ′, x, y ). In contrast, the matrix product KL with vector L (KL)(x′ , y ′) = dx dy K(x′ , y ′, x, y )L(x, y ), (124)

does not depend on x, y anymore, while the tensor product or outer product, (K ? L)(x′′ , y ′′, x, y, x′, y ′) = K(x′′ , y ′′, x′ , y ′)L(x, y ), (125)

depends on additional x′′ , y ′′. Taking the variational derivative of (108) with respect to L(x, y ) using δL(x′ , y ′) = δ (x ? x′ )δ (y ? y ′ ) δL(x, y ) and setting the gradient equal to zero yields the stationarity equation 0 = N ? KL ? eL ΛX . (127) (126)

Alternatively, we can write eL ΛX = ΛX eL = PΛX . The Lagrange multiplier function ΛX is determined by the normalization condition ZX (x) = dy eL(x,y) = 1, ?x ∈ X, (128) which can also be written ZX = IX P = IX eL = I in terms of normalization vector, ZX (x, y ) = ZX (x), normalization matrix, ZX (x, y ; x′, y ′) = δ (x ? x′ )δ (y ? y ′ ) ZX (x), and identity on X , IX (x, y ; x′ , y ′) = δ (x ? x′ ). (132) Multiplication of a vector with IX corresponds to y –integration. Being a non–diagonal matrix IX does in general not commute with diagonal matrices 42 (131) (130) or ZX = I, (129)

like L or P. Note also that despite IX eL = IX eL I = II = I in general IX P = IX eL = I = ZX . According to the fact that IX and ΛX commute, i.e., IX ΛX = ΛX IX ? [ΛX , IX ] = ΛX IX ? IX ΛX = 0, (133)

(introducing the commutator [A, B ] = AB ? BA), and that the same holds for the diagonal matrices [ΛX , eL ] = [ΛX , P] = 0, it follows from the normalization condition IX P = I that IX PΛX = IX ΛX P = ΛX IX P = ΛX I = ΛX , i.e., 0 = (I ? IX eL )ΛX = (I ? IX P)ΛX . (136) For ΛX (x) = 0 Eqs.(135,136) are equivalent to the normalization (128). If there exist directions at the stationary point L? in which the normalization of P changes, i.e., the normalization constraint is active, a ΛX (x) = 0 restricts the gradient to the normalized subspace (Kuhn–Tucker conditions [57, 19, 99, 188]). This will clearly be the case for the unrestricted variations of p(y, x) which we are considering here. Combining ΛX = IX PΛX for ΛX (x) = 0 with the stationarity equation (127) the Lagrange multiplier function is obtained ΛX = IX (N ? KL) = NX ? (IX KL). Here we introduced the vector NX = IX N, with components NX (x, y ) = NX (x) =

i

(134)

(135)

(137)

(138)

δ (x ? xi ) = nx ,

(139)

giving the number of data available for x. Thus, Eq. (137) reads in components ΛX (x) =

i

δ (x ? xi ) ?

dy ′′ dx′ dy ′ K(x, y ′′ ; x′ , y ′)L(x′ , y ′).

(140)

43

Inserting now this equation for ΛX into the stationarity equation (127) yields 0 = N ? KL ? eL (NX ? IX KL) = I ? eL IX (N ? KL) . (141)

Eq. (141) possesses, besides normalized solutions we are looking for, also possibly unnormalized solutions ful?lling N = KL for which Eq. (137) yields ΛX = 0. That happens because we used Eq. (135) which is also ful?lled for ΛX (x) = 0. Such a ΛX (x) = 0 does not play the role of a Lagrange multiplier. For parameterizations of L where the normalization constraint is not necessarily active at a stationary point ΛX (x) = 0 can be possible for a normalized solution L? . In that case normalization has to be checked. It is instructive to de?ne TL = N ? ΛX eL , so the stationarity equation (127) acquires the form KL = TL , which reads in components dx′ dy ′ K(x, y ; x′ , y ′)L(x′ , y ′) =

i

(142)

(143)

δ (x ? xi )δ (y ? yi) ? ΛX (x) eL(x,y) , (144)

which is in general a non–linear equation because TL depends on L. For existing (and not too ill–conditioned) K?1 the form (143) suggest however an iterative solution of the stationarity equation according to Li+1 = K?1 TL (Li ), (145)

for discretized L, starting from an initial guess L0 . Here the Lagrange multiplier ΛX has to be adapted so it ful?lls condition (137) at the end of iteration. Iteration procedures will be discussed in detail in Section 7. 3.1.2 Normalization by parameterization: Error functional Eg

Referring to the discussion in Section 2.3 we show that Eq. (141) can alternatively be obtained by ensuring normalization, instead of using Lagrange multipliers, explicitly by the parameterization L(x, y ) = g (x, y ) ? ln dy ′ eg(x,y ) , 44

′

L = g ? ln ZX ,

(146)

and considering the functional Eg = ? N, g ? ln ZX + 1 g ? ln ZX , K (g ? ln ZX ) . 2 (147)

The stationary equation for g (x, y ) obtained by setting the functional derivative δEg /δg to zero yields again Eq. (141). We check this, using δ ln ZX (x′ ) = δ (x ? x′ )eL(x,y) , δg (x, y ) and δL(x′ , y ′) = δ (x ? x′ )δ (y ? y ′) ? δ (x ? x′ )eL(x,y) , δg (x, y ) where δL denotes a matrix, and the superscript δg We also note that despite IX = IT X

T

δ ln ZX = IX eL = eL IX δg

T

,

(148)

δL = I ? IX eL , (149) δg

the transpose of a matrix.

IX eL = eL IX = (IX eL )T ,

(150)

is not symmetric because eL depends on y and does not commute with the non–diagonal IX . Hence, we obtain the stationarity equation of functional Eg written in terms of L(g ) again Eq. (141) δL 0=? δg

T

δEg = GL ? eL ΛX = I ? eL IX (N ? KL) . δL

(151)

Here GL = N ? KL = ?δEg /δL is the L–gradient of ?Eg . Referring to the discussion following Eq. (141) we note, however, that solving for g instead for L no unnormalized solutions ful?lling N = KL are possible. In case ln ZX is in the zero space of K the functional Eg corresponds to a Gaussian prior in g alone. Alternatively, we may also directly consider a Gaussian prior in g ?g = ? N, g ? ln ZX + 1 g , K g , E 2 with stationarity equation 0 = N ? Kg ? eL NX . 45 (153) (152)

Notice, that expressing the density estimation problem in terms of g , nonlocal normalization terms have not disappeared but are part of the likelihood term. As it is typical for density estimation problems, the solution g can be calculated in X –data space, i.e., in the space de?ned by the xi of the training data. This still allows to use a Gaussian prior structure with respect to the x–dependency which is especially useful for classi?cation problems [231]. 3.1.3 The Hessians HL, Hg

The Hessian HL of ?EL is de?ned as the matrix or operator of second derivatives δ 2 (?EL ) HL (L)(x, y ; x′ y ′ ) = . (154) δL(x, y )δL(x′ , y ′) L For functional (109) and ?xed ΛX we ?nd the Hessian by taking the derivative of the gradient in (127) with respect to L again. This gives HL (L)(x, y ; x′ y ′ ) = ?K(x, y ; x′ y ′) ? δ (x ? x′ )δ (y ? y ′ )ΛX (x)eL(x,y) or The addition of the diagonal matrix ΛX eL = eL ΛX can result in a negative de?nite H even if K has zero modes. like in the case where K is a di?erential operator with periodic boundary conditions. Note, however, that ΛX eL is diagonal and therefore symmetric, but not necessarily positive de?nite, because ΛX (x) can be negative for some x. Depending on the sign of ΛX (x) the normalization condition ZX (x) = 1 for that x can be replaced by the inequality ZX (x) ≤ 1 or ZX (x) ≥ 1. Including the L–dependence of ΛX and with δeL(x ,y ) ′ ′ = δ (x ? x′ )δ (y ? y ′ )eL(x,y) ? δ (x ? x′ )eL(x,y) eL(x ,y ) , δg (x, y ) δeL = I ? eL IX eL = eL ? eL IX eL , δg we ?nd, written in terms of L, Hg (L)(x, y ; x′ , y ′) = δ 2 (?Eg ) δg (x, y )δg (x′, y ′) i.e.,

′ ′

(155) (156)

HL = ?K ? ΛX eL .

(157)

(158)

L

46

= dx′′ dy ′′

δ 2 (?Eg ) δ (?Eg ) δL(x′′ , y ′′) δ 2 L(x′′ , y ′′ ) + δL(x, y )δL(x′′ , y ′′ ) δg (x′ , y ′) δL(x′′ , y ′′) δg (x, y )δg (x′, y ′ )

′ ′

L

= ?K(x, y ; x′ , y ′) ? eL(x ,y ) eL(x,y) +eL(x ,y )

′ ′

dy ′′dy ′′′ K(x′ , y ′′; x, y ′′′ ) dy ′′K(x′ , y ′; x, y ′′) dy ′′(KL)(x, y ′′ ) dy ′′(KL)(x, y ′′ ) . (159)

dy ′′ K(x′ , y ′′; x, y ) + eL(x,y)

?δ (x ? x′ )δ (y ? y ′ )eL(x,y) NX (x) ? +δ (x ? x′ )eL(x,y) eL(x ,y ) NX (x) ?

′ ′

The last term, diagonal in X , has dyadic structure in Y , and therefore for ?xed x at most one non–zero eigenvalue. In matrix notation the Hessian becomes Hg = ? I ? eL IX K I ? IX eL ? I ? eL IX ΛX eL = ? (I ? PIX ) [K (I ? IX P) + ΛX P] ,

(160)

the second line written in terms of the probability matrix. The expression is symmetric under x ? x′ ,y ? y ′ , as it must be for a Hessian and as can be veri?ed using the symmetry of K = KT and the fact that ΛX and IX commute, i.e., [ΛX , IX ] = 0. Because functional Eg is invariant under a shift transformation, g (x, y ) → g ′ (x, y ) + c(x), the Hessian has a space of zero modes with the dimension of X . Indeed, any y –independent function (which can have ?nite L1 –norm only in ?nite Y –spaces) is a left eigenvector of I ? eL IX with eigenvalue zero. The zero mode can be removed by projecting out the zero modes and using where necessary instead of the inverse a pseudo inverse of H, for example obtained by singular value decomposition, or by including additional conditions on g like for example boundary conditions.

3.2

3.2.1

Gaussian prior factor for probabilities

Lagrange multipliers: Error functional EP

We write P (x, y ) = p(y |x, h) for the probability of y conditioned on x and h. We consider now a regularizing term which is quadratic in P instead of

47

L. This corresponds to a factor within the posterior probability (the speci?c prior) which is Gaussian with respect to P . p(h|f ) =e

i 1 dxdydx′ dy ′ P (x,y )K(x,y ;x′ ,y ′ )P (x′ ,y ′ )+ dx ΛX (x)(1? dy P (x,y ))+? c, ln Pi (xi ,yi )?2

(161)

Li (xi ,yi )?1 dxdydx′ dy ′ eL(x,y ) K(x,y ;x′ ,y ′ )eL(x 2

′ ,y ′ )

or written in terms of L = ln P for comparison, p(h|f ) =e

i

c. + dx ΛX (x)(1? dy eL(x,y ) )+?

(162) (163)

Hence, the error functional is 1 EP = βEcomb = ?(ln P, N ) + (P, K P ) + ( P ? δ (y ) , ΛX ). 2 I, i.e., In particular, the choice K = λ 2 λ λ (P, P ) = ||P ||2, (164) 2 2 can be interpreted as a smoothness prior with respect to the distribution function of P (see Section 3.3). In functional (163) we have only implemented the normalization condition for P by a Lagrange multiplier and not the non–negativity constraint. This is su?cient if P (x, y ) > 0 (i.e., P (x, y ) not equal zero) at the stationary point because then P (x, y ) > 0 holds also in some neighborhood and there are no components of the gradient pointing into regions with negative probabilities. In that case the non–negativity constraint is not active at the stationarity point. A typical smoothness constraint, for example, together with positive probability at data points result in positive probabilities everywhere where not set to zero explicitly by boundary conditions. If, however, the stationary point has locations with P (x, y ) = 0 at non–boundary points, then the component of the gradient pointing in the region with negative probabilities has to be projected out by introducing Lagrange parameters for each P (x, y ). This may happen, for example, if the regularizer rewards oscillatory behavior. The stationarity equation for EP is 0 = P ?1 N ? K P ? Λ X , (165)

with the diagonal matrix P(x′ , y ′; x, y ) = δ (x ? x′ )δ (y ? y ′)P (x, y ), or multiplied by P 0 = N ? PKP ? PΛX . (166) 48

Probabilities P (x, y ) are unequal zero at observed data points (xi , yi ) so P?1 N is well de?ned. Combining the normalization condition Eq. (135) for ΛX (x) = 0 with Eq. (165) or (166) the Lagrange multiplier function ΛX is found as ΛX = IX (N ? PKP ) = NX ? IX PKP, where IX PKP (x, y ) = dy ′dx′′ dy ′′ P (x, y ′)K(x, y ′; x′′ , y ′′)P (x′′ , y ′′). (167)

Eliminating ΛX in Eq. (165) by using Eq. (167) gives ?nally 0 = (I ? IX P)(P?1N ? KP ), or for Eq. (166) 0 = (I ? PIX )(N ? PKP ). (169) For similar reasons as has been discussed for Eq. (141) unnormalized solutions ful?lling N ? PKP are possible. De?ning TP = P?1 N ? ΛX = P?1 N ? NX ? IX PKP, the stationarity equation can be written analogously to Eq. (143) as KP = TP , with TP = TP (P ), suggesting for existing K?1 an iteration P i+1 = K?1 TP (P i), starting from some initial guess P 0 . 3.2.2 Normalization by parameterization: Error functional Ez (172) (171) (170) (168)

Again, normalization can also be ensured by parameterization of P and solving for unnormalized probabilities z , i.e., P (x, y ) = z (x, y ) , dy z (x, y ) P = z . ZX (173)

49

The corresponding functional reads Ez = ? N, ln We have δ ln ZX 1 = Z? X IX , δz (175) with diagonal matrix z built analogous to P and ZX , and δz = I, δz δ (z/ZX ) δP 1 = = Z? X (I ? PIX ) , δz δz δ ln P 1 = Z? P ?1 ? I X , X δz (176) δZX = IX , δz δ ln z = z?1 = (ZX P)?1 , δz z ZX + 1 z z . ,K 2 ZX ZX (174)

?1 δP ?1 δZX 2 1 = ?Z? I , = ?P?2 Z? (177) X X X (I ? PIX ) . δz δz The diagonal matrices [ZX , P] = 0 commute, as well as [ZX , IX ] = 0, but [P, IX ] = 0. Setting the gradient to zero and using

(I ? PIX )T = (I ? IX P) , we ?nd δP 0=? δz

1 = Z? X T

(178)

δEz δP

P ? 1 ? I X N ? (I ? I X P ) K P (179)

1 ?1 = Z? X (I ? I X P ) P N ? K P 1 ?1 ?1 = Z? X (I ? IX P) GP = ZX (GP ? ΛX ) = (GP ? ΛX ) ZX ,

with P –gradient GP = P?1 N ? KP = ?δEz /δP of ?Ez and GP the corresponding diagonal matrix. Multiplied by ZX this gives the stationarity equation (171). 3.2.3 The Hessians HP , Hz

We now calculate the Hessian of the functional ?EP . For ?xed ΛX one ?nds the Hessian by di?erentiating again the gradient (165) of ?EP HP (P )(x, y ; x′y ′) = ?K(x′ y ′; x, y ) ? δ (x ? x′)δ (y ? y ′ )

i

δ (x ? xi )δ (y ? yi ) , P 2 (x, y ) (180)

50

i.e., Here the diagonal matrix P?2 N is non–zero only at data points. Including the dependence of ΛX on P one obtains for the Hessian of ?Ez in (174) by calculating the derivative of the gradient in (179) Hz (x, y ; x′ , y ′) = ? 1 K(x, y ; x′ , y ′) ZX (x) HP = ?K ? P?2 N. (181)

? dy ′′ p(x, y ′′ )K(x, y ′′; x′ , y ′) + K(x, y ; x′ , y ′′)p(x′ , y ′′) + dy ′′ dy ′′′p(x, y ′′ )K(x, y ′′; x′ , y ′′′ )p(x′ , y ′′′ ) +δ (x ? x′ )δ (y ? y ′) ?δ (x ? x′ ) + 2 δ (x ? x′ ) i.e.,

1 ?2 ?1 Hz = Z? X (I ? IX P) ?K ? P N (I ? PIX ) ZX

i

δ (x ? xi )δ (y ? yi ) ? δ (x ? x′ ) p2 (x, y )

i

δ (x ? xi )

dx′′ dy ′′ K(x, y ; x′′ , y ′′)p(x′′ , y ′′ ) + p(x′′ , y ′′)K(x′′ , y ′′ ; x′ , y ′) dy ′′dx′′′ dy ′′′p(x, y ′′ )K(x, y ′′; x′′′ , y ′′′)p(x′′′ , y ′′′) 1 , (182) ZX (x′ )

=

1 ?1 ?Z? X (IX (GP ? ΛX ) + (GP ? ΛX ) IX ) ZX , 1 ?Z? X

(183)

?IX P?1 N ? NP?1 IX + IX NIX 1 +IX GP + GP IX ? 2 IX ΛX Z? X .

(I ? IX P) K (I ? PIX ) + P N

?2

(184)

Here we used [ΛX , IX ] = 0. It follows from the normalization dy p(x, y ) = 1 that any y –independent function is right eigenvector of (I ? IX P) with zero eigenvalue. Because ΛX = IX PGP this factor or its transpose is also contained in the second line of Eq. (183), which means that Hz has a zero mode. Indeed, functional Ez is invariant under multiplication of z with a y –independent factor. The zero modes can be projected out or removed by including additional conditions, e.g. by ?xing one value of z for every x.

51

3.3

3.3.1

General Gaussian prior factors

The general case

In the previous sections we studied priors consisting of a factor (the speci?c prior) which was Gaussian with respect to P or L = ln P and additional normalization (and non–negativity) conditions. In this section we consider the situation where the probability p(y |x, h) is expressed in terms of a function φ(x, y ). That means, we assume a, possibly non–linear, operator P = P (φ) which maps the function φ to a probability. We can then formulate a learning problem in terms of the function φ, meaning that φ now represents the hidden variables or unknown state of Nature h.2 Consider the case of a speci?c prior which is Gaussian in φ, i.e., which has a log–probability quadratic in φ 1 ? ( φ , K φ ). (185) 2 This means we are lead to error functionals of the form 1 Eφ = ?( ln P (φ) , N ) + ( φ , K φ ) + ( P (φ) , ΛX ), 2 (186)

where we have skipped the φ–independent part of the ΛX –terms. In general cases also the non–negativity constraint has to be implemented. To express the functional derivative of functional (186) with respect to φ we de?ne besides the diagonal matrix P = P(φ) the Jacobian, i.e., the matrix of derivatives P′ = P′ (φ) with matrix elements P′ (x, y ; x′, y ′; φ) = δP (x′ , y ′; φ) . δφ(x, y ) (187)

The matrix P′ is diagonal for point–wise transformations, i.e., for P (x, y ; φ) = P ( φ(x, y ) ). In such cases we use P ′ to denote the vector of diagonal elements of P′. An example is the previously discussed transformation L = ln P for which P′ = P. The stationarity equation for functional (186) becomes 0 = P′ (φ)P?1(φ)N ? Kφ ? P′ (φ)ΛX , 0 = N ? PP′ K φ ? PΛX .

?1

(188)

and for existing PP′?1 =(P′ P?1 )?1 (for nonexisting inverse see Section 4.1), (189)

Besides φ also the hyperparameters discussed in Chapter 5 belong to the hidden variables h.

2

52

φ P (x, y ) z (x, y ) L(x, y ) = ln P g (x, y ) Φ=

y

P (φ ) P =P P = z/ z dy P = eL P = eg / eg dy P = dΦ/dy norm — norm —

constraints non–negativity non–negativity — — monotony

dy ′ P

boundary

Table 2: Constraints for speci?c choices of φ From Eq. (189) the Lagrange multiplier function can be found by integration, using the normalization condition IX P = I , in the form IX PΛX = ΛX for ΛX (x) = 0. Thus, multiplying Eq. (189) by IX yields ΛX = IX N ? PP′ K φ = NX ? IX PP′ K φ. ΛX is now eliminated by inserting Eq. (190) into Eq. (189) 0 = (I ? PIX ) N ? PP′ K φ .

?1 ?1 ?1

(190)

(191)

A simple iteration procedure, provided K?1 exists, is suggested by writing Eq. (188) in the form Kφ = Tφ , with Table 2 lists constraints to be implemented explicitly for some choices of φ. 3.3.2 Example: Square root of P Tφ (φ) = P′ P?1 N ? P′ ΛX . (193) φi+1 = K?1 Tφ (φi ), (192)

We already discussed the cases φ = ln P with P ′ = √ P = eL , P/P ′ = 1 and ′ ′ φ = P with P = 1, P/P = P . The choice φ = P yields the common L2 –normalization condition over y 1= dy φ2 (x, y ), 53 ?x ∈ X, (194)

which is quadratic in φ, and P = φ2 , P ′ = 2φ, P/P ′ = φ/2. For real φ the non–negativity √ condition P ≥ 0 is automatically satis?ed [82, 206]. For φ = P and a negative Laplacian inverse covariance K = ??, one can relate the corresponding Gaussian prior to the Fisher information [38, 206, 202]. Consider, for example, a problem with ?xed x (so x can be skipped from the notation and one can write P (y )) and a dy –dimensional y . Then one has, assuming the necessary di?erentiability conditions and vanishing boundary terms,

dy

( φ , K φ ) = ?( φ , ? φ ) =

dy

dy

k

?φ ?yk

2

(195)

=

k

dy 4 P (y )

?P (y ) ?yk

2

1 = 4

dy F Ik (0), k

(196)

F where Ik (0) is the Fisher information, de?ned as ?P (y ?y 0 ) 2 ?y 0 P (y ? y 0 )

F Ik (y 0 )

=

dy

=

? ln P (y ? y 0 ) 0 dy P (y ? y k ), 0 ?yk

2

(197)

for the family P (· ? y 0) with location parameter vector y 0. A connection to quantum mechanics can be found considering the training data free case 1 Eφ = ( φ, K φ ) + (ΛX , φ), (198) 2 has the homogeneous stationarity equation K φ = ?2ΦΛX . (199)

For x–independent ΛX this is an eigenvalue equation. Examples include the quantum mechanical Schr¨ odinger equation where K corresponds to the system Hamiltonian and ? 2ΛX = (φ, K φ) , (φ, φ) (200)

to its ground state energy. In quantum mechanics Eq. (200) is the basis for variational methods (see Section 4) to obtain approximate solutions for ground state energies [55, 193, 27]. 54

Similarly, one can take φ = Lmax with the normalization 1= and P = e?φ 3.3.3

2 +L max

?(L ? Lmax ) for L bounded from above by

2 (x,y )+L max

dy e?φ

,

?x ∈ X,

(201)

, P ′ = ?2φP , P/P ′ = ?1/(2φ).

Example: Distribution functions

Instead in terms of the probability density function, one can formulate the prior in terms of its integral, the distribution function. The density P is then recovered from the distribution function φ by di?erentiation,

dy

P (φ ) =

k

?φ = ?yk

dy k

dy

?y k φ =

1 ?1 R? k φ. = R φ, k

(202)

resulting in a non–diagonal P′ . The inverse of the derivative operator R?1 d is the integration operator R = ky Rk P with matrix elements R(x, y ; x′ , y ′) = δ (x ? x′ )θ(y ? y ′), i.e., Rk (x, y ; x′ , y ′) = δ (x ? x′ )

l =k ′ ). δ (yl ? yl′)θ(yk ? yk

(203) (204)

Thus, (202) corresponds to the transformation of (x–conditioned) density functions P in (x–conditioned) distribution functions φ = RP , i.e., φ(x, y ) = y ′ ′ T ?∞ P (x, y )dy . Because R KR is (semi)–positive de?nite if K is, a speci?c prior which is Gaussian in the distribution function φ is also Gaussian in the density P . P′ becomes P (x, y ; x , y ) =

′ ′ ′

δ

dy k

δφ(x, y )

?yk′ φ(x′ , y ′

dy

= δ (x ? x′ )

k

′ δ ′ (y k ? y k ).

(205)

Here the derivative of the δ –function is de?ned by formal partial integration

∞ ?∞ ′ dy ′ f (y ′)δ ′ (y ? y ′) = f (y ′ )δ (y ′ ? y )|∞ ?∞ ? f (y ).

(206)

Fixing φ(x, ?∞) = 0 the variational derivative δ/(δφ(x, ?∞)) is not needed. The normalization condition for P becomes for the distribution function φ = RP the boundary condition φ(x, ∞) = 1, ?x ∈ X . The non–negativity condition for P corresponds to the monotonicity condition φ(x, y ) ≥ φ(x, y ′ ), ?y ≥ y ′ , ?x ∈ X and to φ(x, ?∞) ≥ 0, ?x ∈ X . 55

3.4

3.4.1

Covariances and invariances

Approximate invariance

Prior terms can often be related to the assumption of approximate invariances or approximate symmetries. A Laplacian smoothness functional, for example, measures the deviation from translational symmetry under in?nitesimal translations. Consider for example a linear mapping ? we de?ne a (semi–)distance given by the operator S. To compare φ with φ de?ned by choosing a positive (semi–)de?nite KS , and use as error measure 1 1 (φ ? Sφ), KS (φ ? Sφ) = φ, Kφ . 2 2 Here is positive semi–de?nite if KS is. Conversely, every positive semi–de?nite K can be written K = WT W and is thus of form (209) with S = I ? W and KS = I. Including terms of the form of (209) in the error functional forces φ ?. to be similar to φ A special case are mappings leaving the norm invariant (φ, φ) = (Sφ, Sφ) = (φ, ST Sφ). (210) ? i.e., (Sφ) = (Sφ)? , this requires ST = S?1 and S? = S. For real φ and φ Thus, in that case S has to be an orthogonal matrix ∈ O (N ) and can be written k ∞ 1 A θ A i i S(θ) = e = e i = θi Ai , (211) i k =0 k ! with antisymmetric A = ?AT and real parameters θi . Selecting a set of (generators) Ai the matrices obtained be varying the parameters θi form a Lie group. Up to ?rst order the expansion of the exponential function reads S ≈ 1 + i θi Ai . Thus, we can de?ne an error measure with respect to an in?nitesimal transformation by 1 2 φ ? (1 + θi Ai )φ φ ? (1 + θi Ai)φ , KS θi θi 1 = (φ, 2 AT i K S A i φ ).

i

? = Sφ, φ→φ

(207)

(208) (209)

K = (I ? S)T KS (I ? S)

i

(212)

56

3.4.2

Approximate symmetries

Next we come to the special case of symmetries, i.e., invariance under under coordinate transformations. Symmetry transformations S change the arguments of a function φ. For example for the translation of a function ?(x) = Sφ(x) = φ(x ? c). Therefore it is useful to see how S acts φ(x) → φ on the arguments of a function. Denoting the (possibly improper) eigenvectors of the coordinate operator x with eigenvalue x by (·, x) = |x), i.e., x|x) = x|x), function values can be expressed as scalar products, e.g. φ(x) = (x, φ) for a function in x, or, in two variables, φ(x, y ) = (x ? y, φ). (Note that in this ‘eigenvalue’ notation, frequently used by physicists, for example 2|x) = |2x).) Thus, we see that the action of S on some function h(x) is equivalent to the action of ST ( = S?1 if orthogonal) on |x) Sφ(x) = (x, Sφ) = (ST x, φ), or for φ(x, y ) Assuming S = Sx Sy we may also split the action of S, Sφ(x, y ) = (ST x x) ? y, Sy φ . (215) Sφ(x, y ) = ST (x ? y ), φ . (214) (213)

This is convenient for example for vector ?elds in physics where x and φ(·, y ) form three dimensional vectors with y representing a linear combination of component labels of φ. Notice that, for a general operator S, the transformed argument S|x) does not have to be an eigenvector of the coordinate operator x again. In the general case S can map a speci?c |x) to arbitrary vectors being linear combinations of all |x′ ), i.e., S|x) = dx′ S (x, x′ )|x′ ). A general orthogonal S maps an orthonormal basis to another orthonormal basis. Coordinate transformations, however, are represented by operators S, which map coordinate eigenvectors |x) to other coordinate eigenvectors |σ (x)). Hence, such coordinate transformations S just changes the argument x of a function φ into σ (x), i.e., Sφ(x) = φ(σ (x)), (216) with σ (x) a permutation or a one–to–one coordinate transformation. Thus, even for an arbitrary nonlinear coordinate transformation σ the corresponding operator S in the space of φ is linear. (This is one of the reasons why linear functional analysis is so useful.) 57

A special case are linear coordinate transformations for which we can ?(x) = Sφ(x) = φ(Sx), with S (in contrast to S) acting in the write φ(x) → φ space of x. An example of such S are coordinate rotations which preserve the norm in x–space, analogously to Eq. (210) for φ, and form a Lie group S (θ) = e i θi Ai acting on coordinates, analogously to Eq. (211). 3.4.3 Example: In?nitesimal translations

A Laplacian smoothness prior, for example, can be related to an approximate symmetry under in?nitesimal translations. Consider the group of d– dimensional translations which is generated by the gradient operator ?. This can be veri?ed by recalling the multidimensional Taylor formula for expansion of φ at x S(θ)φ(x) = e

θ? i i i

φ(x) =

∞ k =0

(

i θi ?i )

k

k!

φ(x) = φ(x + θ).

(217)

Up to ?rst order S ≈ 1 + i θi ?i . Hence, for in?nitesimal translations, the error measure of Eq. (212) becomes 1 ?T i ?i φ) = ? (φ, ?φ). 2 i i (218) assuming vanishing boundary terms and choosing KS = I. This is the classical Laplacian smoothness term. 1 2 1 φ ? (1 + θi ?i )φ φ ? (1 + θi ?i )φ = (φ, , θi θi 2 3.4.4 Example: Approximate periodicity

As another example, lets us discuss the implementation of approximate periodicity. To measure the deviation from exact periodicity let us de?ne the di?erence operators ?R θ φ(x) = φ(x) ? φ(x + θ ), ?L θ φ(x) = φ(x ? θ ) ? φ(x). (219) (220)

T R L T For periodic boundary conditions (?L θ ) = ??θ , where (?θ ) denotes the L transpose of ?θ . Hence, the operator, R R T R ?θ = ?L θ ?θ = ?(?θ ) ?θ ,

(221)

58

de?ned similarly to the Laplacian, is positive de?nite, and a possible error term, enforcing approximate periodicity with period θ, is 1 1 1 (?R (θ)φ, ?R (θ)φ) = ? (φ, ?θ φ) = 2 2 2 dx |φ(x) ? φ(x + θ)|2 . (222)

As every periodic function with φ(x) = φ(x + θ) is in the null space of ?θ typically another error term has to be added to get a unique solution of the stationarity equation. Choosing, for example, a Laplacian smoothness term, yields 1 ? (φ, (? + λ?θ ) φ). (223) 2 In case θ is not known, it can be treated as hyperparameter as discussed in Section 5. Alternatively to an implementation by choosing a semi–positive de?nite operator K with symmetric functions in its null space, approximate symmetries can be implemented by giving explicitly a symmetric reference function 1 t(x). For example, 2 (φ ? t, K(φ ? t) ) with t(x) = t(x + θ). This possibility will be discussed in the next section.

3.5

Non–zero means

A prior energy term (1/2)(φ, K φ) measures the squared K–distance of φ to the zero function t ≡ 0. Choosing a zero mean function for the prior process is calculationally convenient for Gaussian priors, but by no means mandatory. In particular, a function φ is in practice often measured relative to some non– trivial base line. Without further a priori information that base line can in principle be an arbitrary function. Choosing a zero mean function that base line does not enter the formulae and remains hidden in the realization of the measurement process. On the the other hand, including explicitly a non– zero mean function t, playing the role of a function template (or reference, target, prototype, base line) and being technically relatively straightforward, can be a very powerful tool. It allows, for example, to parameterize t(θ) by introducing hyperparameters (see Section 5) and to specify explicitly di?erent maxima of multimodal functional priors (see Section 6. [131, 132, 133, 134, 135]). All this cannot be done by referring to a single baseline. Hence, in this section we consider error terms of the form 1 φ ? t, K (φ ? t) . 2 59 (224)

Mean or template functions t allow an easy and straightforward implementation of prior information in form of examples for φ. They are the continuous analogue of standard training data. The fact that template functions t are most times chosen equal to zero, and thus do not appear explicitly in the error functional, should not obscure the fact that they are of key importance for any generalization. There are many situations where it can be very valuable to include non–zero prior means explicitly. Template functions for φ can for example result from learning done in the past for the same or for similar ?(x) to be the output of an empiritasks. In particular, consider for example φ cal learning system (neural net, decision tree, nearest neighbor methods, . . .) ?(x) would be being the result of learning the same or a similar task. Such a φ a natural candidate for a template function t(x). Thus, we see that template functions could be used for example to allow transfer of knowledge between similar tasks or to include the results of earlier learning on the same task in case the original data are lost but the output of another learning system is still available. Including non–zero template functions generalizes functional Eφ of Eq. (186) to Eφ = ?(ln P (φ), N ) + 1 φ ? t, K (φ ? t) + (P (φ), ΛX ) (225) 2 1 = ?(ln P (φ), N ) + (φ, K φ) ? (J, φ)+(P (φ), ΛX )+const.(226) 2

In the language of physics J = Kt represents an external ?eld coupling to φ(x, y ), similar, for example, to a magnetic ?eld. A non–zero ?eld leads to a non–zero expectation of φ in the no–data case. The φ–independent constant 1 stands for the term 1 (t, K t), or 2 (J, K?1 J ) for invertible K, and can be 2 skipped from the error/energy functional Eφ . The stationarity equation for an Eφ with non–zero template t contains an inhomogeneous term Kt = J 0 = P′ (φ)P?1(φ)N ? P′ (φ)ΛX ? K (φ ? t) , with, for invertible PP′?1 and ΛX = 0, ΛX = IX N ? PP′ K (φ ? t) .

?1

(227)

(228)

Notice that functional (225) can be rewritten as a functional with zero template t ≡ 0 in terms of φ = φ ? t. That is the reason why we have not included 60

non–zero templates in the previous sections. For general non–additive combinations of squared distances of the form (224) non–zero templates cannot be removed from the functional as we will see in Section 6. Additive combinations of squared error terms, on the other hand, can again be written as one squared error term, using a generalized ‘bias–variance’–decomposition 1 N 1 φ ? tj , Kj (φ ? tj ) = φ ? t, K (φ ? t) + Emin 2 j =1 2 with template average

N

(229)

t = K ?1

j =1

Kj tj ,

(230)

assuming the existence of the inverse of the operator

N

K=

j =1

Kj .

(231)

and minimal energy/error Emin = N 1 N V (t1 , · · · tN ) = (tj , Kj tj ) ? (t, K t), 2 2 j =1 (232)

which up to a factor N/2 represents a generalized template variance V . We end with the remark that adding error terms corresponds in its probabilistic Bayesian interpretation to ANDing independent events. For example, if we wish to implement that φ is likely to be smooth AND mirror symmetric, we may add two squared error terms, one related to smoothness and another to mirror symmetry. According to (229) the result will be a single squared error term of form (224). Summarizing, we have seen that there are many potentially useful applications of non–zero template functions. Technically, however, non–zero template functions can be removed from the formalism by a simple substitution φ′ = φ ? t if the error functional consists of an additive combination of quadratic prior terms. As most regularized error functionals used in practice have additive prior terms this is probably the reason that they are formulated for t ≡ 0, meaning that non–zero templates functions (base lines) have to be treated by including a preprocessing step switching from φ to φ′ . We will see in Section 6 that for general error functionals templates cannot be removed by a simple substitution and do enter the error functionals explicitly. 61

3.6

Quadratic density estimation and empirical risk minimization

Interpreting an energy or error functional E probabilistically, i.e., assuming ?βE + c to be the logarithm of a posterior probability under study, the form of the training data term has to be ? i ln Pi . Technically, however, it would be easier to replace that data term by one which is quadratic in the probability P of interest. Indeed, we have mentioned in Section 2.5 that such functionals can be justi?ed within the framework of empirical risk minimization. From that Frequentist point of view an error functional E (P ), is not derived from a log–posterior, but represents an empirical risk r ?(P, f ) = i l(xi , yi, P ), approximating an expected risk r (P, f ) for action a = P . This is possible under the assumption that training data are sampled according to the true p(x, y |f ). In that interpretation one is therefore not restricted to a log–loss for training data but may as well choose for training data a quadratic loss like 1 P ? Pemp , KD (P ? Pemp ) , (233) 2 choosing a reference density P emp and a real symmetric positive (semi–)/de?nite KD . Approximating a joint probability p(x, y |h) the reference density Pemp would have to be the joint empirical density

joint Pemp (x, y ) =

1 n

n i

δ (x ? xi )δ (y ? yi ),

(234)

joint i.e., Pemp = N/n, as obtained from the training data. Approximating conditional probabilities p(y |x, h) the reference Pemp has to be chosen as conditional empirical density,

Pemp (x, y ) =

i

or, de?ning the diagonal matrix NX (x, x′ , y, y ′) = δ (x ? x′ )δ (y ? y ′ )NX (x) = δ (x ? x′ )δ (y ? y ′ ) i δ (x ? xi )

1 Pemp = N? X N.

N (x, y ) δ (x ? xi )δ (y ? yi ) = , nx i δ (x ? xi )

(235)

(236)

This, however, is only a valid expression if NX (x) = 0, meaning that for all x at least one measured value has to be available. For x variables with a 62

large number of possible values, this cannot be assumed. For continuous x variables it is even impossible. Hence, approximating conditional empirical densities either non–data x– values must be excluded from the integration in (233) by using an operator KD containing the projector x′ ∈xD δ (x ? x′ ), or Pemp must be de?ned also for such non–data x–values. For existing VX = IX 1 = dy 1, a possible extension ?emp of Pemp would be to assume a uniform density for non–data x values, P yielding ?emp (x, y ) = P

? ? ? ? ?

i

δ(x?xi )δ(y ?yi ) δ(x?xi ) i 1 dy 1

for for

i i

δ (x ? xi ) = 0, δ (x ? xi ) = 0.

(237)

This introduces a bias towards uniform probabilities, but has the advantage to give a empirical density for all x and to ful?ll the conditional normalization requirements. Instead of a quadratic term in P , one might consider a quadratic term in the log–probability L. The log–probability, however, is minus in?nity at all non–data points (x, y ) ∈ D . To work with a ?nite expression, one can choose small ?(y ) and approximate Pemp by

? Pemp (x, y ) =

?(y ) + i δ (x ? xi )δ (y ? yi ) , dy ?(y ) + i δ (x ? xi )

(238)

? provided dy ?(y ) exists. For ?(y ) = 0 also Pemp (x, y ) = 0, ?x and L? emp = ? ln Pemp > ?∞ exists. A quadratic data term in P results in an error functional

?P = 1 P ? Pemp , KD (P ? Pemp ) + 1 (P, K P ) + (P, ΛX ), E 2 2

(239)

skipping the constant part of the ΛX –terms. In (239) the empirical density ?emp of (237). Pemp may be replaced by P Positive (semi–)de?nite operators KD have a square root and can be written in the form RT R. One possibility, skipping for the sake of simplicity x in the following, is to choose as square root R the integration operator, i.e., R = k Rk and R(y, y ′) = θ(y ? y ′ ). Thus, φ = RP transforms the density function P in the distribution function φ, and we have P = P (φ) = R?1 φ. Here the inverse R?1 is the di?erentiation operator k ?yk (with appropriate 63

boundary condition) and RT R?1 = ? k ?k is the product of one– 2 dimensional Laplacians ?k = ? 2 /?yk . Adding for example a regularizing term (164) λ ( P, P ) gives 2 ?P = 1 P ? Pemp , RT R (P ? Pemp ) + λ (P, P ) E 2 2 1 φ ? φemp , φ ? φemp ? λ φ, ?k φ = 2 k 1 1 φ, (? ?k + m2 I)φ ? (φ , φemp) + (φemp , φemp ). = 2 2m 2 k (240) (241) (242)

?1

with m2 = λ?1 . Here the empirical distribution function φemp = RPemp is 1 given by φemp (y ) = n i θ (y ? yi ) (or, including the x variable, φemp (x, y ) = 1 ′ δ ( x ? x ) θ ( y ? yi ) for NX (x) = 0 which could be extended to a ′ x ∈xD NX (x) ? = RP ?emp for NX (x) = 0). The stationarity equation yields linear φ φ=m

2

?

?k + m I

k

2

?1

φemp .

?1

(243)

For dy = 1 (or φ = k φ) the operator becomes (?? + m2 I) which has the structure of a free massive propagator for a scalar ?eld with mass m2 and is calculated below. As already mentioned the normalization and non– negativity condition for P appear for φ as boundary and monotonicity conditions. For non–constant P the monotonicity condition has not to be implemented by Lagrange multipliers as the gradient at the stationary point has no components pointing into the forbidden area. (But the conditions nevertheless have to be checked.) Kernel methods of density estimation, like the use of Parzen windows, can be founded on such quadratic regularization functionals [219]. Indeed, the one–dimensional Eq. (243) is equivalent to the use of Parzens kernel in density estimation [179, 166].

3.7

3.7.1

Regression

Gaussian regression

An important special case of density estimation leading to quadratic data terms is regression for independent training data with Gaussian likelihoods p(yi |xi , h) = √

(yi ?h(xi ))2 1 e? 2σ2 , 2πσ

(244)

64

with ?xed, but possibly xi –dependent, variance σ 2 . In that case P (x, y ) = p(yi|xi , h) is speci?ed by φ = h and the logarithmic term i ln Pi becomes quadratic in the regression function h(xi ), i.e., of the form (224). In an interpretation as empirical risk minimization quadratic error terms corresponds to the choice of a squared error loss function l(x, y, a) = (y ? a(x))2 for action a(x). Similarly, the technical analogon of Bayesian priors are additional (regularizing) cost terms. We have remarked in Section 2.3 that for continuous x measurement of h(x) has to be understood as measurement of a h(? x) = dx ?(x)h(x) for sharply peaked ?(x). We assume here that the discretization of h used in numerical calculations takes care of that averaging. Divergent quantities like δ –functionals, used here for convenience, will then not be present. We now combine Gaussian data terms and a Gaussian (speci?c) prior with prior operator K0 (x, x′ ) and de?ne for training data xi , yi the operator Ki (x, x′ ) = δ (x ? xi )δ (x ? x′ ), (245) and training data templates t = yi . We also allow a general prior template t0 but remark that it is often chosen identically zero. According to (229) the resulting functional can be written in the following forms, useful for di?erent purposes, Eh = 1 n 1 (h(xi ) ? yi )2 + ( h ? t0 , K0 (h ? t0 ) )X 2 i=1 2 (246)

1 n 1 = ( h ? ti , Ki (h ? ti ) )X + ( h ? t0 , K0 (h ? t0 ) )X (247) 2 i=1 2 1 1 D = (h ? tD , KD (h ? tD ))X + (h ? t0 , K0 (h ? t0 ))X + Emin (248) 2 2 1 ( h ? t, K (h ? t) )X + Emin , (249) = 2 with KD =

i=1 n n n

Ki , Ki ,

i=0

1 tD = K? D

Ki ti , Ki ti ,

(250) (251)

K=

t = K ?1

i=1 n i=0

and h–independent minimal errors,

D Emin =

1 2

n

(ti , Ki ti )X + (tD , KD tD )X ,

i=1

(252)

65

Emin =

1 2

n

(ti , Ki ti )X + (t, Kt)X ,

i=0

(253)

D being proportional to the “generalized variances” VD = 2Emin /n and V = 2Emin /(n + 1). The scalar product (·, ·)X stands for x–integration only, for the sake of simplicity however, we will skip the subscript X in the following. The data operator KD n

KD (x, x′ ) =

i=1

δ (x ? xi )δ (x ? x′ ) = nx δ (x ? x′ ),

(254)

contains for discrete x on its diagonal the number of measurements at x,

n

nx = NX (x) =

i=1

δ (x ? xi ),

(255)

which is zero for x not in the training data. As already mentioned for con1 tinuous x a integration around a neighborhood of xi is required. K? D is a short hand notation for the inverse within the space of training data

?1 1 K? = δ (x ? x′ )/nx , D = (ID KD ID )

(256)

ID denoting the projector into the space of training data

n ?

ID = δ (x ? x′ )

i=1

δ (x ? xi ).

(257)

Notice that the sum is not over all n training points xi but only over the n ? ≤ n di?erent xi . (Again for continuous x an integration around xi is required to ensure I2 D = ID ). Hence, the data template tD becomes the mean of y –values measured at x tD (x) = 1 nx

nx

y (xj ),

j =1 xj = x

(258)

and tD (x) = 0 for nx = 0. Normalization of P (x, y ) is not in?uenced by a change in h(x) so the Lagrange multiplier terms have been skipped. The stationarity equation is most easily obtained from (249), 0 = K(h ? t). 66 (259)

It is linear and has on a space where K?1 exists the unique solution h = t. (260)

We remark that K can be invertible (and usually is so the learning problem is well de?ned) even if K0 is not invertible. The inverse K?1 , necessary to calculate t, is training data dependent and represents the covariance operator/matrix of a Gaussian posterior process. In many practical cases, however, 1 the prior covariance K? 0 (or in case of a null space a pseudo inverse of K0 ) is directly given or can be calculated. Then an inversion of a ?nite dimensional matrix in data space is su?cient to ?nd the minimum of the energy Eh [223, 76]. Invertible K0 : Let us assume ?rst deal with the case of an invertible K0 . It is the best to begin the stationarity equation as obtained from (247) or (248)

n

0 =

i=1

Ki (h ? ti ) + K0 (h ? t0 )

(261) (262) (263) (264) (265)

= KD (h ? tD ) + K0 (h ? t0 ).

1 For existing K? 0 1 h = t0 + K? 0 KD (tD ? h),

one can introduce a = KD (tD ? h), to obtain

1 h = t0 + K? 0 a.

Inserting Eq. (265) into Eq. (264) one ?nds an equation for a

1 a = KD (tD ? t0 ). I + KD K? 0

(266)

Multiplying Eq. (266) from the left by the projector ID and using KD ID = ID KD , a = ID a, tD = ID tD , (267)

one obtains an equation in data space

1 ID + KD K? 0,DD a = KD (tD ? t0,D ),

(268)

67

where

1 ?1 ?1 ?1 K? 0,DD = (K0 )DD = ID K0 ID = (K0,DD ) ,

t0,D = ID t0 .

(269) (270)

Thus, a = CDD b, where

1 CDD = ID + KD K? 0,DD ?1

,

(271) (272)

and b = KD (tD ? t0 ). In components Eq. (270) reads,

1 δkl + nxk K? 0 (xk , xl ) a(xl ) = nxk (tD (xk ) ? t0 (xk )) .

(273)

l

Having calculated a the solution h is given by Eq. (265)

?1 1 ?1 1 K? h = t0 + K? D + K0,DD 0 CDD b = t0 + K0 ?1

(tD ? t0 ).

(274)

Eq. (274) can also be obtained directly from Eq. (260) and the de?nitions (251), without introducing the auxiliary variable a, using the decomposition K0 t0 = ?KD t0 + (K0 + KD )t0 and

1 1 K ?1 K D = K ? I + KD K? 0 0 ∞ m=0 ?1 1 KD = K? 0 ∞ m=0 1 ?KD K? 0 m

KD

(275)

=

1 K? 0

1 ?KD ID K? 0 ID

m

1 1 ID + KD K? KD = K? 0,DD 0

?1

KD . (276)

1 K? 0 CDD is also known as equivalent kernel due to its relation to kernel smoothing techniques [205, 94, 90, 76]. Interestingly, Eq. (265) still holds for non–quadratic data terms of the form gD (h) with any di?erentiable function ful?lling g (h) = g (hD ), where hD = ID h is the restriction of h to data space. Hence, also the function of functional derivatives with respect to h(x) is restricted to data space, i.e., g ′ (hD ) ′ ′ = gD (hD ) with gD = ID g ′ and g ′(h, x) = δg (h)/δh(x). For example, g (h) = n i=1 V (h(xi ) ? yi ) with V a di?erentiable function. The ?nite dimensional vector a is then found by solving a nonlinear equation instead of a linear one [73, 75].

68

Furthermore, one can study vector ?elds, i.e., the case where, besides possibly x, also y , and thus h(x), is a vector for given x. (Considering the variable indicating the vector components of y as part of the x–variable, this is a situation where a ?xed number of one–dimensional y , corresponding to a subspace of X with ?xed dimension, is always measured simultaneously.) In that case the diagonal Ki of Eq. (245) can be replaced by a version with non– zero o?–diagonal elements Kα,α′ between the vector components α of y . This corresponds to a multi–dimensional Gaussian data generating probability p(yi |xi , h) = det Ki 2 (2π )

k 2 1

e? 2

1

(y ?hα (xi )) Ki,α,α′ (xi )(yi,α′ ?hα′ (xi )) α,α′ i,α

,

(277)

for k –dimensional vector yi with components yi,α . Non-invertible K0 : For non–invertible K0 one can solve for h using the Moore–Penrose inverse K# 0 . Let us ?rst recall some basic facts [58, 161, 15, 120]. A pseudo inverse of (a possibly non–square) A is de?ned by the conditions A# AA# = A, AA# A = A# , (278) and becomes for real A the unique Moore–Penrose inverse A# if (AA# )T = AA# , A linear equation Ax = b is solvable if AA# b = b. In that case the solution is x = A# b + x0 = A# b + y ? A# Ay, (282) (281) (280) (A # A )T = A # A . (279)

where x0 = y ? A# Ay is solution of the homogeneous equation Ax0 = 0 and vector y is arbitrary. Hence, x0 can be expanded in an orthonormalized basis ψl of the null space of A x0 = cl ψl . (283)

l

For an A which can be diagonalized, i.e., A = M?1 DM with diagonal D, the Moore–Penrose inverse is A# = M?1 D# M. Therefore AA# = A# A = I1 = I ? I0 . 69 (284)

where I0 = l ψl ψlT is the projector into the zero space of A and I1 = I ? I0 = M?1 DD# M. Thus, the solvability condition Eq. (281) becomes I 0 b = 0, or in terms of ψl ( ψl , b) = 0, ?l, (286) meaning that the inhomogeneity b must have no components within the zero space of A. Now we apply this to Eq. (262) where K0 is diagonalizable because positive semi de?nite. (In this case M is an orthogonal matrix and the entries of D are real and larger or equal to zero.) Hence, one obtains under the condition I0 (K0 t0 + KD (tD ? h)) = 0, (287) for Eq. (282)

0 h = K# 0 (K0 t0 + KD (tD ? h)) + h ,

(285)

(288)

where K0 h0 = 0 so that h0 = l cl ψl can be expanded in an orthonormalized basis ψl of the null space of K0 , assumed here to be of ?nite dimension. To ?nd an equation in data space de?ne the vector a = KD (tD ? h), to get from Eqs.(287) and (288) 0 = ( ψl , K0 t0 ) + ( ψl , a), ?l h = K# cl ψl . 0 (K0 t0 + a) +

l

(289)

(290) (291)

These equations have to be solved for a and the coe?cients cl . Inserting Eq. (291) into the de?nition (289) gives (I + K D K # 0 )a = KD tD ? KD I1 t0 ? KD cl ψl ,

l

(292)

using K# 0 K0 = I1 according to Eq. (284). Using a = ID a the solvability condition (287) becomes

n ? i=1

ψl (xi )a = ?( ψl , K0 t0 ), ?l, 70

(293)

the sum going over di?erent xi only. Eq. (292) for a and cl reads in data space, similar to Eq. (268), ?? a=C b, (294) ? ? ?1 = I + K D K # where C 0 has been assumed invertible and b is given by the right hand side of Eq. (292). Inserting into Eq. (291) the solution ?nally can be written ?? h = I1 t0 + K# cl ψl . (295) 0 Cb +

l

Again, general non–quadratic data terms g (hD ) can be allowed. In that case δg (hD )/δh(x) = g ′ (hD , x) = (ID g ′ )(hD , x) and Eq. (289) becomes the nonlinear equation

0 a = g ′(hD ) = g ′ ID K# 0 (K0 t0 + KD (tD ? h)) + h

.

(296)

The solution(s) a of that equation have then to be inserted in Eq. (291). 3.7.2 Exact predictive density

For Gaussian regression the predictive density under training data D and prior D0 can be found analytically without resorting to a saddle point approximation. The predictive density is de?ned as the h-integral p(y |x, D, D0) = = dh p(y |x, h)p(h|D, D0)

dh p(y |x, h)p(yD |xD , h)p(h|D0 ) dh p(yD |xD , h)p(h|D0 ) p(y, yD |x, xD , D0 ) . = p(yD |xD , D0 )

(297)

Denoting training data values yi by ti sampled with covariance Ki concentrated on xi and analogously test data values y = yn+1 by tn+1 sampled with (co–)variance Kn+1 , we have for 1 ≤ i ≤ n + 1 p(yi |xi , h) = det(Ki /2π ) 2 e and p(h|D0 ) = det(K0 /2π ) 2 e 71

1 1

?1 h?ti , Ki (h?ti ) 2

,

(298)

?1 h?t0 , K0 (h?t0 ) 2

,

(299)

hence p(y |x, D, D0) = dh e

1 ?2 1 ?2 n+1 i=0 n i=0

h?ti , Ki (h?ti ) + 1 2 h?ti , Ki (h?ti ) + 1 2

n+1 i=0 n i=0

ln deti (Ki /2π )

.

ln deti (Ki /2π )

(300)

dh e

Here we have this time written explicitly deti (Ki /2π ) for a determinant calculated in that space where Ki is invertible. This is useful because for example in general deti Ki det K0 = deti Ki K0 . Using the generalized ‘bias–variance’– decomposition (229) yields p(y |x, D, D0) = with

n n

dh e

1 ?2 V +1 h?t+ , K+ (h?t+ ) + n 2 + 2 1 1 ?2 h?t, K(h?t) + n V +2 2

n+1 i=0 n i=0

ln deti (Ki /2π )

,

ln deti (Ki /2π )

(301)

dh e

t = K ?1

i=0 n+1 1 t+ = K? + i=0

Ki ti , Ki ti ,

K= K+ =

Ki ,

i=0 n+1

(302) (303) (304) (305)

Ki ,

i=0

V

=

1 n K ti , Ki ti ? t, t , n i=0 n K+ 1 n+1 ti , Ki ti ? t+ , t+ . n i=0 n

V+ =

Now the h–integration can be performed p(y |x, D, D0) = e? 2 V+ + 2

n n 1 1 n+1 i=0 n i=0

ln deti (Ki /2π )? 1 ln det(K+ /2π ) 2 ln deti (Ki /2π )? 1 ln det(K/2π ) 2

e? 2 V + 2

(306)

Canceling common factors, writing again y for tn+1 , Kx for Kn+1 , detx for detn+1 , and using K+ t+ = Kt + Kx y , this becomes p(y |x, D, D0) = e? 2 (y,Ky y)+(y,Ky t)+ 2 (t,(KK+

1 1 ?1 1 1 K?K) t)+ 2 ln detx (Kx K? + K/2π )

. (307) (308)

?1 Here we introduced Ky = KT y = Kx ? Kx K+ Kx and used that

det K?1 K+ = det(I ? K?1 Kx ) = detx K?1 K+ 72

can be calculated in the space of test data x. This follows from K = K+ ? Kx and the equality 1?A 0 det = det(1 ? A) (309) B 1 with A = Ix K?1 Kx , B = (I ? Ix )K?1 Kx , and Ix denoting the projector into the space of test data x. Finally

1 ?1 ?1 Ky = Kx ? Kx K? + Kx = Kx K+ K = (K ? KK+ K),

(310)

yields the correct normalization of the predictive density p(y |x, D, D0) = e with mean and covariance

n

1 1 ?2 ln detx (Ky /2π ) y ?y ?, Ky (y ?y ?) + 2

,

(311)

y ? = t = K ?1

i=0 1 K? = y

Ki ti ,

?1 1 ?1 = K? x + Ix K Ix .

(312) (313)

1 Kx ? Kx K? + Kx

It is useful to express the posterior covariance K?1 by the prior covariance 1 K? 0 . According to 1+A B 0 1

?1

=

(1 + A)?1 ?(1 + A)?1 B 0 1

,

(314)

1 ?1 ?1 ?1 ?1 with A = KD K? ? , and K0,DD = ID K0 ID , K0,D D ? = 0,DD , B = KD K0,D D ?1 ID K0 ID ? = I ? ID we ?nd ? , ID 1 1 K ?1 = K ? I + KD K? 0 0 1 = K? 0 ?1 ?1 1 ? ID + KD K? 0,DD ?1

(315)

1 KD K? ? . ? + ID 0,D D

1 ID + KD K? 0,DD

1 1 ?1 ?1 Notice that while K? in general K? D = (ID KD ID ) 0,DD = ID K0 ID = 1 (ID K0 ID )?1 . This means for example that K? has to be known to ?nd 0 ?1 1 ?1 K0,DD and it is not enough to invert ID K0 ID = K0,DD = (K? 0,DD ) . In 1 data space ID + KD K? 0,DD be manipulated to give ?1 1 ?1 = K? D + K0,DD ?1 1 K? D , so Eq. (315) can

1 1 ?1 K ?1 = K ? I ? ID K? 0 D + K0,DD

?1

1 ID K? . 0

(316)

73

This allows now to express the predictive mean (312) and covariance (313) by the prior covariance

1 1 ?1 y ? = t0 + K? K? 0 D + K0,DD ?1

(tD ? t0 ),

(317)

1 K? 0,Dx .

1 1 ?1 ?1 ?1 K? = Kx + K? y 0,xx ? K0,xD KD + K0,DD

?1

(318)

1 1 Thus, for given prior covariance K? ? and K? 0 both, y y , can be calculated by 1 ?1 inverting the n ?×n ? matrix K = K? . 0,DD + KD Comparison of Eqs.(317,318) with the maximum posterior solution h? of Eq. (274) now shows that for Gaussian regression the exact predictive density p(y |x, D, D0) and its maximum posterior approximation p(y |x, h?) have the same mean t = dy y p(y |x, D, D0) = dy y p(y |x, h? ). (319) ?1

The variances, however, di?er by the term Ix K?1 Ix . According to the results of Section 2.2.2 the mean of the predictive density is the optimal choice under squared–error loss (51). For Gaussian regression, therefore the optimal regression function a? (x) is the same for squared–error loss in exact and in maximum posterior treatment and thus also for log–loss (for Gaussian p(y |x, a) with ?xed variance)

? ? ? ? a? MPA,log = aexact,log = aMPA,sq. = aexact,sq. = h = t.

(320)

In case the space of possible p(y |x, a) is not restricted to Gaussian densities with ?xed variance, the variance of the optimal density under log–loss ?1 p(y |x, a? exact,log ) = p(y |x, D, D0 ) di?ers by Ix K Ix from its maximum poste? ? rior approximation p(y |x, aMPA,log ) = p(y |x, h ). 3.7.3 Gaussian mixture regression (cluster regression)

Generalizing Gaussian regression the likelihoods may be modeled by a mixture of m Gaussians p(y |x, h) =

m k

p(k ) e? 2 (y?hk (x))

m k

β

β

2 2

dy

p(k ) e? 2 (y?hk (x))

,

m

(321)

2 . Hence, h is here where the normalization factor is found as k p(k ) 2β π speci?ed by mixing coe?cients p(k ) and a vector of regression functions hk (x)

74

specifying the x–dependent location of the k th cluster centroid of the mixture model. A simple prior for hk (x) is a smoothness prior diagonal in the cluster components. As any density p(y |x, h) can be approximated arbitrarily well by a mixture with large enough m such cluster regression models allows to interpolate between Gaussian regression and more ?exible density estimation. The posterior density becomes for independent data p(h|D0 ) p(h|D, D0 ) = p(yD |xD , D0 )

n i m k

p(k ) e? 2 (yi ?hk (xi ))

m k

β

2

p (k )

β 2π

m 2

.

(322)

Maximizing that posterior is — for ?xed x, uniform p(k ) and p(h|D0 ) — equivalent to the clustering approach of Rose, Gurewitz, and Fox for squared distance costs [199]. 3.7.4 Support vector machines and regression

Expanding the regression function h(x) in a basis of eigenfunctions Ψk of K0 K0 =

k

λ k Ψk ΨT k,

h(x) =

k

nk Ψk (x)

(323)

yields for functional (246)

2

Eh =

i k

nk Ψk (xi ) ? yi

+

k

λk |nk |2 .

(324)

Under the assumption of output noise for training data the data terms may for example be replaced by the logarithm of a mixture of Gaussians. Such mixture functions with varying mean can develop ?at regions where the error is insensitive (robust) to changes of h. Analogously, Gaussians with varying mean can be added to obtain errors which are ?at compared to Gaussians for large absolute errors. Similarly to such Gaussian mixtures the mean– square error data term (yi ? h(xi ))2 may be replaced by an ?–insensitive error |yi ? h(xi )|? , which is zero for absolute errors smaller ? and linear for larger absolute errors (see Fig.5). This results in a quadratic programming problem and is equivalent to Vapnik’s support vector machine [220, 74, 221, 209, 210, 49]. For a more detailed discussion of the relation between support vector machines and Gaussian processes see [224, 203].

75

Figure 5: Three robust error functions which are insensitive to small errors. Left: Logarithm of mixture with two Gaussians with equal variance and di?erent means. Middle: Logarithm of mixture with 11 Gaussians with equal variance and di?erent means. Right: ?–insensitive error.

3.8

Classi?cation

In classi?cation (or pattern recognition) tasks the independent visible variable y takes discrete values (group, cluster or pattern labels) [16, 61, 24, 47]. We write y = k and p(y |x, h) = Pk (x, h), i.e., k Pk (x, h) = 1. Having received classi?cation data D = {(xi , ki )|1 ≤ i ≤ n} the density estimation error functional for a prior on function φ (with components φk and P = P (φ)) reads

n

Ecl. =

i

ln Pki (xi ; φ) +

1 φ ? t, K (φ ? t) + (P (φ), ΛX ). 2

(325)

In classi?cation the scalar product corresponds to an integral over x and a summation over k , e.g., φ ? t, K (φ ? t) = dx dx′ (φk (x) ? tk (x))Kk,k′ (x, x′ )(φk′ (x′ ) ? tk′ (x′ )), (326) and (P, ΛX ) = dx ΛX (x) k Pk (x). For zero–one loss l(x, k, a) = δk,a(x) — a typical loss function for classi?cation problems — the optimal decision (or Bayes classi?er) is given by the mode of the predictive density (see Section 2.2.2), i.e., a(x) = argmaxk p(k |x, D, D0 ). (327)

k,k ′

In saddle point approximation p(k |x, D, D0 ) ≈ p(k |x, φ? ) where φ? minimizing Ecl. (φ) can be found by solving the stationarity equation (227). For the choice φk = Pk non–negativity and normalization must be ensured. For φ = L with P = eL non–negativity is automatically ful?lled but the Lagrange multiplier must be included to ensure normalization. 76

likelihood p(y |x, h) of general form discrete y Gaussian with ?xed variance mixture of Gaussians

problem type density estimation classi?cation regression clustering

quantum mechanical likelihood inverse quantum mechanics Table 3: Special cases of density estimation Normalization is guaranteed by using unnormalized probabilities φk = zk , P = zk / l zl (for which non–negativity has to be checked) or shifted log– likelihoods φk = gk with gk = Lk +ln l eLl , i.e., Pk = egk / l egl . In that case the nonlocal normalization terms are part of the likelihood and no Lagrange multiplier has to be used [231]. The resulting equation can be solved in the space de?ned by the X –data (see Eq. (153)). The restriction of φk = gk to linear functions φk (x) = wk x + bk yields log–linear models [152]. Recently a mean ?eld theory for Gaussian Process classi?cation has been developed [174, 176]. Table 3 lists some special cases of density estimation. The last line of the table, referring to inverse quantum mechanics, will be discussed in the next section.

3.9

Inverse quantum mechanics

Up to now we have formulated the learning problem in terms of a function φ having a simple, e.g., pointwise, relation to P . Nonlocalities in the relation between φ and P was only due to the normalization condition, or, working with the distribution function, due to an integration. Inverse problems for quantum mechanical systems provide examples of more complicated, nonlocal relations between likelihoods p(y |x, h) = p(y |x, φ) and the hidden variables φ the theory is formulated in. To show the ?exibility of Bayesian Field Theory we will give in the following a short introduction to its application to inverse 77

quantum mechanics. A more detailed discussion of inverse quantum problems including numerical applications can be found in [132, 142, 141, 137, 217]. The state of a quantum mechanical systems can be completely described by giving its density operator ρ. The density operator of a speci?c system depends on its preparation and its Hamiltonian, governing the time evolution of the system. The inverse problem of quantum mechanics consists in the reconstruction of ρ from observational data. Typically, one studies systems with identical preparation but di?ering Hamiltonians. Consider for example Hamiltonians of the form H = T + V, consisting of a kinetic energy part T and a potential V. Assuming the kinetic energy to be ?xed, the inverse problem is that of reconstructing the potential V from measurements. A local potential V(y, y ′) = V (y )δ (y ? y ′) is speci?ed by a function V (y ). Thus, for reconstructing a local potential it is the function V (y ) which determines the likelihood p(y |x, h) = p(y |X, ρ) = p(y |X, V ) = P (φ) and it is natural to formulate the prior in terms of the function φ = V . The possibilities of implementing prior information for V are similar to those we discuss in this paper for general density estimation problems. It is the likelihood model where inverse quantum mechanics di?ers from general density estimation. Measuring quantum systems the variable x corresponds to a hermitian operator X. The possible outcomes y of measurements are given by the eigenvalues of X, i.e., X|y >= y |y >, (328) where |y >, with dual < y |, denotes the eigenfunction with eigenvalue y . (For the sake of simplicity we assume nondegenerate eigenvalues, the generalization to the degenerate case being straightforward.) De?ning the projector ΠX,y = |y >< y | the likelihood model of quantum mechanics is given by p(y |x, ρ) = Tr(ΠX,y ρ). (330) (329)

In the simplest case, where the system is in a pure state, say the ground state ?0 of H ful?lling H|?0 >= E0 |?0 >, (331) the density operator is ρ = ρ2 = |?0 >< ?0 |, 78 (332)

ρ general pure state stationary pure state ground state time–dependent pure state scattering general mixture state stationary mixture state canonical ensemble

i

|ψ >< ψ | |?i (H) >< ?i (H)| |?0 (H)| >< ?0 (H)| |U(t, t0 )ψ (t0 ) >< U(t, t0 )ψ (t0 )| lim

t→∞ t0 →?∞

|U(t, t0 )ψ (t0 ) >< U(t, t0 )ψ (t0 )|

k

p(k ) |ψk >< ψk |

p(i|H) |?i (H) >< ?i (H)| (Tr e?β H )?1 e?β H

Table 4: The most common examples of density operators for quantum systems. In this table ψ denotes an arbitrary pure state, ?i represents an eigenstate of Hamiltonian H . The unitary time evolution operator for a time–independent Hamiltonian H is given by U = e?i(t?t0 )H . In scattering one imposes typically additional speci?c boundary conditions on the initial and ?nal states. and the likelihood (330) becomes p(y |x, h) = p(y |X, ρ) = Tr(|?0 >< ?0 |y >< y |) = |?0 (y )|2 . (333)

Other common choices for ρ are shown in Table 4. In contrast to ideal measurements on classical systems, quantum measurements change the state of the system. Thus, in case one is interested in repeated measurements for the same ρ, that density operator has to be prepared before each measurement. For a stationary state at ?nite temperature, for example, this can be achieved by waiting until the system is again in thermal equilibrium. For a Maximum A Posteriori Approximation the functional derivative of the likelihood is needed. Thus, for reconstructing a local potential we have 79

to calculate δV (y) p(yi|X, V ). (334) To be speci?c, let us assume we measure particle coordinates, meaning we have chosen X to be the coordinate operator. For a system prepared in the ground state of its Hamiltonian H, we then have to ?nd, δV (y) |?0 (yi )|2 . (335)

For that purpose, we take the functional derivative of Eq. (331), which yields (H ? E0 )|δV (y) ?0 >= (δV (y) H ? δV (y) E0 )|?0 > . (336)

Projecting from the left by < ?0 |, using again Eq. (331) and the fact that for a local potential δV (y) H(y ′, y ′′) = δ (y ? y ′ )δ (y ′ ? y ′′), shows that δV (y) E0 =< ?0 |δV (y) H|?0 >= |?0 (y )|2. (337)

Choosing < ?0 |δV (y) ?0 > = 0 and inserting a complete basis of eigenfunctions |?j > of H, we end up with δV (y) ?0 (yi) = 1 ? j (y i )? ? j (y )? 0 (y ). E ? E 0 j j =0 (338)

From this the functional derivative of the quantum mechanical log–likelihood (335) corresponding to data point yi can be obtained easily, δV (y) ln p(yi |X, V ) = 2Re ?0 (yi)?1 δV (y) ?0 (yi ) . (339)

The MAP equations for inverse quantum mechanics are obtained by including the functional derivatives of the prior term for V . In particular, for a Gaussian prior with mean V0 and inverse covariance KV , acting in the space of potential functions V (y ), its negative logarithm, i.e., its prior error functional, reads 1 V ? V0 , KV (V ? V0 ) + ln ZV , (340) 2 with ZV being the V –independent constant normalizing the prior over V . Collecting likelihood and prior terms, the stationarity equation ?nally becomes 0= δV (y) ln p(yi |X, V ) ? KV (V ? V0 ). (341)

i

80

The Bayesian approach to inverse quantum problems is quite ?exible and can be used for many di?erent learning scenarios and quantum systems. By adapting Eq. (339), it can deal with measurements of di?erent observables, for example, coordinates, momenta, energies, and with other density operators, describing, for example, time–dependent states or systems at ?nite temperature [142]. The treatment of bound state or scattering problems for quantum many– body systems requires additional approximations. Common are, for example, mean ?eld methods, for bound state problems [55, 193, 27] as well as for scattering theory [78, 27, 139, 140, 129, 130, 218] Referring to such mean ?eld methods inverse quantum problems can also be treated for many–body systems [141].

4

4.1

Parameterizing likelihoods: Variational methods

General parameterizations

Approximate solutions of the error minimization problem are obtained by restricting the search (trial) space for h(x, y ) = φ(x, y ) (or h(x) in regression). Functions φ which are in the considered search space are called trial functions. Solving a minimization problem in some restricted trial space is also called a variational approach [97, 106, 29, 36, 27]. Clearly, minimal values obtained by minimization within a trial space can only be larger or equal than the true minimal value, and from two variational approximations that with smaller error is the better one. Alternatively, using parameterized functions φ can also implement the prior where φ is known to have that speci?c parameterized form. (In cases where φ is only known to be approximately of a speci?c parameterized form, this should ideally be implemented using a prior with a parameterized template and the parameters be treated as hyperparameters as in Section 5.) The following discussion holds for both interpretations. Any parameterization φ = φ({ξl }) together with a range of allowed values for the parameter vector ξ de?nes a possible trial space. Hence we consider the error functional 1 Eφ(ξ) = ?( ln P (ξ ), N ) + ( φ(ξ ), K φ(ξ ) ) + ( P (ξ ), ΛX ), 2 81 (342)

for φ depending on parameters ξ and p(ξ ) = p( φ(ξ ) ). In the special case of Gaussian regression this reads 1 1 Eh(ξ) = ( h(ξ ) ? tD , KD h(ξ ) ? tD ) + ( h(ξ ), K h(ξ ) ). 2 2 De?ning the matrix ?φ(x, y ) ?ξl the stationarity equation for the functional (342) becomes Φ′ (l; x, y ) = 0 = Φ′ P′ P?1N ? Φ′ Kφ ? Φ′ P′ ΛX . (344) (343)

(345)

Similarly, a parameterized functional Eφ with non–zero template t as in (225) would give 0 = Φ′ P′ P?1 N ? Φ′ K (φ ? t) ? Φ′ P′ ΛX . (346) To have a convenient notation when solving for ΛX we introduce P′ξ = Φ′ (ξ )P′ (φ), i.e., P′ξ (l; x, y ) = and to obtain for Eq. (345) Gφ(ξ) = P′ξ P?1N ? Φ′ Kφ, P′ξ ΛX = Gφ(ξ) . (349) (350) ?P (x, y ) = ?ξl dx′ dy ′ ?φ(x′ , y ′) δP (x, y ) , ?ξl δφ(x′ , y ′) (347)

(348)

For a parameterization ξ restricting the space of possible P the matrix P′ξ is not square and cannot be inverted. Thus, let (P′ξ )# be the Moore–Penrose inverse of P′ξ , i.e., (P′ξ )# P′ξ (P′ξ )# = P′ξ , P′ξ (P′ξ )# P′ξ = (P′ξ )# , (351)

and symmetric (P′ξ )# P′ξ and P′ξ (P′ξ )# . A solution for ΛX exists if P′ξ (P′ξ )# Gφ(ξ) = Gφ(ξ) . In that case the solution can be written ΛX = (P′ξ )# Gφ(ξ) + VΛ ? (P′ξ )# P′ξ VΛ , 82 (353) (352)

with arbitrary vector VΛ and

′ # ′ Λ0 X = V Λ ? (P ξ ) P ξ V Λ

(354)

from the right null space of P′ξ , representing a solution of P′ξ Λ0 X = 0. (355)

Inserting for ΛX (x) = 0 Eq. (353) into the normalization condition ΛX = IX PΛX gives ΛX = IX P (P′ξ )# Gφ(ξ) + VΛ ? (P′ξ )# P′ξ VΛ . (356)

Substituting back in Eq. (345) ΛX is eliminated yielding as stationarity equation 0 = I ? P′ξ IX P(P′ξ )# Gφ(ξ) ? P′ξ IX P VΛ ? (P′ξ )# P′ξ VΛ , (357)

where Gφ(ξ) has to ful?ll Eq. (352). Eq. (357) may be written in a form similar to Eq. (192) Kφ(ξ) (ξ ) = Tφ(ξ) (358) with Tφ(ξ) (ξ ) = P′ξ P?1 N ? P′ξ ΛX , but with Kφ(ξ) (ξ ) = Φ′ KΦ(ξ ), being in general a nonlinear operator. (360) (359)

4.2

Gaussian priors for parameters

Up to now we assumed the prior to be given for a function φ(ξ )(x, y ) depending on x and y . Instead of a prior in a function φ(ξ )(x, y ) also a prior in another not (x, y )–dependent function of the parameters ψ (ξ ) can be given. A Gaussian prior in ψ (ξ ) = Wψ ξ being a linear function of ξ , results in a prior which is also Gaussian in the parameters ξ , giving a regularization term 1 1 T ( ξ, Wψ Kψ Wψ ξ ) = ( ξ, Kξ ξ ), 2 2 (361)

83

T where Kξ = Wψ Kψ Wψ is not an operator in a space of functions φ(x, y ) but a matrix in the space of parameters ξ . The results of Section 4.1 apply to this case provided the following replacement is made

Φ′ Kφ → Kξ ξ. Similarly, a nonlinear ψ requires the replacement Φ′ Kφ → Ψ′ Kψ ψ, where Ψ′ (k, l) =

(362)

(363)

?ψl (ξ ) . (364) ?ξk Thus, in the general case where a Gaussian (speci?c) prior in φ(ξ ) and ψ (ξ ) is given, Eφ(ξ),ψ(ξ) = ?( ln P (ξ ), N ) + ( P (ξ ), ΛX ) 1 1 + ( φ ( ξ ) , K φ ( ξ ) ) + ( ψ ( ξ ) , K ψ ψ ( ξ ) ), 2 2

(365)

or, including also non–zero template functions (means) t, tψ for φ and ψ as discussed in Section 3.5, Eφ(ξ),ψ(ξ) = ?( ln P (ξ ), N ) + ( P (ξ ), ΛX ) 1 + ( φ(ξ ) ? t, K (φ(ξ ) ? t) ) 2 1 + ( ψ (ξ ) ? tψ , Kψ (ψ (ξ ) ? tψ ) ). (366) 2 The φ and ψ –terms of the energy can be interpreted as corresponding to a probability p(ξ |t, K, tψ , Kψ ), (= p(ξ |t, K) p(ξ |tψ , Kψ )), or, for example, to p(tψ |ξ, Kψ ) p(ξ |t, K) with one of the two terms term corresponding to a Gaussian likelihood with ξ –independent normalization. The stationarity equation becomes 0 = P′ξ P?1 N ? Φ′ K(φ ? t) ? Ψ′ Kψ (ψ ? tψ ) ? P′ξ ΛX = Gφ,ψ ? P′ξ ΛX , which de?nes Gφ,ψ , and for ΛX = 0 ΛX = IX P (P′ξ )# Gφ,ψ + Λ0 X , for P′ξ Λ0 X = 0. 84 (369) (367) (368)

Variable L(x, y ) P (x, y ) √ φ= P φ(x, y ) ξ ξ

Error EL EP E√P Eφ Eφ(ξ) Eφ(ξ)ψ(ξ)

Stationarity equation KL = N ? eL ΛX K P = P ?1 N ? Λ X Kφ = 2Φ?1 N ? 2ΦΛX K φ = P ′ P ?1 N ? P ′ Λ X Φ′ Kφ = P′ξ P?1 N ? P′ξ ΛX Φ′ K(φ ? t) + Ψ′ Kψ (ψ ? tψ ) = P′ξ P?1N ? P′ξ ΛX

ΛX IX (N ? KL) IX (N ? PKP ) ΦKφ) I X (N ? 1 2 IX N ? PP′?1 K φ IX P (P′ξ )# Gφ(ξ) + Λ0 X IX P (P′ξ )# Gφ,ψ + Λ0 X

Table 5: Summary of stationarity equations. For notations, conditions and comments see Sections 3.1.1, 3.2.1, 3.3.2, 3.3.1, 4.1 and 4.2.

4.3

Linear trial spaces

Solving a density estimation problem numerically, the function φ has to be discretized. This is done by expanding φ in a basis Bl (not necessarily orthonormal) and, choosing some lmax , truncating the sum to terms with l ≤ lmax , φ=

∞ lmax l=1 l=1

cl Bl → φ =

cl Bl .

(370)

This, also called Ritz’s method, corresponds to a ?nite linear trial space and is equivalent to solving a projected stationarity equation. Using a discretization (370) the functional (186) becomes ERitz = ?( ln P (φ), N ) + 1 2 ck cl ( Bk , K Bl ) + ( P (φ), ΛX ).

kl

(371)

Solving for the coe?cients cl , l ≤ lmax to minimize the error results according to Eq.[345) and Φ′ (l; x, y ) = Bl (x, y ), (372) in 0 = ( Bl , P′ P?1 N ) ?

k

ck ( Bl , K Bk ) ? ( Bl , P′ ΛX ), ?l ≤ lmax , 85

(373)

corresponding to the lmax –dimensional equation KB c = NB (c) ? ΛB (c), with c(l ) KB (l, k ) NB (c)(l) ΛB (c)(l) = = = = cl , ( Bl , K Bk ), ( Bl , P′ (φ(c)) P?1(φ(c)) N ), ( Bl , P′ (φ(c)) ΛX ). (375) (376) (377) (378) (374)

Thus, for an orthonormal basis Bl Eq. (374) corresponds to Eq. (188) projected into the trial space by l BlT Bl . The so called linear models are obtained by the (very restrictive) choice

1

φ (z ) =

l=0

cl Bl = c0 +

l

cl zl

(379)

with z = (x, y ) and B0 = 1 and Bl = zl . Interactions, i.e., terms proportional to products of z –components like cmn zm zn can be included. Including all possible interaction would correspond to a multidimensional Taylor expansion of the function φ(z). If the functions Bl (z ) are also parameterized this leads to mixture models for φ. (See Section 4.4.)

4.4

Mixture models

The function φ(z ) can be approximated by a mixture model, i.e., by a linear combination of components functions φ (z ) = cl Bl (ξl , z ), (380)

with parameter vectors ξl and constants cl (which could also be included into the vector ξl ) to be adapted. The functions Bl (ξl , z ) are often chosen to depend on one–dimensional combinations of the vectors ξl and z . For example they may depend on some distance ||ξl ? z || (‘local or distance approaches’) or the projection of z in ξl –direction, i.e., k ξl,k zk (‘projection approaches’). (For projection approaches see also Sections 4.5, 4.8 and 4.9). 86

A typical example are Radial Basis Functions (RBF) using Gaussian Bl (ξl , z ) for which centers (and possibly covariances and also number of components) can be adjusted. Other local methods include k –nearest neighbors methods (k NN) and learning vector quantizations (LVQ) and its variants. (For a comparison see [155].)

4.5

Additive models

Trial functions φ may be chosen as sum of simpler functions φl each depending only on part of the x and y variables. More precisely, we consider functions (z ) φl depending on projections zl = Il z of the vector z = (x, y ) of all x and (z ) y components. Il denotes an projector in the vector space of z (and not in the space of functions Φ(x, y )). Hence, φ becomes of the form φ (z ) =

l

φl (zl ),

(381)

so only one–dimensional functions φl have to be determined. Restricting the functions φl to a parameterized function space yields a “parameterized additive model” φ (z ) = φl (ξ, zl ), (382)

l

which has to be solved for the parameters ξ . The model can also be generalized to a model “additive in parameters ξl ” φ (z ) =

l

φl (ξl , x, y ),

(383)

where the functions φl (ξl , x, y ) are not restricted to one–dimensional functions depending only on projections zl on the coordinate axes. If the parameters ξl determine the component functions φl completely, this yields just the mixture models of Section 4.4. Another example is projection pursuit, discussed in Section 4.8), where a parameter vector ξl corresponds to a projections ξl · z . In that case even for given ξl still a one–dimensional function φl (ξl · z ) has to be determined. An ansatz like (381) is made more ?exible by including also interactions φ(x, y ) =

l

φl (zl ) +

kl

φkl (zk , zl ) +

klm

φklm (zk , zl , zm ) + · · · .

(384)

87

The functions φkl···(zk , zl , · · ·) can be chosen to depend on product terms like zl,i zk,j , or zl,izk,j zm,n , where zl,i denotes one–dimensional sub-variables of zl . In additive models in the narrower sense [213, 92, 93, 94] zl is a subset of x, y components, i.e., zl ? {xi |1 ≤ i ≤ dx } ∪ {yj |1 ≤ j ≤ dy }, dx denoting the dimension of x, dy the dimension of y . In regression, for example, one takes usually the one–element subsets zl = {xl } for 1 ≤ l ≤ dx . In more general schemes the projections of z do not have to be restricted to projections on the coordinates axes. In particular, the projections can be (z ) optimized too. For example, one–dimensional projections Il z = w · z with z, w ∈ X × Y (where · denotes a scalar product in the space of z variables) are used by ridge approximation schemes. They include for regression problems one–layer (and similarly multilayer) feedforward neural networks (see Section 4.9) projection pursuit regression (see Section 4.8) and hinge functions [31]. For a detailed discussion of the regression case see [76]. The stationarity equation for Eφ becomes for the ansatz (381) 0 = P′l P?1 N ? Kφ ? P′l ΛX , with P′l (zl , z ′ ) = (385)

δP (z ′ ) . (386) δφl (zl ) Considering a density P being also decomposed into components Pl determined by the components φl P (z ) =

l

Pl (φl (zl )),

(387)

the derivative (386) becomes

′ P′l (zl , zk )=

δPl (zl′ ) , δφl (zl )

(388)

so that specifying an additive prior 1 2 ( φk ? tk , Kkl (φl ? tl ) ), (389)

kl

the stationary conditions are coupled equations for the component functions φl which, because P is diagonal, only contain integrations over zl –variables 0= δPl ?1 P N? δφl Klk (φk ? tk ) ? 88 δPl ΛX . δφl (390)

k

For the parameterized approach (382) one ?nds 0 = Φ′l P′l P?1 N ? Φ′l Kφ ? Φ′l P′l ΛX , with Φ′l (k, zl ) = ?φl (zl ) . ?ξk (391)

(392)

For the ansatz (383) Φ′l (k, z ) would be restricted to a subset of ξk .

4.6

Product ansatz

A product ansatz has the form φ (z ) =

l (z )

φl (zl ),

(393)

where zl = Il z represents projections of the vector z consisting of all x and y components. The ansatz can be made more ?exible by using sum of products φ (z ) = φk,l(zl ). (394)

k l

The restriction of the trial space to product functions corresponds to the Hartree approximation in physics. (In a Hartree–Fock approximation the product functions are antisymmetrized under coordinate exchange.) For additive K = l Kl with Kl acting only on φl , i.e., Kl = Kl ? l′ =l Il′ , with Il the projector into the space of functions φl = Il φl , the quadratic regularization term becomes, assuming Il Il′ = δl,l′ , ( φ, K φ ) =

l

( φl , Kl φl )

l ′ =l

( φ l ′ , φ l ′ ).

(395)

For K =

l

Kl with a product structure with respect to φl ( φ, K φ ) =

l

( φ l , K l φ l ).

(396)

In both cases the prior term factorizes into lower dimensional contributions.

89

4.7

Decision trees

Decision trees [32] implement functions which are piecewise constant on rectangular areas parallel to the coordinate axes zl . Such an approach can be written in tree structure with nodes only performing comparisons of the form x < a or x > a which allows a very e?ective hardware implementation. Such a piecewise constant approach can be written in the form φ (z ) =

l

cl

k

Θ(zν (l,k) ? alk )

(397)

with step function Θ and zν (l,k) indicating the component of z which is compared with the reference value alk . While there are e?ective constructive methods to build trees the use of gradient–based minimization or maximization methods would require, for example, to replace the step function by a sigmoid. In particular, decision trees correspond to neural networks at zero temperature, where sigmoids become step functions, and which are restricted to weights vectors in coordinate directions (see Section 4.9). An overview over di?erent variants of decision trees together with a comparison with rule–based systems, neural networks (see Section 4.9) techniques from applied statistics like linear discriminants, projection pursuit (see Section 4.8) and local methods like for example k -nearest neighbors methods (k NN), Radial Basis Functions (RBF), or learning vector quantization (LVQ) is given in [155].

4.8

Projection pursuit

Projection pursuit models [60, 102, 50] are a generalization of additive models (381) (and a special case of models (383) additive in parameters) where the projections of z = (x, y ) are also adapted φ (z ) = ξ 0 +

l

φl (ξ0,l + ξl · z ).

(398)

For such a model one has to determine one–dimensional ‘ridge’ functions φl together with projections de?ned by vectors ξl and constants ξ0 , ξ0,l . Adaptive projections may also be used for product approaches φ (z ) =

l

φl (ξ0,l + ξl · z ). 90

(399)

Similarly, φ may be decomposed into functions depending on distances to adapted reference points (centers). That gives models of the form φ (z ) =

l

φl (||ξl ? z ||),

(400)

which require to adapt parameter vectors (centers) ξl and distance functions φl . For high dimensional spaces the number of centers necessary to cover a high dimensional space with ?xed density grows exponentially. Furthermore, as the volume of a high dimensional sphere tends to be concentrated near its surface, the tails become more important in higher dimensions. Thus, typically, projection methods are better suited for high dimensional spaces than distance methods [206].

4.9

Neural networks

While in projection pursuit–like techniques the one–dimensional ‘ridge’ functions φl are adapted optimally, neural networks use ridge functions of a ?xed sigmoidal form. The resulting lower ?exibility following from ?xing the ridge function is then compensated by iterating this parameterization. This leads to multilayer neural networks. Multilayer neural networks have been become a popular tool for regression and classi?cation problems [201, 124, 156, 96, 164, 226, 24, 196, 10]. One-layer neural networks, also known as perceptrons, correspond to the parameterization φ (z ) = σ

l

wl zl ? b = σ (v ),

(401)

with a sigmoidal function σ , parameters ξ = w , projection v = l wl zl ? b and zl single components of the variables x, y , i.e., zl = xl for 1 ≤ l ≤ dx and zl = yl for dx + 1 ≤ l ≤ dx + dy . (For neural networks with Lorentzians instead of sigmoids see [72].) Typical choices for the sigmoid are σ (v ) = tanh(βv ) or σ (v ) = 1/(1 + e?2βv ). The parameter β , often called inverse temperature, controls the sharpness of the step of the sigmoid. In particular, the sigmoid functions become a sharp step in the limit β → ∞, i.e., at zero temperature. In principle the sigmoidal function σ may depend on further parameters which then — similar to projection pursuit discussed in Section 4.8 — would also have to be included in the optimization process. The threshold or bias b can be 91

treated as weight if an additional input component is included clamped to the value 1. A linear combination of perceptrons φ(x, y ) = b +

l

Wl σ

k

wlk zk ? bk ,

(402)

has the form of a projection pursuit approach (398) but with ?xed φl (v ) = Wl σ (v ). In multi–layer networks the parameterization (401) is cascaded,

mi?1

zk,i = σ

l=1

wkl,i zl,i?1 ? bk,i ) = σ (vk,i),

(403)

with zk,i representing the output of the k th node (neuron) in layer i and

mi?1

vk,i =

l=1

wkl,i zl,i?1 ? bk,i ,

(404)

being the input for that node. This yields, skipping the bias terms for simplicity φ(z, w ) = σ ?

?

mn?1 ln?1

(405) beginning with an input layer with m0 = dx + dy nodes (plus possibly nodes to implement the bias) zl,0 = zl and going over intermediate layers with mi nodes zl,i , 0 < i < n, 1 ≤ l ≤ mi to a single node output layer zn = φ(x, y ). Commonly neural nets are used in regression and classi?cation to parameterize a function φ(x, y ) = h(x) in functionals E=

i

wln?1 ,n σ ?

?

mn?2 ln?2

wln?1 ln?2 ,n?1 · · · σ ?

?

m0 l0

wl1 l0 ,1 zl0 ,0 ? · · ·?? ,

?

??

(yi ? h(xi , w ))2,

(406)

quadratic in h and without further regularization terms. In that case, regularization has to be assured by using either 1. a neural network architecture which is restrictive enough, 2. by using early stopping like training procedures so the full ?exibility of the network structure cannot completely develop and destroy generalization, where in both cases the optimal architecture or algorithm can be determined for example by cross–validation or bootstrap 92

techniques [163, 6, 225, 211, 212, 81, 39, 223, 54], or 3. by averaging over ensembles of networks [167]. In all these cases regularization is implicit in the parameterization of the network. Alternatively, explicit regularization or prior terms can be added to the functional. For regression or classi?cation this is for example done in learning by hints [2, 3, 4] or curvature–driven smoothing with feedforward networks [22]. One may also remark that from a Frequentist point of view the quadratic functional is not interpreted as posterior but as squared–error loss i (yi ? a(xi , w ))2 for actions a(x) = a(x, w ). According to Section 2.2.2 minimization of error functional (406) for data {(xi , yi)|1 ≤ i ≤ n} sampled under the true density p(x, y |f ) yields therefore an empirical estimate for the regression function dy y p(y |x, f ). We consider here neural nets as parameterizations for density estimation with prior (and normalization) terms explicitly included in the functional Eφ . In particular, the stationarity equation for functional (342) becomes 0 = Φ′w P′ P?1 N ? Φ′w Kφ ? Φ′w P′ ΛX , with matrix of derivatives Φ′w (k, l, i; x, y ) = ?φ(x, y, w ) ?wkl,i

ln?1

(407)

(408)

ln?2

= σ ′ (vn ) ···

wln?1 ,n σ ′ (vln?1 ,n?1)

wln?1 ln?2 ,n?1

li+1

wli+2 li+1 ,i+2 σ ′ (vli+1 ,i+1 )wli+1 k,i+1σ ′ (vli ,i )zl,i?1 ,

and σ ′ (v ) = dσ (v )/dv . While φ(x, y, w ) is calculated by forward propagating z = (x, y ) through the net de?ned by weight vector w according to Eq. (405) the derivatives Φ′ can e?ciently be calculated by back–propagation according to Eq. (408). Notice that even for diagonal P′ the derivatives are not needed only at data points but the prior and normalization term require derivatives at all x, y . Thus, in practice terms like Φ′ Kφ have to be calculated in a relatively poor discretization. Notice, however, that regularization is here not only due to the prior term but follows also from the restrictions implicit in a chosen neural network architecture. In many practical cases a relatively poor discretization of the prior term may thus be su?cient. Table 6 summarizes the discussed approaches. 93

Ansatz linear ansatz linear model with interaction mixture model additive model with interaction product ansatz decision trees projection pursuit neural net (2 lay.)

Functional form φ (z ) =

l ξl Bl (z )

to be optimized ξl ξ0 , ξl ξmn , · · · ξ0,l , ξl φl (zl ) φmn (zm zn ), · · · φl (zl )

φ(z ) = ξ0 + l ξl zl + mn ξmn zm zn + · · · φ (z ) = φ (z ) = + φ (z ) = φ (z ) =

l

ξ0,l Bl (ξl , z )

l

φl (zl ) mn φmn (zm zn ) + · · · φl (zl )

k l l ξl

l ξl

Θ(zξlk ? ξ0,lk ) φl (ξ0,l +

l ξl zl )

ξl , ξ0,lk , ξlk φl , ξ0 , ξ0,l , ξl ξl , ξlk

φ (z ) = ξ 0 + φ (z ) = σ (

σ(

k ξlk zk ))

Table 6: Some possible parameterizations.

94

5

5.1

Parameterizing priors: Hyperparameters

Prior normalization

In Chapter 4. parameterization of φ have been studied. This section now discusses parameterizations of the prior density p(φ|D0). For Gaussian prior densities that means parameterization of mean and/or covariance. The parameters of the prior functional, which we will denote by θ, are in a Bayesian context also known as hyperparameters. Hyperparameters θ can be considered as part of the hidden variables. In a full Bayesian approach the h–integral therefore has to be completed by an integral over the additional hidden variables θ. Analogously, the prior densities can be supplemented by priors for θ, also be called hyperpriors, with corresponding energies Eθ . In saddle point approximation thus an additional stationarity equation will appear, resulting from the derivative with respect to θ. The saddle point approximation of the θ–integration (in the case of uniform hyperprior p(θ) and with the h–integral being calculated exactly or by approximation) is also known as ML–II prior [16] or evidence framework [85, 86, 208, 146, 147, 148, 24]. There are some cases where it is convenient to let the likelihood p(y |x, h) depend, besides on a function φ, on a few additional parameters. In regression such a parameter can be the variance of the likelihood. Another example is the inverse temperature β introduced in Section 6.3, which, like φ also appears in the prior. Such parameters may formally be added to the “direct” ?. As those “additional likelihood pahidden variables φ yielding an enlarged φ rameters” are like other hyperparameters typically just real numbers, and not functions like φ, they can often be treated analogously to hyperparameters. For example, they may also be determined by cross–validation (see below) or by a low dimensional integration. In contrast to pure prior parameters, however, the functional derivatives with respect to such “additional likelihood parameters” contain terms arising from the derivative of the likelihood. Within the Frequentist interpretation of error minimization as empirical risk minimization hyperparameters θ can be determined by minimizing the empirical generalization error on a new set of test or validation data DT being independent from the training data D . Here the empirical generalization error is meant to be the pure data term ED (θ) = ED (φ? (θ)) of the error functional for φ? being the optimal φ for the full regularized Eφ (θ) at θ and 95

for given training data D . Elaborated techniques include cross–validation and bootstrap methods which have been mentioned in Sections 2.5 and 4.9. Within the Bayesian interpretation of error minimization as posterior maximization the introduction of hyperparameters leads to a new di?culty. The problem arises from the fact that it is usually desirable to interpret the error term Eθ as prior energy for θ, meaning that p (θ ) = with normalization Zθ = dθ e?Eθ , (410) represents the prior density for θ. Because the joint prior factor for φ and θ is given by the product p(φ, θ) = p(φ|θ)p(θ), (411) one ?nds e?E (φ|θ) p (φ | θ ) = . Z φ (θ ) (412) e?Eθ , Zθ (409)

Hence, the φ–dependent part of the energy represents a conditional prior energy denoted here E (φ|θ). As this conditional normalization Z φ (θ ) = dφ e?E (φ|θ), (413)

is in general θ–dependent a normalization term EN (θ) = ln Zφ (θ) (414)

must therefore be included in the error functional when minimizing with respect to θ. It is interesting to look what happens if p(φ, θ) of Eq. (409) is expressed in terms of joint energy E (φ, θ) as follows p(φ, θ) = Then the joint normalization Zφ,θ = dφ dθ e?E (φ,θ) , 96 (416) e?E (φ,θ) . Zφ,θ (415)

is independent of φ and θ and could be skipped from the functional. However, in that case the term Eθ cannot easily be related to the prior p(θ). Notice especially, that this discussion also applies to the case where Eθ is assumed to be uniform so it does not have to appear explicitly in the error functional. The two ways of expressing p(φ, θ) by a joint or conditional energy, respectively, are equivalent if the joint density factorizes. In that case, however, θ and φ are independent, so θ cannot be used to parameterize the density of φ. Numerically the need to calculate Zφ (θ) can be disastrous because normalization factors Zφ (θ) represent often an extremely high dimensional (functional) integral and are, in contrast to the normalization of P over y , very di?cult to calculate. There are, however, situations for which Zφ (θ) remains θ–independent. ? 0 ) (with Let p(φ, θ) stand for example for a Gaussian speci?c prior p(φ, θ|D the normalization condition factored out as in Eq. (90)). Then, because the normalization of a Gaussian is independent of its mean, parameterizing the mean t = t(θ) results in a θ–independent Zφ (θ). Besides their mean, Gaussian processes are characterized by their covariance operators K?1 . Because the normalization only depends on det K a second possibility yielding θ–dependent Zφ (θ) are parameterized transformations of the form K → OKO?1 with orthogonal O = O(θ). Indeed, such transformations do not change the determinant det K. They are only non–trivial for multi–dimensional Gaussians. For general parameterizations of density estimation problems, however, the normalization term ln Zφ (θ) must be included. The only way to get rid of that normalization term would be to assume a compensating hyperprior p (θ ) ∝ Z φ (θ ), resulting in an error term E (θ) = ? ln Zφ (θ) compensating EN (θ). Thus, in the general case we have to consider the functional Eθ,φ = ?(ln P (φ), N ) + (P (φ), ΛX ) + Eφ (θ) + Eθ + ln Zφ (θ). (418) (417)

writing E (φ|θ) = Eφ and E (θ) = Eθ . The stationarity conditions have the form δEφ = P′ (φ)P?1(φ)N ? P′ (φ)ΛX , (419) δφ ?Eφ ?1 ′ = ?Z′ Zφ (θ) ? Eθ , (420) ?θ 97

with Z′ (l, k ) = δ (l ? k )

?Zφ (θ) , dθl

′ Eθ (l ) =

?Eθ . ?θl

(421)

For compensating hyperprior Eθ = ? ln Zφ (θ) the right hand side of Eq. (420) vanishes. Finally, we want to remark that in case function evaluation of p(φ, θ) is much cheaper than calculating the gradient (420), minimization methods not using the gradient should be considered, like for example the downhill simplex method [191].

5.2

5.2.1

Adapting prior means

General considerations

A prior mean or template function t represents a prototype, reference function or base line for φ. It may be a typical expected pattern in time series prediction or a reference image in image reconstruction. Consider, for example, the task of completing an image φ given some pixel values (training data) [136]. Expecting the image to be that of a face the template function t may be chosen to be some prototypical image of a face. We have seen in Section 3.5 that a single template t could be eliminated for Gaussian (speci?c) priors by solving for φ ? t instead for φ. Restricting, however, to only a single template may be a very bad choice. Indeed, faces for example appear on images in many variations, like in di?erent scales, translated, rotated, various illuminations, and other kinds of deformations. We may now describe such variations by a family of templates t(θ), the parameter θ describing scaling, translations, rotations, and more general deformations. Thus, we expect a function to be similar to only one of the templates t(θ) and want to implement a (soft, probabilistic) OR, approximating t(θ1 ) OR t(θ2 ) OR · · · (See also [132, 133, 134, 135]). A (soft, probabilistic) AND of approximation conditions, on the other hand, is implemented by adding error terms. For example, classical error functionals where data and prior terms are added correspond to an approximation of training data AND a priori data. Similar considerations apply for model selection. We could for example expect φ to be well approximated by a neural network or a decision tree. In that case t(θ) spans, for example, a space of neural networks or decision trees. Finally, let us emphasize again that the great advantage and practi98

cal feasibility of adaptive templates for regression problems comes from the fact that no additional normalization terms have to be added to the error functional. 5.2.2 Density estimation

The general case with adaptive means for Gaussian prior factors and hyperparameter energy Eθ yields an error functional Eθ,φ = ?(ln P (φ), N ) + De?ning t′(l; x, y ) = ?t(x, y ; θ) , ?θl (423) 1 φ ? t(θ), K (φ ? t(θ)) + (P (φ), ΛX ) + Eθ . (422) 2

the stationarity equations of (422) obtained from the functional derivatives with respect to φ and hyperparameters θ become K(φ ? t) = P′ (φ)P?1(φ)N ? P′ (φ)ΛX , ′ t′ K(φ ? t) = ?Eθ . Inserting Eq. (424) in Eq. (425) gives

′ t′ P′(φ)P?1 (φ)N = t′ P′ (φ)ΛX ? Eθ .

(424) (425)

(426)

Eq.(426) becomes equivalent to the parametric stationarity equation (346) with vanishing prior term in the deterministic limit of vanishing prior co′ variances K?1 , i.e., under the assumption φ = t(θ), and for vanishing Eθ . Furthermore, a non–vanishing prior term in (346) can be identi?ed with the term Eθ . This shows, that parametric methods can be considered as deterministic limits of (prior mean) hyperparameter approaches. In particular, a parametric solution can thus serve as reference template t, to be used within a speci?c prior factor. Similarly, such a parametric solution is a natural initial guess for a nonparametric φ when solving a stationarity equation by iteration. If working with parameterized φ(ξ ) extra prior terms Gaussian in some function ψ (ξ ) can be included as discussed in Section 4.2. Then, analogously to templates t for φ, also parameter templates tψ can be made adaptive with hyperparameters θψ . Furthermore, prior terms Eθ and Eθψ for the 99

hyperparameters θ, θψ can be added. Including such additional error terms yields Eθ,θψ ,φ(ξ),ψ(ξ) = ?(ln P ( φ(ξ ) ), N ) + (P ( φ(ξ ) ), ΛX ) 1 + φ(ξ ) ? t(θ), K (φ(ξ ) ? t(θ)) 2 1 + ψ (ξ ) ? tψ (θψ ), Kψ (ψ (ξ ) ? tψ (θψ )) 2 +Eθ + Eθψ , and Eqs.(424) and (424) change to Φ′ K(φ ? t) + Ψ′ Kψ (ψ ? tψ ) = P′ξ P?1 N ? P′ξ ΛX , ′ t′ K(φ ? t) = ?Eθ , ′ ′ tψ Kψ (ψ ? tψ ) = ?Eθψ , (428) (429) (430)

(427)

′ ′ where t′ψ , Eθ , Eθ , denote derivatives with respect to the parameters θψ ψ or θ, respectively. Parameterizing Eθ and Eθψ the process of introducing hyperparameters can be iterated.

5.2.3

Unrestricted variation

To get a ?rst understanding of the approach (422) let us consider the extreme example of completely unrestricted t–variations. In that case the template function t(x, y ) itself represents the hyperparameter. (Such function hyperparameters or hyper?elds are also discussed in Sect. 5.6.) Then, t′ = I and Eq. (425) gives K(φ ? t) = 0 (which for invertible K is solved uniquely by t = φ), resulting according to Eq. (228) in ΛX = NX . (431)

The case of a completely free prior mean t is therefore equivalent to a situation without prior. Indeed, for invertible P′ , projection of Eq. (426) into the x– data space by ID of Eq. (257) yields

1 PD = Λ? X,D N,

(432)

where ΛX,D = ID ΛX ID is invertible and PD = ID P . Thus for xi for which yi are available N (xi , yi ) P (xi , yi) = (433) NX (xi ) 100

is concentrated on the data points. Comparing this with solutions of Eq. (191) for ?xed t we see that adaptive means tend to lower the in?uence of prior terms. 5.2.4 Regression

Consider now the case of regression according to functional (246) with an adaptive template t0 (θ). The system of stationarity equations for the regression function h(x) (corresponding to φ(x, y )) and θ becomes K 0 (h ′ t 0 K 0 (h ? t0 ) = KD (tD ? h), ? t0 ) = 0. (434) (435)

It will also be useful to insert Eq. (434) in Eq. (435), yielding 0 = t′0 KD (h ? tD ). For ?xed t Eq. (434) is solved by the template average t h = t = (K0 + KD )?1 (K0 t0 + KD tD ) , so that Eq. (435) or Eq. (436), respectively, become 0 = t′0 K0 (t ? t0 ), 0 = t′0 KD (t ? tD ). (438) (439) (437) (436)

It is now interesting to note that if we replace in Eq. (439) the full template average t by t0 we get 0 = t′0 KD (t0 ? tD ), (440) which is equivalent to the stationarity equation 0 = H′ KD (h ? tD ), 1 ED,h(ξ) = ( h(ξ ) ? tD , KD (h(ξ ) ? tD ) ) 2 (441)

(the derivative matrix H′ being the analogue to Φ′ for h) of an error functional (442)

without prior terms but with parameterized h(ξ ), e.g., a neural network. The approximation h = t = t0 can, for example, be interpreted as limit λ → ∞,

λ→∞

lim h = lim t = t0 ,

λ→∞

(443)

101

after replacing K0 by λK0 in Eq. (437). The setting h = t0 can then be 1 used as initial guess h0 for an iterative solution for h. For existing K? 0 h = t0 is also obtained after one iteration step of the iteration scheme hi = 1 i?1 t0 + K? ) starting with initial guess h0 = tD . 0 KD (tD ? h For comparison with Eqs.(439,440,441) we give the stationarity equations for parameters ξ for a parameterized regression functional including an additional prior term with hyperparameters 1 1 Eθ,h(ξ) = ( h(ξ ) ? tD , KD (h(ξ ) ? tD ) )+ ( h(ξ ) ? t0 (θ), K0 (θ)(h(ξ ) ? t0 (θ)) ), 2 2 (444) which are 0 = H′ KD (h ? tD ) + h′ K0 (h ? t0 ). (445) Let us now compare the various regression functionals we have met up to now. The non–parameterized and regularized regression functional Eh (246) implements prior information explicitly by a regularization term. A parameterized and regularized functional Eh(ξ) of the form (343) corresponds to a functional of the form (444) for θ ?xed. It imposes restrictions on the regression function h in two ways, by choosing a speci?c parameterization and by including an explicit prior term. If the number of data is large enough, compared to the ?exibility of the parameterization, the data term of Eh(ξ) alone can have a unique minimum. Then, at least technically, no additional prior term would be required. This corresponds to the classical error minimization methods used typically for parametric approaches. Nevertheless, also in such situations the explicit prior term can be useful if it implements useful prior knowledge over h. The regularized functional with prior– or hyperparameters Eθ,h (422) implements, compared to Eh , e?ectively weaker prior restrictions. The prior term corresponds to a soft restriction of h to the space spanned by the parameterized t(θ). In the limit where the parameterization of t(θ) is rich enough to allow t(θ? ) = h? at the stationary point the prior term vanishes completely. The parameterized and regularized functional Eθ,h(ξ) (444), including prior parameters θ, implements prior information explicitly by a regularization term and implicitly by the parameterization of h(ξ ). The explicit prior term vanishes if t(θ? ) = h(ξ ? ) at the stationary point. The functional combines a hard restriction of h with respect to the space spanned by the parameterization h(ξ ) and a soft restriction of h with respect to the 102

space spanned by the parameterized t(θ). Finally, the parameterized and non–regularized functional ED,h(ξ) (442) implements prior information only implicitly by parameterizing h(ξ ). In contrast to the functionals Eθ,h and Eθ,h(ξ) it implements only a hard restriction for h. The following table summarizes the discussion: Functional Eq. prior implemented Eh (246) explicitly Eh(ξ) (343) explicitly and implicitly Eθ,h (422) explicitly no prior for t(θ? ) = h? Eθ,h(ξ) (444) explicitly and implicitly no expl. prior for t(θ? ) = h(ξ ? ) ED,h(ξ) (442) implicitly

5.3

5.3.1

Adapting prior covariances

General case

Parameterizing covariances K?1 is often desirable in practice. It includes for example adapting the trade–o? between data and prior terms (i.e., the determination of the regularization factor), the selection between di?erent symmetries, smoothness measures, or in the multidimensional situation the determination of directions with low variance. As far as the normalization depends on K(θ) one has to consider the error functional Eθ,φ = ?(ln P (φ), N ) + with 1 φ ? t, K(θ) (φ ? t) + (P (φ), ΛX ) + ln Zφ (θ) + Eθ , 2 (446) Zφ (θ) = (2π ) 2 (det K(θ))? 2 , for a d–dimensional Gaussian speci?c prior, and stationarity equations K(φ ? t) = P′ (φ)P?1(φ)N ? P′ (φ)ΛX , 1 ? K (θ ) ? K (θ ) ′ φ ? t, ? Eθ . (φ ? t) = Tr K?1 (θ) 2 ?θ ?θ Here we used ? ?K ? . ln det K = Tr ln K = Tr K?1 ?θ ?θ ?θ 103 (450) (448) (449)

d 1

(447)

In case of an unrestricted variation of the matrix elements of K the hyperparameters become θl = θ(x, y ; x′ , y ′) = K(x, y ; x′, y ′ ). Then, using ? K(x, y ; x′ , y ′ ) = δ (x ? x′′ )δ (y ? y ′′ )δ (x′ ? x′′′ )δ (y ′ ? y ′′′ ), ′′ ′′ ′′′ ′′′ ?θ(x , y ; x , y ) Eqs.(449) becomes the inhomogeneous equation 1 ? K (θ ) ′ ? Eθ . (φ ? t) (φ ? t)T = Tr K?1 (θ) 2 ?θ (452) (451)

We will in the sequel consider the two special cases where the determinant of the covariance is θ–independent so that the trace term vanishes, and where θ is just a multiplicative factor for the speci?c prior energy, i.e., a so called regularization parameter. 5.3.2 Automatic relevance detection

A useful application of hyperparameters is the identi?cation of sensible directions within the space of x and y variables. Consider the general case of a covariance, decomposed into components K0 = i θi Ki . Treating the coe?cient vector θ (with components θi ) as hyperparameter with hyperprior p(θ) results in a prior energy (error) functional 1 (φ ? t, (? 2 θi Ki )(φ ? t) ) ? ln p(θ) + ln Zφ (θ). (453)

i

The θ–dependent normalization ln Zφ (θ) has to be included to obtain the correct stationarity condition for θ. The components Ki can be the compo2 2 nents of a negative Laplacian, for example, Ki = ??x or Ki = ??y . In that i i case adapting the hyperparameters means searching for sensible directions in the space of x or y variables. This technique has been called Automatic Relevance Determination by MacKay and Neal [167]. The positivity constraint 2 Ki or for a can be implemented explicitly,

赞助商链接

- Nonparametric Bayesian Classification
- Nonparametric Bayesian kernel models
- Using Bayesian networks for visualizing highdimensional data. Intelligent Data Analysis
- 数学期望的计算方法及其应用
- 1 NONPARAMETRIC FUNCTIONAL DATA ANALYSIS THROUGH BAYESIAN DENSITY ESTIMATION
- Nonparametric Bayesian kernel models
- Nonparametric Bayesian Classification
- 1 Simulation Based Bayesian Nonparametric Regression Methods
- Dynamic Bayesian Network and Nonparametric Regression for Nonlinear Modeling of Gene Networ
- local limit theory and spurious nonparametric regression
- 贝叶斯非参数性模型的matlab代码(Matlab codes for Bayesian nonparametric model)_算法理论_科研数据集
- Nonparametric bayesian methods in short run process
- 贝叶斯非参数性模型的matlab代码(Matlab codes for Bayesian nonparametric model)
- A Bayesian method for fitting parametric and nonparametric models to noisy data
- A Bayesian analysis of some nonparametric problems

更多相关文章：
更多相关标签：