9512.net
甜梦文库
当前位置:首页 >> >>

Posterior Convergence and Model Estimation in Bayesian Change-point Problems


arXiv:0808.2700v1 [math.ST] 20 Aug 2008

Posterior Convergence and Model Estimation in Bayesian Change-point Problems
Heng Lian
Division of Mathematical Sciences, SPMS, Nanyang Technological University, Singapore, 637371 Summary. We study the posterior distribution of the Bayesian multiple change-point regression problem when the number and the locations of the change-points are unknown. While it is √ relatively easy to apply the general theory to obtain the O(1/ n) rate up to some logarithmic factor, showing the exact parametric rate of convergence of the posterior distribution requires additional work and assumptions. Additionally, we demonstrate the asymptotic normality of the segment levels under these assumptions. For inferences on the number of change-points, we show that the Bayesian approach can produce a consistent posterior estimate. Finally, we argue that the point-wise posterior convergence property as demonstrated might have bad ?nite sample performance in that consistent posterior for model selection necessarily implies the √ maximal squared risk will be asymptotically larger than the optimal O(1/ n) rate. This is the Bayesian version of the same phenomenon that has been noted and studied by other authors. Keywords: change-point problems, posterior distribution, rate of convergence

1. Introduction We consider the regression problem of estimating a piece-wise constant function when the number of segments as well as the locations of its change-points is unknown. This is an old problem that has attracted much attention recently (Goldenshluger et al., 2006; Ben Hariz et al., 2007; Fearnhead, 2008). Applications of multiple change-point models surged after e?cient computations using reversible jump MCMC was discovered (Green, 1995). (Green, 1995) applied piece-wise constant function in the study of the coal mining disaster data in the context of Poisson process. A more recent trend of analysis that dispenses with the usage of MCMC for the change-points problem starts with the paper Liu and Lawrence (1999) where a dynamic programming approach is utilized to marginalize over segment levels and change-point locations. Their original motivation comes from the problem of partitioning DNA sequences into homogeneous segments. This dynamic programming approach is later extended by Fearnhead (2006); Lian (2008). Unlike the above studies, in this paper we are only concerned with the asymptotic properties of Bayesian multiple change-point problems and investigate from the frequentist view the posterior contraction characteristics of a simpli?ed model. Although a piece-wise constant function involves only a ?nite number of parameters, as we will only consider the case where an upper bound on the number of change-points is available a priori, it is nevertheless best studied from a in?nite-dimensional viewpoint and put the estimation problem in the context of function spaces. Until recently, little is known about the behavior of the posterior distribution of in?nite-dimensional models. For consistency issues, Schwartz (1965) shows that the posterior is consistent when certain tests can be established for the true

2

Lian, H.

distribution versus the complement of its neighborhood. Barron et al. (1999) further developed the theory by sieve construction and metric entropy bounds. Convergence rates are studied in two independent and to some extent overlapping but complementary works (Ghosal et al. (2000) and Shen and Wasserman (2001)). In particular, Ghosal et al. (2000) extends the idea of constructing suitable tests in order to bound the convergence rates for both nonparametric and parametric problems, and Ghosal and Van Der Vaart (2007) further extends the approach to non-i.i.d. observations. The existence of tests for many speci?c problems can be found in the existing literature although sometimes new tests need to be carefully designed. In nonparametric Bayesian analysis, we have an i.i.d. sample Z1 , . . . , Zn from the distribution P0 with density p0 with respect to some measure on the sample space (Z, B). The model space is denoted by P which is known to contain the true distribution P0 . Given some prior distribution Π on P, the posterior is a random measure given by Πn (A|Z1 , . . . , Zn ) =
A n i=1 p(Zi )dΠ(P ) n i=1 p(Zi )dΠ(P )

.

For ease of notation, we will omit the explicit conditioning and write Πn (A) for the posterior distribution. We say that the posterior is consistent if
n Πn (P ∈ P : d(P, P0 ) > ?) → 0 in P0 probability

for any ? > 0, where d is some suitable distance function between probability measures. To study rates of convergence, let ?n be a sequence decreasing to zero, we say the rate is at least ?n if for su?ciently large constant M
n Πn (P : d(P, P0 ) > M ?n ) → 0 in P0 probability.

We also need a slightly weaker de?nition of rates of convergence by replacing M with a sequence Mn and requiring that the above posterior mass converge to zero for any sequence Mn that diverges to in?nity. This de?nition is usually required in parametric problems to get rid of the extra log n factor in the convergence rates. In our regression problem, we observe an i.i.d. sample Z = (Z1 , . . . , Zn ) with the distribution of Zi = (Xi , Yi ) de?ned structurally by Yi = θ0 (Xi ) + ?i for i.i.d Gaussian noise ?i ? N (0, 1). and θ0 is a piece-wise constant function on [0, 1) with k0 unknown locations of change-points. We can write θ0 (t) = j=1 aj I(tj?1 ≤ t < tj ), t0 = 0 < t1 < t2 < . . . < tk0 = 1 using the indicator function and thus θ0 is parameterized by (a, t), a = (a1 , . . . , ak0 ), t = (t0 , . . . , tk0 ). For simplicity, we assume the marginal distributions for {Xi } are i.i.d uniform on [0, 1), and note that it is straightforward to extend all the following results to any distribution of X with density bounded away from zero and in?nity. Note P0 is fully determined by θ0 under these assumptions, and thus we also use the space of piece-wise constant functions as our model space which is equivalent to using P. The measure induced by θ is denoted by Pθ , and thus Pθ0 is the same as P0 , the true distribution. Lian (2007a) studied the consistency issue for the above model with the exception that there Xi ’s are deterministically chosen on a grid. In that paper, consistency is proved for the

Bayesian change-point problem

3

case that the true regression function is in the Lipschitz class as well. Another related work is Scricciolo (2007) where the Bayesian density estimation problem is studied with density approximated by piece-wise constant functions. Besides the fact that they are interested in density estimation instead of regression, the focus of that paper is very di?erent from the current one. They are mostly concerned with the case of approximating a smooth density using step functions and aim to achieve the optimal rates up to a logarithmic factor. For density functions that are piece-wise constant, they prove parametric rate of convergence also with an extra logarithmic factor. A diverging number of grid points is used and thus this approach cannot be used to estimate the number of segments when the density is truly piece-wise constant. One simpli?cation of our model compared with those works mentioned at the beginning of this section that focus on the computational issues is that the variance of the noise is assumed to be known here (and actually 1 without loss of generality). Investigations of Bayesian regression with unknown noise levels can run into additional technical di?culties especially in the design of appropriate tests. Consistency of a regression problem with unknown noise level is addressed in Choi and Schervish (2007). We hope to be able to address this problem in our context in a future paper. In this paper, we focus on the case θ0 is piece-wise constant and aim to achieve the √ exact parametric O(1/ n) rates of convergence and also study the posterior consistency in the estimation of the number of change-points, which we refer to as the model selection problem. The proofs for the estimation rates involve direct application of general theorems in Ghosal et al. (2000) but the calculation of the covering number is nontrivial in this case. In order to achieve the exact parametric rate, an additional assumption needs to be made to exclude functions with segment lengths that are too short. 2. Main results Consider the case where we have some a priori bounds for the number of change-points as well as for the segment levels {aj }. The model space is de?ned as
k

Θ = {θ : θ(t) =

j=1

aj I(tj?1 ≤ t ≤ tj ), t0 = 0 < t1 < . . . < tk = 1, k ≤ kmax , |aj | ≤ K} .

By convention, we say θ with tk = 1 has k change-points, which is the same as the number of segments. Another equivalent representation of Θ is Θ = {(a, t) ∈ [?K, K]k × Tk : 1 ≤ k ≤ kmax , where Tk is the set of (k + 1)-tuples (t0 , . . . , tk ) with tj < tj+1 . We will not distinguish between these two di?erent representations and θ can denote either a function or the tuple (a, t). This ambiguity can always be resolved by the context. For rates of convergence, the distance d we use is the L2 norm of the function ||θ|| = θ2 (x)dx . Since we only consider uniformly bounded functions, the L2 norm is equivalent to the Hellinger distance (e.g.Ghosal and Van Der Vaart (2007), section 7.7). We now specify a prior on Θ using a hierarchical approach. Let Θk be the subspace of Θ that consists of functions with k change-points, and the prior Π is speci?ed as a mixture
kmax 1 0 1/2

Π=
k=1

p(k)Πk , p(k) > 0,
k

p(k) = 1

4

Lian, H.

with Πk the prior measure on Θk . We assume that Πk has a density πk (θ) which can be further decomposed as a t πk (a, t) = πk (a|t)πk (t) . The assumption we make on the prior is that a x (A) The density πk (a|t) and πk (t) are bounded away from zero and in?nity on [?K, K]k and Tk respectively. This assumption is satis?ed, for example, when t1 , t2 , . . . , tk?1 are distributed as the order statistic of k?1 points uniform distributed on [0, 1) while segment levels are independent and uniformly distributed on [?K, K]. The ?rst simple result shows that the posterior rate of convergence is n?1/2 up to a logarithmic factor. Theorem 1. Under assumption (A), the posterior rate of convergence is at least ?n = n log1/2 n/n1/2 , i.e. Πn (θ : ||θ ? θ0 || > M ?n ) → 0 in P0 probability for su?ciently large M . Theorem 1 considers the convergence rates of the estimation problem. A di?erent problem is the convergence of the posterior for the number of change-points. Under no additional assumptions, we can show that the posterior probability will concentrate on the true number of change-points with probability converging to 1.
n Theorem 2. Under the same assumption (A), we have Πn (k = k0 ) → 1 in P0 probability, where k0 is the number of change-points for the true function θ0 .

Nonparametric Bayesian model selection has been investigated in Ghosal et al. (2008). The focus of that paper is on conditions under which the adaptive rates are achieved when simultaneously considering models with di?erent rates of contraction. Thus it seems the results presented there cannot be directly applied in our case. To get rid of the extra logarithmic factor in Theorem 1, one would use the more re?ned Theorem 2.4 in Ghosal et al. (2000), using local covering number instead of the global one. Nevertheless, as shown by Lemma 4 in the appendix, the local covering number for Θ is not √ bounded as would be required if we set ?n = O(1/ n). Instead, we consider the smaller model space Θδ = {θ ∈ Θ : min |tj ? tj?1 | ≥ δ} . We can de?ne Θδ in a similar way and assumption (A) can be modi?ed accordingly. Speck t i?cation of a prior on Θδ is easy. Conceptually, we can just restrict πk (t) to be supported δ Θ and renormalize the density. Reversible jump algorithms can be easily modi?ed to take into account the constraint. Dynamic programming can also incorporate the pre-speci?ed shortest possible segment length (Lian (2007b). Theorem 1 and Theorem 2 is still true on Θδ with few modi?cations on the proofs. Green (1995) also noticed the practical advantage of avoiding short steps. They proposed using even-numbered order statistics from 2k ? 1 uniformly distributed points so that short segment lengths are better penalized. As shown in the appendix, putting some lower bound on the segment lengths makes the local covering number bounded by a constant. This requires a very detailed argument to construct the covering. Using this more re?ned bound on the covering number, we can achieve the exact parametric rate.

Bayesian change-point problem

5

Theorem 3. For any δ √ 0, under assumption (A), the posterior rate of convergence > on Θδ is at least ?n = O(1/ n). That is, for every Mn → ∞, we have that Πn (θ ∈ Θδ : n ||θ ? θ0 || > Mn ?n ) → 0 in P0 probability. Combination of Theorem 2 and 3 immediately gives us the rates of convergence for the change-point locations: Corollary 1. Under the same assumptions as above, the posterior convergence rate for the change-point locations is at least ?2 = O(1/n), that is , for any sequence Mn → ∞, n n Πn (max1≤j≤k0 |tj ? t0 | > Mn ?2 ) → 0 in P0 probability. This rate of course agrees with n j many frequentist approaches, say using the cumulative sum. It is well-known that the posterior distribution in regular parametric models conditionally converges to a Gaussian distribution under weak conditions. Since the previous results show that the number and locations of the change-points can be consistently estimated, one would naturally conjecture that the posterior distribution for segment levels will converge to a multivariate Gaussian distribution. This is indeed the case as stated in the following theorem: Theorem 4. Suppose the true segment lengths are lj = t0 ? t0 , j = 1, . . . , k0 . Denotj j?1 ing the posterior distribution of a = (a1 , . . . , ak ) restricted on the event k = k0 (which has a posterior probability converging to 1) by Πn and the covariance matrix I0 = diag(lj · n), a|Z then we have a EZ|θ0 ||Πn ? N (?(t0 ), I0 )||T V → 0 , a|Z where ||P ?Q||T V is the total variation distance between probability measures P and Q, a(t0 ) ? is the maximum likelihood estimator for a, assuming the true locations of the change-points are known. The above theorems show that the Bayesian procedure possesses very good properties. On the one hand, the exact parametric rate is achieved for the estimation problem in the function space. On the other hand, the number of change-points can be consistently estimated. This is reminiscent of the recent literature on the oracle property of penalized estimators. As shown in Fan and Li (2001), the SCAD estimator, when the smoothing parameter is chosen appropriately, can estimate the zero coe?cients in a linear regression model as exactly zero with probability converging to one as sample size increases. At the same time, the estimator is still consistent for nonzero coe?cients and the asymptotic distribution is the same whether or not the correct zero positions are known. This is called the oracle property by Fan and Li (2001). More recently, Leeb and Potscher (2008) showed that the oracle property might be misleading in terms of the estimator’s ?nite sample performance and it is impossible to adapt to the unknown zero restrictions without paying a price. The caveat lies in the point-wise nature of the asymptotic theory laid out in Fan and Li (2001). The authors of Leeb and Potscher (2008) show that an unbounded (normalized) risk results for any estimator possessing the sparsity property. Another related work is Yang (2005) where the author shows that AIC has a minimax property which cannot be shared with any model selection consistent estimators in a regression problem. In our context, similar conclusion can be drawn for the Bayesian multiple change-point problem. Theorem 2 and Theorem exactrate apply to a ?xed true piece-wise constant function and thus the convergence as stated√ point-wise in nature. It is not di?cult to is see from the proof of Theorem 3 that the 1/ n rate is not actually uniform over the class

6

Lian, H.

Θδ . The reason is that to obtain the bound for the local covering number (Lemma 5 in the appendix), the constants involved does depend on θ0 . In particular, the derivation of the lemma requires a lower bound on the size of the jumps of the neighboring segments and thus the convergence is not uniform over Θδ . Intuitively, small jumps makes the estimation more di?cult and heavier penalization by the prior must be entertained (possibly by using a prior that depends on the sample size) to achieve model selection consistency at the cost of losing estimation accuracy. As seen√ the proof of Theorem 5, the di?culty occurs when in the size of the jump is of order O(1/ n), in which case it becomes di?cult to detect the change-point. Nevertheless, as discussed above, the convergence is uniform if we further restrict our attention on the sub-class: Θδ1 ,δ2 = {θ ∈ Θδ1 , min |aj ? aj?1 | ≥ δ2 }. We state the uniform convergence as a proposition without proof: Proposition 1. For any ?xed δ1 , δ2 > 0, the rate of convergence is uniformly at least √ ? ?n = O(1/ n). That is, for any Mn → ∞, supθ∈Θδ1 ,δ2 EZ|θ Πn (θ ∈ Θδ1 ,δ2 : ||θ ? θ|| > ? ? Mn ?n ) → 0. The property of model selection consistency is still satis?ed in this case. On the other hand, the following result con?rms that we cannot expect the posterior to converge uniformly over the class Θδ if the method can adapt to the number of changepoints. Note that the theorem applies for any Bayesian posterior distribution for the changepoint problem, not just the speci?c prior we constructed. Theorem 5. Suppose the posterior distribution satis?es the model consistency condin tion: Πn (k = k0 ) → 1 in P0 probability, then the maximal L2 convergence of θ is necessarily √ slower than the parametric rate ?n = O(1/ n). That is, for some Mn → ∞,
? θ∈Θδ

? sup EZ|θ Πn (θ ∈ Θδ : ||θ ? θ|| > Mn ?n ) → 1 . ?

. The above theorem demonstrated the trade-o? between function estimation and model selection for our Bayesian multiple change-point problems. 3. Discussion In this paper, we investigated in detail some asymptotic properties of Bayesian multiple change-point problems when the noise level is assumed known. We proved estimation rate of convergence as well as model selection consistency of the posterior distribution. The main contribution of the paper is to show that the exact parametric rate is achieved for a restricted class of piece-wise constant functions and that this optimal rate cannot be achieved uniformly over the class. Our theory still leaves some gaps in between. For example, it is still unknown whether it is absolutely necessary to restrict the functions to have not too short segment lengths in order to achieve the optimal rate. The additional restriction makes the local covering number bounded in order to apply the Theorem in Ghosal et al. (2000). Besides, the situation with unknown error level is of signi?cant practical importance in which case one should also put a prior on the noise level. The convergence property in this case is still an open problem.

Bayesian change-point problem

7

Appendix Some Lemmas In preparation for the proofs of the main results, we ?rst collect some lemmas here. Lemma 4 below shows that the local covering number is unbounded as remarked in the main text and is not used further in any other proofs. The constant C is used to denote a generic constant which might not be the same at di?erent places. Note that since we are only considering uniformly bounded class of functions, the Hellinger distance, the KullbackLeibler divergence, as well as the second moment of the likelihood ratio are all equivalent to the L2 norm of the regression function. In the following, we set δ0 = min{minj |t0 ? j t0 |, minj |a0 ? a0 |} > 0, which bounds the segment lengths as well as the jump size j?1 j j?1 from below. Lemma 1. Under condition (A), we have the lower bound for the prior concentration when ?n → 0, 3k Π(θ : ||θ ? θ0 || ≤ ?n ) ≥ Cp(k0 )?n 0 ?2 . Proof. When θ =
?2 n 8K 2 kmax k0 j=1

|tj ? t0 | < , 1 ≤ j ≤ k0 ? 1, it is easy to show that ||θ ? θ0 ||2 < ?2 . Since the prior n j density for (a, t) is bounded away from zero, we get
3k Π(θ : ||θ ? θ0 || ≤ ?n ) ≥ p(k0 )Πk0 (θ : ||θ ? θ0 || ≤ ?n ) ≥ Cp(k0 )?n 0 ?2 .
0 Lemma 2. Let δ ′ = 4kmax (δ0 is de?ned immediately before Lemma 1). When ? < δ ′ , we have that Πk (θ ∈ Θk : ||θ ? θ0 || ≤ ?) ≤ C?3k0 ?2 , k = 1, . . . , kmax , where C is a constant that depends on K, kmax and δ0 . For k > k0 , the bound can be re?ned to C?3k0 ?1 .

aj I(tj?1 ≤ t < tj ) ∈ Θk0 with |aj ? a0 | < ?n /2, 1 ≤ j ≤ k0 and j

δ3

Proof. First we consider the case k < k0 and θ ∈ Θk . By the de?nition of δ0 , the k0 ? 1 intervals (t0 ? δ0 /2, t0 + δ0 /2), j = 1, . . . k0 ? 1 are nonoverlapping. Thus there is at least j j one segment of θ that includes one of these k0 ? 1 intervals. Thus the distance between θ and θ0 is at least δ(δ/2)2 ≥ δ ′ , and thus Πk (θ ∈ Θδ : ||θ ? θ0 || ≤ ?) = 0. k When k ≥ k0 , θ ∈ Θk and ||θ ? θ0 || < ?, for any j, let s(j) be the index of the interval [ts(j)?1 , ts(j) ) which has the largest overlap with [t0 , t0 ). Obviously the length of the j?1 j
δ0 distance between θ and θ0 is at least (as(j) ? a0 )2 kmax > ? ). Similarly, let t(j) be the j 2 index of the change-point of θ that is closest to t0 , we have |tt(j) ? t0 | ≤ 4?2 (otherwise the j j δ0 2 0 squared distance will be bigger than 4?2 ( δ2 )2 = ?2 ). The above considerations give us k0 δ0

overlap is at least δ0 /kmax . This implies |as(j) ? a0 | ≤ ? j

kmax δ0 2

(otherwise the squared L2

constraints on the segments levels of θ as well as k0 ? 1 constraints on the change-point locations. Thus under assumption (A), the prior probability Πk (θ ∈ Θk : ||θ ? θ0 || ≤ ?) is bounded by C?3k0 ?2 . For re?ned bound, we consider k = k0 + 1 only for simplicity. Without loss of generality, we assume t(j) = j and thus ||θ ? θ0 || ≤ ? implies an additional restriction (a00 ? ak0 +1 )2 (1 ? tk0 ) ≤ ?2 . This gives us an additional factor of ? in the bound. k Lemma 3. log D(?, Θ) ≤ b log(1/?) + c, for some constants b, c > 0 that depends on K and kmax , where D(?, Θ) is the ??covering number of Θ.

8

Lian, H.

Proof. Choose a grid on the domain [0, 1) and another grid on [?K, K] ?t = ?2 · i, i ∈ N ∩ [0, 1], ?y = {? · i, i ∈ Z} ∩ [?K, K] . 8K 2 kmax

? Let Θ = {θ ∈ Θ, θ jumps only at points in ?t and takes segment levels in ?y }. It is then ? easy to show that Θ is an ??covering of Θ with covering number bounded by ? 8K ?kmax ? + 1 2K 2 + 1)kmax . ( ? kmax The next lemma considers the local covering/packing number. In particular, Lemma 4 illustrates why we cannot apply Theorem 2.4 from Ghosal et al. (2000) to obtain the exact parametric rate on Θ. Lemma 4. log D(?/2, {θ ∈ Θ, ? ≤ ||θ ? θ0 || ≤ 2?) ≥ C/?2 for some constant C. Proof. Without loss of generality, we consider θ0 = 0. We construct a lower bound for the packing number. For simplicity, we assume 1/(4?2) is an integer. Using the partition of 1/4?2 the interval [0, 1) = ∪i=1 [(i ? 1)4?2, i · 4?2 ) and construct the piece-wise constant functions θi (t) = I((i ? 1)4?2 ≤ t < i · 4?2 ). Obviously, for this set of functions, we have ||θi || = 2? √ and ||θi ? θj || = 2 2?. The lower bound for the covering number is obtained by the simple relationship between covering number and packing number. Lemma 5. For 2? < δ ′ =
3 δ0 4kmax , 2

log D(?/2, {θ ∈ Θδ , ? ≤ ||θ ? θ0 || ≤ 2?) ≤ C for some constant C that depends on δ, δ0 , K and kmax but does not depend on ?. Proof. Suppose that ||θ ? θ0 || ≤ 2?. From the proof of Lemma 2, we know that each change2 point of θ0 has a corresponding change-point of θ that satis?es |tt(j) ? t0 | ≤ 16?2 /δ0 . For j 0 any segment level aj of θ0 , denote the corresponding index of the segment of θ that has an √ √ overlap of at least δ/2 by r(j), by similar argument as Lemma 2, |aj ? a0 | ≤ 2 2?/ δ. r(j) To construct a covering, we partition [0, 1) into nonoverlapping intervals. In the following, M, B, N are su?ciently large integers to be chosen later. First, each interval 2 2 [t0 ? 16?2 /δ0 , t0 + 16?2 /δ0 ] is partitioned into M subintervals with equal lengths. For the j j rest of [0, 1) we partition it into segments of lengths between δ/2B and δ/B. Obviously the total number of subintervals does not depend on ?. These subintervals falls into two types: (i) the subinterval that contains some change-point of θ0 ; (ii) the subinterval that is entirely contained in some segment of θ0 . The function class F that forms a covering is de?ned as the set of functions which is piece-wise constant with respect to√ partition, takes a value of 0 the 2? on type (i) subintervals and takes values of the form a0 + 2 √δ i, i = ?N, ?(N ? 1), . . . , N, j N on type (ii) subintervals if the subinterval is contained in segment j of θ0 . The size of F is a constant independent of ? and we show next that it is indeed a ?/2-covering.

Bayesian change-point problem

9

On subintervals of type (i), the squared L2 distance between F and θ restricted on these 16?2 intervals are at most Mδ2 kmax K 2 . Type (ii) subintervals can further be divided into three types: (iii) it contains a change-point of θ which is closest to some change-point of θ0 ; (iv) it contains a change-point of θ other than those closest to some change-point of θ0 ; (v) it is entirely contained in some segment of θ. On subintervals of type (iii) the squared 16?2 distance is at most Mδ2 kmax K 2 . On subintervals of type (iv) the squared distance is at √ √ 2? 2? δ most B kmax ( 2√δ )2 . On subintervals of type (v) the squared distance is at most ( 2 √δ )2 . N Thus when M, B, N is large enough, we have a ?/2-covering. Proofs of the main results Proof of Theorem 1. We apply Theorem 2.1 in Ghosal et al. (2000) with ?n = C log n/n. Condition (2.2) for that theorem is veri?ed in Lemma 3, condition (2.3) is trivially satis?ed and condition (2.4) is veri?ed in Lemmas 1. Proof of Theorem 2. Theorem 1 immediately implies that the under-estimation proban blity Πn (k < k0 ) → 0 in P0 probability. For over-estimation, it is su?cient to show that pn (Z) pn (Z) n n θ P0 ( Θk pn (Z) dπk0 (θ) < Cn?(3k0 ?2+2ξ)/2 ) → 0 for some 0 < ξ < 1/2, and P0 ( Θk pθ (Z) dπk (θ) > n
0 0 0

(log n)?1 n?(3k0 ?2+2ξ)/2 ) → 0, when k > k0 . step 1. Let Un = {t ∈ Tk0 : t = t0 + u, u ∈ Rk0 +1 , u0 = uk0 = 0, |ui | < c/n} with t Πk0 (Un ) ≥ c′ n?k0 +1 , where Πt 0 is the prior measure on the locations of change-points. For k any ?xed t ∈ Un , with probability converging to 1, by considering a small neighborhood of the maximum likelihood estimator a(t) for the given t as in Laplace approximation, we ? have n n pn (Z) C p(?(t),t) (Z) C p(a0 ,t) (Z) a θ dπk0 (a|t) ≥ k /2 ≥ k /2 . pn (Z) pn (Z) pn (Z) n 0 n 0 0 0 0
pn ,t) (Z) (a0 is normally distributed with mean pn (Z) 0 k0 ?1 0 1 2 2 2 ? 2 f (t) and variance f (t) , where f (t) = j=1 (aj+1 ? a0 )2 · nj , and nj is the number of j Xi that falls into the subinterval [t0 , tj ) or [tj , t0 ) (depending on the sign of uj ). Since nj j j pn (Z) ≥ is Binomial distributed with mean less than c, f (t)2 = Op (ξ log n) and thus log (a0 ,t) p0 (Z)

For any t ∈ Un , and conditional on {Xi }, log

?ξ log n with probability converging to 1. Thus, with probability converging to 1, we have pn (Z) θ dπk0 (θ) ≥ Πt 0 (Un ) nkC/2 n?ξ = Cn?(3k0 ?2+2ξ)/2 . k 0 Θk0 pn (Z) 0 √ 1 step 2. Letting δn = 2 log nn3k0 ?2(1?ξ) , and ?n = C log n/ n, we have that pn (Z) θ dπk (θ) > (log n)?1 n?(3k0 ?2+2ξ)/2 ) n Θk p0 (Z) pn (Z) pn (Z) n n θ θ dπk (θ) > δn ). P0 ( n (Z) dπk (θ) > δn ) + P0 ( n {||θ?θ0 ||>?n }∩Θk p0 (Z) {||θ?θ0||≤?n }∩Θk p0
n P0 (



By the Markov inequality and Fubini’s theorem, the ?rst term above is bounded by δ1 πk (||θ? n 3k θ0 || ≤ ?n ) ≤ δ1 ?n 0 ?1 → 0, where we have made use of Lemma 2. n For the second term, we apply Theorem 1 of Shen and Wasserman (2001) with ? in that theorem replaced by ?n de?ned above. Using
√ 2? ?2 /28

b log(1/?) + c ≤

b log

√ 28 + c · 2? , 2 ?

10

Lian, H.

the entropy condition in that theorem can be veri?ed for ? = ?n . Thus when C is large enough, the second term also converges to 0. √ Proof of Theorem 3. We apply Theorem 2.4 in Ghosal et al. (2000) using ?n = A/ n with A su?ciently large. Condition (2.7) for that theorem is veri?ed in Lemma 5, and (2.8) 0 ||≤2j? is trivially satis?ed. Now we verify (2.9), for which we need to bound Π(j?n ≤||θ?θ||≤?n ) n ) . Π(||θ?θ0 √ ′ ′ When j < δ n/2A, 2j?n < δ and Lemma 2 can be directly applied to obtain that 0 ||≤2j? Πk (θ ∈ Θδ : ||θ ? θ0 || ≤ 2j?n ) ≤ C(j?n )3k0 ?2 , and we get Π(j?n ≤||θ?θ||≤?n ) n ) ≤ Cj 3k0 ?2 ≤ k Π(||θ?θ0 √ 0 ||≤2j? Cexp(A2 j 2 /2). For j ≥ δ ′ n/2A, we bound the probability by 1, and Π(j?n ≤||θ?θ||≤?n) n ) ≤ Π(||θ?θ0 C(1/?n )3k0 ?2 ≤ Cexp(A2 j 2 /2) for this range of j. Proof of Corollary 1. By Theorem 2, we can assume the number of change-points of θ || is also k0 . Then max1≤i≤k0 |ti ? t0 | > Mn ?2 implies that ||θ ? θ0√2 > (δ0 /2)2 Mn ?2 . Thus n n i Πn (max1≤i≤k0 |ti ? t0 | > Mn ?2 ) ≤ Πn (θ ∈ Θδ : ||θ ? θ0 || > (δ0 /2) Mn ?n ) → 0. n i C Proof of Theorem 4. Fixing one t ∈ Tk0 = {t ∈ Tk0 : maxj |tj ? t0 | ≤ C/n}, denote j n the maximum likelihood estimator for a by a(t). Let Πn ? a|t,Z and Πt|Z be the posterior measure for a conditioning on t and the posterior measure for t respectively. The classical Bernstein-von Mises Theorem implies that E0 ||Πn a a|t,Z ? N (?(t), I0 )||T V → 0. We have that E0 ||Πn ? N (?(t0 ), I0 )||T V a a|Z ≤ ||
C Tk 0

Πn dΠn ? N (?(t0 ), I0 )||T V + || a a|t,Z t|Z

C (Tk )c 0

Πn dΠn ? N (?(t0 ), I0 )||T V a a|t,Z t|Z

= (I) + (II). (I) can be bounded by E0 || ≤ E0 a Πn dΠn ? N (?(t0 ), I0 )||T V t|Z a|t,Z
n ||Πn a a|t,Z ? N (?(t), I0 )||T V dΠt|Z + E0

C Tk 0

C Tk

0

C Tk

0

||N (?(t), I0 ) ? N (?(t0 ), I0 )||T V dΠn . a a t|Z

The ?rst term converges to zero by the boundedness of the TV norm and the Fubini’s √ theorem. The second term converges to zero since ||?(t) ? a(t0 )|| = op (1/ n). Letting n a ? goes to in?nity and then C goes to in?nity, we see that E0 ||Πn ? N (?(t0 ), I0 )||T V → 0. a a|Z Proof of Theorem 5. Fix any number M > 0 and γ > 2M . De?ne θ0 = 0 and θn = γ γ √ I( 1 ≤ t < 1), a function with a single change-point and jump size √ . We trivially have 2 n n

γ M ||θ ? θn || ≥ 2√n > √n for all θ ∈ Θ1 (i.e. θ is a constant function). Under θ0 , the posterior probability on Θ1 converges to 1 by Theorem 2. This gives us √ √ EZ|θ0 Πn (θ : ||θ?θn || > M/ n) ≥ EZ|θ0 Πn (θ : ||θ?θn || > M/ n, θ ∈ Θ1 ) ≥ EZ|θ0 Πn (Θ1 ) → 1.

n n Since the measure P0 induces by θ0 and the measure Pθn induced by θn are mutually contiguous (this is a straightforward extension of Theorem 7.2 in Vaart (1998)), we have

? θ∈Θδ

? sup EZ|θ Πn (θ ∈ Θδ : ||θ ? θ|| > M ?n ) ≥ EZ|θn Πn (θ ∈ Θδ : ||θ ? θn || > M ?n ) → 1. ?

Since this is true for any M , it is also true for some slowly diverging sequence Mn as in the statement of the theorem.

Bayesian change-point problem

11

References Barron, A., M. J. Schervish, and L. Wasserman (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics 27 (2), 536–561. Ben Hariz, S., J. J. Wylie, and Q. Zhang (2007). Optimal rate of convergence for nonparametric change-point estimators for nonstationary sequences. Annals of Statistics 35 (4), 1802–1826. Choi, T. and M. J. Schervish (2007). On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis 98 (10), 1969–1987. Fan, J. Q. and R. Z. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456), 1348–1360. Fearnhead, P. (2006). Exact and e?cient bayesian inference for multiple changepoint problems. Statistics and Computing 16 (2), 203–213. Fearnhead, P. (2008). Computational methods for complex stochastic systems: a review of some alternatives to mcmc. Statistics and Computing 18 (2), 151–171. Ghosal, S., J. K. Ghosh, and A. W. Van der Vaart (2000). Convergence rates of posterior distributions. Annals of Statistics 28 (2), 500–531. Ghosal, S., J. Lember, and A. Van Der Vaart (2008). Nonparametric bayesian model selection and averaging. Electronic Journal of Statistics 2, 63–89. Ghosal, S. and A. Van Der Vaart (2007). Convergence rates of posterior distributions for noniid observations. Annals of Statistics 35 (1), 192–223. Goldenshluger, A., A. Tsybakov, and A. Zeevi (2006). Optimal change-point estimation from indirect observations. Annals of Statistics 34 (1), 350–372. Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82 (4), 711–732. Leeb, H. and B. M. Potscher (2008). Sparse estimators and the oracle property, or the return of hodges’ estimator. Journal of Econometrics 142 (1), 201–211. Lian, H. (2007a). On the consistency of bayesian function approximation using step functions. Neural Computation 19 (11), 2871–2880. Lian, H. (2007b). Some topics on statistical theory and applications. Ph. D. thesis, Brown University. Lian, H. (2008). Bayes and empirical bayes inference in change-point problems. Communications in Statistics - Theory and Methods To appear. Liu, J. S. and C. E. Lawrence (1999). Bayesian inference on biopolymer models. Bioinformatics 15 (1), 38–52. Schwartz, L. (1965). On bayes procedures. Z. Wahrsch. Verw. Gabiete 4, 10–26.

12

Lian, H.

Scricciolo, C. (2007). On rates of convergence for bayesian density estimation. Scandinavian Journal of Statistics 34 (3), 626–642. Shen, X. T. and L. Wasserman (2001). Rates of convergence of posterior distributions. Annals of Statistics 29 (3), 687–714. Vaart, A. W. v. d. (1998). Asymptotic statistics. Cambridge, UK ; New York: Cambridge University Press. Yang, Y. H. (2005). Can the strengths of aic and bic be shared? a con?ict between model indenti?cation and regression estimation. Biometrika 92 (4), 937–950.


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图