Nonparametric Methods for Doubly Truncated Data Bradley Efron and Vah? e Petrosian
Abstract Truncated data plays an important role in the statistical analysis of astronomical
arXiv:astroph/9808334v1 28 Aug 1998
observations as well as in survival analysis. The motivating example for this paper concerns a set of measurements on quasars in which there is double truncation. That is, the quasars are only observed if their luminosity occurs within a certain ?nite interval, bounded at both ends, with the interval varying for di?erent observations. Nonparametric methods for the testing and estimation of doubly truncated data are developed. These methods extend some known techniques for data that is only truncated on one side, in particular LyndenBell’s estimator and the truncated version of Kendall’s tau statistic. However the kind of hazard function arguments that underlie the onesided methods fail for twosided truncation. Bootstrap and Markov Chain Monte Carlo techniques are used here in their place. Finally, we apply these techniques to the quasar data, answering a question about their longterm luminosity evolution.
Key Words: tau test, LyndenBell estimator, bootstrap, hypothesis test, Markov chain Monte Carlo, quasars, luminosity evolution, selfconsistency. Acknowledgement We are grateful to Susan Holmes, Persi Diaconis, and Duncan Mur
dach for guidance on the MCMC methodology.
1
1
Introduction
Figure 1 shows an example of doubly truncated astronomical data. The plotted points are redshifts zi and log luminosities yi for n = 210 quasars, as explained in Section 6. Due to experimental constraints the distribution of each yi is truncated to a known interval Ri depending on zi . Truncation means that we would not have learned of yi’s existence if it fell outside of region Ri . Many experimental situations lead to truncated data, see for example McLauren et al. (1991). The double truncation seen in Figure 1, where yi goes undetected if it is either too small or too large, is less common than onesided truncation. The quasar data has still another kind of truncation, of the redshift values zi , but that will not a?ect our discussion. We will describe nonparametric methods that answer to two related questions concerning truncated data: Question 1 Question 2 the yi ’s? The answers apply to all forms of truncation, including the double truncation of Figure 1. In fact some of our methods apply to all forms of data censoring and truncation, as mentioned at the end of Section 5. Question 1 turns out to be crucial for the scienti?c question posed by the quasars. A positive dependence of luminosity on redshift would mean that earlier quasars were intrinsically brighter, or in other words that quasars have been evolving toward a dimmer state as the universe ages. Figure 1 certainly seems to show a strong positive dependence between redshift and luminosity, but that appearance is forced on us by the truncation limits, which increase sharply from left to right. The tests described in Sections 2, 3, 5, and 6 will show that a statistically signi?cant positive relationship remains after accounting for truncation, How can we test whether or not the yi’s are independent of the zi ’s? Assuming independence, how can we estimate the marginal distribution of
2
type= 1 , k= 0
      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ?? ? ? ? ?? ?   ? ? ? ? ?   ? ? ? ?  ??? ? ? ?? ? ? ?   ? ??? ?? ? ? ? ? ? ? ? ? ??   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ??? ? ?  ? ?? ? ? ?  ?? ? ? ?  ??? ? ? ? ? ? ?  ? ? ?? ? ??? ? ?? ? ?  ? ? ? ? ?? ? ? ? ? ?   ? ??   ?? ? ? ?   ?? ? ? ??? ? ? ???? ?? ? ? ?? ? ?? ? ?? ? ?? ? ??? ?? 0.5 1.0 1.5 2.0 2.5 ? ? ?  ? ??  
log luminosity
2
1
0
1
2
3
4
3.0
redshift z
Figure 1: Doubly truncated data; points represent redshifts and luminosities for 210 quasars, as described in Section 6; luminosity subject to lower and upper truncation as indicated by dashes (lower truncation limits shifted down .25 for clarity.) Is luminosity correlated with redshift?
3
though of much smaller magnitude than the ?gure suggests. Turnbull’s important 1976 paper answers Question 2 for arbitrary patterns of data truncation and censorship. Turnbull uses a selfconsistency algorithm to calculate the nonparametric maximum likelihood estimate (MLE) of the y distribution. His equations take on interesting forms for doubly truncated data, as discussed in Section 4. Both Question 1 and Question 2 have simple closedform solutions in the onesided case, where the truncation regions Ri extend upwards to in?nity. Tsai (1990) and also Efron and Petrosian (1992, 1994) answer Question 1 using a version of Kendall’s tau statistic appropriate to onesided truncation. LyndenBell’s (1971) answer to Question 2 in the onesided case is closely related to the KaplanMeier estimate for censored data, see Efron and Petrosian (1994). Sections 25 extend these ideas to twosided truncation (the extensions applying with little change to any pattern of multiple truncation.) The twosided case is less tractable for both questions. The testing problem of Question 1 is particularly challenging computationally, and raises issues concerning the relationship of bootstrap methods to Markov Chain Monte Carlo (MCMC). LyndenBell’s method is completely nonparametric, as is the tau test, but it can be quite ine?cient in some circumstances. Section 4 also discusses more e?cient parametric estimates based on special exponential families, as in Efron (1996) and Efron and Tibshirani (1996). We return to the quasar data in Section 6. The testing and estimation methods developed in Sections 25 are used there to answer Questions 1 and 2.
2
Permutation Tests for Independence
Doubly truncated data of the kind shown in ?gure 1 can be described as follows: the
4
observed data consists of n pairs (zi , yi), yi a realvalued response and zi in covariate, with observation yi restricted to a known region Ri = [ui , vi ],
data = {(zi , yi ) with yi ∈ Ri = [ui , vi ] for i = 1, 2, . . . , n}.
(2.1)
The n quadruplets (zi , yi, ui, vi ) are observed independently of one another. The regions Ri can depend on zi , as in ?gure 1, and the zi values themselves can be subject to observational truncation. Question 1 concerns testing the independence hypothesis Ho : that if we could observe the yi ’s without truncation they would be independent and identically distributed (i.i.d.) according to some common density function f (y ). Because of truncation, Ho says that yi has the conditional density of f (y ) restricted to Ri ,
Ho :
f (yi Ri ) = f (yi)/Fi
for yi ∈ Ri (2.2)
= 0 for yi ∈  Ri independently for i = 1, 2, . . . , n. Here
vi ui
Fi =
f (y )dy.
(2.3)
Question 2 asks us to estimate f (y ) assuming that Ho is true. This section discusses permutation tests of the independence hypothesis Ho . Table 1 and Figure 2 show an arti?cial example of truncated data involving n = 7 points that will be helpful in carrying out the discussion.
? ? ? A permutation of y = (y1 , y2 , . . . , yn ), say y? = (y1 , y2 , . . . , yn ) is observable if the per
muted values all fall within their truncation regions, that is if
? yi ∈ Ri
for i = 1, 2, . . . , n
(2.4)
5
zi 1 2 3 4 5 6 7
yi Ri = [ui , vi ] 0.75 [0.4, 2.0] 1.25 [0.8, 1.8] 1.50 [0.0, 2.3] 1.05 [0.3, 1.4] 2.40 [1.1, 3.0] 2.50 [2.3, 3.4] 2.25 [1.3, 2.6]
Table 2.3: Table 1: An arti?cial example of truncated data involving n = 7 data points.
In ?gure 1, and Table 2, we see that (y3 , y2 , y1, y4 , y5 , y6 , y7) is observable, but not (y2 , y1, y3 , y4 , y5 , y6, y7 ). We de?ne
Y = set of observable permutations. It turns out that Y has 78 members in the sevenpoint example.
(2.5)
Permutation tests of the independence hypothesis Ho are based on the conditional distribution of y? in Y given the ordered values y(1) , y(2) , . . . , y(n) of the observed response vector y, (assuming for convenience no ties), say
y( ) = (y(1) , y(2) , . . . , y(n) ) and also given the zi and Ri values,
(2.6)
z = (z1 , z2 , . . . , zn ) and R = (R1 , R2 , . . . , Rn ).
(2.7)
Independence Lemma Suppose that Ho is true. Then the conditional distribution of y? given y( ) , z, and R is uniform over Y .
? ? ? Proof According to (2.2), (2.3), an observable vector y? = (y1 , y2 , . . . , yn ) has Ho density
6
fig2. 7point example, showing mle from qself3
3
?
? ?
0.18 0.18 0.22
y>
2
? ? ? ?
0.1 0.08 0.09 0.14
0
1
1
2
3
4
5
6
7
z> numbers at right are the selfconsistent dens est
Figure 2: The sevenpoint example from Table 1; dots indicate points (zi , yi ); vertical bars show truncation regions Ri = [ui , vi ]; dashed horizontal lines indicate the ordered y values. Numbers at right give the nonparametric MLE for the y distribution assuming that the independence hypothesis Ho is true, as explained in Section 4.
7
n
? [f (yi )/Fi ] =
n i=1
? f (y i )/
n
Fi .
i=1
(2.8)
i=1
If y? ∈ Y then (2.8) equals which proves the lemma.
n i=1
f (y(i) )/
n i=1 Fi .
This has the same value for all y? ∈ Y ,
Permutation tests of Ho are carried out in the usual way: we choose a test statistic S (y), with larger values of S indicating stronger disagreement with Ho , and compare the observed value s = S (y) with the set of permutation values
S = {S (y?), y? ∈ Y} The pvalue of the test is the proportion of S exceeding s.
(2.9)
The tau test. A particular choice of the test statistic S (y) was advocated in Tsai (1990) and Efron and Petrosian (1992, 1994), in the context of onesided truncation. A pair of indices (i, j ) is said to be comparable if yi ∈ Rj and yj ∈ Ri . De?ne C = set of comparable pairs and (2.10)
τ=
(i,j )∈C
sign[(yi ? yj )(zi ? zj )] / #C .
(2.11)
For untruncated data #C = ? ? and τ is Kendall’s tau statistic. If we decide that the 2 independence hypothesis Ho is false then τ provides a convenient nonparametric measure of correlation between y and z , see Section 3 of Efron and Petrosian (1994); τ = .429 for the sevenpoint example. Here we will use just the numerator of (2.11),
?
n
?
8
τ=
(i,j )∈C
sign[(yi ? yj )(zi ? zj )]
(2.12)
as the statistic S (y) for testing Ho , calling this the tau test. For the sevenpoint example τ = 3 and the 78 members of Y have this distribution of τ ? values, τ ?: #: <3 63 =3 8 >3 7 (2.13)
The onesided pvalue for testing Ho is (7+8/2)/78 (splitting the probability atom at τ ? = τ .) Permutation tests based on τ or τ are identical for onesided truncation, since #C is the same for all observable permutations, but they can di?er under twosided truncation. The two test statistics give almost the same results for the quasar data. Section 6 also discusses other choices of S (y) that have greater testing power.
3
Approximate PValues
The pvalue for the sevenpoint example was found by complete enumeration of the tau statistic over the set of observable permutations Y . This becomes impossible for data sets much larger than n = 7, and we need convenient approximations in order to carry out the test. It seems like a good approximation should be easy to ?nd. The tau statistic (2.12) has permutation expectation zero,
Eperm {τ ? } = 0,
(3.1)
9
since interchanging any comparable pair of indices reverses the sign of the corresponding component of τ ((3.1) is not true for τ , (2.11)). Moreover the components of τ are shorttailed so that the normal approximation
2 τ ? N (0, σperm )
(3.2)
is likely to have good accuracy, see Theorem 2 of Tsai (1990). However, there seems to be no
2 convenient formula for σperm , the permutation variance, at least not in the doubly truncated
case. A formula does exist when the truncation is only onesided, as discussed below.
2 We can use Markov Chain Monte Carlo (MCMC) methods to approximate σperm . That
2 permutation vectors y? , and then estimate σperm by the empirical variance of τ (y? ). A
is we can generate a random walk on Y that eventually produces uniformly distributed particularly simple MCMC scheme starts at y? = y and proceeds as follows: pair of indices i, j at random; and
(1) choose a
(2) interchange the ith and j th components of y? if the
? ∈ Ri ) otherwise keeping y? the same. and yj
? ∈ Rj resulting vector is in Y (that is if yi
Elementary Markov Chain Theory says that y? has its stationary distribution uniform on Y , see Gelman and Rubin (1992). This assumes that Y is connected by coordinate interchanges, a fact proved recently by Persi Diaconis and Ronald Graham (personal communication). A su?cient number of iterations of steps (1) and (2) make y? nearly uniformly distributed on Y . Repeating this whole process some large number B of times, perhaps over Y . Then we can estimate the permutation variance by
2 σperm = B ? b=1 [τ (y (b))
B = 800, gives independent vectors y? (1), y? (2), . . . , y? (B ), distributed nearly uniformly
B?1
? τ (·)]2
B
[τ (·) =
τ (y? (b))/B ],
(3.3)
b=1
and approximate the pvalue of the tau test by
p = 1 ? Φ(T ),
T = τ (y)/σperm
(3.4)
10
where Φ is the standard normal cumulative distribution function (CDF). A more direct approach, but one that tends to require larger values of B1 is to estimate p as at (2.9), by
p = #{τ ? > τ }/B .
(3.5)
An MCMC analysis of the quasar data of Figure 1 was carried out by Duncan Murdoch, personal communication. The number of iterations required to compute T (0) = τ /σperm with the same acuracy achieved in Figure 5 was, very roughly, 4,000,000. In any given situation it is hard to know how many iterations of steps (1) and (2) are required to make y? su?ciently uniform on Y . Generating independent replicates y? (b) as in (3.3), which is very convenient for error analyses, may be quite ine?cient in the MCMC context. An information reference on these di?culties and their possible remedies is Gilks et al. (1993) and the ensuing discussion.
2 Section 5 discusses a bootstrap approximation for σperm that sacri?ces the theoretical
exactness of the MCMC algorithm for more e?cient and more de?nite numerical results. Bootstrap estimates are used in section 6’s analysis of the quasar data. The bootstrap approach has the advantage of applying to any kind of truncated and/or censored data, as mentioned at the end of Section 5. Neither MCMC nor the bootstrap are needed in the case of onesided truncation where the observational regions in (2.11) are all of the form Ri = [ui , ∞). In this case there is a simple way to generate vectors y? uniform on Y . De?ne the riskset numbers Nj = #{i : ui ≤ y(j ) and yi ≥ y(j )}
(3.6)
Nj is the size of the j th risk set in the following sense: Looking at ?gure 2, but with all the regions Ri now extending up to in?nity, we begin with y(1) , the smallest y value, and work upwards. At each y(j ) there are Nj observable choices of “zi ” to go with y(j ) , all of which are equally likely under Ho . These are the values of z “at risk” for pairing with y(j ) .
11
For example, there are N2 = 3 possible choices of z to go with y(2) , namely 2, 3, or 4, the actual choice in ?gure 2 being z = 4. Altogether there are N = N = 324 (= 3 · 3 · 3 · 3 · 2 · 2 · 1) for the sevenpoint example. A uniform choice of y? in Y is accomplished by choosing zi uniformly from the Ni possible choices at each y(j ) . this makes it easy to simulate the permutation distribution for any test statistic S (y). The permutation variance of the tau statistic (2.12) turns out to be
n 2 σperm =4 i=1 n i=1
Ni members of Y ,
Vi
where
Vi =
Ni2 ? 1 12
(3.7)
as shown for example in Section 3 of Efron and Petrosian (1994). We will use Formula (3.7)
2 in Section 5 to validate the accuracy of the bootstrap estimate of σperm .
4
Estimating The Response Distribution
Question 2 of the introduction asks us to estimate the distribution of the response variable y assuming that the independence hypothesis Ho is true. More precisely, we want to estimate the response density f (y ) in (2.2), (2.3). This section discusses nonparametric and parametric estimates of f (y ) when the data is doubly truncated. The nonparametric MLE is a discrete distribution putting all of its probability on the observed responses y1 , y2 , . . . , yn , Turnbull (1976). Let f = (f1 , f2 , . . . , fn ) be a distribution putting probability fi on yi , and let F = (F1 , F2 , . . . , Fn ) be the vector of observational probabilities Fi = probf {y ∈ Ri }, so F = Jf where J is the n × n matrix describing which y values are included in the regions Ri , (4.1)
12
Jij =
1 if yj ∈ Ri 0 if yj ∈  Ri
(4.2)
According to (2.2), the log likelihood of the observed sample is
n
? = log
(fi /Fi )
i=1
(4.3)
Di?erentiating (4.3) with respect to fi , and using (4.1), gives
n ?? 1 1 = ? Jji . ?fi fi j =1 Fj
(4.4)
The maximum likelihood equations ??/?fi = 0 can be expressed as 1 1 = J′ , f F where
n i=1 1 f
(4.5)
) and = ( f11 , f12 , . . . , f1 n
1 F
1 = (F , 1 , . . . , F1n ). Notice that ? in (4.3) stays the same 1 F2
if f , and hence F, is multiplied by any positive constant, allowing us to ignore the constraint fi = 1 in the derivation of (4.5).
Equation (4.5) is the same as Turnbull’s selfconsistency criterion. We can solve for the MLE f by beginning with any initial estimate and then iterating between (4.1) and (4.5) (remembering to rescale after each application of (4.5) so that the estimate of f sums to 1). The nonparametric MLE f for the sevenpoint example is shown at the right edge of Figure 2. The substantial di?erences between f and the untruncated MLE (.14, .14, ..., .14) are not intuitively obvious from the truncation pattern. The method just described is an EM algorithm, and often converges quite slowly to the MLE. An alternative algorithm is based on LyndenBell’s 1971 method for computing the MLE in the case of onesided truncation. For notational convenience suppose that the y values are indexed in increasing order, so yi = y(i) , where we continue to assume that
13
there are no ties. Corresponding to density function f = (f1 , f2 , . . . , fn ), the survival curve G = (G1 , G2 , . . . , Gn ) and the hazard function h = (h1 , h2 , . . . , hn ) are de?ned by
Gj =
i≥j
fi
and
hj = fj /Gj
(4.6)
We can recover G and f from h via the relationships
Gj = exp{
i<j
log(1 ? hi )}
and
fj = Gj ? Gj +1
(4.7)
Here G1 = G(y(1) ) = 1 by de?nition. The following theorem is veri?ed in the Appendix. Hazard Rate Theorem The nonparametric MLE f has hazard function h satisfying
n 1 Jij Qi = Nj + hj i=1
(4.8)
where Nj are the riskset numbers (3.5), {Jij } are the inclusion indicators (4.2), and Qi = Gvi + / Fi Here F = Jf as in (4.1), and Gvi + = {fk : yk > vi }. (4.9)
The numerator of Qi is the MLE probability of exceeding vi , the upper observational limit for yi . In the onesided truncation case Qi = 0 since vi = ∞, so equation (4.8) takes the form 1 = Nj , hj
(4.10)
which is LyndenBell’s (1971) estimate. In this case (4.7) gives the MLE f directly, without iteration. In the case of twosided truncation we can begin with (4.10) and iterate (4.7),
14
(4.8) to obtain the MLE. This converges quickly if the upper truncation is not severe, as turns out to be the situation for the quasar data. The solid curve in Figure 3 shows log G(y ), the nonparametric MLE of log(
yi ≥y
fi ),
for the 210 quasars. The log luminosity y is not the same quantity plotted in ?gure 1, but is an adjusted version as explained in Section 6. Also shown is the LyndenBell estimate (4.10), dashed line, which ignores upper truncation. The two estimates are almost the same so in this case upper truncation has little e?ect. The MLE algorithm based on (4.7)(4.8) converges very quickly here. It is not an accident that the LyndenBell estimate of survival is everywhere less than the MLE. Corollary The LyndenBell estimated survival curve, which ignores upper truncation, is less than or equal to the nonparametric MLE. The proof of the corollary is immediate from a comparison of (4.8) with (4.10), which shows that the estimated hazard rate can only be greater in the LyndenBell case. The curve labeled “SEF” in Figure 3 is derived from a special exponential family in the terminology of Efron and Tibshirani (1996). In this case the SEF density estimate f (y ) is the MLE among densities of the form
log f (y ) = β0 + β1 y + β2 y 2 + β3 y 3 , for y in the range of the observed yi values. That is, it maximizes
n i=1 (f (yi )/Fi )
(4.11) in (2.2),
(2.3) among all choices of (β1 , β2 , β3 ) (with β0 then determined by the requirement that f (y ) integrate to 1 over the interval [y(1) , y(n) ]). The appendix describes the calculation of f (y ). LyndenBell’s estimate and the parametric MLE can behave erratically near the extremes of the y range, where the risk set numbers Nj may be small. At the arrowed point in ?gure 3 for example N1 = 2, giving LyndenBell estimates h1 = .50 and G(y(2) ) = .50 according to (4.10) and (4.7). The nonparametric MLE has G(y(2) ) = .51. The SEF estimate smooths
15
fig3
... .... .... .... .... .... .... ..... .... .... .... .... .... ... ... ... ... ... .. .. .. .. .. .. .. ..SEF .. .. .. .. . . . Nonpar MLE . . . . LyndenBell . . 2 1 0 1 2 3
log Survival
10
8
6
4
2
0
adjusted log luminosity
Figure 3: Estimated log survival curve log G(y ) as a function of the adjusted log luminosity evolution yj for the quasar data, adjusted for luminosity evolution: θ = 2, as explained in Section 6; solid curve is nonparametric MLE; dashed curve is LyndenBell estimate (4.10), ignoring upper truncation; smooth dotted curve is cubic special exponential family.
16
out the bumps in the nonparametric MLE, here giving G(y(2) ) = .72. SEF estimates are less variable than the nonparametric MLE, but can be biased if based upon an incorrect model. The cubic model in Figure 3 was chosen by successive signi?cance tests, as explained in the appendix. A small simulation study showed that percentile points of the cubic SEF estimates had roughly half the standard deviation of their nonparametric MLE counterparts.
5
Bootstrap Tests for Independence
We now return to the question of testing the independence hypothesis Ho , (2.2). this section discusses bootstrap approximations to the permutation pvalues of Section 2. Most of the discussion is in terms of the tau test, but the method applies to any test statistic as mentioned at section’s end, and can be extended to arbitrarily complicated patterns of data censoring and truncation. Let f (y ) be an estimate of the density for the response variable y calculated as in Section 4, assuming that Ho is true. The estimate f might be the nonparametric MLE or an SEF estimate as in Figure 3. We can use f to draw a bootstrap sample y? by following the recipe
? ? ? in (2.2), (2.3): y? = (y1 , y2 , . . . , yn ) has independent components, with the ith component’s
density being
? ? f (y i )/Fi for yi ∈ Ri ? 0 for yi ∈  Ri
independently i = 1, 2, . . . , n
(5.1)
where Fi =
Ri
f (y )dy .
An approximate version of the tau test can be carried out by generating B independent bootstrap samples y? and then proceeding as in (3.3), (3.4) to get the approximate pvalue p = 1 ? Φ(T ). How big need B be? Letting
17
T = τ (y)/σperm
and T = τ (y)/σperm ,
(5.2)
standard normaltheory calculations show that √ . sd? {T /T } = 1/ 2B, where sd? indicates the simulation standard deviation. This gives B: sd? 50 100 200 400 800 0.10 .07 .05 .035 .025 (5.4)
(5.3)
so we need B = 800 bootstrap replications to determine T within about 2.5% of its ideal value T . These same calculations apply to the MCMC approach in (3.3), (3.4). B = 800 is large enough to permit a check in the normality assumption leading to the estimate p = 1 ? Φ(T ), and a retreat to the nonparametric estimate (3.5) if normality looks dubious. As a check on the bootstrap test, σperm was estimated using B = 800 replications for the data in Figure 1, but ignoring the upper bounds of the truncation regions (i.e. taking all vi = ∞). The estimate was σperm = .0441 ± .0011, “±” indicating the bootstrap simulation error, agreeing nicely with the exact permutation standard deviation σperm = .0445 obtained from (3.7). The bootstrap approximation for σperm looks much di?erent than the MCMC approach of Section 3. In particular it requires a preliminary estimate of f (y ). The brief discussion in the appendix shows that the bootstrap and permutation calculations are more alike than they appear, and that similar approximations are used in more familiar statistical contexts. There is nothing special about the tau statistic as far as bootstrap or permutation methods are concerned. Section 6 mentions a more powerful version of tau that puts greater weight on those terms in (2.11) having bigger values of zi ? zj . We could in fact employ 18
any test statistic S (y), and then use the bootstrap or MCMC to approximate the comparison set S and pvalue in (2.9). The main advantage of taulike test statistics is that they have bootstrap or permutation expectations zero under Ho , and nearly normal distributions, which eases the task of ?nding pvalues by simulation. Tsai (1990, Section 4) shows that the tau test of independence can be applied to data that is truncated below and censored above. In principle, bootstrap tests can be applied to any form of truncated and/or censored data. The ?rst step is to estimate the response density f (y ) using Turnbull’s (1976) nonparametric selfconsistency algorithm. Then bootstrap samples y? are drawn as in (5.1), taking into account each yi’s pattern of truncation and censoring, and the pvalue approximated by comparing the test statistic S (y) with S = {S (y? (1)), S (y?(2)), . . . , S (y? (B ))}. Romano (1988) gives a general discussion, and validation, of this kind of “null hypothesis bootstrap” test procedure.
6
The Quasar Data
Our estimation and testing theory will now be applied to the quasar data of Figure 1. First the situation will be described more carefully. The original dataset consisted of independently collected quadruplets
(zi , mi , ai , bi ) i = 1, 2, . . . , n
(6.1)
where zi is the redshift of the ith quasar and mi is its apparent magnitude. The numbers ai and bi are lower and upper truncation limits on mi . Quasars with apparent magnitude above bi were too dim to yield dependable redshifts (remembering that bigger values of m correspond to dimmer objects.) The lower limit ai was used to avoid confusion with nonquasar steller objects. In this study ai = 16.08 for all i, while bi varied between 18.494 and
19
18.934. The full dataset comprised n = 1052 quasars. Here we are considering a randomly selected subset of n = 210 quadruplets (6.1). Farther quasars appear dimmer of course, that is they tend to have bigger values of mi . Hubble’s law, which says that distance is proportional to redshift, allows us to transform apparent magnitudes into a luminosity measurement that should be independent of distance. This transformation depends on the cosmological model assumed. The log luminosity values yi in Figure 1 were obtained from a formula yi = t(zi , mi ) that takes into account relativistic e?ects of the distance, mi 1 1/2 + 2 log(Zi ? Zi ) ? log(Zi ), 2.5 2
yi = 19.894 ? 2.303
(6.2)
where Zi = 1 + zi . Formula (6.2) is derived from the EinsteindeSitter cosmological model, Weinberg (1972). The last term makes the socalled Kcorrection, taking into account the shifting of the spectrum due to redshift. Larger values of yi correspond to intrinsically brighter quasars in Figure 1. The truncation limits Ri = [ui, vi ] were obtained by applying transformation (6.2) to the observational limits for mi ,
ui = t(zi , bi ) and vi = t(zi , ai )
(6.3)
This makes ui and vi , the lower and upper dashes in ?gure 1, strongly increasing functions of zi , even though ai and bi are not. One of the principal goals of the quasar investigations is to study luminosity evolution: quasars may have been intrinsically brighter in the early universe and evolved toward a dimmer state as time went on. This would tend to make the points on the right side of Figure 1 higher since larger redshifts correspond to quasars seen longer ago. However in the absence of luminosity evolution we should have yi independent of zi except for truncation e?ects. This brings us back to Question 1 of the introduction. Testing the independence
20
hypothesis Ho amounts to testing for the absence of luminosity evolution. A convenient oneparameter model for luminosity evolution says that the expected log luminosity increases linearly as θ · log(1 + z ), with θ = 0 corresponding to no evolution. If θ is a hypothesized value of the evolution parameter then instead of yi being independent of zi we should test the null hypothesis “Hθ ”,
Hθ :
yi (θ) = yi ? θ · log(1 + zi )
independent of
zi
(6.4)
Figure 4 shows plots of the data for θ equal 0, 2, and 4, all for our same set of 210 quasars. Notice that the truncation regions Ri = (ui , vi ) also change with θ,
ui (θ) = ui ? θ · log(1 + zi ) and vi (θ) = vi ? θ · log(1 + zi )
(6.5)
We can apply the tau test to each of the null hypotheses Hθ . This was done for values of θ between 0 and 4 with the results shown in ?gure 5. The solid curve is the standardized test statistic T = τ /σperm , (3.4), with σperm determined by bootstrap sampling. B = 800 bootstrap replications (5.1) were drawn for θ = 0, .5, 1, . . . , 4. Almost exactly the same curve was obtained using τ ,, (2.11), in place of τ . We see that T (0) = 2.20, giving an approximate onesided pvalue 1 ? Φ (2.13) = .015. The tau test rejects the null hypothesis of independence Ho in favor of a positive value of the luminosity parameter θ. At θ = 2.38 we have T (θ) = 0. The T (θ) curve crosses ±1.645 at [1.00, 3.20], which provides an approximate 90% central con?dence interval for θ. As a point of comparison, using all 1052 quasars gave θ = 2.11 and 90% interval [1.38, 2.63]. If we are willing to ignore the upper truncation limits (setting all the vi = ∞) we can employ the more exact onesided tau test (3.5)(3.6). These results, which did not involve bootstrap sampling, were only slightly more signi?cant than those for the twosided test, as shown by the dots in Figure 5. The choice θ = 2 makes the adjusted log luminosity yi(2) = yi ?2 log(1+zi ) approximately 21
theta= 0
   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ??  ?? ? ??    ? ?? ? ? ??? ?? ? ? ?  ?  ? ? ? ? ? ?? ? ? ? ?  ?? ? ?  ? ? ?? ? ? ?? ? ? ? ? ? ? ? ?  ? ? ?? ? ?  ? ? ? ? ? ??  ? ??? ?  ?  ?  ??? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  ? ?? ? ?  ? ? ? ? ? ? ? ??? ?   ? ?? ? ?? ? ?  ? ? ???? ? ?  ?? ? ?? ???  ? ?? ? ? ?? ? ?  ? ? ?? ?? ? ?? 0.5 1.0 1.5 2.0 redshift z     ?  
theta= 2
    ?  ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ?? ? ? ? ? ?? ?? ? ? ? ?? ? ?? ? ? ?? ? ? ? ??? ?? ? ? ? ? ???   ? ? ?? ?  ?? ? ? ? ?? ? ? ? ?  ? ?? ?     ? ?? ? ? ? ?? ? ? ? ?  ? ? ? ?? ?    ??     ? ? ? ?? ? ?   ?  ??  ? ?  ?? ? ? ? ?  ?? ? ??  ? ??    ? ?  ? ? ? ? ? ? ? ? ? ?  ? ?  ? ?? ? ?? ? ??  ? ??? ?0.5 1.0 1.5 2.0 redshift z      
theta= 4
   ? ? ? ? ?? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ?? ? ? ? ? ? ? ? ? ??? ??? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ?? ? ? ? ? ?? ? ?? ? ? ??? ?? ? ?? ? ? ? ? ? ?  ? ? ? ? ? ? ? ? ? ?? ? ? ? ?  ?    ? ? ? ?  ? ?    ? ? ?         ? ?   ??  ?         ?       ?        ? ?  ? ??2.5 3.0 0.5 1.0 1.5 2.0 2.5        
2
4
? ? ? ? ?
?
? ? 
?
?
? 
?
 
1
 
? ? ? ? 
?
?
? ? ?
0.5
3
?
0.0
?
1.0
?  
2

log luminosity
log luminosity
log luminosity

? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ? ? ?
 
1
0
1.5
2.5
? ? ? ?? ? ? ? ? ?? ? ? ? ?   ?          
0
1
2.0
? 
1
 
2
2.5
3.0
2
3.0
redshift z
Figure 4: Plots of the quasar data of Figure 1 for three choices of the luminosity evolution parameter θ; θ = 0 corresponds to Figure 1; other values of θ plot (zi , yi (θ)), (6.4), with limits ui(θ) and vi (θ), (6.5).
22
fig5
2
< T(0)=2.13 1.645
1
0
1
1.645
3
2
T(4)=3.28 > 0 1 2 3 4
Figure 5: Tau test of the null hypotheses Hθ , (6.4), for the 210 quasars; θ = 0, 0.5, 1, . . . , 4; solid curve T = τ /σperm , (3.4), crosses zero at θ = 2.38; σperm found by bootstrap sampling (5.1); 90% central con?dence interval θ ∈ [1.00, 3.20]. Dots indicate T values assuming only onesided truncation.
23
independent of zi so we can estimate its density f (y ) as in Section 4. Figure 3 shows the estimated log survival curve log G. It curves sharply downward for large values of y , which is a fortunate thing for the testing results. If log G were linear then f (y ) would correspond to a onesided exponential density. Because of the exponential’s memoryless property it is impossible to test for independence in the exponential case (unless the lower endpoint of the exponential density exceeds some of the lower truncation limits ui ). On the other hand, the MLE f (y ) consistently estimates an exponential tail even if yi and zi are not independent. It is a good idea to estimate f (y ) or G(y ) even if Question
1 is of primary interest. Should the MLE turn out to be exponential then any testing procedure will be futile. Model (6.4) gives us reason to question the e?cacy of the tau statistic in this situation. Let θo be the true value of the evolution parameter. Then the di?erence of two of the luminosity measurements in Figure 1, where θ = 0, can be expressed as
yi ? yj = yi (θo ) ? yj (θo ) + θo (wi ? wj )
[wi = log(1 + zi )]
(6.6)
The di?erences yi (θo ) ? yj (θo ) are symmetrically distributed about zero, except for truncation e?ects so (6.6) suggests that (i, j ) pairs with bigger values of wi ?wj  will be more informative in testing for deviations from Ho . An analysis of the 4907 comparable pairs for the data in Figure 1 veri?ed that those with large values of wi ? wj  were contributing more consistently to the tau statistics: wi ? wj  : prob{sign(yi ? yj )(zi ? zj ) = 1} 0 .25 .5 .75 .51 .56 .59 .62 1 .63
A modi?ed version of the tau statistic (2.11) was tried on the quasar data,
τ ?′ =
C
 log(wi /wj )  sign [(yi ? yj )(zi ? zj )] /
C
 log(wi /wj ),
(6.7)
24
but it showed only modest improvements over either τ or τ . The gains were more substantial ? ′ = 3.21 compared to when all 1052 quasars were included, giving standardized statistic T T = 2.77 for testing Ho .
25
Appendix
A
Proof of the hazard rate theorem (4.8)
Because the nonparametric MLE puts all of its probability on the observed responses, and because it is invariant under monotonic transformation of the y scale, we can assume that yi = i for i = 1, 2, . . ., and that Ri = [ui , vi ] where ui and vi ∈ {1, 2, . . . , n} (again assuming no tied y values.) In addition to Nj = #{i : ui ≤ j and i ≥ j }, (3.5), we de?ne Mj = #{i : ui = j } It is easy to see that (A.1)
M1 = N1 We will also assume that
and Mj = Nj ? Nj ?1 + 1 for j = 2, 3, . . . , n.
(A.2)
Mn = 0,
(A.3)
because if this were not true then the largest observation yi = n would have been truncated to the degenerate interval Rn = [n, n], and we could reduce the sample size to n ? 1. (A3) implies Nn?1 = Nn + 1 = 2 according to (A2), since Nn = 1. Following the notation of section 4, a density f = (f1 , f2 , . . . fn ) has likelihood L =
n i=1
fi /Fi for the observed sample. The hazard function h = (h1 , h2 , . . . , hn ) has hn = 1,
so we need only consider its ?rst n ? 1 coordinates. The following lemma expresses the likelihood in terms of the hazard rate. Lemma The likelihood corresponding to (f1 , f2 , . . . , fn ) or (h1 , h2 , . . . , hn?1 ) is
26
L=
n n?1 fi hj (1 ? hj )Nj ?1 ] / [ {1 ? =[ i=1 j =1 i=1 Fi n
vi ui
(1 ? hk )}]
(A.4)
Proof In the onesided case of no upper truncation, where all the vi = n, the denominator in (A4) equals 1 (since hn = 1), so that (A4) becomes
n?1 fi hj (1 ? hj )Nj ?1 = F i=1 i j =1 n
(A.5)
This is easy to verify, starting from hj = fj /Gj :
n?1 j =1
hj (1 ? hj )Nj ?1 = =
n?1 j =1 n
n?1 fj Gj +1 Nj ?1 GNn?1 ?1 fj ] [ N1 N2 +1 n Nn?1 ,?Nn?2 +1(A.6) ( ) =[ ] Gj Gj G1 , G2 . . . Gn ? 1 j =1 n?1
fj /
j =1 j =1
Gj j ,
M
(A.7)
where we have used Gn = fn ,
n?1 j =1
Nn?1 = 2, (A2), and (A3). But in the onesided case
Gj
Mj
=
n i=1
Fi , since each Fi must equal some Gj , so (A6) proves (A5).
In the twosided case
n n fi Gu i fi =[ ][ ]. i=1 Gui i=1 Fi i=1 Fi n
(A.8) for the onesided case,
But the ?rst bracketed factor on the right is the same as and formula (4.7) says that Gu i Gu i = = [1 ? Fi Gui ? Gvi +
vi ui
n i=1 (fi /Fi )
(1 ? hk )]?1
(A.9)
which together give the lemma (A4). Finally, di?erentiating the log likelihood
n?1
n
vi
log L =
j =1
[ log hj + (Nj ? 1) log (1 ? hj )] ?
i=1
log [1 ?
ui
(1 ? hk )].
(A.10)
27
gives Gvi+ ? log L 1 1 = ? [(Nj ? 1) + ], ?hj hj 1 ? hj Fi i:j ∈Ri which is equivalent to the hazard rate theorem (4.8), (4.9). The likelihood equality (A.5) for the onesided case can be obtained directly from familiar survival analysis arguments involving successive conditionally independent binomial likelihoods. See section 2 of Efron (1988). In the twosided case successive conditioning gives the messier expression
n n
(A.11)
L=[
j =1 i∈Nj
(1 ? hij ) ] [
j =1
hjj ], 1 ? hjj
(A.12)
when hij is the hazard function for yi ,
hij = fj /
j ≤k ≤vi
fk ,
(A.13)
and Nj = {i : ui ≤ j (A4).
and i ≥ j }, the “j th risk set”. The lemma says that this reduces to
Special Exponential Families Classic exponential families such as the normal, Poisson, and binomial have mathematical properties that greatly simplify their use. Modern computational equipment allows us to design special exponential families (SEF) for particular applications without worrying about mathematical tractability. The SEF appearing in Figure 3 has a density of the form
fη (y ) = eη t(y)?φ(η) where t(y ) = (y, y 2, y 3 ), make
Y
′
for y e Y
(A.14)
η = (η1 , η2 , η3 ), R = [ min (yi), max(yi ) ] and φ(η ) is chosen to
fη (y )dy = 1
28
The cubic SEF shown in Figure 3 used the MLE η of the parameter vector η . In order to calculate η we de?ne
Tη (g ) =
R
g (y )eη t(y) dy,
′
(A.15)
where g (y ) may be a vector or matrix function of y in which case the integrals in (A.13) are ˙η (y) = carried out componentwise. Then the score function ? data structure (2.2), (2.3) is
n i=1 ? ?η
log fη (y) for the truncated
˙η (y) = ?
[ t(yi ) ? Tη (Ii · t) / Tη (Ii )],
(A.16)
where Ii (y ) is the indicator function for Ri , as in Efron and Tibshirani (1996). The second derivative matrix is given by
n i=1
¨η (y) = ??
{Tη (Ii · t2 ) / Tη (Ii ) ? [ Tη (Ii · t) / Tη (Ii )) ]2 },
(A.17)
with t2 indicating the n × n outer product matrix (ti tj ). ˙ (y ) = 0, is found by NewtonRaphson iteration The MLE η , satisfying ? η
¨ (y) ]?1 ? ˙ (y ) , η (k + 1) ? η (k ) = [ ?? η (k ) η(k )
(A.18)
These calculations go quickly because they involve only onedimensional integrals (A.13), though many of them. Successive models t(y ) = yj t(y ) = (y, y 2), . . . were tried on the data for Figure 2. Standard hypothesis tests led to the choice of a cubic model. These tests used ¨ (y) ]?1 from (A.15) to assess the signi?cance of the η the estimated covariance matrix [ ?? η coe?cients, e.g. whether or not η3 is signi?cantly nonzero for the cubic model. Note The nonparametric MLE is itself an SEF estimate, where the components of t(y ) in (A.12) are delta functions on the observed yi values, t(y ) = (δ (y ?y1 ), δ (y ?y2), . . . , δ (y ?yn )).
29
Bootstrap and Permutation Calculations Section 5 discusses bootstrap approximations to permutation tests for the null hypothesis of independence Ho . The situation there, involving doubly truncated data, looks complicated but in fact it is perfectly analogous to much simpler and more familiar statistical problems. Consider the problem of testing the equality of two binomial probabilities. We observe independent Bernoulli variates
y11 , y12 , . . . y1n1 ? Bi(1, π1 ) y21 , y22 , . . . y2n2 ? Bi(1, π ) and wish to test the null hypothesis Ho that π1 = π2 = π (say). Letting x1 =
n2 i=1 n1 i=1
(A.19) (A.20) y1i and x2 =
y2i we can arrange the data in a 2 × 2 table, ?rst row (x1 , n1 ? y2 ) and second row
(x2 , n2 ? x2 ), in which case Ho is the usual hypothesis of independence for the table. If Ho is true than x+ = x1 + x2 is a su?cient statistic and x+ ? Bi(n+ , π ) with n+ = n1 + n2 . Moreover all n+ ! permutations of the combined data vector y = (y11 , y12 , . . . , y2n2 ) are equally likely given x+ , allowing us to construct permutation tests for Ho without worrying about the nuisance parameter π . The permutation test based on the statistic
S = p1 ? p2
[ p1 = x1 /n1
p2 = x2 /n2 ]
(A.21)
is Fisher’s exact test for independence. S has permutation expectation zero and variance n+ 1 1 π (1 ? π )( + ), n+ ? 1 n1 n2
2 σperm =
[ π = x+ /n+ ]
(A.22)
2 Suppose we did not know formula (A20) and wished to approximate σperm by bootstrap
simulation. The null hypothesis bootstrap replaces (A18) with
? ? ? y11 , y12 , . . . , y2 n2 ? Bi(1, π ),
ind
(A.23)
30
giving
n1 ? S ? = p? 1 ? p2 = i=1 ? y1 i /n1 ? n2 ? y2 i /n2 . i=1
(A.24)
S ? has bootstrap expectation zero and variance 1 1 + ). n1 n2
2 σperm = π (1 ? π ) (
(A.25)
2 The usual χ2 statistic for testing independence is exactly S 2 /σperm .
The bootstrap sampling is unconditional in the sense that the samples do not necessarily have x? + = x+ . Nevertheless we get an approximation that is accurate to second order,
2 2 σperm /σperm = 1 + O (1/n+ ).
(A.26)
It can be shown that this is no accident, and that secondorder accuracy follows for any situation of the following sort: the observed data can be written as (A, B ) where under
Ho , A has an exponential family of densities and the conditional density f (B A) does not depend on the parameters of the exponential family. In the binomial situation A = x+ and B is a permutation of a vector of x+ 1’s and n+ ? x+ 0’s. Under Ho , A ? Bi(n+ , π ) and B A is uniform. It is easiest to see the connection with our truncated data problem through the SEF formulation. Formula (A.13) leads to an exponential family of densities for y under the truncated data model (2.2), with su?cient statistic A =
n i=1
t(yi ). As the exponential family (A.13) grows larger, say by including
more polynomial terms in t(y ) = (y, y 2, y 3, y 4 , . . .), A itself gets bigger, but we can always take A = y( ) , the order statistic of y. But A = y(
)
is equivalent to taking A = Y (since
the regions Ri are known ancillaries), as we did for the permutation tests of section 2.
31
References Efron, B. and Tibshirani R. (1996). “Using specially designed exponential families for density estimation.” Annals Stat 24, p. 24312461. Efron, B. (1996). “Empirical Bayes methods for combining likelihoods” (with discussion) Journal of American Statistical Association 91, p. 538565. Efron, B. and Petrosian V. (1994), “Survival analysis of the gammaray burst data” Journal of American Statistical Association 89, p. 452462. Efron, B. and Petrosian V. (1992) “A simple test of independence for truncated data with applications to redshift surveys” The Astrophysical Journal 399, p. 345352. Efron, B. (1988) “Logistic regression survival analysis, and the KaplanMeier curve” Journal of American Statistical Association 83, p. 414425. Gelman, A. and Rubin D. (1992) “Iterative simulation using single and multiple sequences” (with discussion) Statistical Science 7, p. 457511. Gilks, W., Clayton D. et al. (1993) “Modeling complexity: applications of Gibbs sampling in medicine” (with discussion) Journal of the Royal Statistical Society B 55, p. 39102. LyndenBell, D. (1971) “A method of allowing for known observational selection in small samples applied to 3CR quasars” Mon. Nat. R. Ast. Soc. 155, p. 95118. McLaren C., Wagsta? M, Brittegram G., Jacobs A. (1991) “Detection of twocomponent mixtures of lognormal distributions in grouped, doubly truncated data analysis of red blood cell volume distributions” Biometrics 47, p. 607622. Romano, J. (1988) “A bootstrap revival of some nonparametric distance tests” Journal of American Statistical Association 83, p. 698708. Tsai, W. (1990) “Testing the independence of truncation time and failure time” Biometrika 77, p. 16977.
32
Turnbull, B. (1996) “The empirical distribution function with arbitrarily grouped censored, and truncated data” Journal of the Royal Statistical Society B38, p. 290295. Weinberg, S. (1972) Gravitation and Cosmology, Wiley, New York.
33