Coarse structural nested mean models are tools for estimating treatment effects from longitudinal observational data with time-dependent confounding. There is, however, no guidance on how to specify the treatment effect model, and model misspecification can lead to bias. We derive a goodness-of-fit test based on modified over-identification restrictions tests for evaluating a treatment effect model, and show that our test is doubly robust in the sense that, with a correct treatment effect model, the test has the correct Type I error if either the treatment initiation model or a nuisance regression outcome model is correctly specified. In a simulation study, we show that the test has correct Type I error and can detect model misspecification. We use the test to study how the timing of antiretroviral treatment initiation after HIV infection predicts the effect of one year of treatment in HIV-positive patients with acute and early infection.
Keywords: Causal inference, Estimating equation, HIV/AIDs, Over-identification restrictions test
In observational studies, there is often time-dependent confounding: some covariates are predictors of both the treatment and the outcome. These covariates may also be affected by the treatment history. In such cases, standard regression methods adjusting for the covariate history can lead to bias (Robins et al., 1992; Robins, 2000; Robins et al., 2000). Coarse structural nested mean models (Robins, 1998) are useful for handling time-varying confounding, but they depend on correct specification of the treatment effect model.
In this paper we propose a goodness-of-fit test for correct specification of the treatment effect model. The key insight is that a correctly specified treatment effect model leads to a larger number of unbiased estimating equations than parameters, which results in over-identification of the latter. Over-identification restrictions tests, also called Sargan tests or -tests (Sargan, 1958; Hansen, 1982), are widely used in econometrics. The standard over-identification restrictions test, given by the minimized value of the generalized method of moments (Newey & McFadden, 1994) criterion function, has a limiting distribution, with degrees of freedom equal to the number of over-identification restrictions. In most situations, the minimum of the generalized method of moments criterion is obtained via a continuous iterative procedure that updates the parameter estimates until convergence (Hansen et al., 1996). Arellano & Bond (1991) showed that the test statistic based on one-step estimates other than the optimal generalized method of moments estimates is not robust and tends to over-reject even in large samples. In the statistics literature, generalized method of moments inference has been used as part of the quadratic inference function approach developed by Qu et al. (2000) and Lindsay & Qu (2003).
Coarse structural nested mean models result in an infinite number of estimating functions, indexed by a set of arbitrary functions . The precision of the estimator depends on the estimating functions (Lok & DeGruttola, 2012). Generalized method of moments approaches provide optimal combinations of the parameter-identification estimating functions and the goodness-of-fit estimating functions. However, it is not clear which estimating functions should be used. Semiparametric efficiency theory allows us to derive an optimal set of estimating equations whose corresponding -estimator achieves the semiparametric efficiency bound (Robins, 1994). Combining optimal estimating equations with additional goodness-of-fit estimating equations allows simultaneous estimation and testing, as in the traditional over-identification approach, but it can unnecessarily increase estimation variability (Lindsay & Qu, 2003). The purpose of this paper is to introduce a different strategy that separates estimation and testing, so that the estimator attains optimality under the null model and the test has high power. To achieve this, we obtain parameter estimates by solving the optimal estimating equations with the number of equations equal to the number of parameters, rather than by minimizing an objective function. The over-identified restrictions, used only for testing, can be developed from some parametric specification of alternative models. Simulation studies show that our test statistic has correct size for large samples and high power in all the scenarios considered. Another advantage of the over-identification restrictions test is that no bootstrap is needed to compute the test statistic, which is valuable when working with large samples.
2. Coarse structural nested mean model analysis
We assume that subjects are monitored at discrete times . Let be the outcome at time , and let be a vector of covariates at time . Let be the treatment indicator, which equals 1 if the subject was on treatment at time and 0 otherwise. We use overbars to denote a variable's history; for example, is the treatment information at times . We assume that once treatment is started, it is never discontinued, so each treatment regime corresponds to a treatment starting time and vice versa. Let be the actual treatment starting time, with if the subject never started the treatment during follow-up. We assume that the subjects constitute an independent sample from a larger population (Rubin, 1978), and for notational simplicity we drop the subscript for subjects. Up to §4 we shall assume that all subjects are followed until time . Let be the outcome at time , possibly counterfactual, had the subject started the treatment at time . We assume that the subject's observed outcome is equal to the potential outcome if is the actual treatment starting time ; that is, . We also assume that there is no unmeasured confounding (Robins et al., 1992); that is, for , is conditionally independent of given and . This assumption holds if contains all prognostic factors for that affect the treatment decision at time . Finally, denotes the full information on a subject. Let be the probability measure induced by and the empirical measure induced by , and define .
Following Lok & DeGruttola (2012), we model the treatment effect as
where with fixed. Model (1) compares the average potential outcomes under treatment starting at time and treatment never started, among the subgroup of subjects with covariate history and . As such, it constitutes a conditional causal effect. In observational studies, the treatment assignment mechanism is unknown. We model the probability of treatment initiation at time conditional on the past history as , where the dimension of is finite and fixed. Let denote the estimating function for . Define . As proved in Robins et al. (1992),
and therefore, by the assumption of no unmeasured confounding, . For any measurable, bounded function , let
Then (Lok & DeGruttola, 2012). Therefore, are stacked unbiased estimating equations for both the parameter and the nuisance parameter . For simplicity, we will suppress the dependence of the estimating functions on ; for example, is shorthand for . Sometimes we will also drop the dependence on the parameters.
To derive the optimal estimating equation, and hence the optimal estimator, we assume that for and with , does not depend on . This is a working assumption only, which allows us to derive a closed-form solution for the optimal (Robins, 1994):
Remark 1. —
The optimal vector depends on the unknown and the true distribution through conditional expectations. Following Lok & DeGruttola (2012), we use a preliminary consistent estimate to replace in and . Also, we replace the unknown conditional expectations by estimators under working models, and write and to reflect their dependence on nuisance parameters , and . Denote the estimating functions for , and by , and . By construction, the dimension of is , so the estimating function (2) with (3) has the same dimension as . Under certain modelling assumptions and regularity conditions for the estimating functions (see van der Vaart, 2000, §§5.2 and 5.3), the resulting -estimator is consistent and asymptotically normal. Technical details are given in the Supplementary Material. Specifically, must be correctly specified. In contrast, the estimator remains consistent for if either or is correctly specified. Thus, the estimator is doubly robust (Robins & Rotnitzky, 2001; van der Laan & Robins, 2003).
3. Goodness-of-fit test
Misspecification of the treatment effect model causes bias in treatment effect estimation. Here we develop tests for specification of the treatment effect model based on over-identification restrictions tests. For a correctly specified model, a new set of unbiased estimating functions which differ from the optimal ones used for estimation should be close to zero when evaluated at the optimal estimator. This is formalized in the following theorem.
Theorem 1 (Goodness-of-fit test). —
Let the treatment effect model beand let. Choose a set of functions, witha finite and fixed number, which are different from the optimal choice. Let
The null hypothesisis thatis correctly specified and eitheroris correctly specified. Underand all the required regularity conditions for estimating functions in van der Vaart (2000 §§and), the goodness-of-fit test statistictends toin distribution as, whereis the covariance matrix of, with
whereis the estimating function (2) with (3),,,andare as defined in Remark 1, andis the estimated variance of, withobtained by replacinginwith.
We state here the key steps of the proof; the details can be found in the Supplementary Material. We first establish the asymptotic distribution of . A key step is to linearize as for some function , and apply the central limit theorem. To do so, we assume that the functions form a Donsker class. Using Lemma 19.24 of van der Vaart (2000), we have . We can then express the asymptotic linear representation of as , which is a linear combination of , , , , and , all evaluated at the truth.
The goodness-of-fit test statistic is doubly robust in the sense that for the limiting distribution to hold, we require only that either or be correctly specified. This property adds protection against possible misspecification of the nuisance models.
The goodness-of-fit test with an arbitrary may have low power. We propose the following procedure for choosing . We specify two models: the null model and an alternative model . We can derive and as in (3) with and , respectively. We use in the parameter-identification estimating function (2) and in the goodness-of-fit estimating function (4). Our simulation study shows that the goodness-of-fit test with has high power in the scenarios considered.
4. Extension of goodness-of-fit test in the presence of censoring
We use inverse probability of censoring weighting (Robins et al., 1995; Hernán et al., 2005) to accommodate subjects lost to follow-up. Let indicate that a subject remains in the study at time . Following Lok & DeGruttola (2012), we assume that censoring is missing at random; that is, is independent of given . Then . Define the inverse probability of censoring weighted estimating functions and using weights ; see Lok & DeGruttola (2012). We assume that the censoring model is correctly specified and identified with estimating functions . Similarly, we denote the inverse probability of censoring weighted estimating function for the preliminary estimator by . For the nuisance regression outcome models, the regression was also weighted. Define the goodness-of-fit test statistic in the presence of censoring by where is the estimated variance of , with
As proved in the Supplementary Material, subject to regularity conditions, has an asymptotic distribution, with degrees of freedom equal to the dimension of .
Our simulation study is based on the HIV data described in §6. Following the approach described in a technical report available from the second author, we used an autoregressive model for the time course of the CD4 count under no treatment, which may be more realistic in months than at earlier times given the different behaviour of CD4 counts during the first six months after infection. Therefore, we simulated data in months 6 to 30. First, in each sample, two groups were simulated: 10% injection drug users and 90% non-injection drug users. The outcome was simulated as log-normal: for injection drug users, and for non-injection drug users. For , where , with for and for . Second, was generated by , where injdrug is an indicator of injection drug use. Finally, . We considered different models for . The censoring process was generated by . Under this model, the average proportion of patients being censored before month 30 is about 42%.
The performance of the test statistics was assessed in terms of Type I error and power. We are interested in testing versus Under , we specified for which the test should have optimal power. Four scenarios were considered. In scenarios (a) and (b), is correctly specified; in scenarios (c) and (d), is misspecified. In scenario (c), is nested in ; and in scenario (d), is not nested in .
- True: , , and .
- True: , , and .
- True: , , and .
- True: , , and .
We estimated the size and power by the frequency of rejecting in 1000 simulated datasets. We considered the following choices of : (i) , a naive choice; (ii) , which is obtained from formula (3) after replacing by and the covariance matrix by a working identity matrix; and (iii) , which is obtained from formula (3) upon replacing by . The nuisance models are specified in the Supplementary Material. In addition to the goodness-of-fit test statistic, we considered an elaborated model fitting and testing approach, which combines the null model with , and tests whether the parameters corresponding to are equal to zero. As can be seen from Table 1, the goodness-of-fit test procedure does not control Type I error well for scenario (b) with ; however, in scenarios (a) and (b), it controls Type I error with all choices of for and . This suggests that the distribution provides an accurate approximation to the finite-sample behaviour of the goodness-of-fit test statistic for moderate sample sizes. From scenarios (c) and (d), it can be seen that the goodness-of-fit test procedure with has the highest power, and as the sample size increases, the power increases, confirming the theoretical results. The goodness-of-fit test procedure with is comparable to the elaborated model fitting and testing approach when testing nested models as in scenario (c); however, it shows more power in detecting nonnested models as in scenario (d), probably because the elaborated model fitting and testing approach fits a larger model and hence loses power.
We used the proposed test to study how the timing of antiretroviral treatment initiation after HIV infection predicts the effect of one year of treatment in HIV-positive patients. We analysed data from the Acute Infection and Early Disease Research Program, which is a multicentre, observational cohort study of HIV-positive patients diagnosed during acute and early infection (Hecht et al., 2006). Dates of infection were estimated based on a stepwise algorithm that uses clinical and laboratory data (Smith et al., 2006). We included patients with CD4 and viral load measured within 12 months of the estimated date of infection, which resulted in 1696 patients. Let be the patient's CD4 count at month after the estimated date of infection, and let be a vector of covariates including age, gender, race, injection drug use, CD4 count and viral load. We let range from 0 to 23 and from to , to avoid making extra modelling assumptions beyond those necessary to estimate the one-year treatment effect . We assumed that treatment can only be initiated at visit times. If was missing at a visit time, we carried the last observation forward. For intermediate missing outcomes, we imputed by interpolation; of the outcomes were missing just prior to onset of treatment, and in such cases we carried the last observation forward. The percentage of patients censored before month 24 is about .
We started with a simple null model for the treatment effect, , and conducted three directed alternative-model tests directed at gender, age and injection drug use, as suggested in the clinical literature. For the test directed at a certain variable , we calculated the goodness-of-fit test statistic with having the optimal form derived from . The nuisance models are specified in the Supplementary Material. Table 2 shows the results. The -values are all greater than , which suggests that there is no significant evidence for rejection of the null model. The results indicate a benefit of antiretroviral treatment; for example, starting treatment at the estimated date of infection would lead to an expected added improvement in CD4 counts of cells/mm after a year of therapy. Delaying treatment initiation may diminish the CD4 count gain associated with one year of treatment, since ; however, this result is not statistically significant.
Application of our proposed test to the HIV data: the optimal estimator fitting the null treatment effect model, showing point estimates (with confidence intervals in parentheses), goodness-of-fit statistics, associated degrees of freedom, and -values...
We thank James Robins, Victor DeGruttola, Susan Little and Davey Smith for insightful and fruitful discussions. This work was sponsored by the U.S. National Institutes of Health.
- Arellano M. & Bond S. (1991). Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Rev. Econ. Studies58, 277–97.
- Hansen L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica50, 1029–54.
- Hansen L. P., Heaton J. & Yaron A. (1996). Finite-sample properties of some alternative GMM estimators. J. Bus. Econ. Statist.14, 262–80.
- Hecht F. M., Wang L., Collier A., Little S., Markowitz M., Margolick J., Kilby J. M., Daar E. & Conway B. (2006). A multicenter observational study of the potential benefits of initiating combination antiretroviral therapy during acute HIV infection. J. Infect. Dis.194, 725–33. [PubMed]
- Hernán M. A., Cole S. R., Margolick J., Cohen M. & Robins J. M. (2005). Structural accelerated failure time models for survival analysis in studies with time-varying treatments. Pharmacoepidemiol. Drug Safety14, 477–91. [PubMed]
- Lindsay B. G. & Qu A. (2003). Inference functions and quadratic score tests. Statist. Sci.18, 394–410.
- Lok J. J. & DeGruttola V. (2012). Impact of time to start treatment following infection with application to initiating HAART in HIV-positive patients. Biometrics68, 745–54. [PMC free article][PubMed]
- Newey W. K. & McFadden D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics4, 2111–245.
- Qu A., Lindsay B. G. & Li B. (2000). Improving generalised estimating equations using quadratic inference functions. Biometrika87, 823–36.
- Robins J. M. (1994). Correcting for non-compliance in randomized trials using structural nested mean models. Commun. Statist. A23, 2379–412.
- Robins J. M. (1998). Correction for non-compliance in equivalence trials. Statist. Med.17, 269–302. [PubMed]
- Robins J. M. (2000). Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology, the Environment, and Clinical Trials. New York: Springer, pp. 95–133.
- Robins J. M., Blevins D., Ritter G. & Wulfsohn M. (1992). G-estimation of the effect of prophylaxis therapy for Pneumocystis carinii pneumonia on the survival of AIDS patients. Epidemiology3, 319–36. [PubMed]
- Robins J. M., Hernan M. A. & Brumback B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology11, 550–60. [PubMed]
- Robins J. M. & Rotnitzky A. (2001). Comment on the Bickel and Kwon article ‘Inference for semiparametric models: Some questions and an answer’. Statist. Sinica11, 863–960.
- Robins J. M., Rotnitzky A. & Zhao L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Statist. Assoc.90, 106–21.
- Rubin D. B. (1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist.6, 34–58.
- Sargan J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica26, 393–415.
- Smith D. M., Strain M. C., Frost S. D., Pillai S. K., Wong J. K., Wrin T., Liu Y., Petropolous C. J., Daar E. S. & Little S. J. et al (2006). Lack of neutralizing antibody response to HIV-1 predisposes to superinfection. Virology355, 1–5. [PubMed]
- van der Laan M. J. & Robins J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. Berlin: Springer.
- van der Vaart A. W. (2000). Asymptotic Statistics. Cambridge: Cambridge University Press.
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.
Common examples of F-tests
Common examples of the use of F-tests include the study of the following cases:
In addition, some statistical procedures, such as Scheffé's method for multiple comparisons adjustment in linear models, also use F-tests.
F-test of the equality of two variances
Main article: F-test of equality of variances
The F-test is sensitive to non-normality. In the analysis of variance (ANOVA), alternative tests include Levene's test, Bartlett's test, and the Brown–Forsythe test. However, when any of these tests are conducted to test the underlying assumption of homoscedasticity (i.e. homogeneity of variance), as a preliminary step to testing for mean effects, there is an increase in the experiment-wise Type I error rate.
Formula and calculation
Most F-tests arise by considering a decomposition of the variability in a collection of data in terms of sums of squares. The test statistic in an F-test is the ratio of two scaled sums of squares reflecting different sources of variability. These sums of squares are constructed so that the statistic tends to be greater when the null hypothesis is not true. In order for the statistic to follow the F-distribution under the null hypothesis, the sums of squares should be statistically independent, and each should follow a scaled χ²-distribution. The latter condition is guaranteed if the data values are independent and normally distributed with a common variance.
Multiple-comparison ANOVA problems
The F-test in one-way analysis of variance is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. For example, suppose that a medical trial compares four treatments. The ANOVA F-test can be used to assess whether any of the treatments is on average superior, or inferior, to the others versus the null hypothesis that all four treatments yield the same mean response. This is an example of an "omnibus" test, meaning that a single test is performed to detect any of several possible differences. Alternatively, we could carry out pairwise tests among the treatments (for instance, in the medical trial example with four treatments we could carry out six tests among pairs of treatments). The advantage of the ANOVA F-test is that we do not need to pre-specify which treatments are to be compared, and we do not need to adjust for making multiple comparisons. The disadvantage of the ANOVA F-test is that if we reject the null hypothesis, we do not know which treatments can be said to be significantly different from the others, nor, if the F-test is performed at level α, can we state that the treatment pair with the greatest mean difference is significantly different at level α.
The formula for the one-way ANOVAF-test statistic is
The "explained variance", or "between-group variability" is
where denotes the sample mean in the i-th group, is the number of observations in the i-th group, denotes the overall mean of the data, and denotes the number of groups.
The "unexplained variance", or "within-group variability" is
where is the jth observation in the ith out of K groups and N is the overall sample size. This F-statistic follows the F-distribution with degrees of freedom and under the null hypothesis. The statistic will be large if the between-group variability is large relative to the within-group variability, which is unlikely to happen if the population means of the groups all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, where t is the Student's statistic.
Further information: Stepwise regression
Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the restricted model, and Model 2 is the unrestricted one. That is, model 1 has p1 parameters, and model 2 has p2 parameters, where p2 > p1, and for any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of model 2. The model with more parameters will always be able to fit the data at least as well as the model with fewer parameters. Thus typically model 2 will give a better (i.e. lower error) fit to the data than model 1. But one often wants to determine whether model 2 gives a significantly better fit to the data. One approach to this problem is to use an F-test.
If there are n data points to estimate parameters of both models from, then one can calculate the F statistic, given by
where RSSi is the residual sum of squares of model i. If the regression model has been calculated with weights, then replace RSSi with χ2, the weighted sum of squared residuals. Under the null hypothesis that model 2 does not provide a significantly better fit than model 1, F will have an F distribution, with (p2−p1, n−p2) degrees of freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the F-distribution for some desired false-rejection probability (e.g. 0.05). The F-test is a Wald test.