Consistent bootstrap procedures for post-model-selection inference
Statistical inference has often been made under a model pre-selected by a data-driven procedure as if this model were the “correct” model. That this practice has been prevalent in many applied works necessitates a careful evaluation of the effects of model selection on conclusions drawn from standard statistical inference. Better insights into the issue have been afforded by recent studies on the distributions of post-model-selection estimators and predictors, which typically exhibit a complicated form in defiance of the “classical” Gaussian protocol, thus pointing to possible invalidity of standard post-model-selection inference. The problem is particularly acute in regression studies where inference analysis is often acted on a “final” model derived from a data-driven variable selection procedure, without taking into account the uncertainties induced by the selection process. This calls into question validity of standard regression inference tools like t-tests and ANOVA, upon which misleading conclusions may potentially be drawn. It is therefore important to develop accurate estimates of the “complicated” distributions of post-model-selection estimators, for which task the conventional asymptotic approach has been found to be particularly ill-equipped. Standard resampling procedures such as the paired and residual bootstraps fail to provide a satisfactory solution neither, as they often turn out to be inconsistent. This project aims to develop a more sustainable bootstrap-based procedure for estimating the distributions of post-model-selection estimators, which will suffer less from the aforementioned difficulties. The proposed procedure is novel and can be viewed as an adaptive modification of existing bootstrap methods. Its performance will be investigated theoretically and empirically under settings where estimation of the distribution is found to be difficult. The project, on successful completion, will deliver a more reliable inference scheme for post-model-selection regression analysis, which will find useful applications in a wide spectrum of quantitative studies.