Toward building a transparent statistical model for improving crop yield prediction: Modeling rainfed corn in the U.S

Yan Li, Kaiyu Guan, Albert Yu, Bin Peng, Lei Zhao, Bo Li, Jian Peng

Research output: Contribution to journalArticlepeer-review


Statistical crop models have been a major tool in identifying critical drivers of crop yield, forecasting short-term crop yield, and assessing long-term climate change impacts on agricultural productivity. However, few studies focus specifically on fundamental issues encountered in developing a high-performance statistical crop model for yield prediction. Such issues include: how to select predictors and fitting functions, how to effectively address the spatiotemporal scale issue, weather it is beneficial to include satellite data as explanatory variables, and how to reconcile different model evaluation procedures. In this study, we present our statistical modeling practices for predicting rainfed corn yield in the Midwest U.S. and address the aforementioned issues through comprehensive diagnostic analysis. Our results show that vapor pressure deficit and precipitation at a monthly scale, in spline form with customized knots, define the “Best Climate-only” model among alternative climate variables (e.g., air temperature) and fitting functions (e.g., linear or polynomial), with an out-of-sample (leave-one-year-out) median R 2 of 0.79 and RMSE of 1.04 t/ha (16.6 bu/acre) from 2003 to 2016. Satellite variables, such as MODIS land surface temperature and Enhanced Vegetation Index (EVI), when used as predictors alone, reduce the model's RMSE to 0.93 t/ha (14.8 bu/acre). Adding satellite variables (i.e., EVI in polynomial form) to the “Best Climate-only” model gives the “Best Climate + EVI” model, which has the highest prediction performance of this study, with a median R 2 of 0.85 and RMSE of 0.90 t/ha (14.3 bu/acre). Such a model trained using all data (so-called “global model”) in most cases leads to better predictions than the state-specific trained models. However, the global model's prediction performance exhibits considerable regional and interannual variations. The regional-varying performance is related to states’ spatiotemporal variability in yield, where states with larger spatial yield variability show higher R 2 , and states with smaller temporal yield variability show lower RMSE. Interannual variations in prediction performance are linked to yield variability and degree of wetness, with higher R 2 in years with larger yield variability but increasingly larger RMSE toward wetter years and extreme dry years. These identified spatial and temporal variations of model's performance, together with inconsistent evaluation practices undermine the comparability between statistical modeling studies. Alleviating such comparability issues requires more transparency and open data practices. The statistical model presented in this study provides a benchmark for further development and can be applied to future research related to yield prediction or assessment of climate change impact.

Original languageEnglish (US)
Pages (from-to)55-65
Number of pages11
JournalField Crops Research
StatePublished - Mar 15 2019


  • Agriculture
  • Corn
  • Statistical model
  • Yield forecast
  • Yield prediction

ASJC Scopus subject areas

  • Agronomy and Crop Science
  • Soil Science


Dive into the research topics of 'Toward building a transparent statistical model for improving crop yield prediction: Modeling rainfed corn in the U.S'. Together they form a unique fingerprint.

Cite this