Welcome

Welcome to Predictive Modeling! This book covers the statistical foundations and computational practice of regression-based predictive modeling. The unifying thread is the regression model \(Y = m(X_1, \ldots, X_p) + \varepsilon\), which relates a response \(Y\) to predictors \(X_1, \ldots, X_p\). How do we estimate \(m\)? How to quantify the uncertainty in its estimation? How to evaluate predictive performance in practice?

The book builds from linear models into their modern extensions, adaptations, and generalizations. Another title could have been “linear models and friends”, understanding “friends” generously as methods that extend linear models or use them as building blocks. This progression moves from the rigid structure of vanilla linear models to the flexibility of nonparametric smoothing, passing through high-dimensional, big data, and non-normal settings.

The book strikes a balance between statistical theory and hands-on modeling practice. Each chapter develops the key ideas alongside fully reproducible R code on simulated and real data. Three main case studies anchor the core chapters: predicting Bordeaux wine quality from meteorological variables, modeling Boston housing values from structural and neighborhood characteristics, and assessing O-ring failure probability in the Challenger disaster. The exposition is complemented by circa 20 Shiny applications, allowing the reader to visualize key concepts and develop intuition for how methods behave in practice.

A collection of more than 60 exercises is included to help the reader test their understanding and apply the methods. These exercises vary in difficulty and type, mainly focusing on data analysis tasks and programming exercises. The reader should close each chapter with a clear understanding of why and how the methods work, together with the practical ability to deploy them.

Scope of the book

The central focus of the book is regression modeling for prediction through the regression model \(Y = m(X_1, \ldots, X_p) + \varepsilon\). The main arc runs from linear models and their practical and modern extensions (model selection, dimension reduction, shrinkage, and big data), to generalized linear models for different response types, and finally to nonparametric regression via local polynomial fitting and local likelihood. Recurring themes are bias–variance tradeoffs, regularization and tuning, model assessment, and the practical consequences of modeling assumptions.

The book does not intend to be comprehensive on the predictive modeling landscape. Important methods, such as tree-based models, neural networks, and support vector machines, lie beyond the scope of this text. Nevertheless, the concepts emphasized throughout, such as regularization through shrinkage, model selection, and diagnostic checks, provide a solid foundation for engaging with the wider predictive modeling landscape.

The book begins with Chapter 1, Introduction, which frames predictive modeling through the regression model \(Y = m(X_1, \ldots, X_p) + \varepsilon\) and clarifies what it means to learn a regression function from data. It introduces the key tradeoffs that recur throughout, such as prediction accuracy versus interpretability and bias versus variance, and illustrates overfitting. The chapter also establishes the notation and basic probabilistic background used in the rest of the book.

Chapter 2, Linear models I, develops the multiple linear model \(Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon\) as the foundational parametric approach to regression. Beginning with least squares and its geometric interpretation, the chapter treats model assumptions and their necessity, sampling distributions and inference for coefficients, confidence and prediction intervals, ANOVA decomposition, and goodness-of-fit measures. The prediction of Bordeaux wine quality from weather data provides a running case study through the chapter.

Chapter 3, Linear models II, focuses on linear modeling in more realistic scenarios, where many predictors are available, hence many competing model specifications are possible, transformations of predictors are often needed, and assumptions need to be checked. It covers model selection through information criteria and cross-validation, notes on model selection consistency, use of qualitative predictors and interactions, nonlinear extensions including polynomials, and the usual toolkit for model diagnostics. The chapter closes with dimension reduction techniques within linear models: principal components regression, including a review of principal component analysis, and partial least squares. The pricing of housing in Boston is used as the recurrent case study.

Chapter 4, Linear models III, extends linear regression to non-classical settings in which ordinary least squares is infeasible, unstable, or impractical. Ridge regression and the lasso are introduced as shrinkage methods for high-dimensional problems in which \(p\gg n\), with particular attention to tuning-parameter selection, analytical form of ridge regression, and variable selection through lasso. The chapter also treats linear models where the parameter vector \(\boldsymbol\beta\) is subject to linear constraints. A different extension is the multivariate linear model where the response is a vector \(\mathbf{Y} = (Y_1, \ldots, Y_q)\): estimation, inference, and shrinkage through ridge and lasso are developed. Mathematical and computational considerations for big data close the chapter.

Chapter 5, Generalized linear models, broadens the regression framework to non-normal responses through the exponential family and link functions. Logistic regression for binary data is developed in depth: estimation by maximum likelihood, interpretation via odds ratios, inference, deviance, model selection, diagnostics, shrinkage through ridge and lasso, and big data considerations. Formulation and estimation are extended to generic generalized linear models, with specifics provided for Poisson regression and binomial regression for count data. The Challenger disaster provides the central case study for the chapter.

Finally, Chapter 6, Nonparametric regression, moves beyond parametric structure by estimating \(m\) with minimal assumptions, letting the data determine the shape of the regression function. The chapter develops the core tools of smoothing starting from nonparametric density estimation (histograms and kernel density estimators) and then moving to kernel regression, including the Nadaraya–Watson estimator and local polynomial regression, with careful attention to bandwidth selection and its bias–variance implications. Extensions to mixed multivariate predictors, prediction and confidence intervals, and local likelihood methods for binary responses complete the chapter.

Software & code

The book contains a substantial amount of code, with snippets that are intended to be self-contained within the chapter in which they appear. This helps illustrate how the methods and theory translate to practice. The software employed throughout this book is the statistical language R and the RStudio IDE (Integrated Development Environment). Basic knowledge of both is assumed.¹

The Shiny interactive apps in the book can be downloaded and run locally, which also allows inspection of their source code. Check out this GitHub repository for the sources.

Several packages that are not included within R by default are used throughout the book. All of the packages can be installed with the following commands:

# Installation of required packages
packages <- c("MASS", "car", "readxl", "rgl", "rmarkdown", "nortest",
              "latex2exp", "pca3d", "ISLR", "pls", "corrplot", "glmnet",
              "mvtnorm", "biglm", "leaps", "lme4", "viridis", "ffbase",
              "ks", "KernSmooth", "nor1mix", "np", "locfit",
              "manipulate", "mice", "VIM", "nnet")
install.packages(packages)

The book makes explicit mention of the package to which a function belongs by using the operator ::, except when the use of the functions of a package is very repetitive and that package is loaded. You can load all the packages by running:

# Load packages
lapply(packages, library, character.only = TRUE)

The book’s R code snippets are collected in the following scripts:

Chapter 1: 01-intro.R.
Chapter 2: 02-lm-i.R.
Chapter 3: 03-lm-ii.R.
Chapter 4: 04-lm-iii.R.
Chapter 5: 05-glm.R.
Chapter 6: 06-npreg.R.
Appendix: 07-appendix.R.

To download them, use your browser’s Save link as… option.

Datasets

The following is a handy list of all the relevant datasets used in the course together with brief descriptions. The list is sorted according to the order of appearance of the datasets in the notes.

wine.csv. The dataset is formed by the auction Price of \(27\) red Bordeaux vintages, five vintage descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the population of France in the year of the vintage (FrancePop).
least-squares.RData. Contains a single data.frame, named least_squares, with 50 observations of the variables x, y_lin, y_qua, and y_exp. These are generated as \(X\sim\mathcal{N}(0,1),\) \(Y_\mathrm{lin}=-0.5+1.5X+\varepsilon,\) \(Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon,\) and \(Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon,\) with \(\varepsilon\sim\mathcal{N}(0,0.5^2).\) The purpose of the dataset is to illustrate the least squares fitting.
least-squares-3D.RData. Contains a single data.frame, named leastSquares3D, with \(50\) observations of the variables x1, x2, x3, y_lin, y_qua, and y_exp. These are generated as \(X_1,X_2\sim\mathcal{N}(0,1),\) \(X_3=X_1+\mathcal{N}(0,0.05^2),\) \(Y_\mathrm{lin}=-0.5 + 0.5 X_1 + 0.5 X_2 +\varepsilon,\) \(Y_\mathrm{qua}=-0.5 + X_1^2 + 0.5 X_2+\varepsilon,\) and \(Y_\mathrm{exp}=-0.5 + 0.5 e^{X_2} + X_3+\varepsilon,\) with \(\varepsilon\sim\mathcal{N}(0,1).\) The purpose of the dataset is to illustrate the least squares fitting with several predictors.
assumptions.RData. Contains the data frame assumptions with \(200\) observations of the variables x1, …, x9 and y1, …, y9. The purpose of the dataset is to identify which regression y1 ~ x1, …, y9 ~ x9 fulfills the assumptions of the linear model. The moreAssumptions.RData dataset has the same structure.
assumptions3D.RData. Contains the data frame assumptions3D with \(200\) observations of the variables x1.1, …, x1.8, x2.1, …, x2.8 and y.1, …, y.8. The purpose of the dataset is to identify which regression y.1 ~ x1.1 + x2.1, …, y.8 ~ x1.8 + x2.8 fulfills the assumptions of the linear model.
Boston.xlsx. The dataset contains \(14\) variables describing \(506\) suburbs in Boston. Among those variables, medv is the median house value, rm is the average number of rooms per house, and crim is the per capita crime rate. The full description is available in ?MASS::Boston.
cpus.txt and gpus.txt. The datasets contain \(102\) and \(35\) rows, respectively, of commercial CPUs and GPUs, from the first models to the present. The variables in the datasets are Processor, Transistor count, Date of introduction, Manufacturer, Process, and Area.
la-liga-2015-2016.xlsx. Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.
challenger.txt. Contains data for \(23\) space-shuttle launches. There are \(8\) variables. Among them: temp (the temperature in Celsius degrees at the time of launch), and fail.field and fail.nozzle (indicators of whether there were incidents in the O-rings of the field joints and nozzles of the solid rocket boosters).
species.txt. Contains data for \(90\) country parcels in which the Biomass, pH of the terrain (categorical variable), and number of Species were measured.
heart.txt. Contains data for \(226\) patients suspected of having a future heart attack. The variables are CK (level of creatinine kinase), and ha and ok (number of patients that suffered a heart attack and did not suffer it, respectively).
Chile.txt. Contains data for \(2700\) respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are \(8\) variables: region, population, sex, age, education, income, statusquo (scale of support for the status quo), and vote. vote is a factor with levels A (abstention), N (against Pinochet), U (undecided), and Y (for Pinochet). Retrieved from data(Chile, package = "carData").

To download them, use your browser’s Save link as… option.

Citation

You may use the following \(\text{B}{\scriptstyle\text{IB}}\mkern-2mu \text{T}\mkern-3mu\lower.5ex\hbox{E}\mkern-2mu\text{X}\) entry when citing this book:

@book{PredictiveModeling2026,
    title        = {Predictive Modeling},
    author       = {Garc\'ia-Portugu\'es, E.},
    year         = {2026},
    note         = {Version 6.1.0. ISBN 978-84-09-29679-8},
    url          = {https://egarpor.github.io/PM-UC3M/}
}

You may also want to use the following template:

García-Portugués, E. (2026). Predictive Modeling. Version 6.1.0. ISBN 978-84-09-29679-8. Available at https://egarpor.github.io/PM-UC3M/.

A previous version of the book at the now-discontinued bookdown.org hosting service was known as Notes for Predictive Modeling.

License

All the material in this book is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License (CC BY-NC-ND 4.0). You may not use this material except in compliance with the aforementioned license. The human-readable summary of the license states that:

You are free to:
- Share – Copy and redistribute the material in any medium or format.
Under the following terms:
- Attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- NonCommercial – You may not use the material for commercial purposes.
- NoDerivatives – If you remix, transform, or build upon the material, you may not distribute the modified material.

Contributions

Contributions, reporting of typos, and feedback on the book are very welcome. Send an email to edgarcia@est-econ.uc3m.es and I will gladly add your name to the list of contributors.

Credits

Several excellent references were used in preparing this book. The following list presents the books that have been consulted:

Chacón and Duong (2018) (Section 6.1.4)
DasGupta (2008) (Section 3.5.2)
Durbán (2017) (Section 5.2.2)
Fan and Gijbels (1996) (Sections 6.2, 6.2.3, and 6.2.4)
Hastie et al. (2009) (Section 4.1)
James et al. (2013) (Sections 2.2 – 2.7, 3.1, 3.5, and 3.6.3, 4.1)
Kuhn and Johnson (2013) (Section 1.1)
Li and Racine (2007) (Section 6.3)
Loader (1999) (Section 6.5)
McCullagh and Nelder (1983) (Sections 5.2 – 5.6)
Peña (2002) (Sections 2.2 – 2.7, 3.5, and 5.2.1)
Seber and Lee (2003) (Section 4.2)
Seber (1984) (Section 4.3)
Wand and Jones (1995) (Sections 6.1.2, 6.1.3, and 6.2.4)
Wasserman (2004) (Sections 6.5)
Wasserman (2006) (Sections 6.2.4)
Wood (2006) (Sections 5.2.2 and 5.7)

This book was made possible thanks to the excellent open-source software Xie (2016), Xie (2020), Allaire et al. (2020), Xie and Allaire (2020), and R Core Team (2020). In addition, some layout improvements are built on the outstanding work of Úcar (2018). The icons used in the book were designed by madebyoliver, freepik, and roundicons from Flaticon.

Last but not least, the book has benefited from contributions from the following people (in alphabetical order):

Apezteguía García, Ainara (fixed a typo)
Botz, Katherine (performed a thorough proofreading of the course materials, fixing a large number of typos)
Caballero Cárdenas, Jorge (fixed five typos)
Carrera Maestro, Antonio (fixed two bugs and three typos)
Castillo Estévez, Marcos José (fixed two typos)
Cerdán Pedraza, Luis (performed an outstanding proofreading of the course materials, fixing more than fifty typos and style issues)
Chettouh, Frederik (fixed a typo and two bugs)
Demir, Gulnur (fixed two typos)
Escalante Ariza, Andrés (fixed a typo)
Fernández, José Ángel (fixed several typos)
Fernández de Marcos Giménez de los Galanes, Alberto (fixed three typos)
García Ramírez, Celia (fixed two bugs)
González Berzal, Trinidad (fixed a typo)
González González, David (fixed two typos)
Marín Abril, Antonio (fixed two bugs)
Modet Álamo, Andrés (performed an excellent review of the course materials detecting and fixing more than thirty typos and four bugs)
Palmero Muñoz, Santiago (fixed a typo and a bug)
Petraccaro, Federico (fixed three typos)
Ramírez Díaz, Enrique (fixed a typo)
Razgovorov, Pavel (fixed a typo)
Rodríguez Beltrán, Cristina (fixed a typo and two bugs)
Rodríguez Ramírez, Manuel (fixed two typos)
Romero González, Celia (fixed a typo)
Royo Ruiz, Carlota (fixed a bug)
Stincone, Leonardo (fixed a typo and a bug)

References

Allaire, J. J., Y. Xie, J. McPherson, et al. 2020. rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.

Chacón, J. E., and T. Duong. 2018. Multivariate Kernel Smoothing and Its Applications. Vol. 160. Monographs on Statistics and Applied Probability. CRC Press. https://doi.org/10.1201/9780429485572.

DasGupta, A. 2008. Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics. Springer. https://doi.org/10.1007/978-0-387-75971-5.

Durbán, M. 2017. Modelización Estadística. Lecture notes.

Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. Vol. 66. Monographs on Statistics and Applied Probability. Chapman & Hall. https://doi.org/10.2307/2670134.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning. Springer Series in Statistics. Springer. https://doi.org/10.1007/978-0-387-84858-7.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 103. Springer Texts in Statistics. Springer. https://doi.org/10.1007/978-1-4614-7138-7.

Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3.

Li, Q., and J. S. Racine. 2007. Nonparametric Econometrics. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691121611/nonparametric-econometrics.

Loader, C. 1999. Local Regression and Likelihood. Statistics and Computing. Springer. https://doi.org/10.2307/1270956.

McCullagh, P., and J. A. Nelder. 1983. Generalized Linear Models. Monographs on Statistics and Applied Probability. Chapman & Hall. https://doi.org/10.1007/978-1-4899-3244-0.

Peña, D. 2002. Regresión y Diseño de Experimentos. Alianza Editorial. https://www.alianzaeditorial.es/libro/manuales/regresion-y-diseno-de-experimentos-daniel-pena-9788420693897/.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna. https://www.R-project.org/.

Seber, G. A. F. 1984. Multivariate Observations. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons. https://doi.org/10.1002/9780470316641.

Seber, G. A. F., and A. J. Lee. 2003. Linear Regression Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience. https://doi.org/10.1002/9780471722199.

Úcar, I. 2018. “Energy Efficiency in Wireless Communications for Mobile User Devices.” PhD thesis, Universidad Carlos III de Madrid. https://enchufa2.github.io/thesis/.

Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. Vol. 60. Monographs on Statistics and Applied Probability. Chapman & Hall. https://doi.org/10.1007/978-1-4899-4493-1.

Wasserman, L. 2004. All of Statistics. Springer Texts in Statistics. Springer-Verlag. https://doi.org/10.1007/978-0-387-21736-9.

Wasserman, L. 2006. All of Nonparametric Statistics. Springer Texts in Statistics. Springer-Verlag. https://doi.org/10.1007/0-387-30623-4.

Wood, S. N. 2006. Generalized Additive Models. Texts in Statistical Science Series. Chapman & Hall/CRC. https://doi.org/10.1201/9781420010404.

Xie, Y. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. The r Series. CRC Press. https://bookdown.org/yihui/bookdown/.

Xie, Y. 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.

Xie, Y., and J. J. Allaire. 2020. tufte: Tufte’s Styles for R Markdown Documents. https://CRAN.R-project.org/package=tufte.

Among others: basic programming in R, ability to work with objects and data structures, ability to produce graphics, knowledge of the main statistical functions, and ability to run scripts in RStudio.↩︎

1 Introduction