A commented and visual walk-through advanced regression techniques.
Author
Jorge A. Thomas
Published
December 1, 2023
Introduction
In this classic Kaggle dataset (Anna Montoya 2016) you’ll follow my workflow developing pipelines on the ETL phase of the DS cycle using the R programming language (R Core Team 2023), as well as a tidy approach on the EDA substantially supported on Visual Analytics. I’ll be feeding some tips on the right margin along the analysis.
The Ames House dataset was prepared to improve on the older classic Boston Houses dataset and, given its open competition nature, it comprises two files:
One file contains features and labels or what’s the same predictors and responses. Here this is the training dataset. It’s all I have to split the Data Budget.
Another file contains only features (predictors) to make predictions for the final submissions. Here this is the test dataset.
The goal of the competition is to achieve minimum RMSE.
I will keep updating this notebook, making improvements and reporting the rank obtained with my submitted predictions.
PHASE I: ETL / EDA
Extract Data
I checked beforehand that there are no missing values, here NA, in the target variable SalePrice. Therefore, I will write a pipeline to read and concatenate both datasets (bind rows), adding and extra column dataset to label as “train” and “test” for further easy splitting1 (subset or filter).
1 The whole Feature Transformation pipeline most be always the same for all predictors in both datasets.
Code
```{r}#| label: Load Data#| warning: falseames_train_raw <-read_csv("./data/raw/train.csv") # Train, validattion and test datasetprint("Dimensions of training dataset")dim(ames_train_raw)ames_test_raw <-read_csv("./data/raw/test.csv") # Features for submission datasetprint("Dimensions of test dataset containing only Feats.")dim(ames_test_raw) # Add Target column with NA so both DFs can be concatenated:.idames_test_raw <- ames_test_raw |>mutate(SalePrice =NA)# Binding and adding identifier column "dataset" ames_all <-bind_rows(list(train = ames_train_raw, test = ames_test_raw), .id ="dataset")print("Available variables:")names(ames_all)```
In order to organise my Data Budget I count on the train dataset with 1460 Observations. Note that Id and the just added dataset columns are not predictors. Hence, there are 78 features to play with. Also remember that the number of features for modelling will vary. Some will be discarded and some will be created along the analysis.
Impute Missing Values
First things first, read the full data description2. There you’ll find that for several columns, missing values NA means actually “None”. The physical absence of a determined feature in a house is a category exposing the lack of such quality that can have a significant impact on the response SalePrice. Later it is most probably that binary features indicating the existence of some installations should be created.
2Print a copy, make notes and study it. This is the first step to get into domain knowledge, in this case Real Estate, for later Feature Engineering.
Therefore, I will fill the empty NA fields of the indicated columns of both datasets with the string “None”.
One of the early and recurrent steps of the EDA is to check the completeness of data. Let’s search for missing values, after filling indicated fields3. For this case I wrote the function count_na() that generates the table displayed on the right margin.
3Write a function to quantify missing values depending on the language and labelling system used.
The LotFrontage column refers to the linear feet of street connected to the house. This feature is not well documented and the missing percentage related to other variables makes it not only unreliable, but very low in variance. Therefore, I will delete the feature.
Everytime a Garage doesn’t exist, i.e., GarageType = "None", there is the corresponding missing value in Garage Year Built GarageYrBlt = NA. I will engineer a new feature called GarageNew with three ordinal categories: None, No, Yes. This based on the delta of YearBuilt - GarageYrBlt; I expect that given the house price, the algorithm will learn that “None” is worse than “No” and so on. Then I will remove the GarageYrBlt predictor.
For the rest of variables with a 1 % or less missing values NA, I’ll calculate the median (if numerical) and the mode (if string) in order to fill them with it.
Here’s my pipeline for the whole dataset:
Code
```{r}#| label: terminating NAs# "replace_na_with_median" is a function I createdames_all <- ames_all |>mutate(LotFrontage =NULL) |># Removing LotFrontagemutate(GarageNew =if_else(YearBuilt - GarageYrBlt >0, "Yes", "No")) |># New Feat.replace_na(list(GarageNew ="None")) |>mutate(GarageNew =factor(GarageNew, levels =c("None", "No", "Yes"), ordered =TRUE)) |># 3 levelsmutate(GarageYrBlt =NULL) |># Removing old Feat.mutate_if(is.numeric, replace_na_with_median)```
Let’s get a list with the mode of each remaining columns containing missing values NA.
Code
```{r}#| label: Mode for NAs# Get the mode for reamaining columns with NAs:# "find_mode()" is a custom function.# Good, old-fashioned code: apply(ames_all, 2, find_mode)list_na_mode <- ames_all |>select(MasVnrType, MasVnrArea, MSZoning, Electrical, Utilities, BsmtFullBath, BsmtHalfBath, Functional, Exterior1st, Exterior2nd, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, KitchenQual,GarageCars, GarageArea, SaleType) |>map(find_mode)# map returns a list # Replace with the created named-listsames_all <- ames_all |>replace_na(list_na_mode) # Sanity check of missing values:print("Full Ames dataset (train and test)")ames_all |>select(-SalePrice) |>count_na()```
[1] "Full Ames dataset (train and test)"
[1] "No missing values (NA) found."
“The data is complete!” Let’s think about predictors.
There are variables (columns) starting with numbers? I’ll rename them:
Using the Mutual Information criterion (MI), I’ll rank the extent to which knowledge of each original predictor reduces the uncertainty of SalePrice.
```{r}#| label: mutual-info#| column: margin#| code-fold: falselibrary(infotheo)# "calc_mi" is a function I wrote.mi_y_X <- ames_all |>filter(dataset =="train") |>select(-c(Id, dataset)) |>calc_mi_score(target ="SalePrice")mi_y_X |>head(20) |> knitr::kable(caption ="Mutual Info")```
Mutual Info
Variable
nats
Percent
SalePrice
SalePrice
2.6515433
100.0000
OverallQual
OverallQual
0.6086855
22.9559
Neighborhood
Neighborhood
0.6084916
22.9486
GrLivArea
GrLivArea
0.5723431
21.5853
GarageArea
GarageArea
0.5244219
19.7780
YearBuilt
YearBuilt
0.5195234
19.5932
YearRemodAdd
YearRemodAdd
0.4564483
17.2144
TotalBsmtSF
TotalBsmtSF
0.4473484
16.8712
FirstFlrSF
FirstFlrSF
0.4141527
15.6193
GarageCars
GarageCars
0.3805914
14.3536
SecondFlrSF
SecondFlrSF
0.3735544
14.0882
BsmtUnfSF
BsmtUnfSF
0.3544288
13.3669
MSSubClass
MSSubClass
0.3458352
13.0428
BsmtQual
BsmtQual
0.3425788
12.9200
ExterQual
ExterQual
0.3394960
12.8037
KitchenQual
KitchenQual
0.3283328
12.3827
OpenPorchSF
OpenPorchSF
0.3051952
11.5101
FullBath
FullBath
0.2863069
10.7977
GarageFinish
GarageFinish
0.2784786
10.5025
WoodDeckSF
WoodDeckSF
0.2706731
10.2081
Feature Engineering
I have already started with the creation of a new predictor in the previous step, this to substitute the problematic variable GarageYrBlt that had a lot of missing values.
Feature creation and transformations are not sequential, i.e., this is not the only step where Feature Engineering is applied. Think about the next modelling phase, where you want to reduce features, e.g., where new Principal Components are the new aggregators of original features.
In this part I will establish which variables are categorical and numerical; this is not always as obvious as it seems4.
4 Integer variables like calendar Years can be treated as categories for the analysis.
The first easy thing that comes to mind is to add the total area of the property as a new column named TotAreaSF; SF means Squared Feet.
After that, I check the data documentation to see which variables can be thought as factors; this can be tedious if Real Estate is not your domain of expertise.
After 1.5 hours of reading, googling, understanding other notebooks and deliberating (yes, it was tedious), I can summarise four types of ways to engineer new features depending on the nature of the variables:
Aggregation to a total of continuous-numerical surface area (Square Feet), e.g. new TotalIntArea Feat. summing basement and floors, etc. Same for a new TotalExtArea. Then deleting the addends variables.
Aggregation to a total of counts-numerical of an available installation of a determined type, e.g. new TotBaths. Then deleting the addends.
Categorisation with ascending quality levels of the originally numerical variables OverallQual and OverallCond.
Creation of new binary features that state if a determined installation exists or doesn’t.
After my second Kaggle submission (Top 25%) I completed the Feature Engineering course that has a lot of tips! I included some new features and tricks from there.
Here’s the code that complies with the four previous steps:
Although nothing is linear, is good to have an idea of simple correlations between all variables; it gives me an idea of what kind of plots to make. After going through the previous steps, it’s obvious that there are correlations. Think about the number of cars that fit in a garage GarageCars and the area of the garage GarageArea.
Code
```{r}#| label: fig-num-corr#| fig-width: 8#| fig-height: 8#| fig-cap: The Pearson's correlation of numeric features and response shows high presence of multicollinearity.#| warning: false#| message: falselibrary(corrplot)ames_all |>filter(dataset =="train") |>select(-c(dataset, Id)) |>select(where(is.numeric)) |>cor(method ="pearson") |>corrplot(type ="lower", method ="circle", insig ='blank', order ="hclust", diag =TRUE,tl.col="black", tl.cex =0.8, addCoef.col ='black', number.cex =0.6)```
Figure 2: The Pearson’s correlation of numeric features and response shows high presence of multicollinearity.
According to the matrix above, GrLivArea has the strongest linear relationship with the target SalePrice, followed by TotalIntArea, this given the collinearity with GrLivArea, and then GarageArea or GarageCars, both heavily correlated too, like commented before.
I’m leaving blank, i.e., without a circle, the insignificant correlations for better visualisation and focus. Now I’m checking the categorical variables and the response.
Figure 3: Pearson’s correlation of categorical features and response.
The best linear association with SalePrice is given by OverallQual factor, followed by everything else that has to do with quality ratings.
Overall, one can see clearly the multicollinearity clusters, which suggest the use of advanced Feature Selection techniques.
Visual Exploration
Now with the previous information in hand, I’m going to visualise separately the two most important quasi-linear relationships: GrLivArea and OverallQual vs. SalePrice.
Code
```{r}#| label: fig-cont_linear#| message: false#| warning: false#| column: body#| fig-width: 6#| fig-height: 6#| fig-cap: 'GrvLivArea Vs. SalePrice - The most "linear" realationship between continuous variables. One can clearly notice the increase of variance as both, Living Area and Price increase, i.e., heteroskedasticity.'#| cap-location: margin#| fig-subcap:#| - "Scatter Plot"#| - "Hexbin Plot"#| layout-ncol: 2ames_all |>filter(dataset =="train") |>ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(x =expression("Ground Living Area ["~ft^2~"]"), y ="Sale\nPrice\n['000 $]") +geom_point(aes(color = OverallQual), alpha =0.5) +theme(axis.text.y =element_blank()) +geom_smooth(method ="lm", formula = y ~ splines::bs(x, 3), color ="black", linewidth =1.5, alpha =0.5) +theme(legend.position ="bottom")# install.packages("hexbin")ames_all |>filter(dataset =="train") |>ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +geom_hex(bins =35, colour ="seagreen") +labs(x =expression("Ground Living Area ["~ft^2~"]")) +theme(axis.title.y =element_blank(), legend.position ="bottom")```
(a) Scatter Plot
(b) Hexbin Plot
Figure 4: GrvLivArea Vs. SalePrice - The most “linear” realationship between continuous variables. One can clearly notice the increase of variance as both, Living Area and Price increase, i.e., heteroskedasticity.
The hexbin plot reveals three cells of price and size where houses are more common in the training dataset. There are more than 50 houses for each bin in the range of 130K, 140K and 180K USD approximately.
On the other hand, note how problematic are the outliers with the spline fit! This visual allows me to remove outliers, i.e., the biggest and cheap houses (right data points).
Code
```{r}#| label: fig-RemoveOutliers#| fig-cap: Removing two outliers from the training dataset part the spline fit is dramatically improved!#| column: marginames_all |>filter(dataset =="train") |>select(c(SalePrice, GrLivArea, OverallQual)) |>filter(GrLivArea >4500) |>glimpse()idx_remove <-which(ames_all$dataset =="train"& ames_all$GrLivArea >4500)ames_all <- ames_all |>filter(!row_number() %in% idx_remove)ames_all |>filter(dataset =="train") |>ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(x =expression("Ground Living Area ["~ft^2~"]"), y ="Sale Price ['000 $]") +geom_point(aes(color = OverallQual), alpha =0.5) +theme(axis.text.y =element_blank()) +geom_smooth(method ="lm", formula = y ~ splines::bs(x, 3), color ="black", size =1.5, alpha =0.5) +theme(legend.position ="none", axis.title.y =element_text(angle =90))```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Figure 5: Removing two outliers from the training dataset part the spline fit is dramatically improved!
The two big houses above were super offers! Or did the housing bubble exploded?
In the same fashion, I want to see the effect of the factor OverallQual.
Code
```{r}#| label: fig-cat_linear#| message: false#| warning: false#| fig-width: 7#| fig-height: 8#| fig-cap: 'OverallQual Vs. SalePrice - The most "linear" realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.'# install.packages("ggrain")library(ggrain)ames_all |>filter(dataset =="train") |>ggplot(aes(x = OverallQual, y = SalePrice/1000, colour = OverallQual, fill = OverallQual)) +geom_rain(alpha =0.3) +scale_y_continuous(breaks =seq(0, 800, 100)) +labs(x ="Overall\nQuality", y ="Sale Price ['000 $]") +coord_flip() +theme(legend.position ="none")```
Figure 6: OverallQual Vs. SalePrice - The most “linear” realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.
I’m wondering in which price range is the most popular quality?
```{r}#| label: fig-count_overallQual#| cap-location: margin#| fig-cap: "Houses with mid-quality levels are most frequent in the train dataset."#| column: margin#| code-fold: falseames_all |>filter(dataset =="train") |>count(OverallQual, name ="count") |>ggplot(aes(y =fct_reorder(OverallQual, count), x = count)) +geom_col(aes(fill = OverallQual)) +geom_text(aes(label = count), size =8) +labs(y ="Overall Quality") +theme(legend.position ="none", axis.title.y =element_text(angle =90))```
Figure 7: Houses with mid-quality levels are most frequent in the train dataset.
It seems that most popular houses, with an overall quality of 5 (mid-quality level), are the ones between 120K and 145K USD. Of course, this was back in the years of 2006 - 2010.
How sale prices vary with the living area and depending on house styles?
Code
```{r}#| label: fig-houseStyles#| fig-column: page-right#| fig-width: 8#| fig-height: 8#| fig-cap: "One and two story houses are the most common within the train dataset."ames_all |>filter(dataset =="train") |>ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(x =expression("Ground Living Area ["~ft^2~"]"), y ="Sale\nPrice\n['000 $]") +geom_point(aes(color = OverallQual), alpha =0.3) +geom_smooth(method ="lm", formula = y ~ splines::bs(x, 3), color ="black",se =FALSE,size =0.7, alpha =0.5) +facet_wrap(~HouseStyle, nrow =2) +theme(legend.position ="bottom")```
Figure 8: One and two story houses are the most common within the train dataset.
How sale prices vary with the living area and depending on location?
One can hear often that location is everything. Let’s see it here:
Code
```{r}#| label: fig-houseLocations#| fig-width: 7#| fig-height: 10#| fig-cap: "There's a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods."#| warning: falseames_all |>filter(dataset =="train") |>ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(x =expression("Ground Living Area ["~ft^2~"]"), y ="Sale\nPrice\n['000 $]") +geom_point(aes(color = OverallQual), alpha =0.3) +geom_smooth(method ="lm", formula = y ~ splines::bs(x, 3), color ="black",se =FALSE,size =0.7, alpha =0.5) +facet_wrap(~Neighborhood, nrow =4) +theme(legend.position ="none", axis.text.y =element_text(size =8), axis.text.x =element_text(size =8, angle =90))```
Figure 9: There’s a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods.
Analysis of response SalePrice
How well-behaved is the target?
Code
```{r}#| label: fig-Y#| message: false#| warning: false#| column: body#| fig-width: 6#| fig-height: 6#| fig-cap: "Analysis of the target variable. All the assumptions of OLS and linear regression are broken, as usual."#| fig-subcap:#| - "Histogram"#| - "Q-Q Plot"#| layout-ncol: 2ames_all |>filter(dataset =="train") |>ggplot(aes(x = SalePrice/1000)) +geom_histogram(bins =50, color ="seagreen", fill ="seagreen", alpha =0.5) +scale_x_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(x ="Sale Price ['000 $]")ames_all |>filter(dataset =="train") |>ggplot(aes(sample = SalePrice/1000)) +stat_qq(color ="seagreen", alpha =0.5) +stat_qq_line(color ="seagreen") +scale_y_continuous(breaks =seq(0, 800, 50), labels = scales::comma) +labs(y ="Sale\nPrice\n['000 $]",x ="Theoretical Quantiles (Norm. Dist.)")```
(a) Histogram
(b) Q-Q Plot
Figure 10: Analysis of the target variable. All the assumptions of OLS and linear regression are broken, as usual.
Clearly, the target variable is not well-behaved. It shows a log-normal distribution heavily skewed, to right of course. In order to achieve better predictions, a transformation might be needed.
Transformation of the response
For what I’ve seen, processes that have to do with money, e.g., sale prices, costs of products, salaries, they are most of the time Log-normal. Rarely there is something free, so you don’t have zeros or with negative value; think about the top 1% of reach people, or anything that becomes premium the higher the price. These are long and sparse right tails.
The logarithmic transformation works when there are few large values, like in this case, there are only a few expensive houses. The logarithm scale will shrink that right tail5, making the distribution more normal-like.
5 Think of a Logarithmic transformation like pressing an accordeon only from your right arm.
Nevertheless, the moment the first data transformation is performed, we start to loose interpretability very quickly. Nowadays, the most accurate predictions need more than one transformation in order to wor
Because we have powerful computers and libraries with written functions to perform variable transformations, I will use the Box-Cox transformation, which is a family of functions that includes the logarithmic one depending on the value of a parameter lambda. This parameter is automatically assessed.
Figure 12: The transformed response shows a more bell-like shape.
… and it is, indeed. Well, except for the tails, now the transformed values are more normal-like at the core.
Do not include the transformation of the response as a step within the recipe. It will cause an error with the predict function trying to locate SalePrice in the test dataset.
PHASE 2: MODELLING
“This is where the fun begins.”
One of the motivations to develop this project is because I’m switching to the Tidymodels framework. Therefore, it’s time to load the library to train a simple reference model.
Resampling for Test Error Estimation
This is a critical step for final model selection. Depending on the chosen resample technique to estimate the Goodness-of-Prediction or test error, in this case metrics like RMSE can be slightly optimistic or conservative. Take a look at the pristine explanation by (Zablotski 2022).
Simple validation set? 1 train fold and 1 test fold: NO, it’s old and inaccurate.
v-folds Cross-Validation? Analysis (training folds), Assessment (validation folds) and test: it’s small data, I can choose something even more powerful.
The Bootstrap? YES.
By today’s standards this dataset is small, therefore I will choose Bootstraping to keep the chosen samples the same size as the training set. The probability of an observation of being sample at least once is 0.63, therefore the complementary training set (around 37%) will become the assessment set. Although performance metrics will be slightly conservative or pessimistic, I prefer it.
I will generate 1000 bootstrap-samples.
I will use 10-folds CV reapeated 10 times (100 splits) for the hyperparameter tuning of more complex algorithms. Once optimised, I’ll assess the test error again with the 1000 bootstraps, for the sake of comparison.
Expending the data budget
Normally with v-fold CV I expend my data budget, but here doing resampling with replacement is kind of printing money… Here’s the code to generate the bootstraps and therefore split the data budget:
Code
```{r}#| label: DataBudget#| warning: false#| fold: falselibrary(tidymodels)tidymodels_prefer()# conflicted::conflicts_prefer(scales::discard)# Select the pre-processed training datasetames_train <- ames_all |>filter(dataset =="train")# ---- v-folds Cross Validation ----set.seed(1982)ames_folds <-vfold_cv(ames_train, v =10, repeats =10, strata = SalePrice_bc) # v = 10 folds are defaultset.seed(1982)ames_folds_xgb <-vfold_cv(ames_train, v =10, repeats =1, strata = SalePrice_bc) # ---- The Bootstrap ----set.seed(1982)ames_boots <-bootstraps(ames_train, times =1000, strata = SalePrice_bc)ames_boots ```
For the sake of RMSE metric comparison I will always scale numeric predictors and use the Box-Cox transformed response SalePrice_bc, this given the improvement shown in Figure 12.
The following code shows the preprocessing steps enclosed in a pipeline or recipe. The list of available step functions can be found here.
Code
```{r}#| label: fig-skewed-feats#| fig-cap: "Sample of skewed numerical predictors that contain zero as value."ames_train |>select(c(GrLivArea, TotBaths, TotalExtArea, TotalBsmtSF, YearBuilt, SecondFlrSF)) |>pivot_longer(GrLivArea:SecondFlrSF) |>ggplot(aes(x = value)) +geom_histogram(bins =50, color ="lightgrey", fill ="lightgrey") +labs(x ="") +facet_wrap(~name, scales ="free")```
Figure 13: Sample of skewed numerical predictors that contain zero as value.
Given the sensitivity to tails and outliers of linear models, remember Figure 4 (a), it will be convenient to try to normalise the distribution of these predictors. However, Box-Cox transformations work only for positive data. The alternative will be the more general Yeo-Johnson transformation, which works for both, positive and negative values.
Note that SalePrice_bc is the transformed outcome. Click here for more info.
OK! We can see the power of adding good predictors in the test RMSE estimation
My reference is the linear GAM with Splines, showing a mean RMSE = 0.0812. The result was transformed in the pipeline, therefore no units are displayed.
I got slightly better results using dummy encoded factors. Moreover, applying the Yeo-Johnson transformation to numeric Feats. helped too! They were skewed, see Figure 13.
Elastic-Net Regression
Having high multicollinearity it’s reasonable to think of some sort of penalisation or shrinkage method. Hence, I’ll go with a Generalised Linear Model, GLM. In this manner I can have a first look at feature importance.
I’ll choose Elastic Net regression, combining the best of Lasso and Ridge. I must remember to myself that predictors must be scaled! I don’t want to over-penalised non-scaled variables using shrinkage methods.
A very low penalty of \(\lambda = 0.0008\) with a mixture of \(\alpha = 0.15\) were the best parameters.
New fit with new feature engineering, mean RMSE = 0.0636 (1000 Bootstraps). Kaggle’s Top 15%.
Elastic-Net Submission
Code
```{r}#| label: Elastic-Net_submission#| eval: false# Get best model and fit with whole training setfit_glm_elnet <- final_glmnet |>fit(ames_train)# Predicting on the test setames_test <- ames_all |># select(-SalePrice) |>filter(dataset =="test")res_elasticnet <-predict(fit_glm_elnet, ames_test) |>mutate(SalePrice_pred =inv_boxcox(.pred, boxcox_lambda)) # orignal values in USD# .csv for submissionsub_elasticnet <-data.frame(Id = ames_test$Id, SalePrice = res_elasticnet$SalePrice_pred)write.csv(sub_elasticnet, "./data/submissions/jthom_sub_5_elasticnet.csv", quote =FALSE, row.names =FALSE)```
XGBoost Regression
Because of the multicollinearity, XGBoost is one of the best candidates for this project.
A total of 2.8 hours for the search of 6 hyperparameters in 60 randomly-generated grids using 100 splits (10-CV folds, repeated 10 times). See the commented code above with the best ones selected (RMSE).
XGBoost fit: Test Error estimation mean (of 1000 Bootstraps) RMSE = 0.0666.
XGBoost Submission
Results not submitted given the slightly worse performance than Elasti-Net (above). However, results will be used to build an ensemble (stacking).
Code
```{r}#| label: XGB_submission#| eval: false# Get best model and fit with whole training setfit_xgboost <- final_xgboost |>fit(ames_train)# Predicting on the test setames_test <- ames_all |>filter(dataset =="test")res_xgboost <-predict(fit_xgboost, ames_test) |>mutate(SalePrice_pred =inv_boxcox(.pred, boxcox_lambda)) # orignal values in USD# .csv for submissionsub_xgboost <-data.frame(Id = ames_test$Id, SalePrice = res_xgboost$SalePrice_pred)write.csv(sub_xgboost, "./data/submissions/jthom_sub_4_xgboost.csv", quote =FALSE, row.names =FALSE)```
Models Stacking
Mean of Predictions
The simplest ensemble, often very effective when the nature of the models are different like in this case! The Elastic-Net regression had a better RMSE than the XGBoost, therefore I can give more weight to it:
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.