Prediction of House Prices in Ames

A commented and visual walk-through advanced regression techniques.

Author

Jorge A. Thomas

Published

December 1, 2023

Introduction

In this classic Kaggle dataset (Anna Montoya 2016) you’ll follow my workflow developing pipelines on the ETL phase of the DS cycle using the R programming language (R Core Team 2023), as well as a tidy approach on the EDA substantially supported on Visual Analytics. I’ll be feeding some tips on the right margin along the analysis.

The Ames House dataset was prepared to improve on the older classic Boston Houses dataset and, given its open competition nature, it comprises two files:

One file contains features and labels or what’s the same predictors and responses. Here this is the training dataset. It’s all I have to split the Data Budget.
Another file contains only features (predictors) to make predictions for the final submissions. Here this is the test dataset.

The goal of the competition is to achieve minimum RMSE.

I will keep updating this notebook, making improvements and reporting the rank obtained with my submitted predictions.

PHASE I: ETL / EDA

Extract Data

I checked beforehand that there are no missing values, here NA, in the target variable SalePrice. Therefore, I will write a pipeline to read and concatenate both datasets (bind rows), adding and extra column dataset to label as “train” and “test” for further easy splitting¹ (subset or filter).

¹ The whole Feature Transformation pipeline most be always the same for all predictors in both datasets.

Code

```{r}
#| label: Load Data
#| warning: false

ames_train_raw <- read_csv("./data/raw/train.csv") # Train, validattion and test dataset
print("Dimensions of training dataset")
dim(ames_train_raw)

ames_test_raw <- read_csv("./data/raw/test.csv")  # Features for submission dataset
print("Dimensions of test dataset containing only Feats.")
dim(ames_test_raw) 

# Add Target column with NA so both DFs can be concatenated:.id
ames_test_raw <- 
  ames_test_raw |> 
    mutate(SalePrice = NA)

# Binding and adding identifier column "dataset" 
ames_all <- 
  bind_rows(list(train = ames_train_raw, test = ames_test_raw), .id = "dataset")

print("Available variables:")
names(ames_all)
```

[1] "Dimensions of training dataset"
[1] 1460   81
[1] "Dimensions of test dataset containing only Feats."
[1] 1459   80
[1] "Available variables:"
 [1] "dataset"       "Id"            "MSSubClass"    "MSZoning"     
 [5] "LotFrontage"   "LotArea"       "Street"        "Alley"        
 [9] "LotShape"      "LandContour"   "Utilities"     "LotConfig"    
[13] "LandSlope"     "Neighborhood"  "Condition1"    "Condition2"   
[17] "BldgType"      "HouseStyle"    "OverallQual"   "OverallCond"  
[21] "YearBuilt"     "YearRemodAdd"  "RoofStyle"     "RoofMatl"     
[25] "Exterior1st"   "Exterior2nd"   "MasVnrType"    "MasVnrArea"   
[29] "ExterQual"     "ExterCond"     "Foundation"    "BsmtQual"     
[33] "BsmtCond"      "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"   
[37] "BsmtFinType2"  "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"  
[41] "Heating"       "HeatingQC"     "CentralAir"    "Electrical"   
[45] "1stFlrSF"      "2ndFlrSF"      "LowQualFinSF"  "GrLivArea"    
[49] "BsmtFullBath"  "BsmtHalfBath"  "FullBath"      "HalfBath"     
[53] "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd" 
[57] "Functional"    "Fireplaces"    "FireplaceQu"   "GarageType"   
[61] "GarageYrBlt"   "GarageFinish"  "GarageCars"    "GarageArea"   
[65] "GarageQual"    "GarageCond"    "PavedDrive"    "WoodDeckSF"   
[69] "OpenPorchSF"   "EnclosedPorch" "3SsnPorch"     "ScreenPorch"  
[73] "PoolArea"      "PoolQC"        "Fence"         "MiscFeature"  
[77] "MiscVal"       "MoSold"        "YrSold"        "SaleType"     
[81] "SaleCondition" "SalePrice"

In order to organise my Data Budget I count on the train dataset with 1460 Observations. Note that Id and the just added dataset columns are not predictors. Hence, there are 78 features to play with. Also remember that the number of features for modelling will vary. Some will be discarded and some will be created along the analysis.

Impute Missing Values

First things first, read the full data description ². There you’ll find that for several columns, missing values NA means actually “None”. The physical absence of a determined feature in a house is a category exposing the lack of such quality that can have a significant impact on the response SalePrice. Later it is most probably that binary features indicating the existence of some installations should be created.

² Print a copy, make notes and study it. This is the first step to get into domain knowledge, in this case Real Estate, for later Feature Engineering.

Therefore, I will fill the empty NA fields of the indicated columns of both datasets with the string “None”.

Code

```{r}
#| label: easy replace_na

cols_NA_to_none <- list(
  Alley = "None",
  BsmtQual = "None", BsmtCond = "None", BsmtExposure = "None", BsmtFinType1 = "None", BsmtFinType2 = "None", 
  FireplaceQu = "None", 
  GarageType = "None", GarageFinish = "None", GarageQual = "None", GarageCond = "None", 
  PoolQC = "None",
  Fence = "None", 
  MiscFeature = "None")

ames_all <- ames_all |>
  replace_na(cols_NA_to_none)   
```

One of the early and recurrent steps of the EDA is to check the completeness of data. Let’s search for missing values, after filling indicated fields³. For this case I wrote the function count_na() that generates the table displayed on the right margin.

³ Write a function to quantify missing values depending on the language and labelling system used.

```{r}
#| label: Counting NAs
#| code-fold: false
#| column: margin

# Remaining NA count leaving out target SalePrice
ames_all |> 
  select(-SalePrice) |>
  count_na() |>  
  knitr::kable(caption = 'Ames dataset')
```

Ames dataset
Variable	NA_count	Percent
LotFrontage	486	16.65
GarageYrBlt	159	5.45
MasVnrType	24	0.82
MasVnrArea	23	0.79
MSZoning	4	0.14
Utilities	2	0.07
BsmtFullBath	2	0.07
BsmtHalfBath	2	0.07
Functional	2	0.07
Exterior1st	1	0.03
Exterior2nd	1	0.03
BsmtFinSF1	1	0.03
BsmtFinSF2	1	0.03
BsmtUnfSF	1	0.03
TotalBsmtSF	1	0.03
Electrical	1	0.03
KitchenQual	1	0.03
GarageCars	1	0.03
GarageArea	1	0.03
SaleType	1	0.03

I’ll fill the remaining missing values as follow:

The LotFrontage column refers to the linear feet of street connected to the house. This feature is not well documented and the missing percentage related to other variables makes it not only unreliable, but very low in variance. Therefore, I will delete the feature.
Everytime a Garage doesn’t exist, i.e., GarageType = "None", there is the corresponding missing value in Garage Year Built GarageYrBlt = NA. I will engineer a new feature called GarageNew with three ordinal categories: None, No, Yes. This based on the delta of YearBuilt - GarageYrBlt; I expect that given the house price, the algorithm will learn that “None” is worse than “No” and so on. Then I will remove the GarageYrBlt predictor.
For the rest of variables with a 1 % or less missing values NA, I’ll calculate the median (if numerical) and the mode (if string) in order to fill them with it.

Here’s my pipeline for the whole dataset:

Code

```{r}
#| label: terminating NAs

# "replace_na_with_median" is a function I created

ames_all <- ames_all |>
  mutate(LotFrontage = NULL) |> # Removing LotFrontage
  mutate(GarageNew = if_else(YearBuilt - GarageYrBlt > 0, "Yes", "No")) |> # New Feat.
  replace_na(list(GarageNew = "None")) |>
  mutate(GarageNew = factor(GarageNew, levels = c("None", "No", "Yes"), ordered = TRUE)) |> # 3 levels
  mutate(GarageYrBlt = NULL) |> # Removing old Feat.
  mutate_if(is.numeric, replace_na_with_median)
```

Let’s get a list with the mode of each remaining columns containing missing values NA.

Code

```{r}
#| label: Mode for NAs

# Get the mode for reamaining columns with NAs:
# "find_mode()" is a custom function.

# Good, old-fashioned code: apply(ames_all, 2, find_mode)

list_na_mode <- ames_all |> 
  select(MasVnrType, MasVnrArea, MSZoning, Electrical, Utilities,   BsmtFullBath,   BsmtHalfBath,   Functional, Exterior1st,    Exterior2nd,    BsmtFinSF1, BsmtFinSF2, BsmtUnfSF,  TotalBsmtSF,    KitchenQual,GarageCars, GarageArea, SaleType) |>
  map(find_mode)

# map returns a list 

# Replace with the created named-lists
ames_all <- ames_all |>
  replace_na(list_na_mode) 

# Sanity check of missing values:
print("Full Ames dataset (train and test)")
ames_all |>
  select(-SalePrice) |>
  count_na()
```

[1] "Full Ames dataset (train and test)"
[1] "No missing values (NA) found."

“The data is complete!” Let’s think about predictors.

There are variables (columns) starting with numbers? I’ll rename them:

Code

```{r}
#| label: rename_cols

ames_all <-
  ames_all |>
  rename(FirstFlrSF = "1stFlrSF") |>
  rename(SecondFlrSF = "2ndFlrSF") |>
  rename(threessnporch = "3SsnPorch")
```

Screening of variable importance

Using the Mutual Information criterion (MI), I’ll rank the extent to which knowledge of each original predictor reduces the uncertainty of SalePrice.

```{r}
#| label: mutual-info
#| column: margin
#| code-fold: false

library(infotheo)

# "calc_mi" is a function I wrote.

mi_y_X <- ames_all |>
  filter(dataset == "train") |>
  select(-c(Id, dataset)) |>
  calc_mi_score(target = "SalePrice")

mi_y_X |> 
  head(20) |>
  knitr::kable(caption = "Mutual Info")
```

Mutual Info
	Variable	nats	Percent
SalePrice	SalePrice	2.6515433	100.0000
OverallQual	OverallQual	0.6086855	22.9559
Neighborhood	Neighborhood	0.6084916	22.9486
GrLivArea	GrLivArea	0.5723431	21.5853
GarageArea	GarageArea	0.5244219	19.7780
YearBuilt	YearBuilt	0.5195234	19.5932
YearRemodAdd	YearRemodAdd	0.4564483	17.2144
TotalBsmtSF	TotalBsmtSF	0.4473484	16.8712
FirstFlrSF	FirstFlrSF	0.4141527	15.6193
GarageCars	GarageCars	0.3805914	14.3536
SecondFlrSF	SecondFlrSF	0.3735544	14.0882
BsmtUnfSF	BsmtUnfSF	0.3544288	13.3669
MSSubClass	MSSubClass	0.3458352	13.0428
BsmtQual	BsmtQual	0.3425788	12.9200
ExterQual	ExterQual	0.3394960	12.8037
KitchenQual	KitchenQual	0.3283328	12.3827
OpenPorchSF	OpenPorchSF	0.3051952	11.5101
FullBath	FullBath	0.2863069	10.7977
GarageFinish	GarageFinish	0.2784786	10.5025
WoodDeckSF	WoodDeckSF	0.2706731	10.2081

Feature Engineering

I have already started with the creation of a new predictor in the previous step, this to substitute the problematic variable GarageYrBlt that had a lot of missing values.

Feature creation and transformations are not sequential, i.e., this is not the only step where Feature Engineering is applied. Think about the next modelling phase, where you want to reduce features, e.g., where new Principal Components are the new aggregators of original features.

In this part I will establish which variables are categorical and numerical; this is not always as obvious as it seems⁴.

⁴ Integer variables like calendar Years can be treated as categories for the analysis.

The first easy thing that comes to mind is to add the total area of the property as a new column named TotAreaSF; SF means Squared Feet.

After that, I check the data documentation to see which variables can be thought as factors; this can be tedious if Real Estate is not your domain of expertise.

After 1.5 hours of reading, googling, understanding other notebooks and deliberating (yes, it was tedious), I can summarise four types of ways to engineer new features depending on the nature of the variables:

Aggregation to a total of continuous-numerical surface area (Square Feet), e.g. new TotalIntArea Feat. summing basement and floors, etc. Same for a new TotalExtArea. Then deleting the addends variables.
Aggregation to a total of counts-numerical of an available installation of a determined type, e.g. new TotBaths. Then deleting the addends.
Categorisation with ascending quality levels of the originally numerical variables OverallQual and OverallCond.
Creation of new binary features that state if a determined installation exists or doesn’t.

After my second Kaggle submission (Top 25%) I completed the Feature Engineering course that has a lot of tips! I included some new features and tricks from there.

Here’s the code that complies with the four previous steps:

Code

```{r}
#| label: Feat-Eng

# --- Total Aggregations ----
ames_all <- ames_all |>

  # Continuous numerical ----
  # NEW TotalInteriorArea
  mutate(TotalIntArea = BsmtFinSF1 + BsmtFinSF2 + FirstFlrSF + SecondFlrSF) |>    

  # NEW Interior Areas to lengths (sqrt Feat. Transformation)
  # mutate(TotalIntLength = sqrt(TotalIntArea)) |>
  # mutate(BsmtFinF1 = sqrt(BsmtFinSF1)) |>
  # mutate(BsmtFinF2 = sqrt(BsmtFinSF2)) |>
  # mutate(FirstFlrF = sqrt(FirstFlrSF)) |>
  # mutate(SecondFlrF = sqrt(SecondFlrSF)) |> 

  select(-c(BsmtFinSF1, BsmtFinSF2)) |> # Bye bye...

  # NEW TotalExteriorArea
  mutate(TotalExtArea = WoodDeckSF + OpenPorchSF + EnclosedPorch + threessnporch + ScreenPorch + GarageArea) |>
  select(-c(WoodDeckSF, OpenPorchSF, EnclosedPorch, threessnporch, ScreenPorch)) |> # Bye bye...
 
  # NEW Exterior Areas to lengths (sqrt Feat. Transformation)
  # mutate(TotalExtLength = sqrt(TotalExtArea)) |>

  # NEW GarageLength
  # mutate(GarageLength = sqrt(GarageArea)) |>

  # Counting numerical ----
  # NEW TotalBaths
  mutate(TotBaths = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath) |> 
  select(-c(BsmtFullBath, BsmtHalfBath, FullBath, HalfBath)) # Bye bye...


# ---- Ordered Categorical Features ----

# Check which columns have "None"
# sapply(ames_all, function(x) sum(x == "None"))

ames_all <- ames_all |>
  # 10 Levels:
  mutate(OverallQual = factor(OverallQual, seq(1, 10), ordered = TRUE)) |>
  mutate(OverallCond = factor(OverallCond, seq(1, 10), ordered = TRUE)) |> 

  # 5 Levels + None
  mutate(ExterQual = factor(ExterQual, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(ExterCond = factor(ExterCond, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(BsmtQual = factor(BsmtQual, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(BsmtCond = factor(BsmtCond, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(HeatingQC = factor(HeatingQC, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(KitchenQual = factor(KitchenQual, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(FireplaceQu = factor(FireplaceQu, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(GarageQual = factor(GarageQual, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(GarageCond = factor(GarageCond, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
  mutate(PoolQC = factor(PoolQC, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>

  #  Others:
  mutate(LotShape = factor(LotShape, c("Reg", "IR1", "IR2", "IR3"), ordered = TRUE)) |>
  mutate(LandSlope = factor(LandSlope, c("Sev", "Mod", "Gtl"), ordered = TRUE)) |>
  mutate(BsmtExposure = factor(BsmtExposure, c("None", "No", "Mn", "Av", "Gd"), ordered = TRUE)) |>
  mutate(BsmtFinType1 = factor(BsmtFinType1, c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"), ordered = TRUE)) |>
  mutate(BsmtFinType2 = factor(BsmtFinType2, c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"), ordered = TRUE)) |>
  mutate(Functional = factor(Functional, c("Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"), ordered = TRUE)) |>
  mutate(GarageFinish = factor(GarageFinish, c("None", "Unf", "RFn", "Fin"), ordered = TRUE)) |>
  mutate(PavedDrive = factor(PavedDrive, c("N", "P", "Y"), ordered = TRUE)) |>
  mutate(Utilities = factor(Utilities, c("NoSeWa", "NoSewr", "AllPub"), ordered = TRUE)) |>  
  mutate(Electrical = factor(Electrical, c("Mix", "FuseP", "FuseF", "FuseA", "SBrkr"), ordered = TRUE)) |>
  mutate(Fence = factor(Fence, c("None", "MnWw", "GdWo", "MnPrv", "GdPrv"), ordered = TRUE))


# ---- New Binary Features ----

# Does the installation or Feature exist? 
ames_all <- ames_all |>
  # CentralAir
  mutate(CentralAir = if_else(CentralAir == "N", "No", "Yes")) |>
  mutate(CentralAir = fct_relevel(CentralAir, "No", "Yes"))|>

  # NEW IsRemodelled (it will replace YearRemodAdd)
  # mutate(IsRemodelled = if_else(YearRemodAdd - YearBuilt > 0, "Yes", "No")) |>
  # mutate(IsRemodelled = factor(IsRemodelled, levels = c("No", "Yes"))) |>
  # select(-YearRemodAdd) |> # Bye YearRemodAdd
  # Not important

  # NEW ExistPool  
  mutate(ExistPool = if_else(PoolArea == 0, "No", "Yes")) |>
  mutate(ExistPool = fct_relevel(ExistPool, "No", "Yes"))|>
  # a dummy of PoolQC makes this Feat. Redundant, BUT PoolQC has no Information! (Eliminated)

  # NEW ExistGarage
  mutate(ExistGarage = if_else(GarageArea == 0, "No", "Yes")) |>
  mutate(ExistGarage = fct_relevel(ExistGarage, "No", "Yes"))|>
  select(-c(GarageArea)) |> # Bye

  # NEW ExistBsmt
  mutate(ExistBsmt = if_else(TotalBsmtSF == 0, "No", "Yes")) |>
  mutate(ExistBsmt = fct_relevel(ExistBsmt, "No", "Yes"))|>

  # NEW Exist2ndFloor
  mutate(Exist2ndFloor = if_else(`SecondFlrSF` == 0, "No", "Yes")) |>  
  mutate(Exist2ndFloor = fct_relevel(Exist2ndFloor, "No", "Yes")) |>

  # NEW ExistFireplace
  mutate(ExistFireplace = if_else(Fireplaces == 0, "No", "Yes")) |>
  mutate(ExistFireplace = fct_relevel(ExistFireplace, "No", "Yes")) |>

  # NEW ExistMasVeneer: Masonry veneer area
  mutate(ExistMasVeneer = if_else(MasVnrArea > 0, "Yes", "No")) |>
  mutate(ExistMasVeneer = factor(ExistMasVeneer, levels = c("No", "Yes"))) |>

  # From Kaggle's Feature Engineering course ---- 

  # New ratios
  mutate(LivLotRatio = GrLivArea/LotArea) |>
  mutate(Spaciousness = (FirstFlrSF + SecondFlrSF) / TotRmsAbvGrd) |>
  # select(-c(FirstFlrSF)) |>

  # New interactions (by me!)
  mutate(OverallQuCoInt = as.numeric(OverallQual) * as.numeric(OverallCond))
  
 

# ---- Unordered factors ----

# Nominative (unordered) categorical Feats
cats_feats <- c("MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", 
                "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", 
                "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", 
                "Foundation", "Heating", "CentralAir", "GarageType", "MiscFeature", 
                "SaleType", "SaleCondition")

ames_all <- ames_all |>
  #mutate(across(where(is.character), factor)) |>
  mutate(across(all_of(cats_feats), ~ factor(.x, ordered = FALSE)))
  

# ---- Numerical Features ----

# BedroomabvGr (count) left numerical 
# Kitchens (count) left numerical 
# TotRmsAbvGrd (count) left numerical
# Fireplaces (count) left numerical
# GarageCars (count) left numerical
# GarageArea (continous) left numerical 
# MiscVal left numerical 
# MasVnrArea (continous) left numeric


# ---- Remove 'LowQualFinSF' ----

# Low quality finished square feet (all floors)
# Most of the values are 0 -> almost 0 variance. 

ames_all$LowQualFinSF <- NULL
ames_all$PoolArea <- NULL
```

Column MSSubClass has too many categories, therefore I will perform target enconding in the recipe as a step.

Recheck mutual info

Code

```{r}
#| label: recheck_mi
#| column: margin

# "calc_mi_score" is a function I wrote.

mi_y_X_2 <- ames_all |>
  filter(dataset == "train") |>
  select(-c(Id, dataset)) |>
  calc_mi_score(target = "SalePrice")

mi_y_X_2 |> 
  head(20) |>
  knitr::kable(caption = "Mutual Information")
```

Mutual Information
	Variable	nats	Percent
SalePrice	SalePrice	2.6515433	100.0000
OverallQual	OverallQual	0.6086855	22.9559
Neighborhood	Neighborhood	0.6084916	22.9486
GrLivArea	GrLivArea	0.5723431	21.5853
TotalExtArea	TotalExtArea	0.5331534	20.1073
TotalIntArea	TotalIntArea	0.5243177	19.7741
YearBuilt	YearBuilt	0.5195234	19.5932
OverallQuCoInt	OverallQuCoInt	0.4865985	18.3515
YearRemodAdd	YearRemodAdd	0.4564483	17.2144
TotalBsmtSF	TotalBsmtSF	0.4473484	16.8712
FirstFlrSF	FirstFlrSF	0.4141527	15.6193
Spaciousness	Spaciousness	0.4133356	15.5885
TotBaths	TotBaths	0.4108588	15.4951
GarageCars	GarageCars	0.3805914	14.3536
SecondFlrSF	SecondFlrSF	0.3735544	14.0882
BsmtUnfSF	BsmtUnfSF	0.3544288	13.3669
MSSubClass	MSSubClass	0.3458352	13.0428
BsmtQual	BsmtQual	0.3425788	12.9200
ExterQual	ExterQual	0.3394960	12.8037
KitchenQual	KitchenQual	0.3283328	12.3827

Drop no info columns

```{r}
#| label: noinfo_drop
#| code-fold: false

# < 0.06444 == < 2.4%

noinfo_cols <- mi_y_X_2$Variable[which(mi_y_X_2$Percent < 2.4)]

ames_all <-
  ames_all |>
  select(-all_of(noinfo_cols)) 
```

Searching for simple correlations

Although nothing is linear, is good to have an idea of simple correlations between all variables; it gives me an idea of what kind of plots to make. After going through the previous steps, it’s obvious that there are correlations. Think about the number of cars that fit in a garage GarageCars and the area of the garage GarageArea.

Code

```{r}
#| label: fig-num-corr
#| fig-width: 8
#| fig-height: 8
#| fig-cap: The Pearson's correlation of numeric features and response shows high presence of multicollinearity.
#| warning: false
#| message: false

library(corrplot)

ames_all |>
  filter(dataset == "train") |>
  select(-c(dataset, Id)) |>
  select(where(is.numeric)) |>  
  cor(method = "pearson") |>
  corrplot(type = "lower", method = "circle", insig = 'blank', order = "hclust", diag = TRUE,
  tl.col="black", tl.cex = 0.8, addCoef.col = 'black', number.cex = 0.6)
```

Figure 2: The Pearson’s correlation of numeric features and response shows high presence of multicollinearity.

According to the matrix above, GrLivArea has the strongest linear relationship with the target SalePrice, followed by TotalIntArea, this given the collinearity with GrLivArea, and then GarageArea or GarageCars, both heavily correlated too, like commented before.

I’m leaving blank, i.e., without a circle, the insignificant correlations for better visualisation and focus. Now I’m checking the categorical variables and the response.

Code

```{r}
#| label: fig-cat-corr
#| fig-width: 10
#| fig-height: 10
#| fig-cap: Pearson's correlation of categorical features and response.
#| warning: false
#| message: false

ames_all |>
  filter(dataset == "train") |>
  select(-dataset) |>
  select(where(is.factor) | contains("Price")) |>  
  mutate_if(is.factor, as.integer) |> 
  cor(method = "pearson") |>
  corrplot(type = "lower", order="hclust", diag = TRUE, insig = 'blank',
  tl.col="black", tl.cex = 0.8, number.cex = 0.6)
```

Figure 3: Pearson’s correlation of categorical features and response.

The best linear association with SalePrice is given by OverallQual factor, followed by everything else that has to do with quality ratings.

Overall, one can see clearly the multicollinearity clusters, which suggest the use of advanced Feature Selection techniques.

Visual Exploration

Now with the previous information in hand, I’m going to visualise separately the two most important quasi-linear relationships: GrLivArea and OverallQual vs. SalePrice.

Code

```{r}
#| label: fig-cont_linear
#| message: false
#| warning: false
#| column: body
#| fig-width: 6
#| fig-height: 6
#| fig-cap: 'GrvLivArea Vs. SalePrice - The most "linear" realationship between continuous variables. One can clearly notice the increase of variance as both, Living Area and Price increase, i.e., heteroskedasticity.'
#| cap-location: margin
#| fig-subcap:
#|   - "Scatter Plot"
#|   - "Hexbin Plot"
#| layout-ncol: 2

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
  labs(x = expression("Ground Living Area ["~ft^2~"]"), 
       y = "Sale\nPrice\n['000 $]") +  
  geom_point(aes(color = OverallQual), alpha = 0.5) + theme(axis.text.y = element_blank()) +
  geom_smooth(method = "lm", formula =  y ~ splines::bs(x, 3), color = "black", linewidth = 1.5, alpha = 0.5) +
  theme(legend.position = "bottom")

# install.packages("hexbin")
ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +    
  geom_hex(bins = 35, colour = "seagreen") + 
  labs(x = expression("Ground Living Area ["~ft^2~"]")) +    
  theme(axis.title.y = element_blank(), legend.position = "bottom")
```

The hexbin plot reveals three cells of price and size where houses are more common in the training dataset. There are more than 50 houses for each bin in the range of 130K, 140K and 180K USD approximately.

On the other hand, note how problematic are the outliers with the spline fit! This visual allows me to remove outliers, i.e., the biggest and cheap houses (right data points).

Code

```{r}
#| label: fig-RemoveOutliers
#| fig-cap: Removing two outliers from the training dataset part the spline fit is dramatically improved!
#| column: margin

ames_all |>
  filter(dataset == "train") |>
  select(c(SalePrice, GrLivArea, OverallQual)) |>
  filter(GrLivArea > 4500) |>
  glimpse()

idx_remove <- which(ames_all$dataset == "train" & ames_all$GrLivArea > 4500)

ames_all <-
  ames_all |> 
    filter(!row_number() %in% idx_remove)

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
  labs(x = expression("Ground Living Area ["~ft^2~"]"), 
       y = "Sale Price ['000 $]") +  
  geom_point(aes(color = OverallQual), alpha = 0.5) + theme(axis.text.y = element_blank()) +
  geom_smooth(method = "lm", 
              formula =  y ~ splines::bs(x, 3), 
              color = "black", 
              size = 1.5, 
              alpha = 0.5) +
  theme(legend.position = "none", axis.title.y = element_text(angle = 90))
```

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Rows: 2
Columns: 3
$ SalePrice   <dbl> 184750, 160000
$ GrLivArea   <dbl> 4676, 5642
$ OverallQual <ord> 10, 10

Figure 5: Removing two outliers from the training dataset part the spline fit is dramatically improved!

The two big houses above were super offers! Or did the housing bubble exploded?

In the same fashion, I want to see the effect of the factor OverallQual.

Code

```{r}
#| label: fig-cat_linear
#| message: false
#| warning: false
#| fig-width: 7
#| fig-height: 8
#| fig-cap: 'OverallQual Vs. SalePrice - The most "linear" realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.'

# install.packages("ggrain")
library(ggrain)

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = OverallQual, y = SalePrice/1000, colour = OverallQual, fill = OverallQual)) +     
  geom_rain(alpha = 0.3) +  
  scale_y_continuous(breaks = seq(0, 800, 100)) +  
  labs(x = "Overall\nQuality", y = "Sale Price ['000 $]") +
  coord_flip() + 
  theme(legend.position = "none")
```

Figure 6: OverallQual Vs. SalePrice - The most “linear” realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.

I’m wondering in which price range is the most popular quality?

```{r}
#| label: fig-count_overallQual
#| cap-location: margin
#| fig-cap: "Houses with mid-quality levels are most frequent in the train dataset."
#| column: margin
#| code-fold: false

ames_all |>
  filter(dataset == "train") |>
  count(OverallQual, name = "count") |>
  ggplot(aes(y = fct_reorder(OverallQual, count), x = count)) + 
  geom_col(aes(fill = OverallQual)) +
  geom_text(aes(label = count), size = 8) +
  labs(y = "Overall Quality") + 
  theme(legend.position = "none", axis.title.y = element_text(angle = 90))
```

Figure 7: Houses with mid-quality levels are most frequent in the train dataset.

It seems that most popular houses, with an overall quality of 5 (mid-quality level), are the ones between 120K and 145K USD. Of course, this was back in the years of 2006 - 2010.

How sale prices vary with the living area and depending on house styles?

Code

```{r}
#| label: fig-houseStyles
#| fig-column: page-right
#| fig-width: 8
#| fig-height: 8
#| fig-cap: "One and two story houses are the most common within the train dataset."

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
  labs(x = expression("Ground Living Area ["~ft^2~"]"), 
       y = "Sale\nPrice\n['000 $]") +  
  geom_point(aes(color = OverallQual), alpha = 0.3) +
  geom_smooth(method = "lm", 
              formula =  y ~ splines::bs(x, 3), 
              color = "black",
              se = FALSE,
              size = 0.7, 
              alpha = 0.5) +
  facet_wrap(~HouseStyle, nrow = 2) +
  theme(legend.position = "bottom")
```

Figure 8: One and two story houses are the most common within the train dataset.

How sale prices vary with the living area and depending on location?

One can hear often that location is everything. Let’s see it here:

Code

```{r}
#| label: fig-houseLocations
#| fig-width: 7
#| fig-height: 10
#| fig-cap: "There's a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods."
#| warning: false

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
  labs(x = expression("Ground Living Area ["~ft^2~"]"), 
       y = "Sale\nPrice\n['000 $]") +  
  geom_point(aes(color = OverallQual), alpha = 0.3) +
  geom_smooth(method = "lm", 
              formula =  y ~ splines::bs(x, 3), 
              color = "black",
              se = FALSE,
              size = 0.7, 
              alpha = 0.5) +
  facet_wrap(~Neighborhood, nrow = 4) +
  theme(legend.position = "none", 
        axis.text.y  = element_text(size = 8),        
        axis.text.x  = element_text(size = 8, angle = 90))
```

Figure 9: There’s a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods.

Analysis of response SalePrice

How well-behaved is the target?

Code

```{r}
#| label: fig-Y
#| message: false
#| warning: false
#| column: body
#| fig-width: 6
#| fig-height: 6
#| fig-cap: "Analysis of the target variable. All the assumptions of OLS and linear regression are broken, as usual."
#| fig-subcap:
#|   - "Histogram"
#|   - "Q-Q Plot"
#| layout-ncol: 2

ames_all |>
  filter(dataset == "train") |>
  ggplot(aes(x = SalePrice/1000)) +  
  geom_histogram(bins = 50, color = "seagreen", fill = "seagreen", alpha = 0.5) +
  scale_x_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
  labs(x = "Sale Price ['000 $]")

ames_all |>
  filter(dataset == "train") |> 
  ggplot(aes(sample = SalePrice/1000)) + 
  stat_qq(color = "seagreen", alpha = 0.5) + 
  stat_qq_line(color = "seagreen") +  
  scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +
  labs(y = "Sale\nPrice\n['000 $]",
       x = "Theoretical Quantiles (Norm. Dist.)")
```

Clearly, the target variable is not well-behaved. It shows a log-normal distribution heavily skewed, to right of course. In order to achieve better predictions, a transformation might be needed.

Transformation of the response

For what I’ve seen, processes that have to do with money, e.g., sale prices, costs of products, salaries, they are most of the time Log-normal. Rarely there is something free, so you don’t have zeros or with negative value; think about the top 1% of reach people, or anything that becomes premium the higher the price. These are long and sparse right tails.

The logarithmic transformation works when there are few large values, like in this case, there are only a few expensive houses. The logarithm scale will shrink that right tail⁵, making the distribution more normal-like.

⁵ Think of a Logarithmic transformation like pressing an accordeon only from your right arm.

Nevertheless, the moment the first data transformation is performed, we start to loose interpretability very quickly. Nowadays, the most accurate predictions need more than one transformation in order to wor

Because we have powerful computers and libraries with written functions to perform variable transformations, I will use the Box-Cox transformation, which is a family of functions that includes the logarithmic one depending on the value of a parameter lambda. This parameter is automatically assessed.

Family Box-Cox transformations:

\(y(\lambda) = \begin{cases} \frac{y^{-1}-1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log y, & \text{if } \lambda = 0 \end{cases}\)

```{r}
#| label: fig-boxcox
#| fig-cap: "Best parameter lambda = -0.05050505. It is almost a pure logarithmic transformation."
#| column: margin
#| code-fold: false
#| warning: false
#| message: false

boxcox_trans <- MASS::boxcox(lm(SalePrice ~ 1, 
                                data = subset(ames_all, dataset == "train")), 
                             lambda = seq(-1, 1), 
                             plotit = TRUE)

boxcox_lambda <- as.numeric(boxcox_trans$x[which.max(boxcox_trans$y)])

# New transformed response variable
ames_all <- ames_all |>
  mutate(SalePrice_bc = (SalePrice^boxcox_lambda - 1) / boxcox_lambda)
```

Figure 11: Best parameter lambda = -0.05050505. It is almost a pure logarithmic transformation.

The new mutated SalePrice should be better behaved…

Code

```{r}
#| label: fig-Y-trans
#| message: false
#| warning: false
#| column: body
#| fig-width: 6
#| fig-height: 6
#| fig-cap: "The transformed response shows a more bell-like shape."
#| fig-subcap:
#|   - "Histogram"
#|   - "Q-Q Plot"
#| layout-ncol: 2

ames_all |>
  filter(dataset == "train") |>  
  ggplot(aes(x = SalePrice_bc)) +  
  geom_histogram(bins = 50, color = "seagreen", fill = "seagreen", alpha = 0.5) +  
  labs(x = "Sale Price [Box-Cox]")

ames_all |>
  filter(dataset == "train") |>  
  ggplot(aes(sample = SalePrice_bc)) + 
  stat_qq(color = "seagreen", alpha = 0.5) + 
  stat_qq_line(color = "seagreen") +   
  labs(y = "Sale\nPrice\n[Box-Cox]", x = "Theoretical Quantiles (Norm. Dist.)")
```

… and it is, indeed. Well, except for the tails, now the transformed values are more normal-like at the core.

Do not include the transformation of the response as a step within the recipe. It will cause an error with the predict function trying to locate SalePrice in the test dataset.

PHASE 2: MODELLING

“This is where the fun begins.”

One of the motivations to develop this project is because I’m switching to the Tidymodels framework. Therefore, it’s time to load the library to train a simple reference model.

Resampling for Test Error Estimation

This is a critical step for final model selection. Depending on the chosen resample technique to estimate the Goodness-of-Prediction or test error, in this case metrics like RMSE can be slightly optimistic or conservative. Take a look at the pristine explanation by (Zablotski 2022).

Simple validation set? 1 train fold and 1 test fold: NO, it’s old and inaccurate.
v-folds Cross-Validation? Analysis (training folds), Assessment (validation folds) and test: it’s small data, I can choose something even more powerful.
The Bootstrap? YES.

By today’s standards this dataset is small, therefore I will choose Bootstraping to keep the chosen samples the same size as the training set. The probability of an observation of being sample at least once is 0.63, therefore the complementary training set (around 37%) will become the assessment set. Although performance metrics will be slightly conservative or pessimistic, I prefer it.

I will generate 1000 bootstrap-samples.

I will use 10-folds CV reapeated 10 times (100 splits) for the hyperparameter tuning of more complex algorithms. Once optimised, I’ll assess the test error again with the 1000 bootstraps, for the sake of comparison.

Expending the data budget

Normally with v-fold CV I expend my data budget, but here doing resampling with replacement is kind of printing money… Here’s the code to generate the bootstraps and therefore split the data budget:

Code

```{r}
#| label: DataBudget
#| warning: false
#| fold: false

library(tidymodels)
tidymodels_prefer()
# conflicted::conflicts_prefer(scales::discard)

# Select the pre-processed training dataset
ames_train <- 
  ames_all |> 
  filter(dataset == "train")
 

# ---- v-folds Cross Validation ----
set.seed(1982)
ames_folds <- vfold_cv(ames_train, v = 10, repeats = 10, strata = SalePrice_bc) # v = 10 folds are default

set.seed(1982)
ames_folds_xgb <- vfold_cv(ames_train, v = 10, repeats = 1, strata = SalePrice_bc) 

# ---- The Bootstrap ----
set.seed(1982)
ames_boots <- bootstraps(ames_train, times = 1000, strata = SalePrice_bc)
ames_boots  
```

# A tibble: 1,000 × 2
   splits             id           
   <list>             <chr>        
 1 <split [1458/545]> Bootstrap0001
 2 <split [1458/530]> Bootstrap0002
 3 <split [1458/534]> Bootstrap0003
 4 <split [1458/535]> Bootstrap0004
 5 <split [1458/513]> Bootstrap0005
 6 <split [1458/534]> Bootstrap0006
 7 <split [1458/519]> Bootstrap0007
 8 <split [1458/532]> Bootstrap0008
 9 <split [1458/529]> Bootstrap0009
10 <split [1458/555]> Bootstrap0010
# ℹ 990 more rows

For the sake of RMSE metric comparison I will always scale numeric predictors and use the Box-Cox transformed response SalePrice_bc, this given the improvement shown in Figure 12.

The following code shows the preprocessing steps enclosed in a pipeline or recipe. The list of available step functions can be found here.

Code

```{r}
#| label: fig-skewed-feats
#| fig-cap: "Sample of skewed numerical predictors that contain zero as value."

ames_train |>  
  select(c(GrLivArea, TotBaths, TotalExtArea, TotalBsmtSF, YearBuilt, SecondFlrSF)) |> 
  pivot_longer(GrLivArea:SecondFlrSF) |>
  ggplot(aes(x = value)) +  
  geom_histogram(bins = 50, color = "lightgrey", fill = "lightgrey") +  
  labs(x = "") + 
  facet_wrap(~name, scales = "free")
```

Figure 13: Sample of skewed numerical predictors that contain zero as value.

Given the sensitivity to tails and outliers of linear models, remember Figure 4 (a), it will be convenient to try to normalise the distribution of these predictors. However, Box-Cox transformations work only for positive data. The alternative will be the more general Yeo-Johnson transformation, which works for both, positive and negative values.

Note that SalePrice_bc is the transformed outcome. Click here for more info.

OK! We can see the power of adding good predictors in the test RMSE estimation

My reference is the linear GAM with Splines, showing a mean RMSE = 0.0812. The result was transformed in the pipeline, therefore no units are displayed.

I got slightly better results using dummy encoded factors. Moreover, applying the Yeo-Johnson transformation to numeric Feats. helped too! They were skewed, see Figure 13.

Elastic-Net Regression

Having high multicollinearity it’s reasonable to think of some sort of penalisation or shrinkage method. Hence, I’ll go with a Generalised Linear Model, GLM. In this manner I can have a first look at feature importance.

I’ll choose Elastic Net regression, combining the best of Lasso and Ridge. I must remember to myself that predictors must be scaled! I don’t want to over-penalised non-scaled variables using shrinkage methods.

Elastic-Net Spec & Tuning

Code

```{r}
#| label: Elastic-net-setup
#| eval: false

library(embed)

# Elastic Net
glmnet_base_rec <-
  recipe(SalePrice_bc ~ ., data = ames_train) |>
  update_role(Id, new_role = "Id variable") |>
  update_role(SalePrice, new_role = "original outcome") |>
  update_role(dataset, new_role = "splitting variable") |>
  step_lencode_glm(MSSubClass, outcome = vars(SalePrice_bc)) |>
  step_dummy(all_nominal_predictors()) |>    
  step_YeoJohnson(all_numeric_predictors()) |>
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric_predictors())
  
# glmnet_base_rec |> prep() |> bake(new_data = NULL) 

# library(usemodels)  # CheatSheets

# Model: Elastic Net -------------
glmnet_spec <-
  linear_reg(penalty = tune(), mixture = tune()) |> 
  set_engine("glmnet") |>
  set_mode("regression")
  # translate()

glmnet_wf <-
  workflow() |>
  add_recipe(glmnet_base_rec) |>
  add_model(glmnet_spec)

# penalty = lambda; mixture = alpha (0 - Ridge, 1- Lasso)

# penalty_ <- 10^seq(-4, -1, length.out = 20)
# mixture_ = c(0.05, 0.2, 0.4, 0.6, 0.8, 1)
penalty_ <- 10^seq(-3, -2, length.out = 40)
mixture_ <- c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6)

glmnet_grid <- tidyr::crossing(penalty = penalty_, 
                               mixture = mixture_)

# Tune with Folds

tic("opt-ElasticNet")
set.seed(1982)
doParallel::registerDoParallel()

glmnet_tune <- tune_grid(glmnet_wf, 
                         resamples = ames_folds,  # CV is faster!
                         grid = glmnet_grid)

toc()

# Explore Results 
show_best(glmnet_tune, metric = "rmse")

# Set best parameters
final_glmnet <- glmnet_wf |> finalize_workflow(select_best(glmnet_tune, metric = "rmse"))
final_glmnet

write_rds(final_glmnet, file = "./models/final_models/final_glmnet.rds")
```

Elastic-Net Test Error Assessment

Code

```{r}
#| label: final_glmnet_assessment
#| eval: false

tic("elastic-net fit bootstaps")
glmnet_boot_fit <- 
  final_glmnet |> 
  fit_resamples(resamples = ames_boots)
  
glmnet_boot_fit |>
  collect_metrics(summarize = TRUE)
toc()
```

Code

```{r}
#| label: load_elnet_model

final_glmnet <- read_rds(file = "./models/final_models/final_glmnet.rds")
final_glmnet
```

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_lencode_glm()
• step_dummy()
• step_YeoJohnson()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.00888623816274341
  mixture = 0.15

Computational engine: glmnet

A very low penalty of \(\lambda = 0.0008\) with a mixture of \(\alpha = 0.15\) were the best parameters.

New fit with new feature engineering, mean RMSE = 0.0636 (1000 Bootstraps). Kaggle’s Top 15%.

Elastic-Net Submission

Code

```{r}
#| label: Elastic-Net_submission
#| eval: false

# Get best model and fit with whole training set
fit_glm_elnet <- 
  final_glmnet |>
  fit(ames_train)

# Predicting on the test set
ames_test <- 
  ames_all |>
  # select(-SalePrice) |>
  filter(dataset == "test")

res_elasticnet <- 
  predict(fit_glm_elnet, ames_test) |>
  mutate(SalePrice_pred = inv_boxcox(.pred, boxcox_lambda))  # orignal values in USD

# .csv for submission
sub_elasticnet <- data.frame(Id = ames_test$Id, SalePrice = res_elasticnet$SalePrice_pred)
write.csv(sub_elasticnet, "./data/submissions/jthom_sub_5_elasticnet.csv", quote = FALSE, row.names = FALSE)
```

XGBoost Regression

Because of the multicollinearity, XGBoost is one of the best candidates for this project.

XGBoost Spec & Tuning

Code

```{r}
#| label: XGB-set-tune
#| eval: false

library(xgboost)

xgboost_rec <-
  recipe(formula = SalePrice_bc ~ ., data = ames_train) |>
  update_role(Id, new_role = "Id variable") |>
  update_role(SalePrice, new_role = "original outcome") |>
  update_role(dataset, new_role = "splitting variable") |>
  step_lencode_glm(MSSubClass, outcome = vars(SalePrice_bc)) |>
  step_dummy(all_nominal_predictors()) |>    
  step_YeoJohnson(all_numeric_predictors()) |>
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric_predictors())  

xgboost_spec <-
  boost_tree(trees = tune(), 
             min_n = tune(), 
             tree_depth = tune(), 
             learn_rate = tune(),
             loss_reduction = tune(), 
             sample_size = tune()) |>
  set_mode("regression") |>
  set_engine("xgboost", validation = 0.2) # 20% from train split of each resample

xgboost_wf <-
  workflow() |>
  add_recipe(xgboost_rec) |>
  add_model(xgboost_spec)

# tic("opt_xgboost")

# set.seed(1982)
# doParallel::registerDoParallel()

# xgboost_tune <-
#   tune_grid(xgboost_wf, 
#             resamples = ames_folds, 
#             grid = 60)

# toc()

# Explore Results 
# show_best(xgboost_tune, metric = "rmse")

# Best Selection
#   trees = 1300
#   min_n = 3
#   tree_depth = 13
#   learn_rate = 0.0087033422152885
#   loss_reduction = 3.4747461708994e-08
#   sample_size = 0.175630365774268

# Set best parameters

# final_xgboost <- xgboost_wf |>
#   finalize_workflow(select_best(xgboost_tune, metric = "rmse"))

final_xgboost <- read_rds(file = "./models/final_models/final_xgboost.rds")
final_xgboost

# write_rds(final_xgboost, file = "./models/final_models/final_xgboost.rds")

# Assessment of test error
```

A total of 2.8 hours for the search of 6 hyperparameters in 60 randomly-generated grids using 100 splits (10-CV folds, repeated 10 times). See the commented code above with the best ones selected (RMSE).

Code

```{r}
#| label: final_xgboost_assessment
#| eval: false

# tic("test_xgboost")

xgboost_boot_fit <- 
  final_xgboost |> 
  fit_resamples(resamples = ames_boots)
  
xgboost_boot_fit |>
  collect_metrics(summarize = TRUE)

toc()  
```

XGBoost fit: Test Error estimation mean (of 1000 Bootstraps) RMSE = 0.0666.

XGBoost Submission

Results not submitted given the slightly worse performance than Elasti-Net (above). However, results will be used to build an ensemble (stacking).

Code

```{r}
#| label: XGB_submission
#| eval: false

# Get best model and fit with whole training set
fit_xgboost <- 
  final_xgboost |>
  fit(ames_train)

# Predicting on the test set
ames_test <- 
  ames_all |>  
  filter(dataset == "test")

res_xgboost <- 
  predict(fit_xgboost, ames_test) |>
  mutate(SalePrice_pred = inv_boxcox(.pred, boxcox_lambda))  # orignal values in USD

# .csv for submission
sub_xgboost <- data.frame(Id = ames_test$Id, SalePrice = res_xgboost$SalePrice_pred)
write.csv(sub_xgboost, "./data/submissions/jthom_sub_4_xgboost.csv", quote = FALSE, row.names = FALSE)
```

Models Stacking

Mean of Predictions

The simplest ensemble, often very effective when the nature of the models are different like in this case! The Elastic-Net regression had a better RMSE than the XGBoost, therefore I can give more weight to it:

```{r}
#| label: stack_1
#| eval: false
#| code-fold: false

# (elastic_net_pred + elastic_net_pred + xgboost_pred)/3

avg_elnet_xgb <- ((2*res_elasticnet$SalePrice_pred) + res_xgboost$SalePrice_pred)/3

sub_stack_elnet_xgb <- data.frame(Id = ames_test$Id, SalePrice = avg_elnet_xgb)

write.csv(sub_stack_elnet_xgb, "./data/submissions/jthom_sub_5_stack_elnet_xgb.csv", 
          quote = FALSE, 
          row.names = FALSE)
```

This was a top 11% Kaggle submission! Best so far…

…to be continue…

J. Thomas

References

Anna Montoya, DataCanary. 2016. “House Prices - Advanced Regression Techniques.” Kaggle. https://kaggle.com/competitions/house-prices-advanced-regression-techniques.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Zablotski, Yury. 2022. Effective Resampling for Machine Learning in Tidymodels rsample r Package Reviews. https://www.youtube.com/watch?v=GtAX7Y2F4aY.