Prediction of House Prices in Ames

A commented and visual walk-through advanced regression techniques.

Author

Jorge A. Thomas

Published

December 1, 2023

Houses

Location

Figure 1: Ames, Iowa - USA.

Introduction

In this classic Kaggle dataset (Anna Montoya 2016) you’ll follow my workflow developing pipelines on the ETL phase of the DS cycle using the R programming language (R Core Team 2023), as well as a tidy approach on the EDA substantially supported on Visual Analytics. I’ll be feeding some tips on the right margin along the analysis.

The Ames House dataset was prepared to improve on the older classic Boston Houses dataset and, given its open competition nature, it comprises two files:

  1. One file contains features and labels or what’s the same predictors and responses. Here this is the training dataset. It’s all I have to split the Data Budget.

  2. Another file contains only features (predictors) to make predictions for the final submissions. Here this is the test dataset.

The goal of the competition is to achieve minimum RMSE.

I will keep updating this notebook, making improvements and reporting the rank obtained with my submitted predictions.

PHASE I: ETL / EDA

Extract Data

I checked beforehand that there are no missing values, here NA, in the target variable SalePrice. Therefore, I will write a pipeline to read and concatenate both datasets (bind rows), adding and extra column dataset to label as “train” and “test” for further easy splitting1 (subset or filter).

  • 1 The whole Feature Transformation pipeline most be always the same for all predictors in both datasets.

  • Code
    ```{r}
    #| label: Load Data
    #| warning: false
    
    ames_train_raw <- read_csv("./data/raw/train.csv") # Train, validattion and test dataset
    print("Dimensions of training dataset")
    dim(ames_train_raw)
    
    ames_test_raw <- read_csv("./data/raw/test.csv")  # Features for submission dataset
    print("Dimensions of test dataset containing only Feats.")
    dim(ames_test_raw) 
    
    # Add Target column with NA so both DFs can be concatenated:.id
    ames_test_raw <- 
      ames_test_raw |> 
        mutate(SalePrice = NA)
    
    # Binding and adding identifier column "dataset" 
    ames_all <- 
      bind_rows(list(train = ames_train_raw, test = ames_test_raw), .id = "dataset")
    
    print("Available variables:")
    names(ames_all)
    ```
    [1] "Dimensions of training dataset"
    [1] 1460   81
    [1] "Dimensions of test dataset containing only Feats."
    [1] 1459   80
    [1] "Available variables:"
     [1] "dataset"       "Id"            "MSSubClass"    "MSZoning"     
     [5] "LotFrontage"   "LotArea"       "Street"        "Alley"        
     [9] "LotShape"      "LandContour"   "Utilities"     "LotConfig"    
    [13] "LandSlope"     "Neighborhood"  "Condition1"    "Condition2"   
    [17] "BldgType"      "HouseStyle"    "OverallQual"   "OverallCond"  
    [21] "YearBuilt"     "YearRemodAdd"  "RoofStyle"     "RoofMatl"     
    [25] "Exterior1st"   "Exterior2nd"   "MasVnrType"    "MasVnrArea"   
    [29] "ExterQual"     "ExterCond"     "Foundation"    "BsmtQual"     
    [33] "BsmtCond"      "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"   
    [37] "BsmtFinType2"  "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"  
    [41] "Heating"       "HeatingQC"     "CentralAir"    "Electrical"   
    [45] "1stFlrSF"      "2ndFlrSF"      "LowQualFinSF"  "GrLivArea"    
    [49] "BsmtFullBath"  "BsmtHalfBath"  "FullBath"      "HalfBath"     
    [53] "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd" 
    [57] "Functional"    "Fireplaces"    "FireplaceQu"   "GarageType"   
    [61] "GarageYrBlt"   "GarageFinish"  "GarageCars"    "GarageArea"   
    [65] "GarageQual"    "GarageCond"    "PavedDrive"    "WoodDeckSF"   
    [69] "OpenPorchSF"   "EnclosedPorch" "3SsnPorch"     "ScreenPorch"  
    [73] "PoolArea"      "PoolQC"        "Fence"         "MiscFeature"  
    [77] "MiscVal"       "MoSold"        "YrSold"        "SaleType"     
    [81] "SaleCondition" "SalePrice"    

    In order to organise my Data Budget I count on the train dataset with 1460 Observations. Note that Id and the just added dataset columns are not predictors. Hence, there are 78 features to play with. Also remember that the number of features for modelling will vary. Some will be discarded and some will be created along the analysis.

    Impute Missing Values

    First things first, read the full data description2. There you’ll find that for several columns, missing values NA means actually “None”. The physical absence of a determined feature in a house is a category exposing the lack of such quality that can have a significant impact on the response SalePrice. Later it is most probably that binary features indicating the existence of some installations should be created.

  • 2 Print a copy, make notes and study it. This is the first step to get into domain knowledge, in this case Real Estate, for later Feature Engineering.

  • Therefore, I will fill the empty NA fields of the indicated columns of both datasets with the string “None”.

    Code
    ```{r}
    #| label: easy replace_na
    
    cols_NA_to_none <- list(
      Alley = "None",
      BsmtQual = "None", BsmtCond = "None", BsmtExposure = "None", BsmtFinType1 = "None", BsmtFinType2 = "None", 
      FireplaceQu = "None", 
      GarageType = "None", GarageFinish = "None", GarageQual = "None", GarageCond = "None", 
      PoolQC = "None",
      Fence = "None", 
      MiscFeature = "None")
    
    ames_all <- ames_all |>
      replace_na(cols_NA_to_none)   
    ```

    One of the early and recurrent steps of the EDA is to check the completeness of data. Let’s search for missing values, after filling indicated fields3. For this case I wrote the function count_na() that generates the table displayed on the right margin.

  • 3 Write a function to quantify missing values depending on the language and labelling system used.

  • ```{r}
    #| label: Counting NAs
    #| code-fold: false
    #| column: margin
    
    # Remaining NA count leaving out target SalePrice
    ames_all |> 
      select(-SalePrice) |>
      count_na() |>  
      knitr::kable(caption = 'Ames dataset')
    ```
    Ames dataset
    Variable NA_count Percent
    LotFrontage 486 16.65
    GarageYrBlt 159 5.45
    MasVnrType 24 0.82
    MasVnrArea 23 0.79
    MSZoning 4 0.14
    Utilities 2 0.07
    BsmtFullBath 2 0.07
    BsmtHalfBath 2 0.07
    Functional 2 0.07
    Exterior1st 1 0.03
    Exterior2nd 1 0.03
    BsmtFinSF1 1 0.03
    BsmtFinSF2 1 0.03
    BsmtUnfSF 1 0.03
    TotalBsmtSF 1 0.03
    Electrical 1 0.03
    KitchenQual 1 0.03
    GarageCars 1 0.03
    GarageArea 1 0.03
    SaleType 1 0.03

    I’ll fill the remaining missing values as follow:

    • The LotFrontage column refers to the linear feet of street connected to the house. This feature is not well documented and the missing percentage related to other variables makes it not only unreliable, but very low in variance. Therefore, I will delete the feature.

    • Everytime a Garage doesn’t exist, i.e., GarageType = "None", there is the corresponding missing value in Garage Year Built GarageYrBlt = NA. I will engineer a new feature called GarageNew with three ordinal categories: None, No, Yes. This based on the delta of YearBuilt - GarageYrBlt; I expect that given the house price, the algorithm will learn that “None” is worse than “No” and so on. Then I will remove the GarageYrBlt predictor.

    • For the rest of variables with a 1 % or less missing values NA, I’ll calculate the median (if numerical) and the mode (if string) in order to fill them with it.

    Here’s my pipeline for the whole dataset:

    Code
    ```{r}
    #| label: terminating NAs
    
    # "replace_na_with_median" is a function I created
    
    ames_all <- ames_all |>
      mutate(LotFrontage = NULL) |> # Removing LotFrontage
      mutate(GarageNew = if_else(YearBuilt - GarageYrBlt > 0, "Yes", "No")) |> # New Feat.
      replace_na(list(GarageNew = "None")) |>
      mutate(GarageNew = factor(GarageNew, levels = c("None", "No", "Yes"), ordered = TRUE)) |> # 3 levels
      mutate(GarageYrBlt = NULL) |> # Removing old Feat.
      mutate_if(is.numeric, replace_na_with_median)
    ```

    Let’s get a list with the mode of each remaining columns containing missing values NA.

    Code
    ```{r}
    #| label: Mode for NAs
    
    # Get the mode for reamaining columns with NAs:
    # "find_mode()" is a custom function.
    
    # Good, old-fashioned code: apply(ames_all, 2, find_mode)
    
    list_na_mode <- ames_all |> 
      select(MasVnrType, MasVnrArea, MSZoning, Electrical, Utilities,   BsmtFullBath,   BsmtHalfBath,   Functional, Exterior1st,    Exterior2nd,    BsmtFinSF1, BsmtFinSF2, BsmtUnfSF,  TotalBsmtSF,    KitchenQual,GarageCars, GarageArea, SaleType) |>
      map(find_mode)
    
    # map returns a list 
    
    # Replace with the created named-lists
    ames_all <- ames_all |>
      replace_na(list_na_mode) 
    
    # Sanity check of missing values:
    print("Full Ames dataset (train and test)")
    ames_all |>
      select(-SalePrice) |>
      count_na()
    ```
    [1] "Full Ames dataset (train and test)"
    [1] "No missing values (NA) found."

    “The data is complete!” Let’s think about predictors.

    There are variables (columns) starting with numbers? I’ll rename them:

    Code
    ```{r}
    #| label: rename_cols
    
    ames_all <-
      ames_all |>
      rename(FirstFlrSF = "1stFlrSF") |>
      rename(SecondFlrSF = "2ndFlrSF") |>
      rename(threessnporch = "3SsnPorch")
    ```

    Screening of variable importance

    Using the Mutual Information criterion (MI), I’ll rank the extent to which knowledge of each original predictor reduces the uncertainty of SalePrice.

    ```{r}
    #| label: mutual-info
    #| column: margin
    #| code-fold: false
    
    library(infotheo)
    
    # "calc_mi" is a function I wrote.
    
    mi_y_X <- ames_all |>
      filter(dataset == "train") |>
      select(-c(Id, dataset)) |>
      calc_mi_score(target = "SalePrice")
    
    mi_y_X |> 
      head(20) |>
      knitr::kable(caption = "Mutual Info")
    ```
    Mutual Info
    Variable nats Percent
    SalePrice SalePrice 2.6515433 100.0000
    OverallQual OverallQual 0.6086855 22.9559
    Neighborhood Neighborhood 0.6084916 22.9486
    GrLivArea GrLivArea 0.5723431 21.5853
    GarageArea GarageArea 0.5244219 19.7780
    YearBuilt YearBuilt 0.5195234 19.5932
    YearRemodAdd YearRemodAdd 0.4564483 17.2144
    TotalBsmtSF TotalBsmtSF 0.4473484 16.8712
    FirstFlrSF FirstFlrSF 0.4141527 15.6193
    GarageCars GarageCars 0.3805914 14.3536
    SecondFlrSF SecondFlrSF 0.3735544 14.0882
    BsmtUnfSF BsmtUnfSF 0.3544288 13.3669
    MSSubClass MSSubClass 0.3458352 13.0428
    BsmtQual BsmtQual 0.3425788 12.9200
    ExterQual ExterQual 0.3394960 12.8037
    KitchenQual KitchenQual 0.3283328 12.3827
    OpenPorchSF OpenPorchSF 0.3051952 11.5101
    FullBath FullBath 0.2863069 10.7977
    GarageFinish GarageFinish 0.2784786 10.5025
    WoodDeckSF WoodDeckSF 0.2706731 10.2081

    Feature Engineering

    I have already started with the creation of a new predictor in the previous step, this to substitute the problematic variable GarageYrBlt that had a lot of missing values.

    Feature creation and transformations are not sequential, i.e., this is not the only step where Feature Engineering is applied. Think about the next modelling phase, where you want to reduce features, e.g., where new Principal Components are the new aggregators of original features.

    In this part I will establish which variables are categorical and numerical; this is not always as obvious as it seems4.

  • 4 Integer variables like calendar Years can be treated as categories for the analysis.

  • The first easy thing that comes to mind is to add the total area of the property as a new column named TotAreaSF; SF means Squared Feet.

    After that, I check the data documentation to see which variables can be thought as factors; this can be tedious if Real Estate is not your domain of expertise.

    After 1.5 hours of reading, googling, understanding other notebooks and deliberating (yes, it was tedious), I can summarise four types of ways to engineer new features depending on the nature of the variables:

    1. Aggregation to a total of continuous-numerical surface area (Square Feet), e.g. new TotalIntArea Feat. summing basement and floors, etc. Same for a new TotalExtArea. Then deleting the addends variables.

    2. Aggregation to a total of counts-numerical of an available installation of a determined type, e.g. new TotBaths. Then deleting the addends.

    3. Categorisation with ascending quality levels of the originally numerical variables OverallQual and OverallCond.

    4. Creation of new binary features that state if a determined installation exists or doesn’t.

    After my second Kaggle submission (Top 25%) I completed the Feature Engineering course that has a lot of tips! I included some new features and tricks from there.

    Here’s the code that complies with the four previous steps:

    Code
    ```{r}
    #| label: Feat-Eng
    
    # --- Total Aggregations ----
    ames_all <- ames_all |>
    
      # Continuous numerical ----
      # NEW TotalInteriorArea
      mutate(TotalIntArea = BsmtFinSF1 + BsmtFinSF2 + FirstFlrSF + SecondFlrSF) |>    
    
      # NEW Interior Areas to lengths (sqrt Feat. Transformation)
      # mutate(TotalIntLength = sqrt(TotalIntArea)) |>
      # mutate(BsmtFinF1 = sqrt(BsmtFinSF1)) |>
      # mutate(BsmtFinF2 = sqrt(BsmtFinSF2)) |>
      # mutate(FirstFlrF = sqrt(FirstFlrSF)) |>
      # mutate(SecondFlrF = sqrt(SecondFlrSF)) |> 
    
      select(-c(BsmtFinSF1, BsmtFinSF2)) |> # Bye bye...
    
      # NEW TotalExteriorArea
      mutate(TotalExtArea = WoodDeckSF + OpenPorchSF + EnclosedPorch + threessnporch + ScreenPorch + GarageArea) |>
      select(-c(WoodDeckSF, OpenPorchSF, EnclosedPorch, threessnporch, ScreenPorch)) |> # Bye bye...
     
      # NEW Exterior Areas to lengths (sqrt Feat. Transformation)
      # mutate(TotalExtLength = sqrt(TotalExtArea)) |>
    
      # NEW GarageLength
      # mutate(GarageLength = sqrt(GarageArea)) |>
    
      # Counting numerical ----
      # NEW TotalBaths
      mutate(TotBaths = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath) |> 
      select(-c(BsmtFullBath, BsmtHalfBath, FullBath, HalfBath)) # Bye bye...
    
    
    # ---- Ordered Categorical Features ----
    
    # Check which columns have "None"
    # sapply(ames_all, function(x) sum(x == "None"))
    
    ames_all <- ames_all |>
      # 10 Levels:
      mutate(OverallQual = factor(OverallQual, seq(1, 10), ordered = TRUE)) |>
      mutate(OverallCond = factor(OverallCond, seq(1, 10), ordered = TRUE)) |> 
    
      # 5 Levels + None
      mutate(ExterQual = factor(ExterQual, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(ExterCond = factor(ExterCond, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(BsmtQual = factor(BsmtQual, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(BsmtCond = factor(BsmtCond, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(HeatingQC = factor(HeatingQC, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(KitchenQual = factor(KitchenQual, c("Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(FireplaceQu = factor(FireplaceQu, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(GarageQual = factor(GarageQual, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(GarageCond = factor(GarageCond, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
      mutate(PoolQC = factor(PoolQC, c("None", "Po", "Fa", "TA", "Gd", "Ex"), ordered = TRUE)) |>
    
      #  Others:
      mutate(LotShape = factor(LotShape, c("Reg", "IR1", "IR2", "IR3"), ordered = TRUE)) |>
      mutate(LandSlope = factor(LandSlope, c("Sev", "Mod", "Gtl"), ordered = TRUE)) |>
      mutate(BsmtExposure = factor(BsmtExposure, c("None", "No", "Mn", "Av", "Gd"), ordered = TRUE)) |>
      mutate(BsmtFinType1 = factor(BsmtFinType1, c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"), ordered = TRUE)) |>
      mutate(BsmtFinType2 = factor(BsmtFinType2, c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"), ordered = TRUE)) |>
      mutate(Functional = factor(Functional, c("Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"), ordered = TRUE)) |>
      mutate(GarageFinish = factor(GarageFinish, c("None", "Unf", "RFn", "Fin"), ordered = TRUE)) |>
      mutate(PavedDrive = factor(PavedDrive, c("N", "P", "Y"), ordered = TRUE)) |>
      mutate(Utilities = factor(Utilities, c("NoSeWa", "NoSewr", "AllPub"), ordered = TRUE)) |>  
      mutate(Electrical = factor(Electrical, c("Mix", "FuseP", "FuseF", "FuseA", "SBrkr"), ordered = TRUE)) |>
      mutate(Fence = factor(Fence, c("None", "MnWw", "GdWo", "MnPrv", "GdPrv"), ordered = TRUE))
    
    
    # ---- New Binary Features ----
    
    # Does the installation or Feature exist? 
    ames_all <- ames_all |>
      # CentralAir
      mutate(CentralAir = if_else(CentralAir == "N", "No", "Yes")) |>
      mutate(CentralAir = fct_relevel(CentralAir, "No", "Yes"))|>
    
      # NEW IsRemodelled (it will replace YearRemodAdd)
      # mutate(IsRemodelled = if_else(YearRemodAdd - YearBuilt > 0, "Yes", "No")) |>
      # mutate(IsRemodelled = factor(IsRemodelled, levels = c("No", "Yes"))) |>
      # select(-YearRemodAdd) |> # Bye YearRemodAdd
      # Not important
    
      # NEW ExistPool  
      mutate(ExistPool = if_else(PoolArea == 0, "No", "Yes")) |>
      mutate(ExistPool = fct_relevel(ExistPool, "No", "Yes"))|>
      # a dummy of PoolQC makes this Feat. Redundant, BUT PoolQC has no Information! (Eliminated)
    
      # NEW ExistGarage
      mutate(ExistGarage = if_else(GarageArea == 0, "No", "Yes")) |>
      mutate(ExistGarage = fct_relevel(ExistGarage, "No", "Yes"))|>
      select(-c(GarageArea)) |> # Bye
    
      # NEW ExistBsmt
      mutate(ExistBsmt = if_else(TotalBsmtSF == 0, "No", "Yes")) |>
      mutate(ExistBsmt = fct_relevel(ExistBsmt, "No", "Yes"))|>
    
      # NEW Exist2ndFloor
      mutate(Exist2ndFloor = if_else(`SecondFlrSF` == 0, "No", "Yes")) |>  
      mutate(Exist2ndFloor = fct_relevel(Exist2ndFloor, "No", "Yes")) |>
    
      # NEW ExistFireplace
      mutate(ExistFireplace = if_else(Fireplaces == 0, "No", "Yes")) |>
      mutate(ExistFireplace = fct_relevel(ExistFireplace, "No", "Yes")) |>
    
      # NEW ExistMasVeneer: Masonry veneer area
      mutate(ExistMasVeneer = if_else(MasVnrArea > 0, "Yes", "No")) |>
      mutate(ExistMasVeneer = factor(ExistMasVeneer, levels = c("No", "Yes"))) |>
    
      # From Kaggle's Feature Engineering course ---- 
    
      # New ratios
      mutate(LivLotRatio = GrLivArea/LotArea) |>
      mutate(Spaciousness = (FirstFlrSF + SecondFlrSF) / TotRmsAbvGrd) |>
      # select(-c(FirstFlrSF)) |>
    
      # New interactions (by me!)
      mutate(OverallQuCoInt = as.numeric(OverallQual) * as.numeric(OverallCond))
      
     
    
    # ---- Unordered factors ----
    
    # Nominative (unordered) categorical Feats
    cats_feats <- c("MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", 
                    "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", 
                    "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", 
                    "Foundation", "Heating", "CentralAir", "GarageType", "MiscFeature", 
                    "SaleType", "SaleCondition")
    
    ames_all <- ames_all |>
      #mutate(across(where(is.character), factor)) |>
      mutate(across(all_of(cats_feats), ~ factor(.x, ordered = FALSE)))
      
    
    # ---- Numerical Features ----
    
    # BedroomabvGr (count) left numerical 
    # Kitchens (count) left numerical 
    # TotRmsAbvGrd (count) left numerical
    # Fireplaces (count) left numerical
    # GarageCars (count) left numerical
    # GarageArea (continous) left numerical 
    # MiscVal left numerical 
    # MasVnrArea (continous) left numeric
    
    
    # ---- Remove 'LowQualFinSF' ----
    
    # Low quality finished square feet (all floors)
    # Most of the values are 0 -> almost 0 variance. 
    
    ames_all$LowQualFinSF <- NULL
    ames_all$PoolArea <- NULL
    ```

    Column MSSubClass has too many categories, therefore I will perform target enconding in the recipe as a step.

    Recheck mutual info

    Code
    ```{r}
    #| label: recheck_mi
    #| column: margin
    
    # "calc_mi_score" is a function I wrote.
    
    mi_y_X_2 <- ames_all |>
      filter(dataset == "train") |>
      select(-c(Id, dataset)) |>
      calc_mi_score(target = "SalePrice")
    
    mi_y_X_2 |> 
      head(20) |>
      knitr::kable(caption = "Mutual Information")
    ```
    Mutual Information
    Variable nats Percent
    SalePrice SalePrice 2.6515433 100.0000
    OverallQual OverallQual 0.6086855 22.9559
    Neighborhood Neighborhood 0.6084916 22.9486
    GrLivArea GrLivArea 0.5723431 21.5853
    TotalExtArea TotalExtArea 0.5331534 20.1073
    TotalIntArea TotalIntArea 0.5243177 19.7741
    YearBuilt YearBuilt 0.5195234 19.5932
    OverallQuCoInt OverallQuCoInt 0.4865985 18.3515
    YearRemodAdd YearRemodAdd 0.4564483 17.2144
    TotalBsmtSF TotalBsmtSF 0.4473484 16.8712
    FirstFlrSF FirstFlrSF 0.4141527 15.6193
    Spaciousness Spaciousness 0.4133356 15.5885
    TotBaths TotBaths 0.4108588 15.4951
    GarageCars GarageCars 0.3805914 14.3536
    SecondFlrSF SecondFlrSF 0.3735544 14.0882
    BsmtUnfSF BsmtUnfSF 0.3544288 13.3669
    MSSubClass MSSubClass 0.3458352 13.0428
    BsmtQual BsmtQual 0.3425788 12.9200
    ExterQual ExterQual 0.3394960 12.8037
    KitchenQual KitchenQual 0.3283328 12.3827

    Drop no info columns

    ```{r}
    #| label: noinfo_drop
    #| code-fold: false
    
    # < 0.06444 == < 2.4%
    
    noinfo_cols <- mi_y_X_2$Variable[which(mi_y_X_2$Percent < 2.4)]
    
    ames_all <-
      ames_all |>
      select(-all_of(noinfo_cols)) 
    ```

    Searching for simple correlations

    Although nothing is linear, is good to have an idea of simple correlations between all variables; it gives me an idea of what kind of plots to make. After going through the previous steps, it’s obvious that there are correlations. Think about the number of cars that fit in a garage GarageCars and the area of the garage GarageArea.

    Code
    ```{r}
    #| label: fig-num-corr
    #| fig-width: 8
    #| fig-height: 8
    #| fig-cap: The Pearson's correlation of numeric features and response shows high presence of multicollinearity.
    #| warning: false
    #| message: false
    
    library(corrplot)
    
    ames_all |>
      filter(dataset == "train") |>
      select(-c(dataset, Id)) |>
      select(where(is.numeric)) |>  
      cor(method = "pearson") |>
      corrplot(type = "lower", method = "circle", insig = 'blank', order = "hclust", diag = TRUE,
      tl.col="black", tl.cex = 0.8, addCoef.col = 'black', number.cex = 0.6)
    ```

    Figure 2: The Pearson’s correlation of numeric features and response shows high presence of multicollinearity.

    According to the matrix above, GrLivArea has the strongest linear relationship with the target SalePrice, followed by TotalIntArea, this given the collinearity with GrLivArea, and then GarageArea or GarageCars, both heavily correlated too, like commented before.

    I’m leaving blank, i.e., without a circle, the insignificant correlations for better visualisation and focus. Now I’m checking the categorical variables and the response.

    Code
    ```{r}
    #| label: fig-cat-corr
    #| fig-width: 10
    #| fig-height: 10
    #| fig-cap: Pearson's correlation of categorical features and response.
    #| warning: false
    #| message: false
    
    ames_all |>
      filter(dataset == "train") |>
      select(-dataset) |>
      select(where(is.factor) | contains("Price")) |>  
      mutate_if(is.factor, as.integer) |> 
      cor(method = "pearson") |>
      corrplot(type = "lower", order="hclust", diag = TRUE, insig = 'blank',
      tl.col="black", tl.cex = 0.8, number.cex = 0.6)
    ```

    Figure 3: Pearson’s correlation of categorical features and response.

    The best linear association with SalePrice is given by OverallQual factor, followed by everything else that has to do with quality ratings.

    Overall, one can see clearly the multicollinearity clusters, which suggest the use of advanced Feature Selection techniques.

    Visual Exploration

    Now with the previous information in hand, I’m going to visualise separately the two most important quasi-linear relationships: GrLivArea and OverallQual vs. SalePrice.

    Code
    ```{r}
    #| label: fig-cont_linear
    #| message: false
    #| warning: false
    #| column: body
    #| fig-width: 6
    #| fig-height: 6
    #| fig-cap: 'GrvLivArea Vs. SalePrice - The most "linear" realationship between continuous variables. One can clearly notice the increase of variance as both, Living Area and Price increase, i.e., heteroskedasticity.'
    #| cap-location: margin
    #| fig-subcap:
    #|   - "Scatter Plot"
    #|   - "Hexbin Plot"
    #| layout-ncol: 2
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
      labs(x = expression("Ground Living Area ["~ft^2~"]"), 
           y = "Sale\nPrice\n['000 $]") +  
      geom_point(aes(color = OverallQual), alpha = 0.5) + theme(axis.text.y = element_blank()) +
      geom_smooth(method = "lm", formula =  y ~ splines::bs(x, 3), color = "black", linewidth = 1.5, alpha = 0.5) +
      theme(legend.position = "bottom")
    
    # install.packages("hexbin")
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +    
      geom_hex(bins = 35, colour = "seagreen") + 
      labs(x = expression("Ground Living Area ["~ft^2~"]")) +    
      theme(axis.title.y = element_blank(), legend.position = "bottom")
    ```

    (a) Scatter Plot

    (b) Hexbin Plot

    Figure 4: GrvLivArea Vs. SalePrice - The most “linear” realationship between continuous variables. One can clearly notice the increase of variance as both, Living Area and Price increase, i.e., heteroskedasticity.

    The hexbin plot reveals three cells of price and size where houses are more common in the training dataset. There are more than 50 houses for each bin in the range of 130K, 140K and 180K USD approximately.

    On the other hand, note how problematic are the outliers with the spline fit! This visual allows me to remove outliers, i.e., the biggest and cheap houses (right data points).

    Code
    ```{r}
    #| label: fig-RemoveOutliers
    #| fig-cap: Removing two outliers from the training dataset part the spline fit is dramatically improved!
    #| column: margin
    
    ames_all |>
      filter(dataset == "train") |>
      select(c(SalePrice, GrLivArea, OverallQual)) |>
      filter(GrLivArea > 4500) |>
      glimpse()
    
    idx_remove <- which(ames_all$dataset == "train" & ames_all$GrLivArea > 4500)
    
    ames_all <-
      ames_all |> 
        filter(!row_number() %in% idx_remove)
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
      labs(x = expression("Ground Living Area ["~ft^2~"]"), 
           y = "Sale Price ['000 $]") +  
      geom_point(aes(color = OverallQual), alpha = 0.5) + theme(axis.text.y = element_blank()) +
      geom_smooth(method = "lm", 
                  formula =  y ~ splines::bs(x, 3), 
                  color = "black", 
                  size = 1.5, 
                  alpha = 0.5) +
      theme(legend.position = "none", axis.title.y = element_text(angle = 90))
    ```
    Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
    ℹ Please use `linewidth` instead.
    Rows: 2
    Columns: 3
    $ SalePrice   <dbl> 184750, 160000
    $ GrLivArea   <dbl> 4676, 5642
    $ OverallQual <ord> 10, 10

    Figure 5: Removing two outliers from the training dataset part the spline fit is dramatically improved!

    The two big houses above were super offers! Or did the housing bubble exploded?

    In the same fashion, I want to see the effect of the factor OverallQual.

    Code
    ```{r}
    #| label: fig-cat_linear
    #| message: false
    #| warning: false
    #| fig-width: 7
    #| fig-height: 8
    #| fig-cap: 'OverallQual Vs. SalePrice - The most "linear" realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.'
    
    # install.packages("ggrain")
    library(ggrain)
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = OverallQual, y = SalePrice/1000, colour = OverallQual, fill = OverallQual)) +     
      geom_rain(alpha = 0.3) +  
      scale_y_continuous(breaks = seq(0, 800, 100)) +  
      labs(x = "Overall\nQuality", y = "Sale Price ['000 $]") +
      coord_flip() + 
      theme(legend.position = "none")
    ```

    Figure 6: OverallQual Vs. SalePrice - The most “linear” realationship with a categorical predictor. The raincloud plot shows how variances of factors are very different.

    How sale prices vary with the living area and depending on house styles?

    Code
    ```{r}
    #| label: fig-houseStyles
    #| fig-column: page-right
    #| fig-width: 8
    #| fig-height: 8
    #| fig-cap: "One and two story houses are the most common within the train dataset."
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
      labs(x = expression("Ground Living Area ["~ft^2~"]"), 
           y = "Sale\nPrice\n['000 $]") +  
      geom_point(aes(color = OverallQual), alpha = 0.3) +
      geom_smooth(method = "lm", 
                  formula =  y ~ splines::bs(x, 3), 
                  color = "black",
                  se = FALSE,
                  size = 0.7, 
                  alpha = 0.5) +
      facet_wrap(~HouseStyle, nrow = 2) +
      theme(legend.position = "bottom")
    ```

    Figure 8: One and two story houses are the most common within the train dataset.

    How sale prices vary with the living area and depending on location?

    One can hear often that location is everything. Let’s see it here:

    Code
    ```{r}
    #| label: fig-houseLocations
    #| fig-width: 7
    #| fig-height: 10
    #| fig-cap: "There's a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods."
    #| warning: false
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = GrLivArea, y = SalePrice/1000)) +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
      labs(x = expression("Ground Living Area ["~ft^2~"]"), 
           y = "Sale\nPrice\n['000 $]") +  
      geom_point(aes(color = OverallQual), alpha = 0.3) +
      geom_smooth(method = "lm", 
                  formula =  y ~ splines::bs(x, 3), 
                  color = "black",
                  se = FALSE,
                  size = 0.7, 
                  alpha = 0.5) +
      facet_wrap(~Neighborhood, nrow = 4) +
      theme(legend.position = "none", 
            axis.text.y  = element_text(size = 8),        
            axis.text.x  = element_text(size = 8, angle = 90))
    ```

    Figure 9: There’s a lot of variation depending on location, as expected. Northridge (NoRidge), College Creek (CollgCr) and Northridge Heights (NridgHt) seem to be the most expensive neighbourhoods.

    Analysis of response SalePrice

    How well-behaved is the target?

    Code
    ```{r}
    #| label: fig-Y
    #| message: false
    #| warning: false
    #| column: body
    #| fig-width: 6
    #| fig-height: 6
    #| fig-cap: "Analysis of the target variable. All the assumptions of OLS and linear regression are broken, as usual."
    #| fig-subcap:
    #|   - "Histogram"
    #|   - "Q-Q Plot"
    #| layout-ncol: 2
    
    ames_all |>
      filter(dataset == "train") |>
      ggplot(aes(x = SalePrice/1000)) +  
      geom_histogram(bins = 50, color = "seagreen", fill = "seagreen", alpha = 0.5) +
      scale_x_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +  
      labs(x = "Sale Price ['000 $]")
    
    ames_all |>
      filter(dataset == "train") |> 
      ggplot(aes(sample = SalePrice/1000)) + 
      stat_qq(color = "seagreen", alpha = 0.5) + 
      stat_qq_line(color = "seagreen") +  
      scale_y_continuous(breaks = seq(0, 800, 50), labels = scales::comma) +
      labs(y = "Sale\nPrice\n['000 $]",
           x = "Theoretical Quantiles (Norm. Dist.)")
    ```

    (a) Histogram

    (b) Q-Q Plot

    Figure 10: Analysis of the target variable. All the assumptions of OLS and linear regression are broken, as usual.

    Clearly, the target variable is not well-behaved. It shows a log-normal distribution heavily skewed, to right of course. In order to achieve better predictions, a transformation might be needed.

    Transformation of the response

    For what I’ve seen, processes that have to do with money, e.g., sale prices, costs of products, salaries, they are most of the time Log-normal. Rarely there is something free, so you don’t have zeros or with negative value; think about the top 1% of reach people, or anything that becomes premium the higher the price. These are long and sparse right tails.

    The logarithmic transformation works when there are few large values, like in this case, there are only a few expensive houses. The logarithm scale will shrink that right tail5, making the distribution more normal-like.

  • 5 Think of a Logarithmic transformation like pressing an accordeon only from your right arm.

  • Nevertheless, the moment the first data transformation is performed, we start to loose interpretability very quickly. Nowadays, the most accurate predictions need more than one transformation in order to wor

    Because we have powerful computers and libraries with written functions to perform variable transformations, I will use the Box-Cox transformation, which is a family of functions that includes the logarithmic one depending on the value of a parameter lambda. This parameter is automatically assessed.

    Family Box-Cox transformations:


    \(y(\lambda) = \begin{cases} \frac{y^{-1}-1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log y, & \text{if } \lambda = 0 \end{cases}\)

    ```{r}
    #| label: fig-boxcox
    #| fig-cap: "Best parameter lambda = -0.05050505. It is almost a pure logarithmic transformation."
    #| column: margin
    #| code-fold: false
    #| warning: false
    #| message: false
    
    boxcox_trans <- MASS::boxcox(lm(SalePrice ~ 1, 
                                    data = subset(ames_all, dataset == "train")), 
                                 lambda = seq(-1, 1), 
                                 plotit = TRUE)
    
    boxcox_lambda <- as.numeric(boxcox_trans$x[which.max(boxcox_trans$y)])
    
    # New transformed response variable
    ames_all <- ames_all |>
      mutate(SalePrice_bc = (SalePrice^boxcox_lambda - 1) / boxcox_lambda)
    ```

    Figure 11: Best parameter lambda = -0.05050505. It is almost a pure logarithmic transformation.

    The new mutated SalePrice should be better behaved…

    Code
    ```{r}
    #| label: fig-Y-trans
    #| message: false
    #| warning: false
    #| column: body
    #| fig-width: 6
    #| fig-height: 6
    #| fig-cap: "The transformed response shows a more bell-like shape."
    #| fig-subcap:
    #|   - "Histogram"
    #|   - "Q-Q Plot"
    #| layout-ncol: 2
    
    ames_all |>
      filter(dataset == "train") |>  
      ggplot(aes(x = SalePrice_bc)) +  
      geom_histogram(bins = 50, color = "seagreen", fill = "seagreen", alpha = 0.5) +  
      labs(x = "Sale Price [Box-Cox]")
    
    ames_all |>
      filter(dataset == "train") |>  
      ggplot(aes(sample = SalePrice_bc)) + 
      stat_qq(color = "seagreen", alpha = 0.5) + 
      stat_qq_line(color = "seagreen") +   
      labs(y = "Sale\nPrice\n[Box-Cox]", x = "Theoretical Quantiles (Norm. Dist.)")
    ```

    (a) Histogram

    (b) Q-Q Plot

    Figure 12: The transformed response shows a more bell-like shape.

    … and it is, indeed. Well, except for the tails, now the transformed values are more normal-like at the core.

    Do not include the transformation of the response as a step within the recipe. It will cause an error with the predict function trying to locate SalePrice in the test dataset.

    PHASE 2: MODELLING

    “This is where the fun begins.”

    One of the motivations to develop this project is because I’m switching to the Tidymodels framework. Therefore, it’s time to load the library to train a simple reference model.

    Resampling for Test Error Estimation

    This is a critical step for final model selection. Depending on the chosen resample technique to estimate the Goodness-of-Prediction or test error, in this case metrics like RMSE can be slightly optimistic or conservative. Take a look at the pristine explanation by (Zablotski 2022).

    • Simple validation set? 1 train fold and 1 test fold: NO, it’s old and inaccurate.
    • v-folds Cross-Validation? Analysis (training folds), Assessment (validation folds) and test: it’s small data, I can choose something even more powerful.
    • The Bootstrap? YES.

    By today’s standards this dataset is small, therefore I will choose Bootstraping to keep the chosen samples the same size as the training set. The probability of an observation of being sample at least once is 0.63, therefore the complementary training set (around 37%) will become the assessment set. Although performance metrics will be slightly conservative or pessimistic, I prefer it.

    I will generate 1000 bootstrap-samples.

    I will use 10-folds CV reapeated 10 times (100 splits) for the hyperparameter tuning of more complex algorithms. Once optimised, I’ll assess the test error again with the 1000 bootstraps, for the sake of comparison.

    Expending the data budget

    Normally with v-fold CV I expend my data budget, but here doing resampling with replacement is kind of printing money… Here’s the code to generate the bootstraps and therefore split the data budget:

    Code
    ```{r}
    #| label: DataBudget
    #| warning: false
    #| fold: false
    
    library(tidymodels)
    tidymodels_prefer()
    # conflicted::conflicts_prefer(scales::discard)
    
    # Select the pre-processed training dataset
    ames_train <- 
      ames_all |> 
      filter(dataset == "train")
     
    
    # ---- v-folds Cross Validation ----
    set.seed(1982)
    ames_folds <- vfold_cv(ames_train, v = 10, repeats = 10, strata = SalePrice_bc) # v = 10 folds are default
    
    set.seed(1982)
    ames_folds_xgb <- vfold_cv(ames_train, v = 10, repeats = 1, strata = SalePrice_bc) 
    
    # ---- The Bootstrap ----
    set.seed(1982)
    ames_boots <- bootstraps(ames_train, times = 1000, strata = SalePrice_bc)
    ames_boots  
    ```
    # A tibble: 1,000 × 2
       splits             id           
       <list>             <chr>        
     1 <split [1458/545]> Bootstrap0001
     2 <split [1458/530]> Bootstrap0002
     3 <split [1458/534]> Bootstrap0003
     4 <split [1458/535]> Bootstrap0004
     5 <split [1458/513]> Bootstrap0005
     6 <split [1458/534]> Bootstrap0006
     7 <split [1458/519]> Bootstrap0007
     8 <split [1458/532]> Bootstrap0008
     9 <split [1458/529]> Bootstrap0009
    10 <split [1458/555]> Bootstrap0010
    # ℹ 990 more rows

    For the sake of RMSE metric comparison I will always scale numeric predictors and use the Box-Cox transformed response SalePrice_bc, this given the improvement shown in Figure 12.

    The following code shows the preprocessing steps enclosed in a pipeline or recipe. The list of available step functions can be found here.

    Code
    ```{r}
    #| label: fig-skewed-feats
    #| fig-cap: "Sample of skewed numerical predictors that contain zero as value."
    
    ames_train |>  
      select(c(GrLivArea, TotBaths, TotalExtArea, TotalBsmtSF, YearBuilt, SecondFlrSF)) |> 
      pivot_longer(GrLivArea:SecondFlrSF) |>
      ggplot(aes(x = value)) +  
      geom_histogram(bins = 50, color = "lightgrey", fill = "lightgrey") +  
      labs(x = "") + 
      facet_wrap(~name, scales = "free")
    ```

    Figure 13: Sample of skewed numerical predictors that contain zero as value.

    Given the sensitivity to tails and outliers of linear models, remember Figure 4 (a), it will be convenient to try to normalise the distribution of these predictors. However, Box-Cox transformations work only for positive data. The alternative will be the more general Yeo-Johnson transformation, which works for both, positive and negative values.

    Note that SalePrice_bc is the transformed outcome. Click here for more info.

    OK! We can see the power of adding good predictors in the test RMSE estimation

    My reference is the linear GAM with Splines, showing a mean RMSE = 0.0812. The result was transformed in the pipeline, therefore no units are displayed.

    I got slightly better results using dummy encoded factors. Moreover, applying the Yeo-Johnson transformation to numeric Feats. helped too! They were skewed, see Figure 13.

    Elastic-Net Regression

    Having high multicollinearity it’s reasonable to think of some sort of penalisation or shrinkage method. Hence, I’ll go with a Generalised Linear Model, GLM. In this manner I can have a first look at feature importance.

    I’ll choose Elastic Net regression, combining the best of Lasso and Ridge. I must remember to myself that predictors must be scaled! I don’t want to over-penalised non-scaled variables using shrinkage methods.

    Elastic-Net Spec & Tuning

    Code
    ```{r}
    #| label: Elastic-net-setup
    #| eval: false
    
    library(embed)
    
    # Elastic Net
    glmnet_base_rec <-
      recipe(SalePrice_bc ~ ., data = ames_train) |>
      update_role(Id, new_role = "Id variable") |>
      update_role(SalePrice, new_role = "original outcome") |>
      update_role(dataset, new_role = "splitting variable") |>
      step_lencode_glm(MSSubClass, outcome = vars(SalePrice_bc)) |>
      step_dummy(all_nominal_predictors()) |>    
      step_YeoJohnson(all_numeric_predictors()) |>
      step_zv(all_predictors()) |> 
      step_normalize(all_numeric_predictors())
      
    # glmnet_base_rec |> prep() |> bake(new_data = NULL) 
    
    # library(usemodels)  # CheatSheets
    
    # Model: Elastic Net -------------
    glmnet_spec <-
      linear_reg(penalty = tune(), mixture = tune()) |> 
      set_engine("glmnet") |>
      set_mode("regression")
      # translate()
    
    glmnet_wf <-
      workflow() |>
      add_recipe(glmnet_base_rec) |>
      add_model(glmnet_spec)
    
    # penalty = lambda; mixture = alpha (0 - Ridge, 1- Lasso)
    
    # penalty_ <- 10^seq(-4, -1, length.out = 20)
    # mixture_ = c(0.05, 0.2, 0.4, 0.6, 0.8, 1)
    penalty_ <- 10^seq(-3, -2, length.out = 40)
    mixture_ <- c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6)
    
    glmnet_grid <- tidyr::crossing(penalty = penalty_, 
                                   mixture = mixture_)
    
    # Tune with Folds
    
    tic("opt-ElasticNet")
    set.seed(1982)
    doParallel::registerDoParallel()
    
    glmnet_tune <- tune_grid(glmnet_wf, 
                             resamples = ames_folds,  # CV is faster!
                             grid = glmnet_grid)
    
    toc()
    
    # Explore Results 
    show_best(glmnet_tune, metric = "rmse")
    
    # Set best parameters
    final_glmnet <- glmnet_wf |> finalize_workflow(select_best(glmnet_tune, metric = "rmse"))
    final_glmnet
    
    write_rds(final_glmnet, file = "./models/final_models/final_glmnet.rds")
    ```

    Elastic-Net Test Error Assessment

    Code
    ```{r}
    #| label: final_glmnet_assessment
    #| eval: false
    
    tic("elastic-net fit bootstaps")
    glmnet_boot_fit <- 
      final_glmnet |> 
      fit_resamples(resamples = ames_boots)
      
    glmnet_boot_fit |>
      collect_metrics(summarize = TRUE)
    toc()
    ```
    Code
    ```{r}
    #| label: load_elnet_model
    
    final_glmnet <- read_rds(file = "./models/final_models/final_glmnet.rds")
    final_glmnet
    ```
    ══ Workflow ════════════════════════════════════════════════════════════════════
    Preprocessor: Recipe
    Model: linear_reg()
    
    ── Preprocessor ────────────────────────────────────────────────────────────────
    5 Recipe Steps
    
    • step_lencode_glm()
    • step_dummy()
    • step_YeoJohnson()
    • step_zv()
    • step_normalize()
    
    ── Model ───────────────────────────────────────────────────────────────────────
    Linear Regression Model Specification (regression)
    
    Main Arguments:
      penalty = 0.00888623816274341
      mixture = 0.15
    
    Computational engine: glmnet 

    A very low penalty of \(\lambda = 0.0008\) with a mixture of \(\alpha = 0.15\) were the best parameters.

    New fit with new feature engineering, mean RMSE = 0.0636 (1000 Bootstraps). Kaggle’s Top 15%.

    Elastic-Net Submission

    Code
    ```{r}
    #| label: Elastic-Net_submission
    #| eval: false
    
    # Get best model and fit with whole training set
    fit_glm_elnet <- 
      final_glmnet |>
      fit(ames_train)
    
    # Predicting on the test set
    ames_test <- 
      ames_all |>
      # select(-SalePrice) |>
      filter(dataset == "test")
    
    res_elasticnet <- 
      predict(fit_glm_elnet, ames_test) |>
      mutate(SalePrice_pred = inv_boxcox(.pred, boxcox_lambda))  # orignal values in USD
    
    # .csv for submission
    sub_elasticnet <- data.frame(Id = ames_test$Id, SalePrice = res_elasticnet$SalePrice_pred)
    write.csv(sub_elasticnet, "./data/submissions/jthom_sub_5_elasticnet.csv", quote = FALSE, row.names = FALSE)
    ```

    XGBoost Regression

    Because of the multicollinearity, XGBoost is one of the best candidates for this project.

    XGBoost Spec & Tuning

    Code
    ```{r}
    #| label: XGB-set-tune
    #| eval: false
    
    library(xgboost)
    
    xgboost_rec <-
      recipe(formula = SalePrice_bc ~ ., data = ames_train) |>
      update_role(Id, new_role = "Id variable") |>
      update_role(SalePrice, new_role = "original outcome") |>
      update_role(dataset, new_role = "splitting variable") |>
      step_lencode_glm(MSSubClass, outcome = vars(SalePrice_bc)) |>
      step_dummy(all_nominal_predictors()) |>    
      step_YeoJohnson(all_numeric_predictors()) |>
      step_zv(all_predictors()) |> 
      step_normalize(all_numeric_predictors())  
    
    xgboost_spec <-
      boost_tree(trees = tune(), 
                 min_n = tune(), 
                 tree_depth = tune(), 
                 learn_rate = tune(),
                 loss_reduction = tune(), 
                 sample_size = tune()) |>
      set_mode("regression") |>
      set_engine("xgboost", validation = 0.2) # 20% from train split of each resample
    
    xgboost_wf <-
      workflow() |>
      add_recipe(xgboost_rec) |>
      add_model(xgboost_spec)
    
    # tic("opt_xgboost")
    
    # set.seed(1982)
    # doParallel::registerDoParallel()
    
    # xgboost_tune <-
    #   tune_grid(xgboost_wf, 
    #             resamples = ames_folds, 
    #             grid = 60)
    
    # toc()
    
    # Explore Results 
    # show_best(xgboost_tune, metric = "rmse")
    
    # Best Selection
    #   trees = 1300
    #   min_n = 3
    #   tree_depth = 13
    #   learn_rate = 0.0087033422152885
    #   loss_reduction = 3.4747461708994e-08
    #   sample_size = 0.175630365774268
    
    # Set best parameters
    
    # final_xgboost <- xgboost_wf |>
    #   finalize_workflow(select_best(xgboost_tune, metric = "rmse"))
    
    final_xgboost <- read_rds(file = "./models/final_models/final_xgboost.rds")
    final_xgboost
    
    # write_rds(final_xgboost, file = "./models/final_models/final_xgboost.rds")
    
    # Assessment of test error
    ```

    A total of 2.8 hours for the search of 6 hyperparameters in 60 randomly-generated grids using 100 splits (10-CV folds, repeated 10 times). See the commented code above with the best ones selected (RMSE).

    Code
    ```{r}
    #| label: final_xgboost_assessment
    #| eval: false
    
    # tic("test_xgboost")
    
    xgboost_boot_fit <- 
      final_xgboost |> 
      fit_resamples(resamples = ames_boots)
      
    xgboost_boot_fit |>
      collect_metrics(summarize = TRUE)
    
    toc()  
    ```

    XGBoost fit: Test Error estimation mean (of 1000 Bootstraps) RMSE = 0.0666.

    XGBoost Submission

    Results not submitted given the slightly worse performance than Elasti-Net (above). However, results will be used to build an ensemble (stacking).

    Code
    ```{r}
    #| label: XGB_submission
    #| eval: false
    
    # Get best model and fit with whole training set
    fit_xgboost <- 
      final_xgboost |>
      fit(ames_train)
    
    # Predicting on the test set
    ames_test <- 
      ames_all |>  
      filter(dataset == "test")
    
    res_xgboost <- 
      predict(fit_xgboost, ames_test) |>
      mutate(SalePrice_pred = inv_boxcox(.pred, boxcox_lambda))  # orignal values in USD
    
    # .csv for submission
    sub_xgboost <- data.frame(Id = ames_test$Id, SalePrice = res_xgboost$SalePrice_pred)
    write.csv(sub_xgboost, "./data/submissions/jthom_sub_4_xgboost.csv", quote = FALSE, row.names = FALSE)
    ```

    Models Stacking

    Mean of Predictions

    The simplest ensemble, often very effective when the nature of the models are different like in this case! The Elastic-Net regression had a better RMSE than the XGBoost, therefore I can give more weight to it:

    ```{r}
    #| label: stack_1
    #| eval: false
    #| code-fold: false
    
    # (elastic_net_pred + elastic_net_pred + xgboost_pred)/3
    
    avg_elnet_xgb <- ((2*res_elasticnet$SalePrice_pred) + res_xgboost$SalePrice_pred)/3
    
    sub_stack_elnet_xgb <- data.frame(Id = ames_test$Id, SalePrice = avg_elnet_xgb)
    
    write.csv(sub_stack_elnet_xgb, "./data/submissions/jthom_sub_5_stack_elnet_xgb.csv", 
              quote = FALSE, 
              row.names = FALSE)
    ```

    This was a top 11% Kaggle submission! Best so far…

    …to be continue…

    J. Thomas

    References

    Anna Montoya, DataCanary. 2016. “House Prices - Advanced Regression Techniques.” Kaggle. https://kaggle.com/competitions/house-prices-advanced-regression-techniques.
    R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
    Zablotski, Yury. 2022. Effective Resampling for Machine Learning in Tidymodels rsample r Package Reviews. https://www.youtube.com/watch?v=GtAX7Y2F4aY.