Michal Kapusta

Investment Research | Finance | R programming

Predicting residential property prices in Bratislava using recipes - Linear regression (part I)

The goal of the post

In this blogpost, i will be predicting the residential prices in the city of Bratislava (Slovakian capital). In order to access real world data I will be scraping data from property listing website. After scraping and data cleaning steps, I will build a linear model that can predict house prices.

Acquiring the real world data

The property website www.topreality.sk is a good resource for the appartment price listings. in addition, the website link generator is straight forward. After filtering based on:

  • the desired flat location (in this case all districts of Bratislava)
  • number of bedrooms (1-4)
  • purchase instead of rent offers

the following link is generated:

https://www.topreality.sk/vyhladavanie-nehnutelnosti-1.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string

The first part of the url (https://www.topreality.sk/vyhladavanie-nehnutelnosti-1.html*) is of particular interest since everything after this part are just filters. After plugging number 2* before the .html part it will load up second page of the search.

https://www.topreality.sk/vyhladavanie-nehnutelnosti-2.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string

This information is essential, since it is the only variable that is changing in the url string. The icon to jump to the latest record (symbol >>) also contains information about the last url number. In this case it reveals 255 webpages are needed to list all the ads. Following url generator creates string of urls needed to scrape all the listings:

url <- "https://www.topreality.sk/vyhladavanie-nehnutelnosti-1.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string"

source_urls <- function(url) {
        library(rvest)
        library(stringr)
        url_length <- url %>%
                read_html() %>%
                html_nodes("a.last") %>%
                html_attr("href") 
        
        ptrn <- one_or_more(char_class(ASCII_ALNUM %R% "-"))
        
        url_length_num <- url_length %>%
                str_extract(pattern = ptrn) %>%
                str_replace(pattern = "vyhladavanie-nehnutelnosti-",
                            replacement = "") %>% 
                as.numeric()
        
        url_length_num <- url_length_num[1]
        
        url_length_seq <- 1:url_length_num
        
        url_seq <- str_c("https://www.topreality.sk/vyhladavanie-nehnutelnosti-",1:length(url_length_seq),".html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string")
        
        print(paste0(length(url_length_seq)," - scraped urls"))
        
        return(url_seq)
        
        
}

urls        <- source_urls(url = url)

The result of the function source_urls is vector of websites (255 long) that are needed to scrape in order to get information about all the flats in the city of Bratislava. Here are the last three websites urls.

x
https://www.topreality.sk/vyhladavanie-nehnutelnosti-253.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string
https://www.topreality.sk/vyhladavanie-nehnutelnosti-254.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string
https://www.topreality.sk/vyhladavanie-nehnutelnosti-255.html?type%5B0%5D=108&type%5B1%5D=102&type%5B2%5D=103&type%5B3%5D=104&type%5B4%5D=105&form=1&obec=2%2C4%2C5%2C6%2C8%2C9%2C10%2C12%2C13%2C14%2C15%2C16%2C17%2C19%2C20%2C21%2C22&n_search=search&gpsPolygon=&searchType=string

Using the package rvest we can access the information displayed on the webpage. The following data-points are of our interest:

  • price of the flat
  • size of the flat in square meters
  • number of bedrooms
  • street name
  • district name
  • id of the listing (part of the url leading to the listing)
  • date of the listing

To scrape all the information we need to leverage the rvest, stringr, rebus and purrr packages.

get_ads_data      <- function(url) {
        Sys.sleep(2)
        print(url)
        price           <- url %>% read_html() %>% html_nodes(".price strong") %>% html_text() 
        area            <- url %>% read_html() %>% html_nodes(".areas strong:nth-child(1)") %>% html_text()
        listed          <- url %>% read_html() %>% html_nodes(".date") %>% html_text()
        price_per_sqm   <- url %>% read_html() %>% html_nodes(".priceArea") %>% html_text()
        street          <- url %>% read_html() %>% html_nodes(".locality") %>% html_text()
        no_of_bedrooms  <- url %>% read_html() %>% html_nodes("li:nth-child(1) .noclick") %>% html_text()
        url_saved       <- url %>% read_html() %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href") 
        
        ext_pattern <-    one_or_more(DIGIT) %R% ".html"
        id <- str_extract(url_saved,pattern = ext_pattern) %>% str_replace(pattern = ".html",replacement = "")
        
        
        
        final <- cbind(id,price,area,listed,price_per_sqm,street,no_of_bedrooms,url_saved) %>% as.tibble()
        
        return(final)
        
}
tidy_scraped_data <- function(data) {
        # vectors
        vct_price        <- data$price %>% str_split(pattern = ",",simplify = T) %>% as.tibble() %>% select(V1) %>% mutate_at(1, str_replace, pattern = " ",replacement = "")  %>% mutate_at(1, as.numeric) %>% pull(V1)
        vct_price_persqm <- data$price_per_sqm %>%  str_split(pattern = ",",simplify = T) %>% as.tibble() %>% select(V1)  %>% mutate_at(1, str_replace, pattern = " ",replacement = "") %>% mutate_at(1, as.numeric) %>% pull(V1)
        vct_no_bed       <- data$no_of_bedrooms %>%  str_split(pattern = " izb",simplify = T) %>% as.tibble() %>% select(V1)  %>% mutate_at(1, str_replace, pattern = " ",replacement = "") %>% mutate_at(1, as.numeric) %>% pull(V1)
        vct_district     <- data$street %>%  str_split(pattern = "\\(",simplify = T) %>% as.tibble() %>% select(V2)  %>% mutate_at(1, str_replace, pattern = "\\)",replacement = "")  %>% pull(V2)
        vct_street       <- data$street %>%  str_split(pattern = ",",simplify = T) 
        vct_street       <- vct_street[,1]
        vct_street       <- vct_street %>% str_remove(pattern = rebus::one_or_more(rebus::DIGIT)) %>% str_trim()
        
        #replace data
        data$price         <- vct_price
        data$price_per_sqm <- vct_price_persqm
        data$street        <- vct_street
        data$district      <- vct_district
        data$no_of_bedrooms <- vct_no_bed
        
        df_final <- data %>%  mutate_at(3, str_replace, pattern = " m2",replacement = "") %>% mutate_at(3, as.numeric)
        return(df_final)
}

df_ads_raw <- urls %>%
                map(safely(get_ads_data)) %>%
                map_df("result")

df_ads_tidy <- df_ads_raw %>%
                tidy_scraped_data()

The scraper generates following datatable:

id price area listed price_per_sqm street no_of_bedrooms url_saved district
6560969 134500 57 8.8.2018 2364 Haburská 2 https://www.topreality.sk/predaj-slnecny-2-izbovy-byt-bratislava-ii-ruzinov-strkovec-loggia-r6560969.html Bratislava II
6568976 147000 81 8.8.2018 1814 Martákovej 3 https://www.topreality.sk/slnecny-3-izb-byt-81-m2-balkon-velka-murovana-pivnica-samostatne-izby-147-000-eur-3-posch-3-na-martakovej-ul-v-ruzinove-na-posni-r6568976.html Bratislava II
6488721 149990 69 8.8.2018 2173 Kadnárova 3 https://www.topreality.sk/najlacnejsibyt-sk-baiii-krasnany-kadnarova-ul-3-i-69-m2-kompletna-rekonstrukcia-2018-r6488721.html Bratislava III
6031939 183982 67 8.8.2018 2746 Šancová 3 https://www.topreality.sk/vila-ladova-oaza-v-centre-mesta-poslednych-5-volnych-3-izbovych-bytov-bratislava-ulica-ladova-sancova-novostavba-stare-mesto-r6031939.html Bratislava I
5653750 90600 35 8.8.2018 2588 Údernícka 2 https://www.topreality.sk/vystavba-zahajena-siahnite-na-vrchol-petrzalky-novostavba-za-bezkonkurencnu-cenu-r5653750.html Bratislava V
5653749 134524 71 8.8.2018 1894 Údernícka 3 https://www.topreality.sk/vystavba-zahajena-siahnite-na-vrchol-petrzalky-novostavba-3i-v-cenach-od-134-000-r5653749.html Bratislava V

The database contains information about ca 2379 listings stored in consistent way. Perfect!

Exploratory Data Analysis

After the data is obtained we can proceed into the explanatory data analysis part. First lets look at the summary statistics.

library(skimr)
df_ads %>% skimr::skim_to_list()
## $character
## # A tibble: 4 x 8
##   variable  missing complete n     min   max   empty n_unique
## * <chr>     <chr>   <chr>    <chr> <chr> <chr> <chr> <chr>   
## 1 district  0       2121     2121  12    14    0     5       
## 2 listed    0       2121     2121  8     10    0     184     
## 3 street    0       2121     2121  1     35    0     566     
## 4 url_saved 0       2121     2121  51    237   0     2121    
## 
## $integer
## # A tibble: 3 x 12
##   variable missing complete n     mean  sd    p0    p25   p50   p75   p100 
## * <chr>    <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 area     2       2119     2121  "   ~ "   ~ 11    52    69    85    420  
## 2 id       0       2121     2121  6400~ 3493~ 1887~ 6397~ 6514~ 6549~ 6570~
## 3 no_of_b~ 33      2088     2121  "   ~ "   ~ 1     2     3     3     4    
## # ... with 1 more variable: hist <chr>
## 
## $numeric
## # A tibble: 2 x 12
##   variable missing complete n     mean  sd    p0    p25   p50   p75   p100 
## * <chr>    <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 price    6       2115     2121  1758~ 8848~ 40000 1249~ 1499~ 2e+05 1e+06
## 2 price_p~ 58      2063     2121  "  2~ " 11~ 2     2065  2375  2763  33150
## # ... with 1 more variable: hist <chr>

The table shows we have successfully scraped ca 2100 records. Approximately 100 records are incomplete and will be removed from the analysis. Few other observations:

  • Average price for flat in Bratislava is ca 173.000 Eur
  • Average size of the flat is 73 square meters
  • Average price per square meter is 2478 Eur
  • District variable needs to be converted to factors

Now lets look at the data! The following charts show basic distribution of price and log(price).

df_ads  %>% drop_na() %>% 
        mutate(log_price = log10(price)) %>% 
        select(price, log_price) %>% 
        gather(Ratio, Value,1:2) %>% 
        ggplot(aes(Value)) +
        geom_histogram() +
        theme_minimal(base_family = "Verdana",base_size = 12)   + 
        facet_wrap(~Ratio, scales = "free") +
        theme(plot.title = element_text(face = "bold")) + 
        labs(title = "DISTRIBUTION OF APARTMENT PRICES IN BRATISLAVA")

The charts show that the majority of ads have price in range between 125.000 EUR - 200.000 EUR. The distribution is right-skewed an needs to be transformed using logarithm. The log10(price) ensures the distribution is normalised (more suitable for forecasting and human comprehension). For backward conversion use calculate for example: 10^5.5 = 316227 EUR.

Now, we would expect the prices vary across the five districts in Bratislava. Let’s find out!

  • that prices are highest in Bratislava I - the old town district
  • BA I median price is around 5.45 or 281.000€
  • The most competitive priced is BA V. (small range between 25-75 % Quartile)

Lets dig deeper into dataset to extract all the information. Lets look at price distributions within the district based on the number of beds.

The transformation reveals:

  • Price increases by number of bedrooms (exception: BA V prices for 2 & 3 bedroom)
  • Some of the ranges are very wide. Big differences occur within districts as well.
  • Most costly category is a 4x bedroom flat in BA I costing ~400.000 EUR (10 ^ 5.6)

Next, lets look at the qualitative variable - area compared to the price

The chart above shows that the flat prices is in linear relationship with the flat area. This is to be expected. Also Bratislava IV looks clustered around the model line - suggesting stronger relationship in the market. Is this a mere coincidence or hidden insight?

Next, lets fit a linear model using data at we assembled so far. We will predict the price for the apartments using the traditional set of variables data scientists usually have at hand. The following chunk of code is a direct copy of Max Kuhn R code presented at UseR 2018 workshop https://github.com/topepo/user2018.

set.seed(4595)

df_to_model <- df_ads  %>%
        drop_na() %>%
        select(-url_saved,-listed, -price_per_sqm) %>% 
        mutate(no_of_bedrooms = no_of_bedrooms %>% as.factor(),
               district = district %>% as.factor(),
               street = street %>% as.factor())

# split 

library(rsample)
library(caret)

set.seed(4595)
data_split <- initial_split(df_to_model, strata = "price")

df_train <- training(data_split)
df_test  <- testing(data_split)

cv_splits <- vfold_cv(df_train, v = 10, strata = "price")


# recipes
library(recipes)

model_recipe <- recipe(price ~ . - id, data = df_train ) %>% 
                step_other(street, threshold = .0125) %>% 
                step_dummy(street, district, no_of_bedrooms)  %>%      
                step_log(price, base = 10) %>%                         
                step_BoxCox(area) 

# create processed databases
preped_model <- model_recipe %>% prep()

x_train_processed_tbl <- bake(preped_model, df_train) 
x_test_processed_tbl  <- bake(preped_model, df_test)


# apply recipe on CV splits

cv_splits <- cv_splits %>%
        mutate(ads_rec = map(splits, prepper, recipe = model_recipe, retain = T))  # create CV splits

# apply lm function to CV splits

lm_fit_rec <- function(rec_obj, ...) {
        lm(..., data = juice(rec_obj))
        
}
 
cv_splits <- cv_splits %>% mutate(fits = map(ads_rec, lm_fit_rec, price ~.)) 


# generate predictions based on the splits
assess_predictions <- function(split_obj, rec_obj, mod_obj) {
  raw_data <- assessment(split_obj)
  proc_x <- bake(rec_obj, newdata = raw_data, all_predictors())
  bake(rec_obj, newdata = raw_data, everything()) %>%
    mutate(.fitted = predict(mod_obj, newdata = proc_x),
      .resid = price - .fitted,  
      .row = as.integer(split_obj, data = "assessment"))
}

cv_splits <- cv_splits %>%
  mutate(pred =  pmap( lst(split_obj = cv_splits$splits, rec_obj = cv_splits$ads_rec,mod_obj = cv_splits$fits),
        assess_predictions 
      )
  )
        

# measure performance
library(yardstick)

# Compute the summary statistics
map_df(cv_splits$pred, metrics, truth = price, estimate = .fitted) %>% 
  colMeans %>% 
  knitr::kable()
x
rmse 0.1095630
rsq 0.6380458

The code above is broadly doing following steps:

  • splitting the datasets into 10 equal-sized subsets (cross-validation)
  • recipe step 1: collapse infrequent street names into “other”
  • recipe step 2: create binary column (0 or 1) for each of the factor variables no_of_bedrooms, district, street
  • recipe step 3: log transformation of price
  • recipe step 4: Box-Cox transformation of area variable

After splitting the dataset and performing The results shows that the regression Rsquared is around 0.63. This means around 63% of the price variance can be explained using the variables at hand.

The detailed breakdown of the statistical model can be seen here:

        cv_splits$fits[[1]] %>% summary()
## 
## Call:
## lm(formula = ..1, data = juice(rec_obj))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83469 -0.05880 -0.00526  0.04528  0.78919 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.068e+00  9.211e-02  44.165  < 2e-16 ***
## id                        -3.180e-08  1.013e-08  -3.140 0.001726 ** 
## area                       4.013e-01  1.736e-02  23.124  < 2e-16 ***
## street_Bystrick��          -9.793e-02  3.458e-02  -2.832 0.004690 ** 
## street_Pod.Kr��snou.h��rkou -4.032e-02  2.695e-02  -1.496 0.134869    
## street_Ra��ianska          -7.518e-02  3.078e-02  -2.442 0.014722 *  
## street_��ancov��            -1.062e-01  3.144e-02  -3.377 0.000753 ***
## street_other              -4.197e-02  2.150e-02  -1.952 0.051086 .  
## district_Bratislava.II    -1.310e-01  9.353e-03 -14.002  < 2e-16 ***
## district_Bratislava.III   -7.823e-02  1.061e-02  -7.371 2.94e-13 ***
## district_Bratislava.IV    -1.299e-01  1.022e-02 -12.711  < 2e-16 ***
## district_Bratislava.V     -1.340e-01  1.184e-02 -11.321  < 2e-16 ***
## no_of_bedrooms_X2          1.568e-02  1.118e-02   1.403 0.160956    
## no_of_bedrooms_X3          1.133e-02  1.396e-02   0.812 0.417095    
## no_of_bedrooms_X4          1.432e-02  1.768e-02   0.810 0.417916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1102 on 1350 degrees of freedom
## Multiple R-squared:  0.636,  Adjusted R-squared:  0.6322 
## F-statistic: 168.5 on 14 and 1350 DF,  p-value: < 2.2e-16

Now let’s visualise the dataset

cv_splits %>%
        unnest(pred) %>% 
        mutate(pred_accuracy = ntile(.resid, 5)) %>% 
        ggplot(aes(price, .fitted, col = as.factor(pred_accuracy))) +
        geom_point() + 
        geom_abline() + 
        scale_color_brewer(palette = "RdYlGn") +
        theme_minimal() +
        labs(title = "MODEL ACCURACY: Fitted vs Real Prices",
                subtitle = "Price and prediction in a relationship") + 
        theme(legend.position = "bottom") + 
        theme(plot.title = element_text(face = "bold")) +
        geom_smooth(se = F, color = "red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The points around the line (3 pred_accuracy) are the most accurate estimates of the model - where residual values are small. Other colors are showing overvalued prices (5 pred_accuracy) and undervalues (1 pred_accuracy). The red line indicates that low prices are overpredicted and high prices are underpredicted by the linear model.

Even better understanding offers a interactive map of Bratislava with original prices and the linear model results:

Finally, here are the 10 most undervalued streets in Slovakian capital:

Further work and final considerations

In summary, we have successfully scraped and analysed residential property market in Bratislava. With ca 2.000 ads we found linear model that can explain ca 63% of the price variance. For real world application this result is too weak. The problem needs more advanced statistical models and possibly more data (distance to city center, number of transportation options. etc) to drive the accuracy even higher.