Michal Kapusta

Investment Research | Finance | R programming

What can hotel landlords learn from the guest reviews? Use case from booking.com dataset

Background Information

In this article, I will analyze textual data from hotel reviews. This analysis will focus on real estate investors or landlords since guest reviews offer insight into hotel performace. Social media and smartphone revolution enabled reviews, comments and travel blogs to be written by guests. Nowadays, every single guest is kindly asked to write a review of his stay. These reviews are stored on websites. Several websites (tripadvisors.com, booking.com) reached a critical size and contain extensive data on the first-hand guests experiences. This textual data is valuable but hard to extract. Humans will struggle to read thousands of reviews and then write a non-biased analysis. Computers can.

This article is broken into three chapters:

  • Chapter #1 What can we learn from guest score distribution?
  • Chapter #2 Can we spot what’s wrong with a group of hotels using one chart?
  • Chapter #3 Recommendation for property owner: Strand Palace Hotel reviews analysis.

What can we learn from guest score distribution?

The data used in this analysis is available on the website www.kaggle.com. The dataset of interest is from the www.booking.com and contains nearly 500.000 reviews from approx. 1500 hotels (Paris, Barcelona, Vienna, London and Milan).

Now, let’s look at all hotels on displayed on a map.

Glancing over the map we can state that the dataset contains hotel located mostly in the city center. Next, let’s look at the distribution of hotel score by business, and leisure guests. Bear in mind, that 10 points are the highest score and one is the lowest score the hotel guest can provide.

The chart shows business guests are more critical with an average score of 8.19. Leisure guests scored their trip on average - 8.53. Additionally, the business guest review distribution is a more slightly “wider”. This means inconsistent hotel experience might exist. As a hotel operator, you always prefer to achieve consistent (high) results - thus narrow distribution is desired.

Now, lets find out if the distributions is different across cities?

From the chart above we can see that the business guest review score varies between 7.9 (Milan) and 8.35 (Vienna). Additionally, the leisure guests consistently score their hotel stay higher than business travelers. Interesting!

Now, since we have the reviewer nationality data, is there a country-specific pattern?

As expected, each group is acting slightly different. The MENA region business travelers are more critical (average score is just 7.8 points), then guests from the USA or APAC region.

Now, let’s look at the top hotels receiving the highest score from a business guest in Europe.

Surprisingly, among the top ten hotels, Barcelona was represented four out of ten times. The Serras Hotel received the best average score. Only two London hotels (Soho Hotel and Rosewood) made in into the top list. Interestingly, three hotels in Vienna made it into the top ten.

Now, let’s look at the hotel with the highest business guest score by country of origin. As we learned, country groups behave differently. This could also mean they prefer different Hotels to stay. Let’s look only at the hotels in the London market.

Interestingly, the MENA travelers score Park Plaza, Wembley Holiday Inn and Adria Hotels with the highest score.

Can we spot what’s wrong with the hotel itself?

Now, let’s move away from guest review score data, and focus on the actual textual information. The words used when writing the review are a great source of unstructured data. Open-source programming languages (like R) allow usage of specialized tools (packages/library) to work with text (with ease). One of the suitable tools for the job is the tidytext package developed by Julia Silge, David Robinson (among other authors).

The process of tidying the unstructured text data includes the following steps:

  • selecting ten most frequently reviewed hotels and focusing on the negative reviews
  • cutting the reviews into individual words
  • removing the stopwords
  • apply inverse document frequency to assign unique words to a particular hotel

This process will create a table containing the words associated the most with specific hotels within the peer group.

The chart above shows that in guests of Britannia International Hotel Canary Wharf frequently used words like “dlr” or “train” in the review. After a quick google search, I understand that the hotel area is next to the train tracks. This location is a critical issue. In real estate world: you can fix the building, but not the location. Similar to Britannica the Copthorne hotel location is adjacent to the train tracks. Grand Royale in Hyde Park guests had a problem with loud floorboards, basement, boilers, and blinds. Such faults can be managed by a decent refurbishment.

Recommendation for property owner: Strand Palace Hotel reviews analysis.

After the group review, I will focus on one particular hotel Strand Palace Hotel in London. I will try to extract the maximum pieces of information from the guest reviews.

Firstly, how is the average score changing over time? Are the scores consistent?

From the chart above we can see the reviewer score over time split between business and leisure guests. Both are showing up the drop in the score in summer 2017. Business guests are overall more critical, but the fall in the score is concerning.

Now let’s look at the reviews themselves.

The sample size of 4120 reviews is a challenge to tidy for further analysis. Here we use the tidytext steps to extract the insights.

  • step 1: cut the text into single words
  • step 2: remove stopwords
  • step 3: apply sentiment lexicon (in this case bing lexicon)
  • step 4: apply pairwise correlation calculations to get high correlated words

The first two steps show the filtered selection of words arranged by frequency.

text n
negative 828
small 658
breakfast 498
hotel 469
t 456
rooms 421
bed 405
staff 277
bathroom 256
nothing 256

In this case, the word negative, small or breakfast appears on the top. Since the analyzed reviews are negative the word “negative” doesn’t have any significance and will be removed from the analysis.

Let’s visualize the words using a basic chart.

The chart displays the most frequent words by sentiment. It seems the words noisy, expensive, cold are used often when describing the hotel stay. This is important for the property operators since it is direct feedback from customers. Feedback can lead to precise refurbishment project aimed to remove the most critical issues.

Finally, let’s look at the collection of words that appear in the reviews next to other words, which are highly correlated.

The chart reveals frequent issues that hotel guest are frequently commenting. The first group of issues revolves around words like door, walls, thin or hear. This set of words points out the fact that the walls between the rooms are thin and the noises from the neighbor is disturbing for the guest. Next group of words is connected to the word air or window. It appears air conditioning and view outside is an issue.


In this article, I have analyzed approx. 500.000 hotel reviews written by hotel guests. Using tidytext principles and distribution analysis we have found that:

  • Business guest are harder to satisfy (in general, across the cities)
  • Paris and Vienna business guest reviews are the highest
  • Nationality of the reviewer plays an important role (overall rating, hotel choices)
  • Hotel Strand analysis shows: overall rating drops in summer 2017. Air condition, thin walls, view outside, long check-in and double booked rooms are often mentioned