library(tidyverse)
library(skimr)
Esercizio di data wrangling
Si legga in R il file di dati hotels.csv
Exercizi
Exercizio 1.
Dai uno sguardo ai dati con una nuova funzione, skim()
.
Nota: Una definizione delle variabili è alla fine, nella sezione Data dictionary (anche se solo alcune variabili sono usate negli esercizi che seguono).
skim(hotels)
Name | hotels |
Number of rows | 119390 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
Exercizio 2.
La gente viaggia per capriccio, per un colpo di testa? Vediamo …
Estrai le prenotazioni che non sono state fatte dagli USA (country
code "USA"
) e il cui lead_time
è meno di 1 giorno.
%>%
hotels filter(
"USA",
country ____
lead_time ____ ____ )
Exercizio 3.
Quante prenotazioni coinvolgono almeno 1 child or baby?
Nel codice che segue, sostituisci con l’operatore logico corretto:
[AT LEAST]
[OR]
%>%
hotels filter(
1 [OR] babies [AT LEAST] 1
children [AT LEAST] )
Exercizio 4.
Pensi sia più probabile trovare prenotazioni con children o babies in city hotels o in resort hotels? Testa il tuo intuito. Usa filter()
per determinare il numero di prenotazioni in resort hotels che hanno almeno 1 child or baby nella stanza. Quindi, fai lo stesso per city hotels, e confronta il numero di righe nei dataframe così filtrati.
# add code here
# pay attention to correctness and code style
# add code here
# pay attention to correctness and code style
Exercizio 5.
Crea una tabella di frequenza del numero di adults
in una prenotazione. Mostra i risultati in ordine discendente così che l’osservazione più frequente sia in cima. Qual è il numero di adulti più comune nelle prenotazioni in questo dataset? Ti sorprendono i risultati?
# add code here
# pay attention to correctness and code style
Exercizio 6.
Ripeti l’ Esercizio 5, una volta per le prenotazioni cancellate (is_canceled
coded as 1) e una volta per le prenotazioni non cancellate (is_canceled
coded as 0). Cosa questo ti rivela sui risultati ottenuti al punto precedente?
# add code here
# pay attention to correctness and code style
Exercizio 7.
Si calcoli il minimo, la media, la mediana, e il massimo di average daily rate (adr
) aggreggato per tipo di hotel
così da avere le statistiche separate per resort e city hotels. In quale tipo di hotel adr
è più elevato, in media?
# add code here
# pay attention to correctness and code style
Exercizio 8.
Si osservano due valori insoliti nelle statistiche riassuntive di cui sopra: un minimo negativo e un massimo molto alto. Che tipologie di hotel sono queste? Individua queste osservazioni nel dataset e scopri la data di arrivo (anno e mese) e quante persone (adulti, bambini e neonati) hanno soggiornato nella stanza. Puoi esaminare i dati nel viewer per individuare questi valori, ma preferibilmente dovresti identificarli in modo riproducibile con un codice.
Hint: Ad esempio, puoi filter
per i dati ammontari di adr
e select
le colonne d’interesse.
# add code here
# pay attention to correctness and code style
Data dictionary
variable | class | description |
---|---|---|
hotel | character | Hotel (H1 = Resort Hotel or H2 = City Hotel) |
is_canceled | double | Value indicating if the booking was canceled (1) or not (0) |
lead_time | double | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
arrival_date_year | double | Year of arrival date |
arrival_date_month | character | Month of arrival date |
arrival_date_week_number | double | Week number of year for arrival date |
arrival_date_day_of_month | double | Day of arrival date |
stays_in_weekend_nights | double | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
stays_in_week_nights | double | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
adults | double | Number of adults |
children | double | Number of children |
babies | double | Number of babies |
meal | character | Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner) |
country | character | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
market_segment | character | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
distribution_channel | character | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
is_repeated_guest | double | Value indicating if the booking name was from a repeated guest (1) or not (0) |
previous_cancellations | double | Number of previous bookings that were cancelled by the customer prior to the current booking |
previous_bookings_not_canceled | double | Number of previous bookings not cancelled by the customer prior to the current booking |
reserved_room_type | character | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
assigned_room_type | character | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
booking_changes | double | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
deposit_type | character | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
agent | character | ID of the travel agency that made the booking |
company | character | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
days_in_waiting_list | double | Number of days the booking was in the waiting list before it was confirmed to the customer |
customer_type | character | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
adr | double | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
required_car_parking_spaces | double | Number of car parking spaces required by the customer |
total_of_special_requests | double | Number of special requests made by the customer (e.g. twin bed or high floor) |
reservation_status | character | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
reservation_status_date | double | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |