Prepare your final presentation by using RMarkdown. Any template (pdf, html, word, slides, etc.) is admissible.
Send the professors via email your presentation the day before the exam.
Your project’s discussion should not last longer than 25 minutes! Respect for time will be evaluated.
A list of possible datasets for the final exam are provided below. They are ordered in somehow increasing difficulty. You are free to choose any of them: though, one project cannot be selected by more than four students in a year. You can book your project’s choice here. The link is also available in the moodle page.
This is the list of possible projects for the final exam. Choose one among them. Alternatively, you may propose your own dataset by contacting the professors of the course via email. You have to bring your printed final work when you deal the oral examination.
The R
package Ecdat
contains the Males
dataset about wages and several other variables for a sample of 545 individuals observed over a longitudinal study of 8 years (1980-1987), for a total sample size of 4360 observations. The data have been used by several scholars, notably
Vella, F. and M. Verbeek (1998). Whose wages do unions raise? A dynamic model of unionism and wage. Journal of Applied Econometrics, 13, 163-183.
The help file description of the 12 variables in the dataset is as follows:
nr
subject identifieryear
year (1980-1987)school
years of schoolingexper
year of experience (=age-6-school
)union
factor; whether wage was set by collective bargainingethn
factor with 3 levelsmaried
factor marital statushealth
factor for presence of health problemswage
log of hourly wageindustry
factor with 12 levelsoccupation
factor with 9 levelsresidence
factor with 4 levelsAfter performing some explanatory analyses:
wage
with rstan
package, taking into account the hierarchical structure of the data. Pay some particular attention to the relation between unionism and wage.The R
package SemiPar
contains the milan.mort
dataset on short-term effect of air pollution on mortality. The data comprise 3652 observations on 9 variables, whose description can be found in the help file. The data are also analysed in the book by Ruppert, Wand and Carroll (2003). The original reference is
Vigotti, M.A., Rossi, G., Bisanti, L., Zanobetti, A. and Schwartz, J. (1996). Short term effect of urban air pollution on respiratory health in Milan, Italy, 1980-1989. Journal of Epidemiology and Community Health, 50, 71-75.
After performing some explorative analyses:
total.mort
(or a suitable transformation of it) as a normally distributed response variable, build a model for the average number of deaths, checking if some of the covariates may have a nonlinear effect (do not consider the resp.mort
variable). Follow a Bayesian approach for the task and check the model fit via pp checks.total.mort
, comparing the fitted response values with those obtained previously.The R
package mlmRev
contains the ScotsSec
dataset on scores attained by Scottish secondary school students on a standardized test taken at age 16. The data include 3435 observations on 6 variables. The help file description is as follows:
verbal
The verbal reasoning score on a test taken by the students on entry to secondary schoolattain
The score attained on the standardized test taken at age 16primary
A factor indicating the primary school that the student attendedsex
A factor with levels M
and F
social
The student’s social class on a numeric scale from low to high social classsecond
A factor indicating the secondary school that the student attendedAfter performing some explorative analyses:
attain01
which takes values 1 if attain
is greater than 5 and 0 otherwise. Build a model for studying the effects of covariates on attain01
with rstan
, taking into account the hierarchical structure of the data.attain
(stan fit is not required).The R
package engsoccerdata
contains many datasets with results for National Leagues, European Cup and Champions League matches (including qualifiers) from 1871 to 2016. The dataset italy
consists of 25404 match-observations on 8 variables. The help file description is as follows:
Date
Date of matchSeason
Season of match - refers to starting yearhome
Home teamvisitor
Visiting teamFT
Full-time result at 90 minshgoal
home goals at FT 90minsvgoal
visitor goals at FT 90minstier
tier of football pyramid: 1The reference
Baio, G., Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37, 253–264
explains how to model two independent Poisson distributions for the number of the goals scored by two teams in a Bayesian framework.
rstan
package, trying to replicate the model of the paper. Download the data for the Covid-19 spreading outbreak from the official website of Protezione Civile, by using the following command:
read.csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
The dataset contains the following variables:
data
Date of notificationstato
Country of referencecodice_regione
Code of the Region (ISTAT 2019)denominazione_regione
Name of the Regionlat
Latitudelong
Longitudericoverati_con_sintomi
Hospitalised patients with symptomsterapia_intensiva
Intensive Caretotale_ospedalizzati
Total hospitalised patientsisolamento_domiciliare
Home confinementtotale_positivi
Total amount of current positive cases (Hospitalised patients + Home confinement)variazione_totale_positivi
New amount of current positive cases (totale_positivi current day - totale_positivi previous day)nuovi_positivi
New amount of current positive cases (totale_casi current day - totale_casi previous day)dimessi_guariti
Recovereddeceduti
Deathtotale_casi
Total amount of positive casestamponi
Tests performedcasi_testati
Total number of people testedConsider your dataset until 1 April 2020. After performing some explanatory analysis:
Build a model for nuovi_positivi
with the rstan
package. Poisson distribution is ok, but you can explore other ones.
Evaluate the inclusion of the following covariates:
Study the temporal trend of your selected response variable.
Check the fit of your final model by using posterior predictive checking tools and comment.
[optional] Provide 3/4 days-forward predictions.
[optional] Compare alternative models in terms of predictive information criteria and comment.