--- title: "Clustering automobile models" output: pdf_document: default html_document: df_print: paged --- The file ``cars.csv`` contains the characteristics (type of engine, vehicle size, and performance) of different models of petrol cars (the sources are the websites of the car manufacturers and, in some cases, specialized publications): ```{r} D<-read.csv("cars2.csv", header = TRUE, row.names=1) dim(D) head(D) ``` We first briefly examine the data. We notice that the variables have vastly different means ```{r} apply(D , 2, mean) ``` and variances ```{r} apply(D , 2, var) ``` We therefore proceed to standardize the variables ```{r} D.sd<-scale(D) head(D.sd) ``` Next, we perform hierarchical clustering using \texttt{hclust()} and Euclidean distance as the dissimilarity measure: ```{r} hc.sing<-hclust(dist(D.sd), method="single") plot(hc.sing, cex=0.5) hc.ave<-hclust(dist(D.sd), method="average") plot(hc.ave, cex=0.5) hc.com<-hclust(dist(D.sd), method="complete") plot(hc.com, cex=0.5) ``` ```{r} library(cluster) hc.ward<-agnes(D.sd, method="ward") pltree(hc.ward, cex=0.5, hang = -1) # rect.hclust(hc.ward, k = 3, border = 2:4) ``` Using the results from Ward's method, we obtain the partitions with 3 and 5 clusters ```{r} cl3<-cutree(hc.ward, 3) cl5<-cutree(hc.ward, 5) ``` ```{r} # find out members of 3 clusters table(cl3) labels<-row.names(D) labels[cl3==1] labels[cl3==2] labels[cl3==3] ``` The clusters are characterized as follows: - a large group of city cars, utility cars and minivans - a medium size group of sportive cars, S.U.V.s and off-road vehicles - a small third group of luxury cars Refining the partition and considering 5 groups, we see that the two more heterogeneous groups are split up giving rise to a clear separation between small cars and minivans in the first cluster, and in the second cluster between sportive cars on the one hand, S.U.V.s and off-road vehicles on the other. Finally, we use the ```pam()``` function to perform Partitioning around medoids method, which provides a visualization similar to that of k-Means but returns the cluster medoids that are representative observations of the group they belong to, helping understanding the differences between groups. ```{r} # with 3 groups pam.out3<-pam(D.sd,3) pam.out3$silinfo$avg.width pam.out3$medoids #D[pam.out3$id.med, ] # with 5 groups pam.out5<-pam(D.sd,5) pam.out5$silinfo$avg.width #D[pam.out5$id.med, ] ``` The function ```plot()``` returns two types of visualization of results, according to the argument ```which```. When ```which=1```, the plot shows the 2 largest principal components and the data points (by default, labels are omitted) (green symbols) in terms of the value of the first and second principal component. It also shows the percentage of variability in the data that can be explained by the first 2 principal components; when ```which=2```, the silhouette plot is returned: ```{r} par(mfrow=c(1,2)) plot(pam.out3, which=1, main=" ") plot(pam.out3, which=2, main=" ") ``` ```{r} par(mfrow=c(1,2)) plot(pam.out5, which=1, main=" ") plot(pam.out5, which=2, main=" ") ``` Consider the partition in 3 groups. Let's add cluster labels to the original data ```{r} D$Group <- as.factor(pam.out3$clustering) ``` Finally, box plots of some variables for each group will be drawn ```{r} par(mfrow=c(2,2), mar=c(2,2,2,2)) plot(D$Group, D$pw, col=2:4, main="Power", xlab="cluster") plot(D$Group, D$speed, col=2:4, main="Max. Speed", xlab="cluster") plot(D$Group, D$acc, col=2:4, main="Acceleration", xlab="cluster") plot(D$Group, D$cons, col=2:4, main="Fuel Consumption", xlab="cluster") ```