--- title: "Clustering automobile models" output: pdf_document: default html_document: df_print: paged --- The file ``cars.csv`` contains the characteristics (type of engine, vehicle size, and performance) of different models of petrol cars (the sources are the websites of the car manufacturers and, in some cases, specialized publications): ```{r} D<-read.csv("cars2.csv", header = TRUE, row.names=1) dim(D) head(D) ``` We first briefly examine the data. We notice that the variables have vastly different means ```{r} apply(D , 2, mean) ``` and variances ```{r} apply(D , 2, var) ``` We therefore proceed to standardize the variables ```{r} D.sd<-scale(D) head(D.sd) ``` Next, we perform hierarchical clustering using \texttt{hclust()} and Euclidean distance as the dissimilarity measure: ```{r} hc.sing<-hclust(dist(D.sd), method="single") plot(hc.sing, cex=0.5) hc.ave<-hclust(dist(D.sd), method="average") plot(hc.ave, cex=0.5) hc.com<-hclust(dist(D.sd), method="complete") plot(hc.com, cex=0.5) ``` ```{r} library(cluster) hc.ward<-agnes(D.sd, method="ward") pltree(hc.ward, cex=0.5, hang = -1) # rect.hclust(hc.ward, k = 3, border = 2:4) ``` Using the results from Ward's method, we obtain the partitions with 3 and 5 clusters ```{r} cl3<-cutree(hc.ward, 3) cl5<-cutree(hc.ward, 5) ``` ```{r} # find out members of 3 clusters table(cl3) labels<-row.names(D) labels[cl3==1] labels[cl3==2] labels[cl3==3] ``` The clusters are characterized as follows: - a large group of city cars, utility cars and minivans - a medium size group of sportive cars, S.U.V.s and off-road vehicles - a small third group of luxury cars Refining the partition and considering 5 groups, we see that the two more heterogeneous groups are split up giving rise to a clear separation between small cars and minivans in the first cluster, and in the second cluster between sportive cars on the one hand, S.U.V.s and off-road vehicles on the other. Finally, we use the ```pam()``` function to perform Partitioning around medoids method, which provides a visualization similar to that of k-Means but returns the cluster medoids that are representative observations of the group they belong to, helping understanding the differences between groups. ```{r} # with 3 groups pam.out3<-pam(D.sd,3) pam.out3$silinfo$avg.width pam.out3$medoids #D[pam.out3$id.med, ] # with 5 groups pam.out5<-pam(D.sd,5) pam.out5$silinfo$avg.width #D[pam.out5$id.med, ] ``` The function ```plot()``` returns two types of visualization of results, according to the argument ```which```. When ```which=1```, the plot shows the 2 largest principal components and the data points (by default, labels are omitted) (green symbols) in terms of the value of the first and second principal component. It also shows the percentage of variability in the data that can be explained by the first 2 principal components; when ```which=2```, the silhouette plot is returned: <!-- The ellipses are the shape of the regions covered by the clusters in terms of the principal components. The "clusplot" is a visualization of how separable or not your clusters are (lots of overlap between the ellipses means they are not very well separated, so the clustering did not work very well) and how much redundant information is in your data set. --> ```{r} par(mfrow=c(1,2)) plot(pam.out3, which=1, main=" ") plot(pam.out3, which=2, main=" ") ``` ```{r} par(mfrow=c(1,2)) plot(pam.out5, which=1, main=" ") plot(pam.out5, which=2, main=" ") ``` Consider the partition in 3 groups. Let's add cluster labels to the original data ```{r} D$Group <- as.factor(pam.out3$clustering) ``` Finally, box plots of some variables for each group will be drawn ```{r} par(mfrow=c(2,2), mar=c(2,2,2,2)) plot(D$Group, D$pw, col=2:4, main="Power", xlab="cluster") plot(D$Group, D$speed, col=2:4, main="Max. Speed", xlab="cluster") plot(D$Group, D$acc, col=2:4, main="Acceleration", xlab="cluster") plot(D$Group, D$cons, col=2:4, main="Fuel Consumption", xlab="cluster") ```