Summary

Diamonds dataset consists of 10 columns and 53,940 rows. Colums are as follows;

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Plots

We are expecting to find price value of an diamond by looking at its other attributes. Firstly we will examine columns’ relation with price values in further plots.

Plots for ordinal columns:

In below tables, ‘Cut’, ‘Clarity’, ‘Color’ columns examined to see if there is a difference between factor in terms of ‘Price’.

Plots for numeric columns:

In below tables, ‘Carat’, ‘Depth’, ‘X’, ‘Y’, ‘Z’ columns examined to see if there is a relation/ trend with ‘Price’

Key Findings

  • Cut, clarity, color has different scale of price frequency in them. Though it seem avg. price for types doesn’t varies a lot.
  • At some level, X, Y, Z attributes has a rising trend with price.
  • Carat attribute has a rising trend with price.
  • Depth doesn’t seem to have an effect on price.

PCA Analysis

To understand which column has higher correlation with price and to understand if we can do factor reduction, I did PCA analysis on data set. But because cut, color, clarity are not numerical, I transformed them into numerical values then applied PCA Analysis.

##               color_n       price          x          y          z      carat
## color_n    1.00000000  0.17251093  0.2702867  0.2635844  0.2682269  0.2914368
## price      0.17251093  1.00000000  0.8844352  0.8654209  0.8612494  0.9215913
## x          0.27028669  0.88443516  1.0000000  0.9747015  0.9707718  0.9750942
## y          0.26358440  0.86542090  0.9747015  1.0000000  0.9520057  0.9517222
## z          0.26822688  0.86124944  0.9707718  0.9520057  1.0000000  0.9533874
## carat      0.29143675  0.92159130  0.9750942  0.9517222  0.9533874  1.0000000
## cut_n     -0.02051852 -0.05349066 -0.1255652 -0.1214619 -0.1493225 -0.1349670
## clarity_n  0.02563128 -0.14680007 -0.3719985 -0.3584196 -0.3669520 -0.3528406
##                 cut_n   clarity_n
## color_n   -0.02051852  0.02563128
## price     -0.05349066 -0.14680007
## x         -0.12556524 -0.37199853
## y         -0.12146187 -0.35841962
## z         -0.14932254 -0.36695200
## carat     -0.13496702 -0.35284057
## cut_n      1.00000000  0.18917474
## clarity_n  0.18917474  1.00000000
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.2270983 1.0643090 0.9664831 0.86948598 0.35757577
## Proportion of Variance 0.6199958 0.1415942 0.1167612 0.09450073 0.01598255
## Cumulative Proportion  0.6199958 0.7615900 0.8783512 0.97285197 0.98883452
##                            Comp.6      Comp.7      Comp.8
## Standard deviation     0.21850601 0.166216557 0.118114435
## Proportion of Variance 0.00596811 0.003453493 0.001743877
## Cumulative Proportion  0.99480263 0.998256123 1.000000000

From correlation matrix, we can say ‘X’, ‘Y’, ‘Z’ and ‘Carat’ columns have the highest correlation with price. Meanwhile clarity has the highest negative correlation with price. From PCA Analysis, we can say that with 5 component, we can explain %98.88 of the data.

CART Model

Model with 4 variable

According to PCA and correlation matrix. With 5 components I can explain %98.88 of the data. So I choose carat, x, y, z and price columns for the first model.

Sample Check

In price_difference column I have calculated the percentage of the difference between predicted price value and actual price. In summary table, we can see we have a mean of %15 price difference. And between 1st and 3rd quartile we predicted price within -%15 and %38 range.

##  price_predict    price_actual   price_difference  
##  Min.   : 1058   Min.   :  326   Min.   :-0.83980  
##  1st Qu.: 1058   1st Qu.:  948   1st Qu.:-0.15732  
##  Median : 3076   Median : 2405   Median : 0.09682  
##  Mean   : 3941   Mean   : 3941   Mean   : 0.15568  
##  3rd Qu.: 6140   3rd Qu.: 5352   3rd Qu.: 0.38538  
##  Max.   :14890   Max.   :18823   Max.   : 5.57235

Line plots of actual, predicted and difference columns of sample data

Test Check

In price_difference column I have calculated the percentage of the difference between predicted price value and actual price. Likewise sample summary table, we have a mean of %15 in price_difference column.

##  price_predict    price_actual   price_difference  
##  Min.   : 1058   Min.   :  345   Min.   :-0.83838  
##  1st Qu.: 1058   1st Qu.:  956   1st Qu.:-0.15801  
##  Median : 3076   Median : 2386   Median : 0.09267  
##  Mean   : 3900   Mean   : 3900   Mean   : 0.15538  
##  3rd Qu.: 6140   3rd Qu.: 5248   3rd Qu.: 0.37826  
##  Max.   :14890   Max.   :18797   Max.   : 2.44178

Line plots of actual, predicted and difference columns of sample data

Differences

To sum up, we can say we have mean price difference in value as 1525.3544253 in sample and 1526.9804765 in test data. %38.5474601of our predicted prices fell %20 above or %20 below of the actual prices in sample data. %38.7560252of our predicted prices fell %20 above or %20 below of the actual prices in test data

Model with all columns

Thus we saw the correlation matrix, with ordinal values converted to numeric values isn’t a solid example of PCA Analysis. So I did CART with using all of the columns.

As you can see from the table, With all of the columns claritycolumn made it its way to the decision criterias.

Sample Check

Sample and test data checked as the first CART model.

##  price_predict    price_actual   price_difference  
##  Min.   : 1050   Min.   :  326   Min.   :-0.78635  
##  1st Qu.: 1050   1st Qu.:  950   1st Qu.:-0.15230  
##  Median : 3060   Median : 2403   Median : 0.08015  
##  Mean   : 3939   Mean   : 3939   Mean   : 0.14412  
##  3rd Qu.: 5401   3rd Qu.: 5352   3rd Qu.: 0.35285  
##  Max.   :14918   Max.   :18818   Max.   : 3.27984

Price Difference Plots for Sample Data

Test Check

##  price_predict    price_actual   price_difference  
##  Min.   : 1050   Min.   :  326   Min.   :-0.75535  
##  1st Qu.: 1050   1st Qu.:  950   1st Qu.:-0.15349  
##  Median : 3060   Median : 2394   Median : 0.08056  
##  Mean   : 3920   Mean   : 3909   Mean   : 0.14549  
##  3rd Qu.: 5401   3rd Qu.: 5252   3rd Qu.: 0.35225  
##  Max.   :14918   Max.   :18823   Max.   : 2.52362

Price Difference Plots for Test Data

We can say we have mean price difference in value as 1382.528733 in sample and 1379.1496408 in test data %42.2830923of our predicted prices fell %20 above or %20 below of the actual prices in sample data %42.1023359of our predicted prices fell %20 above or %20 below of the actual prices in test data

Differences in values and the total percentage of the predicted values fall between -%20 to %20 from actual values was higher in the second CART. So I choose to go with all the columns in data table.