Diamonds dataset consists of 10 columns and 53,940 rows. Colums are as follows;
cut
: ordinal value of cut shape of the diamond. Values : Fair < Good < Very Good < Premium < Idealcarat
: numeric value of carats in a diamondclarity
: ordinal value of diamond’s clarity. Values : I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IFcolor
: ordinal value of diamond’s color Values : D < E < F < G < H < I < Jdepth
: numeric value of depth in a diamondtable
: numeric value of diamonds flat surface on its facetprice
: price of the diamondx
: numeric valuey
: numeric valuez
: numeric value## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
We are expecting to find price value of an diamond by looking at its other attributes. Firstly we will examine columns’ relation with price values in further plots.
In below tables, ‘Cut’, ‘Clarity’, ‘Color’ columns examined to see if there is a difference between factor in terms of ‘Price’.
In below tables, ‘Carat’, ‘Depth’, ‘X’, ‘Y’, ‘Z’ columns examined to see if there is a relation/ trend with ‘Price’
To understand which column has higher correlation with price and to understand if we can do factor reduction, I did PCA analysis on data set. But because cut
, color
, clarity
are not numerical, I transformed them into numerical values then applied PCA Analysis.
## color_n price x y z carat
## color_n 1.00000000 0.17251093 0.2702867 0.2635844 0.2682269 0.2914368
## price 0.17251093 1.00000000 0.8844352 0.8654209 0.8612494 0.9215913
## x 0.27028669 0.88443516 1.0000000 0.9747015 0.9707718 0.9750942
## y 0.26358440 0.86542090 0.9747015 1.0000000 0.9520057 0.9517222
## z 0.26822688 0.86124944 0.9707718 0.9520057 1.0000000 0.9533874
## carat 0.29143675 0.92159130 0.9750942 0.9517222 0.9533874 1.0000000
## cut_n -0.02051852 -0.05349066 -0.1255652 -0.1214619 -0.1493225 -0.1349670
## clarity_n 0.02563128 -0.14680007 -0.3719985 -0.3584196 -0.3669520 -0.3528406
## cut_n clarity_n
## color_n -0.02051852 0.02563128
## price -0.05349066 -0.14680007
## x -0.12556524 -0.37199853
## y -0.12146187 -0.35841962
## z -0.14932254 -0.36695200
## carat -0.13496702 -0.35284057
## cut_n 1.00000000 0.18917474
## clarity_n 0.18917474 1.00000000
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.2270983 1.0643090 0.9664831 0.86948598 0.35757577
## Proportion of Variance 0.6199958 0.1415942 0.1167612 0.09450073 0.01598255
## Cumulative Proportion 0.6199958 0.7615900 0.8783512 0.97285197 0.98883452
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.21850601 0.166216557 0.118114435
## Proportion of Variance 0.00596811 0.003453493 0.001743877
## Cumulative Proportion 0.99480263 0.998256123 1.000000000
From correlation matrix, we can say ‘X’, ‘Y’, ‘Z’ and ‘Carat’ columns have the highest correlation with price. Meanwhile clarity has the highest negative correlation with price. From PCA Analysis, we can say that with 5 component, we can explain %98.88 of the data.
According to PCA and correlation matrix. With 5 components I can explain %98.88 of the data. So I choose carat
, x
, y
, z
and price
columns for the first model.
In price_difference
column I have calculated the percentage of the difference between predicted price value and actual price. In summary table, we can see we have a mean of %15 price difference. And between 1st and 3rd quartile we predicted price within -%15 and %38 range.
## price_predict price_actual price_difference
## Min. : 1058 Min. : 326 Min. :-0.83980
## 1st Qu.: 1058 1st Qu.: 948 1st Qu.:-0.15732
## Median : 3076 Median : 2405 Median : 0.09682
## Mean : 3941 Mean : 3941 Mean : 0.15568
## 3rd Qu.: 6140 3rd Qu.: 5352 3rd Qu.: 0.38538
## Max. :14890 Max. :18823 Max. : 5.57235
Line plots of actual, predicted and difference columns of sample data
In price_difference
column I have calculated the percentage of the difference between predicted price value and actual price. Likewise sample summary table, we have a mean of %15 in price_difference
column.
## price_predict price_actual price_difference
## Min. : 1058 Min. : 345 Min. :-0.83838
## 1st Qu.: 1058 1st Qu.: 956 1st Qu.:-0.15801
## Median : 3076 Median : 2386 Median : 0.09267
## Mean : 3900 Mean : 3900 Mean : 0.15538
## 3rd Qu.: 6140 3rd Qu.: 5248 3rd Qu.: 0.37826
## Max. :14890 Max. :18797 Max. : 2.44178
Line plots of actual, predicted and difference columns of sample data
To sum up, we can say we have mean price difference in value as 1525.3544253 in sample and 1526.9804765 in test data. %38.5474601of our predicted prices fell %20 above or %20 below of the actual prices in sample data. %38.7560252of our predicted prices fell %20 above or %20 below of the actual prices in test data
Thus we saw the correlation matrix, with ordinal values converted to numeric values isn’t a solid example of PCA Analysis. So I did CART with using all of the columns.
As you can see from the table, With all of the columns clarity
column made it its way to the decision criterias.
Sample and test data checked as the first CART model.
## price_predict price_actual price_difference
## Min. : 1050 Min. : 326 Min. :-0.78635
## 1st Qu.: 1050 1st Qu.: 950 1st Qu.:-0.15230
## Median : 3060 Median : 2403 Median : 0.08015
## Mean : 3939 Mean : 3939 Mean : 0.14412
## 3rd Qu.: 5401 3rd Qu.: 5352 3rd Qu.: 0.35285
## Max. :14918 Max. :18818 Max. : 3.27984
Price Difference Plots for Sample Data
## price_predict price_actual price_difference
## Min. : 1050 Min. : 326 Min. :-0.75535
## 1st Qu.: 1050 1st Qu.: 950 1st Qu.:-0.15349
## Median : 3060 Median : 2394 Median : 0.08056
## Mean : 3920 Mean : 3909 Mean : 0.14549
## 3rd Qu.: 5401 3rd Qu.: 5252 3rd Qu.: 0.35225
## Max. :14918 Max. :18823 Max. : 2.52362
Price Difference Plots for Test Data
We can say we have mean price difference in value as 1382.528733 in sample and 1379.1496408 in test data %42.2830923of our predicted prices fell %20 above or %20 below of the actual prices in sample data %42.1023359of our predicted prices fell %20 above or %20 below of the actual prices in test data
Differences in values and the total percentage of the predicted values fall between -%20 to %20 from actual values was higher in the second CART. So I choose to go with all the columns in data table.
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/ https://alice86.github.io/2018/04/08/Factor-Analysis-on-Ordinal-Data-example-in-R-(psych,-homals)/ https://rpubs.com/DocOfi/342740 https://mef-bda503.github.io/archive/fall17/files/intro_to_ml.html#properties https://mef-bda503.github.io/archive/fall17/files/intro_to_ml_2.html