I analyzed diamonds data to predict price by linear regression.
df=diamonds%>%glimpse()
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
df%>% group_by(cut)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## cut count
## <ord> <int>
## 1 Ideal 21551
## 2 Premium 13791
## 3 Very Good 12082
## 4 Good 4906
## 5 Fair 1610
df%>% group_by(color)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## color count
## <ord> <int>
## 1 G 11292
## 2 E 9797
## 3 F 9542
## 4 H 8304
## 5 D 6775
## 6 I 5422
## 7 J 2808
df%>% group_by(clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## clarity count
## <ord> <int>
## 1 SI1 13065
## 2 VS2 12258
## 3 SI2 9194
## 4 VS1 8171
## 5 VVS2 5066
## 6 VVS1 3655
## 7 IF 1790
## 8 I1 741
All combinations of cut, clarity and colour can be: 8 * 7 * 5=280 In data, there 276 variations. This is worrying. Let’s compare by pairs.
df%>% group_by(color, cut, clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color', 'cut' (override with `.groups` argument)
## # A tibble: 276 x 4
## # Groups: color, cut [35]
## color cut clarity count
## <ord> <ord> <ord> <int>
## 1 E Ideal VS2 1136
## 2 G Ideal VS1 953
## 3 D Ideal VS2 920
## 4 G Ideal VS2 910
## 5 F Ideal VS2 879
## 6 G Ideal VVS2 774
## 7 E Ideal SI1 766
## 8 H Ideal SI1 763
## 9 D Ideal SI1 738
## 10 G Premium VS2 721
## # ... with 266 more rows
cut and color 35 vs 35
df%>% group_by(color, cut)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
## # A tibble: 35 x 3
## # Groups: color [7]
## color cut count
## <ord> <ord> <int>
## 1 G Ideal 4884
## 2 E Ideal 3903
## 3 F Ideal 3826
## 4 H Ideal 3115
## 5 G Premium 2924
## 6 D Ideal 2834
## 7 E Very Good 2400
## 8 H Premium 2360
## 9 E Premium 2337
## 10 F Premium 2331
## # ... with 25 more rows
color and clarity 56 vs 56
df%>% group_by(color, clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
## # A tibble: 56 x 3
## # Groups: color [7]
## color clarity count
## <ord> <ord> <int>
## 1 E VS2 2470
## 2 E SI1 2426
## 3 G VS2 2347
## 4 H SI1 2275
## 5 F VS2 2201
## 6 G VS1 2148
## 7 F SI1 2131
## 8 D SI1 2083
## 9 G SI1 1976
## 10 E SI2 1713
## # ... with 46 more rows
cut and clarity 40 vs 40. These 3 variables are extremely noisy. I will eliminate these variables from linear regression.
df2=df%>% select(-cut,-color, -clarity)
df2
## # A tibble: 53,940 x 7
## carat depth table price x y z
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 61.5 55 326 3.95 3.98 2.43
## 2 0.21 59.8 61 326 3.89 3.84 2.31
## 3 0.23 56.9 65 327 4.05 4.07 2.31
## 4 0.290 62.4 58 334 4.2 4.23 2.63
## 5 0.31 63.3 58 335 4.34 4.35 2.75
## 6 0.24 62.8 57 336 3.94 3.96 2.48
## 7 0.24 62.3 57 336 3.95 3.98 2.47
## 8 0.26 61.9 55 337 4.07 4.11 2.53
## 9 0.22 65.1 61 337 3.87 3.78 2.49
## 10 0.23 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
Let’s run multiple lineer regression
lm.fit=lm(price~., data=df2)
summary(lm.fit)
##
## Call:
## lm(formula = price ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23878.2 -615.0 -50.7 347.9 12759.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20849.316 447.562 46.584 < 2e-16 ***
## carat 10686.309 63.201 169.085 < 2e-16 ***
## depth -203.154 5.504 -36.910 < 2e-16 ***
## table -102.446 3.084 -33.216 < 2e-16 ***
## x -1315.668 43.070 -30.547 < 2e-16 ***
## y 66.322 25.523 2.599 0.00937 **
## z 41.628 44.305 0.940 0.34744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8592
## F-statistic: 5.486e+04 on 6 and 53933 DF, p-value: < 2.2e-16
z has a high p value, let’s remove it and run regression again. Std errors decreased, r sqr did not change.
lm.fit=lm(price~.-z, data=df2)
summary(lm.fit)
##
## Call:
## lm(formula = price ~ . - z, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23872.1 -614.8 -50.5 347.5 12759.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20702.947 419.575 49.343 < 2e-16 ***
## carat 10686.707 63.199 169.095 < 2e-16 ***
## depth -200.718 4.855 -41.344 < 2e-16 ***
## table -102.490 3.084 -33.234 < 2e-16 ***
## x -1293.542 36.063 -35.869 < 2e-16 ***
## y 69.575 25.287 2.751 0.00594 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1497 on 53934 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8592
## F-statistic: 6.583e+04 on 5 and 53934 DF, p-value: < 2.2e-16
Let’s remove y, despite low, still has the highes p. std error of X significantly decreased while r sqr did not change
lm.fit=lm(price~.-z -y, data=df2)
summary(lm.fit)
##
## Call:
## lm(formula = price ~ . - z - y, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23894.2 -615.0 -50.6 346.9 12760.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20765.521 418.984 49.56 <2e-16 ***
## carat 10692.510 63.168 169.27 <2e-16 ***
## depth -201.231 4.852 -41.48 <2e-16 ***
## table -102.824 3.082 -33.37 <2e-16 ***
## x -1226.773 26.678 -45.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1497 on 53935 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8592
## F-statistic: 8.228e+04 on 4 and 53935 DF, p-value: < 2.2e-16
Now let’s try with the most significant factor, carat. With this single variable we coult achieve very low p, similar r sqr by much lower std error plus the benefit of a simpler model.
lm.fit=lm(price~carat, data=df2)
summary(lm.fit)
##
## Call:
## lm(formula = price ~ carat, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
In the next steps we can transform the price as price pert carat an repeat the analysis
df3=df2%>%mutate(unit_price=price/carat)%>%select(-price, -carat)
df3
## # A tibble: 53,940 x 6
## depth table x y z unit_price
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 61.5 55 3.95 3.98 2.43 1417.
## 2 59.8 61 3.89 3.84 2.31 1552.
## 3 56.9 65 4.05 4.07 2.31 1422.
## 4 62.4 58 4.2 4.23 2.63 1152.
## 5 63.3 58 4.34 4.35 2.75 1081.
## 6 62.8 57 3.94 3.96 2.48 1400
## 7 62.3 57 3.95 3.98 2.47 1400
## 8 61.9 55 4.07 4.11 2.53 1296.
## 9 65.1 61 3.87 3.78 2.49 1532.
## 10 59.4 61 4 4.05 2.39 1470.
## # ... with 53,930 more rows
this time R sqr was a very bad .6373 while all p values are significant
lm.fit=lm(unit_price~., data=df3)
summary(lm.fit)
##
## Call:
## lm(formula = unit_price ~ ., data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7763.7 -701.4 -145.0 492.6 17468.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4058.48 338.08 12.005 < 2e-16 ***
## depth -64.91 4.42 -14.685 < 2e-16 ***
## table -74.98 2.53 -29.634 < 2e-16 ***
## x 1232.84 28.77 42.859 < 2e-16 ***
## y 113.42 20.94 5.416 6.11e-08 ***
## z 155.49 36.37 4.275 1.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1229 on 53934 degrees of freedom
## Multiple R-squared: 0.6273, Adjusted R-squared: 0.6273
## F-statistic: 1.815e+04 on 5 and 53934 DF, p-value: < 2.2e-16
Let’s drop y and z and try again. R squared did not change but std errors, especialy for X dropped significantly.
lm.fit=lm(unit_price~.-y-z, data=df3)
summary(lm.fit)
##
## Call:
## lm(formula = unit_price ~ . - y - z, data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7300.5 -700.8 -143.4 492.3 17458.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3591.456 312.409 11.50 <2e-16 ***
## depth -56.516 3.870 -14.60 <2e-16 ***
## table -75.732 2.529 -29.94 <2e-16 ***
## x 1440.922 4.814 299.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1229 on 53936 degrees of freedom
## Multiple R-squared: 0.6269, Adjusted R-squared: 0.6269
## F-statistic: 3.021e+04 on 3 and 53936 DF, p-value: < 2.2e-16
To improve performance transforming color, cut and clarity variables as dummy variables try to isolate their prediction performance.