I analyzed diamonds data to predict price by linear regression.

df=diamonds%>%glimpse()
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
df%>% group_by(cut)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   cut       count
##   <ord>     <int>
## 1 Ideal     21551
## 2 Premium   13791
## 3 Very Good 12082
## 4 Good       4906
## 5 Fair       1610
df%>% group_by(color)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
##   color count
##   <ord> <int>
## 1 G     11292
## 2 E      9797
## 3 F      9542
## 4 H      8304
## 5 D      6775
## 6 I      5422
## 7 J      2808
df%>% group_by(clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
##   clarity count
##   <ord>   <int>
## 1 SI1     13065
## 2 VS2     12258
## 3 SI2      9194
## 4 VS1      8171
## 5 VVS2     5066
## 6 VVS1     3655
## 7 IF       1790
## 8 I1        741

All combinations of cut, clarity and colour can be: 8 * 7 * 5=280 In data, there 276 variations. This is worrying. Let’s compare by pairs.

df%>% group_by(color, cut, clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color', 'cut' (override with `.groups` argument)
## # A tibble: 276 x 4
## # Groups:   color, cut [35]
##    color cut     clarity count
##    <ord> <ord>   <ord>   <int>
##  1 E     Ideal   VS2      1136
##  2 G     Ideal   VS1       953
##  3 D     Ideal   VS2       920
##  4 G     Ideal   VS2       910
##  5 F     Ideal   VS2       879
##  6 G     Ideal   VVS2      774
##  7 E     Ideal   SI1       766
##  8 H     Ideal   SI1       763
##  9 D     Ideal   SI1       738
## 10 G     Premium VS2       721
## # ... with 266 more rows

cut and color 35 vs 35

df%>% group_by(color, cut)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
## # A tibble: 35 x 3
## # Groups:   color [7]
##    color cut       count
##    <ord> <ord>     <int>
##  1 G     Ideal      4884
##  2 E     Ideal      3903
##  3 F     Ideal      3826
##  4 H     Ideal      3115
##  5 G     Premium    2924
##  6 D     Ideal      2834
##  7 E     Very Good  2400
##  8 H     Premium    2360
##  9 E     Premium    2337
## 10 F     Premium    2331
## # ... with 25 more rows

color and clarity 56 vs 56

df%>% group_by(color, clarity)%>% summarise(count=n())%>%arrange(desc(count))
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
## # A tibble: 56 x 3
## # Groups:   color [7]
##    color clarity count
##    <ord> <ord>   <int>
##  1 E     VS2      2470
##  2 E     SI1      2426
##  3 G     VS2      2347
##  4 H     SI1      2275
##  5 F     VS2      2201
##  6 G     VS1      2148
##  7 F     SI1      2131
##  8 D     SI1      2083
##  9 G     SI1      1976
## 10 E     SI2      1713
## # ... with 46 more rows

cut and clarity 40 vs 40. These 3 variables are extremely noisy. I will eliminate these variables from linear regression.

df2=df%>% select(-cut,-color, -clarity)

df2
## # A tibble: 53,940 x 7
##    carat depth table price     x     y     z
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23   61.5    55   326  3.95  3.98  2.43
##  2 0.21   59.8    61   326  3.89  3.84  2.31
##  3 0.23   56.9    65   327  4.05  4.07  2.31
##  4 0.290  62.4    58   334  4.2   4.23  2.63
##  5 0.31   63.3    58   335  4.34  4.35  2.75
##  6 0.24   62.8    57   336  3.94  3.96  2.48
##  7 0.24   62.3    57   336  3.95  3.98  2.47
##  8 0.26   61.9    55   337  4.07  4.11  2.53
##  9 0.22   65.1    61   337  3.87  3.78  2.49
## 10 0.23   59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

Let’s run multiple lineer regression

lm.fit=lm(price~., data=df2)
summary(lm.fit)
## 
## Call:
## lm(formula = price ~ ., data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23878.2   -615.0    -50.7    347.9  12759.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20849.316    447.562  46.584  < 2e-16 ***
## carat       10686.309     63.201 169.085  < 2e-16 ***
## depth        -203.154      5.504 -36.910  < 2e-16 ***
## table        -102.446      3.084 -33.216  < 2e-16 ***
## x           -1315.668     43.070 -30.547  < 2e-16 ***
## y              66.322     25.523   2.599  0.00937 ** 
## z              41.628     44.305   0.940  0.34744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 5.486e+04 on 6 and 53933 DF,  p-value: < 2.2e-16

z has a high p value, let’s remove it and run regression again. Std errors decreased, r sqr did not change.

lm.fit=lm(price~.-z, data=df2)
summary(lm.fit)
## 
## Call:
## lm(formula = price ~ . - z, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23872.1   -614.8    -50.5    347.5  12759.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20702.947    419.575  49.343  < 2e-16 ***
## carat       10686.707     63.199 169.095  < 2e-16 ***
## depth        -200.718      4.855 -41.344  < 2e-16 ***
## table        -102.490      3.084 -33.234  < 2e-16 ***
## x           -1293.542     36.063 -35.869  < 2e-16 ***
## y              69.575     25.287   2.751  0.00594 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53934 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 6.583e+04 on 5 and 53934 DF,  p-value: < 2.2e-16

Let’s remove y, despite low, still has the highes p. std error of X significantly decreased while r sqr did not change

lm.fit=lm(price~.-z -y, data=df2)
summary(lm.fit)
## 
## Call:
## lm(formula = price ~ . - z - y, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23894.2   -615.0    -50.6    346.9  12760.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20765.521    418.984   49.56   <2e-16 ***
## carat       10692.510     63.168  169.27   <2e-16 ***
## depth        -201.231      4.852  -41.48   <2e-16 ***
## table        -102.824      3.082  -33.37   <2e-16 ***
## x           -1226.773     26.678  -45.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53935 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 8.228e+04 on 4 and 53935 DF,  p-value: < 2.2e-16

Now let’s try with the most significant factor, carat. With this single variable we coult achieve very low p, similar r sqr by much lower std error plus the benefit of a simpler model.

lm.fit=lm(price~carat, data=df2)
summary(lm.fit)
## 
## Call:
## lm(formula = price ~ carat, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

In the next steps we can transform the price as price pert carat an repeat the analysis

df3=df2%>%mutate(unit_price=price/carat)%>%select(-price, -carat)
df3
## # A tibble: 53,940 x 6
##    depth table     x     y     z unit_price
##    <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
##  1  61.5    55  3.95  3.98  2.43      1417.
##  2  59.8    61  3.89  3.84  2.31      1552.
##  3  56.9    65  4.05  4.07  2.31      1422.
##  4  62.4    58  4.2   4.23  2.63      1152.
##  5  63.3    58  4.34  4.35  2.75      1081.
##  6  62.8    57  3.94  3.96  2.48      1400 
##  7  62.3    57  3.95  3.98  2.47      1400 
##  8  61.9    55  4.07  4.11  2.53      1296.
##  9  65.1    61  3.87  3.78  2.49      1532.
## 10  59.4    61  4     4.05  2.39      1470.
## # ... with 53,930 more rows

this time R sqr was a very bad .6373 while all p values are significant

lm.fit=lm(unit_price~., data=df3)
summary(lm.fit)
## 
## Call:
## lm(formula = unit_price ~ ., data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7763.7  -701.4  -145.0   492.6 17468.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4058.48     338.08  12.005  < 2e-16 ***
## depth         -64.91       4.42 -14.685  < 2e-16 ***
## table         -74.98       2.53 -29.634  < 2e-16 ***
## x            1232.84      28.77  42.859  < 2e-16 ***
## y             113.42      20.94   5.416 6.11e-08 ***
## z             155.49      36.37   4.275 1.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1229 on 53934 degrees of freedom
## Multiple R-squared:  0.6273, Adjusted R-squared:  0.6273 
## F-statistic: 1.815e+04 on 5 and 53934 DF,  p-value: < 2.2e-16

Let’s drop y and z and try again. R squared did not change but std errors, especialy for X dropped significantly.

lm.fit=lm(unit_price~.-y-z, data=df3)
summary(lm.fit)
## 
## Call:
## lm(formula = unit_price ~ . - y - z, data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7300.5  -700.8  -143.4   492.3 17458.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3591.456    312.409   11.50   <2e-16 ***
## depth        -56.516      3.870  -14.60   <2e-16 ***
## table        -75.732      2.529  -29.94   <2e-16 ***
## x           1440.922      4.814  299.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1229 on 53936 degrees of freedom
## Multiple R-squared:  0.6269, Adjusted R-squared:  0.6269 
## F-statistic: 3.021e+04 on 3 and 53936 DF,  p-value: < 2.2e-16

To improve performance transforming color, cut and clarity variables as dummy variables try to isolate their prediction performance.