forked from maximilianpress/Week5_Models
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathblooddonate.Rmd
279 lines (216 loc) · 10.2 KB
/
blooddonate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
output: html_document
---
Predicting blood donations
===============================
### Maximilian Press + Morgan Lawless
Various looks at what you can do with models, hopefully with an emphasis on parametric (GLM) models and the R model object.
Takes a statistical learning point of view on the problem.
Dolph Schluter's R modeling pages are a good resource for general-purpose model fitting. https://www.zoology.ubc.ca/~schluter/R/fit-model/
## Deconstructing the model: K-Nearest Neighbor (kNN)
kNN is a nonparametric, almost stupidly simple method that just finds data points in the training set that are closest to each test case and uses them to make a prediction. kNN is asymptotically optimal as a predictor (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).
```{r fig.width=5,fig.height=5,fig.cap='A brief visual introduction to nearest neighbor'}
set.seed(667)
# just using random samples here, with mild covariation
a = runif(20) # a predictor variable
b = a+runif(20) # a variable of interest to be predicted
par(mfrow=c(1,1))
plot(a,b,xlab='predictor variable',ylab='predicted variable',ylim=c(0,2))
# a new data point to predict
c=runif(1)
d=c+runif(1)
points(c,d,pch=19)
abline(v=c-.05)
abline(v=c+.05)
neighbs = which((a > c-.05) & (a<c+.05))
neighbs # these are the nearest neigbors (not doing a specific k here)
knn_est = mean(b[neighbs]) # make a prediction based on neighbs
points(c, knn_est, pch=19,col='red') # plot it
```
### Discuss!
## Predicting blood donation
Well, hopefully that was instructive. Now let's look at some actual data. This dataset is from a paper whose reference I have lost, trying to predict who will show up for blood drives, based on prior donation history.
```{r fig.width=7, fig.height=7}
trans = read.csv('transfusion.data',header=T) # read it in
head(trans)
plot(trans,pch='.')
```
Obviously some of these things are more meaningful than other things. I will sorta naively fit the model based on everything, ignoring the possibility of interactions.
Using prediction to evaluate the model. I chose to randomly sample 500 observations to train, and test on the remaining 248.
```{r}
# training set
set.seed(666) # EXTREME
dim(trans)
trainindex = sample(1:nrow(trans),500)
train = trans[trainindex,]
# test set
test = trans[!(1:nrow(trans) %in% trainindex),]
# some utility functions
source('roc.R')
```
First, fit a linear model, which is ok but not very interesting.
```{r fig.width=5, fig.height=5,fig.cap='Predictions of linear model (training only)'}
# fit the model
linmod = lm(whether.he.she.donated.blood.in.March.2007 ~
Frequency..times., data = train)
str(linmod)
plot(train$Frequency..times.,
jitter(train$whether.he.she.donated.blood.in.March.2007),
xlab='# donation events',ylab='donated in test period (jittered)',
cex = .5 , main='Training set')
# things you can do with the fitted model object
abline(linmod) # add the predicted function to the plot just generated
# return various useful information about the model:
summary(linmod) # print a lot of results, in semi-human-readable table
coef(linmod) # coefficients (parameters)
confint(linmod) # confidence intervals
resid(linmod)[1:10] # residuals on the model - printing out only first ten
anova(linmod) # anova table
# this would plot lots of model fit info, which may or may not be useful:
plot(linmod)
# alternate visualization method
require(visreg)
visreg(linmod)
```
So we had a low p-value, which is good right? Problem solved, everyone go home.
Except this is obviously a really crappy model. This can be shown if we try to predict test values (new data that wasn't used to build the model, just plugging new values into the model function) and compare them to the actual values of the test outcome.
```{r fig.width=5, fig.height=5, fig.cap = 'Predictions vs. true values from linear model on test data'}
linpred = predict(linmod,newdata=test)
linpredplot = plot(
linpred,
jitter(test$whether.he.she.donated.blood.in.March.2007),
ylab='True value (jittered)', xlab='Predicted value',
ylim = c(-.2,1.2), xlim = c(0,1), cex = .5, main='Test set')
points( c(0,1), c(0,1), cex = 2, pch = 19 )
```
```{r fig.width=5, fig.height=7, fig.cap='ROC analysis of linear model on test'}
prediction = cbind(linpred,test[,5])
a = ROC(prediction)
```
Not great. There are also some numerical summaries of model fit that various people use (besides $R^2$).
```{r}
# Akaike information criterion:
# 2(num parameters) - 2ln( L(model) ) [lower is better!!!]
AIC(linmod)
# stolen from https://www.kaggle.com/c/bioresponse/forums/t/1576/r-code-for-logloss
# prettied up from that to make more readable
LogLoss = function(actual, predicted) {
# for explanation see https://en.wikipedia.org/wiki/Loss_function
result = -1/length(actual) *
sum(
actual * log(predicted) + # true prediction failures
(1-actual) * log(1-predicted) # false prediction failures
)
return(result) }
# note that this makes use of training set
LogLoss( test$whether.he.she.donated.blood.in.March.2007, linpred )
# AUC from the ROC curve above also is such a measure.
# you can even use a U-test to sort of evaluate the quality of the predictions:
wilcox.test(
linpred[test$whether.he.she.donated.blood.in.March.2007 == 1],
linpred[test$whether.he.she.donated.blood.in.March.2007 == 0]
)
# or even something as naive as correlations
cor(test$whether.he.she.donated.blood.in.March.2007, linpred,method='spearman')
```
can in principle make the model more complicated by adding more variables, and compare to the old model.
```{r}
multilinmod = lm(whether.he.she.donated.blood.in.March.2007 ~
Frequency..times. + Recency..months.,data = train)
# model comparisons!!
AIC(multilinmod,linmod)
anova(multilinmod,linmod)
multipredict = cbind(predict(multilinmod,newdata=test),test[,5])
a = ROC(multipredict)
c(logLik(multilinmod),logLik(linmod))
c(LogLoss(multipredict[,2],multipredict[,1]),LogLoss(multipredict[,2],linpred))
```
Try instead a logistic regression: a generalized linear model (GLM) of the family "binomial". That is, it expects the outcome variable (blood donation) to be distributed as a binomial (0/1) random variable. The predictor "generalizes" a linear fit using the logistic function to be able to make discrete 0/1 predictions.
```{r figure.width=5,figure.height=7,fig.cap='Performance of naive logistic regression'}
trainfit = glm(whether.he.she.donated.blood.in.March.2007 ~
Frequency..times.,family='binomial',data=train )
class(linmod)
class(trainfit)
# plot out predictions of 2 models
par(mfrow = c(1,1) )
plot(train$Frequency..times.,jitter(train$whether.he.she.donated.blood.in.March.2007))
curve( predict( trainfit, data.frame(Frequency..times.=x), type='response' ), add=TRUE )
abline(linmod,col='red')
AIC(trainfit,linmod,multilinmod)
# add EVERYTHING to the model
trainfit = glm(whether.he.she.donated.blood.in.March.2007 ~ Recency..months. *
Frequency..times. * Monetary..c.c..blood.
* Time..months.,family='binomial',data=train
)
# summarize it a little...
AIC(trainfit,multilinmod,linmod)
summary(trainfit)
# automated model selection
stepped = step(trainfit)
summary(stepped)
AIC(stepped)
```
```{r figure.width=5,figure.height=7,fig.cap='Performance of naive logistic regression, by ROC'}
# do some predictions
predictor = predict.glm(stepped,newdata=test)
prediction = cbind(predictor,test[,5])
colnames(prediction) = c( 'prediction','true values')
#curve( predict( trainfit, data.frame(Frequency..times.=x), type='response' ), add=TRUE, col='blue')
# various looks at prediction success
cor(prediction[,1],prediction[,2],method='spearman')
#LogLoss(prediction[,1],prediction[,2])
a=ROC(prediction)
```
So it's okay (i.e. AUC>0.8).
```{r}
curated_fit = glm(whether.he.she.donated.blood.in.March.2007 ~
Frequency..times.
+ Time..months.,family='binomial',
data=train)
curated_prediction = predict.glm(curated_fit,newdata=test)
prediction = cbind(curated_prediction,test[,5])
```
```{r fig.width=5,fig.height=7,fig.cap = 'Performance of logistic regression with reduced model'}
a=ROC(prediction)
```
Didn't really change much (Figure 3). Lost a little AUC, but not much for removing 2 explanatory variables in slavish devotion to occam's razor. Precision seems to fall apart a bit, though. While logistic regression is nice and simple, it is not doing a super job, so I will move on to see if anything else does better.
## Naive Bayes
Naive Bayes is an attractively simple classification technique. It is similar to the initial logistic regression implemented above, because of its assumption of independence of predictor variables. It uses a straightforward interpretation of Bayes' rule to compute probabilities of each variable belonging to each class. While we only have a binary outcome, it is possible that NB will perform better for some reason.
```{r}
require(e1071)
# this function wants response to be a factor
classifier = naiveBayes(train[,1:4],as.factor(train[,5]))
class(classifier)
str(classifier)
bayespredict = cbind(predict(classifier,test[,-5]),test[,5])
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of Naive Bayes predictor'}
a=ROC(bayespredict)
```
Well, it turns out that Bayesian statistics is not the answer to everything (Figure 4). About the same as the reduced logistic regression model. The curve is weirdly step-like, wonder what's going on there. Perhaps because NB is specifying categorical cutoffs in the continuous data?
```
More kNN.
Figure 6: k=2, Figure 7: k=3, Figure 8: k=4, Figure 9: k=5.
```{r}
library(class)
nn2_pred = knn(train[,1:4],test=test[,1:4] ,cl=train[,5],k=2)
nn2_predict = cbind(nn2_pred,test[,5])
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=2.'}
a=ROC(nn2_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=3.'}
nn3_pred = knn(train[,1:4],test=test[,1:4] ,cl=train[,5],k=3)
nn3_predict = cbind(nn3_pred,test[,5])
a=ROC(nn3_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=4.'}
nn4_pred = knn(train[,1:4],cl=train[,5],test=test[,1:4] ,k=4)
nn4_predict = cbind(nn4_pred,test[,5])
a=ROC(nn4_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=5.'}
nn5_pred = knn(train[,1:4],test[,1:4],cl=train[,5] ,k=5)
nn5_predict = cbind(nn5_pred,test[,5])
a=ROC(nn5_predict)
```