blooddonate.Rmd

---
output: html_document
---
Predicting blood donations 
===============================
### Maximilian Press + Morgan Lawless

Various looks at what you can do with models, hopefully with an emphasis on parametric (GLM) models and the R model object. 

Takes a statistical learning point of view on the problem.

Dolph Schluter's R modeling pages are a good resource for general-purpose model fitting. https://www.zoology.ubc.ca/~schluter/R/fit-model/

## Deconstructing the model: K-Nearest Neighbor (kNN)
kNN is a nonparametric, almost stupidly simple method that just finds data points in the training set that are closest to each test case and uses them to make a prediction. kNN is asymptotically optimal as a predictor (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). 

```{r fig.width=5,fig.height=5,fig.cap='A brief visual introduction to nearest neighbor'}
set.seed(667) 
# just using random samples here, with mild covariation
a = runif(20) # a predictor variable
b = a+runif(20) # a variable of interest to be predicted
par(mfrow=c(1,1))
plot(a,b,xlab='predictor variable',ylab='predicted variable',ylim=c(0,2))
# a new data point to predict
c=runif(1)
d=c+runif(1)
points(c,d,pch=19)
abline(v=c-.05)
abline(v=c+.05)
neighbs = which((a > c-.05) & (a<c+.05))
neighbs # these are the nearest neigbors (not doing a specific k here)
knn_est = mean(b[neighbs])  # make a prediction based on neighbs
points(c, knn_est, pch=19,col='red')  # plot it
```
### Discuss!

## Predicting blood donation
Well, hopefully that was instructive. Now let's look at some actual data. This dataset is from a paper whose reference I have lost, trying to predict who will show up for blood drives, based on prior donation history.

```{r fig.width=7, fig.height=7}
trans = read.csv('transfusion.data',header=T) # read it in
head(trans)
plot(trans,pch='.')
```

Obviously some of these things are more meaningful than other things.  I will sorta naively fit the model based on everything, ignoring the possibility of interactions.

Using prediction to evaluate the model. I chose to randomly sample 500 observations to train, and test on the remaining 248.  

```{r}
# training set
set.seed(666) # EXTREME
dim(trans)
trainindex = sample(1:nrow(trans),500)
train = trans[trainindex,]
# test set
test = trans[!(1:nrow(trans) %in% trainindex),]

# some utility functions
source('roc.R')

```

First, fit a linear model, which is ok but not very interesting.
```{r fig.width=5, fig.height=5,fig.cap='Predictions of linear model (training only)'}

# fit the model
linmod = lm(whether.he.she.donated.blood.in.March.2007 ~ 
	Frequency..times., data = train)
str(linmod)

plot(train$Frequency..times.,
	jitter(train$whether.he.she.donated.blood.in.March.2007),
	xlab='# donation events',ylab='donated in test period (jittered)', 
	cex = .5 , main='Training set')

# things you can do with the fitted model object
abline(linmod)	# add the predicted function to the plot just generated

# return various useful information about the model:
summary(linmod)	# print a lot of results, in semi-human-readable table
coef(linmod) 	# coefficients (parameters)
confint(linmod)	# confidence intervals
resid(linmod)[1:10]	# residuals on the model -  printing out only first ten
anova(linmod)	# anova table

# this would plot lots of model fit info, which may or may not be useful:
plot(linmod)	

# alternate visualization method
require(visreg)
visreg(linmod)

```

So we had a low p-value, which is good right? Problem solved, everyone go home.

Except this is obviously a really crappy model. This can be shown if we try to predict test values (new data that wasn't used to build the model, just plugging new values into the model function) and compare them to the actual values of the test outcome.

```{r fig.width=5, fig.height=5, fig.cap = 'Predictions vs. true values from linear model on test data'}
linpred = predict(linmod,newdata=test)
linpredplot = plot(
	linpred, 
	jitter(test$whether.he.she.donated.blood.in.March.2007),
	ylab='True value (jittered)', xlab='Predicted value', 
	ylim = c(-.2,1.2), xlim = c(0,1), cex = .5, main='Test set')

points( c(0,1), c(0,1), cex = 2, pch = 19 )
```

```{r fig.width=5, fig.height=7, fig.cap='ROC analysis of linear model on test'}
prediction = cbind(linpred,test[,5])
a = ROC(prediction)
```

Not great. There are also some numerical summaries of model fit that various people use (besides $R^2$).

```{r}
# Akaike information criterion: 
# 2(num parameters) - 2ln( L(model) ) [lower is better!!!]
AIC(linmod)	

# stolen from https://www.kaggle.com/c/bioresponse/forums/t/1576/r-code-for-logloss
# prettied up from that to make more readable
LogLoss = function(actual, predicted)	{
# for explanation see https://en.wikipedia.org/wiki/Loss_function
	result = -1/length(actual) * 
	sum( 
	actual * log(predicted) + # true prediction failures
	(1-actual) * log(1-predicted) # false prediction failures
	)
	return(result)	}

# note that this makes use of training set
LogLoss( test$whether.he.she.donated.blood.in.March.2007, linpred )	

# AUC from the ROC curve above also is such a measure.
# you can even use a U-test to sort of evaluate the quality of the predictions:
wilcox.test( 
	linpred[test$whether.he.she.donated.blood.in.March.2007 == 1],
	linpred[test$whether.he.she.donated.blood.in.March.2007 == 0] 
	)

# or even something as naive as correlations
cor(test$whether.he.she.donated.blood.in.March.2007, linpred,method='spearman')
```

can in principle make the model more complicated by adding more variables, and compare to the old model. 

```{r}
multilinmod = lm(whether.he.she.donated.blood.in.March.2007 ~ 
	Frequency..times. + Recency..months.,data = train)

# model comparisons!!
AIC(multilinmod,linmod)
anova(multilinmod,linmod)

multipredict = cbind(predict(multilinmod,newdata=test),test[,5])
a = ROC(multipredict)
  
c(logLik(multilinmod),logLik(linmod))
c(LogLoss(multipredict[,2],multipredict[,1]),LogLoss(multipredict[,2],linpred))

```

Try instead a logistic regression: a generalized linear model (GLM) of the family "binomial". That is, it expects the outcome variable (blood donation) to be distributed as a binomial (0/1) random variable. The predictor "generalizes" a linear fit using the logistic function to be able to make discrete 0/1 predictions.

```{r figure.width=5,figure.height=7,fig.cap='Performance of naive logistic regression'}
trainfit = glm(whether.he.she.donated.blood.in.March.2007 ~ 
	Frequency..times.,family='binomial',data=train )

class(linmod)
class(trainfit)

# plot out predictions of 2 models
par(mfrow = c(1,1) )
plot(train$Frequency..times.,jitter(train$whether.he.she.donated.blood.in.March.2007))
curve( predict( trainfit, data.frame(Frequency..times.=x), type='response' ), add=TRUE )
abline(linmod,col='red')

AIC(trainfit,linmod,multilinmod)

# add EVERYTHING to the model
trainfit = glm(whether.he.she.donated.blood.in.March.2007 ~ Recency..months. * 
	 Frequency..times. * Monetary..c.c..blood. 
	* Time..months.,family='binomial',data=train
	)

# summarize it a little...
AIC(trainfit,multilinmod,linmod)

summary(trainfit)

# automated model selection
stepped = step(trainfit)
summary(stepped)
AIC(stepped)
```


```{r figure.width=5,figure.height=7,fig.cap='Performance of naive logistic regression, by ROC'}
# do some predictions
predictor = predict.glm(stepped,newdata=test)
prediction = cbind(predictor,test[,5])
colnames(prediction) = c( 'prediction','true values')


#curve( predict( trainfit, data.frame(Frequency..times.=x), type='response' ), add=TRUE, col='blue')

# various looks at prediction success
cor(prediction[,1],prediction[,2],method='spearman')
#LogLoss(prediction[,1],prediction[,2])

a=ROC(prediction)

```

So it's okay (i.e. AUC>0.8).  


```{r}
curated_fit = glm(whether.he.she.donated.blood.in.March.2007 ~ 
	Frequency..times. 
	+ Time..months.,family='binomial',
	data=train)
curated_prediction = predict.glm(curated_fit,newdata=test)

prediction = cbind(curated_prediction,test[,5])

```  
```{r fig.width=5,fig.height=7,fig.cap = 'Performance of logistic regression with reduced model'}
a=ROC(prediction)
```

Didn't really change much (Figure 3).  Lost a little AUC, but not much for removing 2 explanatory variables in slavish devotion to occam's razor.  Precision seems to fall apart a bit, though.  While logistic regression is nice and simple, it is not doing a super job, so I will move on to see if anything else does better.

## Naive Bayes
Naive Bayes is an attractively simple classification technique. It is similar to the initial logistic regression implemented above, because of its assumption of independence of predictor variables.  It uses a straightforward interpretation of Bayes' rule to compute probabilities of each variable belonging to each class.  While we only have a binary outcome, it is possible that NB will perform better for some reason.  

```{r}
require(e1071)
# this function wants response to be a factor
classifier = naiveBayes(train[,1:4],as.factor(train[,5]))
class(classifier)
str(classifier)
bayespredict = cbind(predict(classifier,test[,-5]),test[,5])
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of Naive Bayes predictor'}
a=ROC(bayespredict)
```
Well, it turns out that Bayesian statistics is not the answer to everything (Figure 4).  About the same as the reduced logistic regression model.  The curve is weirdly step-like, wonder what's going on there.  Perhaps because NB is specifying categorical cutoffs in the continuous data?

```
More kNN.
Figure 6: k=2, Figure 7: k=3, Figure 8: k=4, Figure 9: k=5.
```{r}
library(class)
nn2_pred = knn(train[,1:4],test=test[,1:4] ,cl=train[,5],k=2)
nn2_predict = cbind(nn2_pred,test[,5])
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=2.'}
a=ROC(nn2_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=3.'}
nn3_pred = knn(train[,1:4],test=test[,1:4] ,cl=train[,5],k=3)
nn3_predict = cbind(nn3_pred,test[,5])
a=ROC(nn3_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=4.'}
nn4_pred = knn(train[,1:4],cl=train[,5],test=test[,1:4] ,k=4)
nn4_predict = cbind(nn4_pred,test[,5])
a=ROC(nn4_predict)
```
```{r fig.width=5,fig.height=7,fig.cap='Performance of kNN with k=5.'}
nn5_pred = knn(train[,1:4],test[,1:4],cl=train[,5] ,k=5)
nn5_predict = cbind(nn5_pred,test[,5])
a=ROC(nn5_predict)
```