Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC prediction are inconsistent when using max_depth #52

Open
skjerns opened this issue May 10, 2019 · 9 comments
Open

RFC prediction are inconsistent when using max_depth #52

skjerns opened this issue May 10, 2019 · 9 comments

Comments

@skjerns
Copy link

skjerns commented May 10, 2019

I have created a RandomForestclassifier in Python using sklearn. Now I convert the code to C using sklearn-porter. In around 10-20% of the cases the prediction of the transpiled code is wrong.

I figured that the problem occurs when specifying max_depth.

Here's some code to reproduce the issue:

import numpy as np
import sklearn_porter
from sklearn.ensemble import RandomForestClassifier

train_x = np.random.rand(1000, 8)
train_y = np.random.randint(0, 4, 1000)

# when using max_depth='auto', the problem does not occur
rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 1.0

# now using max_depth=10 the integrity
rfc = RandomForestClassifier(n_estimators=10, max_depth=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 0.829

I also saw that Python is performing calculations with double while the C code seems to use float, might that be an issue? (changing float -> double did not change anything unfortunately).

@skjerns skjerns changed the title RFC prediction different from sklearn an C code RFC prediction are inconsistent when using max_depth May 15, 2019
@skjerns
Copy link
Author

skjerns commented Jun 7, 2019

Looking further into this issue I believe it might be something with the final leave probabilities. They are slightly different when not growing the tree to the max. Therefore the final probability can deviate if the samples are very close to each other

@nok
Copy link
Owner

nok commented Jun 25, 2019

Thanks for your work and the given hints. I will check the outputs with more tests. Did you maybe check another languages? Or is it still a C issue?

@skjerns
Copy link
Author

skjerns commented Jun 25, 2019

I did not check other languages yet, but I assume that they have the same problem. I can check tomorrow.

@skjerns
Copy link
Author

skjerns commented Jun 26, 2019

Checked it in Java: Same results. I assume it will be the same in other languages.

@nok
Copy link
Owner

nok commented Jun 26, 2019

Okay, thank you for the double check. Then I will dig deeper in the original implementation. In particular in the difference between the different max_depth conditions.

@skjerns
Copy link
Author

skjerns commented Jun 26, 2019

I think a good way to approach is to implement a predict_proba-method. I originally assumed that we just let each tree predict a class and take the majority vote (as it is done in the implementation of sklearn-porter). However, this is not the case and like the reason why we have this discrepancy.

Some more details I found in this stackoverflow comment thread:
https://stackoverflow.com/questions/30814231/using-the-predict-proba-function-of-randomforestclassifier-in-the-safe-and-rig
(see comments)

  1. About prediction precision: I insist but this is not a question of number of trees. Even with a single decision tree you should be able to get probability predicitions with more than one digits. A decision tree aims at clustering the inputs based on some rules (the decision), and these clusters are the leafs of the tree. If you have a leaf with 2 non-spam emails and one spam email from your training data, then the probability prediction for any email that belongs to this leaf/cluster (with regards to the rules established by fitting the model), is : 1/3 for spam and 2/3 for non-spam. – Sebastien Jun 20 '15 at 14:49
  2. About the dependencies in predictions: Again Sklearn definition gives the answer : the probability is computed with regards to the leaf (corresponding to your email to test) 's characteristics : the number of instances of each class in this leaf. This is set when your model is fitted, so it only depends on the training data. In conclusion : the result is the probability of instance 1 to spam with 60% whatever the other 9 instances' probabilities are. – Sebastien Jun 20 '15 at 15:00

similarly here: https://scikit-learn.org/stable/modules/tree.html#tree

So I think if a predict_proba method is implemented correctly (instead of majority winner vote), the problems with max_depth will disappear. And another cool feature would be added, class probabilities :)

@skjerns
Copy link
Author

skjerns commented Jun 28, 2019

This seems to be the case, indeed:

So depending on implementation: predicted probability is either (a) the mean terminal leaf probability across all trees or (b) the fraction of trees voting either class. If out-of-bag(OOB) prediction, then only in trees where sample is OOB. For a single fully grown tree, I would guess the predicted probability only could be 0 or 1 for any class, because all terminal nodes are pure(same label). If the single tree is not fully grown and/or more trees are grown, then the predicted probability can be a positive rational number from 0 to 1.

https://stats.stackexchange.com/questions/193424/is-decision-tree-output-a-prediction-or-class-probabilities

So we'd need to change the internal structure such that each tree does not return the class index but a probability vector.

@nok
Copy link
Owner

nok commented Jul 11, 2019

Hello @skjerns, JFYI, I started to implement the predict_proba method for all listed estimators.

For that I began with the DecisionTreeClassifier estimator and the high-level languages. After that I will focus on the RandomForestclassifier estimator with the DecisionTreeClassifier as base estimator.

@crea-psfc
Copy link

Hi @nok and @skjerns, I have actually looked into this as I wanted to integrate in the porter C library the functionality for analyzing the feature contributions: https://github.com/andosa/treeinterpreter . This technique allows you to extract the importance of the features, when testing unseen samples and uncovers the drivers of the final Random Forests decision. It basically keeps track of the samples population before/after a split by associating gains/losses to the splitting feature. I'm introducing this as it is a pretty short step between implementing this and the predict_proba method for the forest. I am currently working on that.

@nok, let me know how you wanna proceed and I can commit on a dev branch my changes to the C templates and the __init__.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants