Code Bug Fix: Latest XGboost and Sklearn giving error

Original Source Link

xgb=XGBClassifier(objective="binary:logistic", n_estimators=100, random_state=42, eval_metric=["auc"])

xgb.fit(X_train, y_train)

KeyError Traceback (most recent call last)
~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/IPython/core/formatters.py in call(self, obj, include, exclude)
968
969 if method is not None:
–> 970 return method(include=include, exclude=exclude)
971 return None
972 else:

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/base.py in _repr_mimebundle_(self, **kwargs)
461
462 def get_indices(self, i):
–> 463 “””Row and column indices of the i’th bicluster.
464
465 Only works if rows_ and columns_ attributes exist.

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/base.py in repr(self, N_CHAR_MAX)
277 right_lim = re.match(regex, repr_[::-1]).end()
278
–> 279 ellipsis = ‘…’
280 if left_lim + len(ellipsis) < len(repr_) – right_lim:
281 # Only add ellipsis if it results in a shorter repr

~/.pyenv/versions/3.7.4/lib/python3.7/pprint.py in pformat(self, object)
142 def pformat(self, object):
143 sio = _StringIO()
–> 144 self._format(object, sio, 0, 0, {}, 0)
145 return sio.getvalue()
146

~/.pyenv/versions/3.7.4/lib/python3.7/pprint.py in _format(self, object, stream, indent, allowance, context, level)
159 self._readable = False
160 return
–> 161 rep = self._repr(object, context, level)
162 max_width = self._width – indent – allowance
163 if len(rep) > max_width:

~/.pyenv/versions/3.7.4/lib/python3.7/pprint.py in _repr(self, object, context, level)
391 def _repr(self, object, context, level):
392 repr, readable, recursive = self.format(object, context.copy(),
–> 393 self._depth, level)
394 if not readable:
395 self._readable = False

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/_pprint.py in format(self, object, context, maxlevels, level)
168 def format(self, object, context, maxlevels, level):
169 return _safe_repr(object, context, maxlevels, level,
–> 170 changed_only=self._changed_only)
171
172 def _pprint_estimator(self, object, stream, indent, allowance, context,

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/_pprint.py in _safe_repr(object, context, maxlevels, level, changed_only)
412 if changed_only:
413 params = _changed_params(object)
–> 414 else:
415 params = object.get_params(deep=False)
416 components = []

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/_pprint.py in _changed_params(estimator)
96 init_params = {name: param.default for name, param in init_params.items()}
97 for k, v in params.items():
—> 98 if (repr(v) != repr(init_params[k]) and
99 not (is_scalar_nan(init_params[k]) and is_scalar_nan(v))):
100 filtered_params[k] = v

KeyError: ‘base_score’

Name: xgboost
Version: 1.0.2

scikit-learn-0.23.0

pyhton 3

Looks like it was a bug with scikit-learn’s update – check https://github.com/dmlc/xgboost/issues/5668 – guess it has been resolved.

Tagged : /

Server Bug Fix: Main options on how to deal with imbalanced data

Original Source Link

As far as I can tell, broadly speaking, there are three ways of dealing with binary imbalanced datasets:

Option 1:

  • Create k-fold Cross-Validation samples randomly (or even better create k-fold samples using Stratified k-fold: https://scikit-learn.org/0.16/modules/generated/sklearn.cross_validation.StratifiedKFold.html ).
  • For each fold apply a resampling technique (upsampling, downsampling or a combination of both) separately on the “training” and “test” sets.
  • Use a “traditional” metric for evaluation: for instance the AUC of the ROC curve (TP Rate vs FP Rate).

Option 2:

  • Create k-fold Cross-Validation samples randomly (or even better create k-fold samples using Stratified k-fold).
  • Do not apply any resampling technique.
  • Use an “alternative” metric for evaluation: for instance the AUC of the Precision-Recall curve or something like the F-score (the harmonic mean of Precision and Recall).

Option 3:

  • Use something like XGBoost and tune the scale_pos_weight ( https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html ).
  • Create k-fold Cross-Validation samples randomly (or even better create k-fold samples using Stratified k-fold).
  • Use a “traditional” metric for evaluation: for instance the AUC of the ROC curve (TP Rate vs FP Rate).

My main question is if I correctly interpret what the options are. Is there any conceptual mistake in what I’m saying? Is it appropriate to use Stratified k-fold in the three cases when dealing with imbalance? Is it not necessary to apply any resampling when using XGBoost and tuning scale_pos_weight? When some resampling is applied (Options 1 and 3) does it make sense to use a “traditional” metric and does not make sense to use an “alternative” metric? In general, the resampling has to be applied separately on training and test sets? Etc.

Also, it would be nice if you have any good reference to SMOTE and ROSE, regarding how they work, how to apply them and how to use them with python.

Unbalanced datasets are not a problem, and oversampling will not solve a non-problem. Please note Matthew Drury’s highly upvoted comment at that thread:

Honestly, knowing there is someone else out there that is mystified by the endless class balancing questions is comforting.

Much of the purported problems probably stems from using accuracy as a KPI, which is a bad idea.

Tagged : / / / /

Code Bug Fix: How to save XGBoost model trained using Python sklearn for easy portability to Java?

Original Source Link

I am using RandomizedSearchCV to train an XGBoost model using the SKLearn API for XGBoost.

After training this model, how could I export it into the XGBoost internal format?
https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.save_model

I need to then give this model to a team which uses the Java XGBoost library and are importing the model using this:

Booster booster = XGBoost.loadModel("model.bin");

Below is the RandomizedSearchCV I have with sklearn Python. The pipelines and hypeparameters currently hold only the XGBoost algorithm

pipelines = {
    'xgb' : make_pipeline(xgb.XGBClassifier(random_state=777)),
}

xgb_hyperparameters = {
    'xgbclassifier__objective': ['binary:logistic'],
    'xgbclassifier__n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600],
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ],
    'xgbclassifier__max_depth': [ 3, 4, 5, 6, 8,9, 10,11, 12,13,14, 15,16,17,18,19,20,25,30],
    'xgbclassifier__min_child_weight': [ 1, 3, 4, 5, 6, 7, 8, 9],
    'xgbclassifier__gamma': [ 0.0, 0.1, 0.2 , 0.3, 0.4, 0.5 ]
}


hyperparameters = {
    'xgb' : xgb_hyperparameters
}


fitted_models = {}

for model_name, model_pipeline in pipelines.items():
print('nStarting to fit: {}'.format(model_name))

# Fit each model
model = RandomizedSearchCV(
    pipelines[model_name],
    hyperparameters[model_name],
    n_iter    = 15,
    n_jobs    = -1, # Use all my computer's power
    cv        = 5,  # The default value of this will change to 5 in version 0.22
    verbose   = 20,
    scoring   = 'precision'
)

model.fit(x_train, y_train)
fitted_models[model_name] = model

# Calculate predictions
y_pred = model.predict(x_test)  # y_pred = predicted Y value
y_pred_proba_prep = model.predict_proba(x_test) 
y_pred_proba = [p[1] for p in y_pred_proba_prep] # y_pred_proba = the probability of the prediction

# Calculate all the performance scores:
roc_score = roc_auc_score(y_test, y_pred_proba)
precision = precision_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print('''nnModel {} has been fitted. n
      ROC AUC: {}n
      Precision: {}n
      Recall: {}n
      F1: {}n
      '''.format(model_name, roc_score, precision, recall, f1)
     )

Tagged : / /

Server Bug Fix: What is the proper way to use early stopping with cross-validation?

Original Source Link

I am not sure what is the proper way to use early stopping with cross-validation for a gradient boosting algorithm. For a simple train/valid split, we can use the valid dataset as the evaluation dataset for the early stopping and when refitting we use the best number of iterations.

But in case of cross-validation like k-fold, my intuition would be to use each valid set of each fold as evaluation dataset for the early stopping but that means the best number of iterations would be different from a fold to another. So when refitting, what do we use as the final best number of iterations ? the mean ?

Thanks !

I suspect this is a “no free lunch” situation, and the best thing to do is experiment with (subsets) of your data (or ideally, similar data disjoint from your training data) to see how the final model’s ideal number of estimators compares to those of the cv iterations.

For example, if your validation performance rises sharply with additional estimators, then levels out, and finally decreases very slowly, then going too far isn’t such a problem but cutting off early is. If instead your validation performance grows slowly to a peak but then plummets with overfitting, then you’ll want to set a smaller number of estimators for the final model. And then there are all the other considerations for your model aside from straight validation score; maybe you’re particularly averse to overfitting and want to set a smaller number of estimators, say the minimum among the cv iterations.

Another wrench: with more data, your model may want more estimators than any of the cv-estimates. If you have the resources to experiment, also look into this.

Finally, you may consider leaving an early-stopping validation set aside even for the final model. That trades away some extra training data for the convenience of not needing to estimate the optimal number of estimators as above.

I think some answers to (/comments about) related questions are well addressed in these posts:

  1. https://stats.stackexchange.com/q/402403
  2. https://stats.stackexchange.com/q/361494

In my mind, the tldr summary as it relates to your question is that after cross validation one could (or maybe should) retrain a model using a single very large training set, with a small validation set left in place to determine an iteration at which to stop early. While one certainly can think about ways to determine an early stopping parameter from the cross validation folds, and then use all of the data for training the final model, it is not at all clear that is will result in the best performance. It seems reasonable to think that simply using cross validation to test the model performance and determine other model hyperparameters, and then to retain a small validation set to determine the early stopping parameter for the final model training may yield the best performance.

If one wants to proceed as you suggest by using cross validation to train many different models on different folds, each set to stop early based on its own validation set, and then use these cross validation folds to determine an early stopping parameter for a final model to be trained on all of the data, my inclination would be to use the mean as you suggest. This is just a hunch, and I have no evidence to support it (though it does seem to be an opinion mentioned in numerous reputable seeming sources). I would suggest testing the performance of this choice vs other candidates such as taking the max/min, etc. if you are set on proceeding in this way. I wouldn’t take anyone’s word for it being the best way to proceed unless they provide proof or evidence of their assertion.

Finally, I want to mention that if one is not necessarily interested in constructing a newly trained final model after cross validation, but rather just wants to obtain predictions for a specific instance of a problem, yet a third route is to forego training a final model altogether. By this I mean, one could train one model for each fold using cross validation, but record during each fold predictions that fold’s model makes for the test set while the cross validation loop is occurring. At the end of cross validation, one is left with one trained model per fold (each with it’s own early stopping iteration), as well as one prediction list for the test set for each fold’s model. Finally, one can average these predictions across folds to produce a final prediction list for the test set (or use any other way to take the numerous prediction lists and produce a single one).

Note: This response may be more appropriate as a comment since I don’t provide an answer to the question, but it was a bit long for that.

Tagged : / / /

Code Bug Fix: XGBoost native functionality vs Scikit Learn wrapper

Original Source Link

I’ve recently been playing around with XGBoost for python. I’m pretty familiar with the scikit learn library, so my first inclination was to use the scikit wrapper provided by XGB.

Then tonight I started playing around with the native functions provided by XGB. The main difficulty I faced was tuning parameters without the benefit of GridSearchCV. I wrote my own very clunky function to do it for me, which I don’t generally like doing. I’m not a good enough programmer to trust myself.

Other than that, the native library was pretty straight forward and I could see no definite reason to use one over the other.

What extra functionality does the native library provide over the wrapper? Or visa versa?

Tagged : / /