Code Bug Fix: Train Test Split sklearn based on group variable

Original Source Link

My X is as follows:

Unique ID.   Exp start date.   Value.    Status.
001          01/01/2020.       4000.     Closed
001          12/01/2019        4000.     Archived
002          01/01/2020.       5000.     Closed
002          12/01/2019        5000.     Archived

I want to make sure that none of the unique IDs that were in training are included in testing. I am using sklearn test train split. Is this possible?

Good for you that train_test_split has the stratify parameter. if you set it to X['Unique ID'], it means there is no way you can find a unique id in both training and testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=df['Unique ID'].values)

I believe you need GroupShuffleSplit (documentation here).

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])

gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]

It can be seen from above that train/test indices are created based on the groups variable.

In your case, Unique ID. should be used as groups.

Tagged : / / /

Server Bug Fix: What happens if at leaf node both classes have same number of samples?

Original Source Link

I analyzed a small dataset which had three features, so I kept max_depth of decision tree to be 3, in doing so I found it something intresting, there was a leaf node which had number of samples of both classes to be equal and decision tree choose one class, now I am intrested to know how class is decided in such scenario, is it random or some other criteria, I have attached image to explain my scenarioDecision Tree Model

This is an implementation detail, and I wouldn’t necessarily rely on this behavior, but presently in sklearn, it will choose the “first” class.

The predict method calls for the probability prediction, then takes the argmax, which in case of ties takes the first one:

Tagged : / / /

Code Bug Fix: How to prepare training set for Support vector regression model which is able able to include the rows until row number 1000?

Original Source Link

I have the data frame i.e. df. Which has more than 2000 rows and 15 columns including expense column. I have completed all of the processes for SVR. Now I want to make the training set to include the rows until row number 1000.

The test set will have one single row, row number 1001.

For preparing the training and testing the model I need to take the expense column as the target value and all other columns will be taken as the features.

But I know the method for splitting it into training(50%) and testing set(50%). Which I included below:

from sklearn.svm import SVR
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

If I prepare the training and testing set as above mention, how may I write code?

Is this what you want?

csv = pandas.read_csv('data.csv')
train, test = csv.loc[:1000], csv.loc[1001]

To then split the train test sets further into X and Y, simply do:

train_X, train_Y = train.drop(columns=['expense']), test['expense']
test_X, test_Y = train.drop(columns=['expense']), test['expense']

If instead you want to use the column ‘use’ as a response, the following should work:

train, test = dataset.loc[:1000], dataset.loc[1001]

train_X, train_y = train.drop(columns=['use']), train['use']
test_X, test_y = test.drop(columns=['use']), test['use']

SupportVectorRefModel = SVR(), train_y)

Tagged : / / /

Server Bug Fix: Why can’t scikit-learn SVM solve two concentric circles?

Original Source Link

Consider the following dataset (code for generating it is at the bottom of the post):
enter image description here

Running the following code:

from sklearn.svm import SVC
model_2 = SVC(kernel='rbf', degree=2, gamma='auto', C=100), y_train)
print('accuracy (train): %5.2f'%(metric(y_train, model_2.predict(X_train))))
print('accuracy (test): %5.2f'%(metric(y_test, model_2.predict(X_test))))
print('Number of support vectors:', sum(model_2.n_support_))

I get the following output:

accuracy (train):  0.64
accuracy (test):  0.26
Number of support vectors: 55

I also tried with varying degrees of polynomial kernel and got more or less the same results.

So why does it do such a poor job. I’ve just learned about SVM and I would have thought that a polynomial kernel of 2nd degree could just project these points onto a paraboloid and the result would be linearly separable. Where am I going wrong here?

Reference: The starter code for the snippets in this post comes from this course

Code for generating data:

data, labels = sklearn.datasets.make_circles()
idx = np.arange(len(labels))
# train on a random 2/3 and test on the remaining 1/3
idx_train = idx[:2*len(idx)//3]
idx_test = idx[2*len(idx)//3:]
X_train = data[idx_train]
X_test = data[idx_test]

y_train = 2 * labels[idx_train] - 1  # binary -> spin
y_test = 2 * labels[idx_test] - 1

scaler = sklearn.preprocessing.StandardScaler()
normalizer = sklearn.preprocessing.Normalizer()

X_train = scaler.fit_transform(X_train)
X_train = normalizer.fit_transform(X_train)

X_test = scaler.fit_transform(X_test)
X_test = normalizer.fit_transform(X_test)
plt.figure(figsize=(6, 6))
plt.scatter(data[labels == 0, 0], data[labels == 0, 1], color='navy')
plt.scatter(data[labels == 1, 0], data[labels == 1, 1], color='c')

Let’s start with warnings:

  1. All the preprocessing should be done using training set’s fitted values:

    X_test = scaler.transform(X_test)
    X_test = normalizer.transform(X_test)
  2. degree is a hyperparameter for polynomial kernel and is ignored if the kernel is not poly:

    model_2 = SVC(kernel='poly', degree=2, gamma='auto', C=100)


    model_2 = SVC(kernel='rbf', gamma='auto', C=100)
  3. While debugging, print the final dataset after going through preprocessing to see if you’ve destroyed the dataset:

enter image description here

Do not blindly implement preprocessing. Remove the normalisation step because it just sabotages the dataset. You’ll have 100% accuracy.

@gunes has a very good answer: degree is for poly, and rbf is controlled by gamma and C. In general, it is not surprising to see the default parameter does not work well.

See RBF SVM parameters

If you change your code

model_2 = SVC(kernel='rbf', gamma=1000, C=100)

You will see 100% on training but 56% on testing.

The reason is As @gunes mentioned the pre-processing changed the data. this also tells us RBF kernel is pretty powerful that can overfit training data pretty well.

The answer is very simple and very short. Because you attempt to make a support vector machine create something that is impossible, there is no support vectors that will constrain to only those two circles.

Tagged : / / / /

Server Bug Fix: ExtraTreesRegressor criterion

Original Source Link

As I understand, ExtraTreesRegressor from sklearn works by doing random splits instead of minimizing a metric like gini for classification or mae for regression.

I don’t understand why there’s a criterion parameter, as the criterion for the splits should be random.

Is it just for code compatibility, or am I missing something?

No, extremely-random trees does still optimize splits. It does only pick one random splitting point for each feature (out of those randomly chosen max_features) but then which feature is actually used for the split depends on the criterion chosen.

The criterion parameter is used to measure the quality of the split when selected, it is not involved in the initial splitting algorithm (the features used for the split are chosen randomly)


  • mse and mae are the only options available for use, and mse is the default. mae was added after version 0.18. Check your version if it is available. A few issues have been reported with the use of mae.


criterion{“mse”, “mae”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

Tagged : / /

Code Bug Fix: Latest XGboost and Sklearn giving error

Original Source Link

xgb=XGBClassifier(objective="binary:logistic", n_estimators=100, random_state=42, eval_metric=["auc"]), y_train)

KeyError Traceback (most recent call last)
~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/IPython/core/ in call(self, obj, include, exclude)
969 if method is not None:
–> 970 return method(include=include, exclude=exclude)
971 return None
972 else:

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/ in _repr_mimebundle_(self, **kwargs)
462 def get_indices(self, i):
–> 463 “””Row and column indices of the i’th bicluster.
465 Only works if rows_ and columns_ attributes exist.

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/ in repr(self, N_CHAR_MAX)
277 right_lim = re.match(regex, repr_[::-1]).end()
–> 279 ellipsis = ‘…’
280 if left_lim + len(ellipsis) < len(repr_) – right_lim:
281 # Only add ellipsis if it results in a shorter repr

~/.pyenv/versions/3.7.4/lib/python3.7/ in pformat(self, object)
142 def pformat(self, object):
143 sio = _StringIO()
–> 144 self._format(object, sio, 0, 0, {}, 0)
145 return sio.getvalue()

~/.pyenv/versions/3.7.4/lib/python3.7/ in _format(self, object, stream, indent, allowance, context, level)
159 self._readable = False
160 return
–> 161 rep = self._repr(object, context, level)
162 max_width = self._width – indent – allowance
163 if len(rep) > max_width:

~/.pyenv/versions/3.7.4/lib/python3.7/ in _repr(self, object, context, level)
391 def _repr(self, object, context, level):
392 repr, readable, recursive = self.format(object, context.copy(),
–> 393 self._depth, level)
394 if not readable:
395 self._readable = False

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/ in format(self, object, context, maxlevels, level)
168 def format(self, object, context, maxlevels, level):
169 return _safe_repr(object, context, maxlevels, level,
–> 170 changed_only=self._changed_only)
172 def _pprint_estimator(self, object, stream, indent, allowance, context,

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/ in _safe_repr(object, context, maxlevels, level, changed_only)
412 if changed_only:
413 params = _changed_params(object)
–> 414 else:
415 params = object.get_params(deep=False)
416 components = []

~/.pyenv/versions/3.7.4/envs/py3env/lib/python3.7/site-packages/sklearn/utils/ in _changed_params(estimator)
96 init_params = {name: param.default for name, param in init_params.items()}
97 for k, v in params.items():
—> 98 if (repr(v) != repr(init_params[k]) and
99 not (is_scalar_nan(init_params[k]) and is_scalar_nan(v))):
100 filtered_params[k] = v

KeyError: ‘base_score’

Name: xgboost
Version: 1.0.2


pyhton 3

Looks like it was a bug with scikit-learn’s update – check – guess it has been resolved.

Tagged : /

Code Bug Fix: Plotting multiple confusion matrix side by side

Original Source Link

I am new here. This is my first question that I hope to get an answer from experts. I have 5 classifier models that I am trying to plot their confusion matrix.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections

classifiers = {
    "Naive Bayes": GaussianNB(),
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
    "Support Vector Classifier": SVC(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),

and then

from sklearn.metrics import confusion_matrix
for key, classifier in classifiers.items(): 
    y_pred =, y_train).predict(X_test)
    cf_matrix=confusion_matrix(y_test, y_pred)

which gives me

now I am trying to plot them with below code but no data is shown on the plots.

fig, axn = plt.subplots(1,5, sharex=True, sharey=True)
cbar_ax = fig.add_axes([.91, .3, .03, .4])

for i, ax in enumerate(axn.flat):
    sns.heatmap(cf_matrix, ax=ax,
                cbar=i == 0,
                vmin=0, vmax=1,
                cbar_ax=None if i else cbar_ax)

fig.tight_layout(rect=[0, 0, .9, 1])

Here is how it looks like

Can someone please help me get this done?

sklearn provides plotting capability on confusion_matrix.
There are two ways to do it,

I used the second way here, because removing colorbar was quite verbose in first way (having multiple colorbars looks very cluttered).

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

classifiers = {
    "Naive Bayes": GaussianNB(),
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
    "Support Vector Classifier": SVC(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),

iris = load_iris()
X, y =,

X_train, X_test, y_train, y_test = train_test_split(X, y)

f, axes = plt.subplots(1, 5, figsize=(20, 5), sharey='row')

for i, (key, classifier) in enumerate(classifiers.items()):
    y_pred =, y_train).predict(X_test)
    cf_matrix = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(cf_matrix,
    disp.plot(ax=axes[i], xticks_rotation=45)
    if i!=0:

f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.40, hspace=0.1)

f.colorbar(disp.im_, ax=axes)

enter image description here

You need to store the confusion matrix somewhere, so for if I use an example dataset:

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

data = load_breast_cancer()
scaler = StandardScaler()

X_df = pd.DataFrame(, columns=data.feature_names)
X_df = scaler.fit_transform(X_df)
y_df = pd.DataFrame(, columns=['target'])

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=11)

And store it in a similar dictionary:

from sklearn.metrics import confusion_matrix
cf_matrix = dict.fromkeys(classifiers.keys())
for key, classifier in classifiers.items(): 
    y_pred =, y_train.values.ravel()).predict(X_test)
    cf_matrix[key]=confusion_matrix(y_test, y_pred)

Then you can plot it:

fig, axn = plt.subplots(1,5, sharex=True, sharey=True,figsize=(12,2))

for i, ax in enumerate(axn.flat):
    k = list(cf_matrix)[i]
    sns.heatmap(cf_matrix[k], ax=ax,cbar=i==4)

enter image description here

Tagged : / / / /

Code Bug Fix: Is pandas_ml broken?

Original Source Link

The version info and issue are as given below. I want to know if pandas_ml is broken or am I doing something wrong. Why am I not able to import pandas_ml?

Basic info:
Versions of sklearn and pandas_ml and python are given below:

Python                            3.8.2
scikit-learn                      0.23.0
pandas-ml                         0.6.1


import pandas_ml as pdml

returns the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-47-79d5f9d2381c> in <module>
----> 1 import pandas_ml as pdml
      2 #from pandas_ml import ModelFrame
      3 #mf = pdml.ModelFrame(df.to_dict())
      4 #mf.head()

d:program in <module>
      1 #!/usr/bin/env python
----> 3 from pandas_ml.core import ModelFrame, ModelSeries       # noqa
      4 from import info                         # noqa
      5 from pandas_ml.version import version as __version__     # noqa

d:program in <module>
      1 #!/usr/bin/env python
----> 3 from pandas_ml.core.frame import ModelFrame       # noqa
      4 from pandas_ml.core.series import ModelSeries     # noqa

d:program in <module>
      9 import pandas_ml.imbaccessors as imbaccessors
---> 10 import pandas_ml.skaccessors as skaccessors
     11 import pandas_ml.smaccessors as smaccessors
     12 import pandas_ml.snsaccessors as snsaccessors

d:program in <module>
     13 from pandas_ml.skaccessors.linear_model import LinearModelMethods                 # noqa
     14 from pandas_ml.skaccessors.manifold import ManifoldMethods                        # noqa
---> 15 from pandas_ml.skaccessors.metrics import MetricsMethods                          # noqa
     16 from pandas_ml.skaccessors.model_selection import ModelSelectionMethods           # noqa
     17 from pandas_ml.skaccessors.neighbors import NeighborsMethods                      # noqa

d:program in <module>
    254 _true_pred_methods = (_classification_methods + _regression_methods
    255                       + _cluster_methods)
--> 256 _attach_methods(MetricsMethods, _wrap_target_pred_func, _true_pred_methods)

d:program in _attach_methods(cls, wrap_func, methods)
     92         for method in methods:
---> 93             _f = getattr(module, method)
     94             if hasattr(cls, method):
     95                 raise ValueError("{0} already has '{1}' method".format(cls, method))

AttributeError: module 'sklearn.metrics' has no attribute 'jaccard_similarity_score'

It seems it is indeed. Here is the situation:

Although the function jaccard_similarity_score is not shown in the available ones of sklearn.metrics in the documentation, it was still there under the hood (hence available) until v0.22.2 (source code) in addition to the jaccard_score one. But in the source code of the latest v0.23, it has been removed, and only jaccard_score remains.

This would imply that it could still be possible to use pandas-ml by simply downgrading scikit-learn to v.0.22.2. But unfortunately this will not work either, throwing a different error:

!pip install pandas-ml
# Successfully installed enum34-1.1.10 pandas-ml-0.6.1

import sklearn
# '0.22.2.post1'

import pandas_ml as pdml


AttributeError: module 'sklearn.preprocessing' has no attribute 'Imputer'

I guess it would be possible to find a scikit-learn version that works with it by going back enough (the last commit in their Github repo was in March 2019), but not sure if it is worth the fuss. In any case, they do not even mention scikit-learn (let alone any specific version of it) in their requirements file, which does not seem as sound practice, and the whole project seems rather abandoned.

So after some time and effort on this, I got it working and realized that the concept of broken in Python is rather murky. It would depend upon the combination of libraries you are trying to use and their dependencies. The older releases are all available and can be used but sometimes, it can be a hit-and-trial process to find that correct combination of package versions which gets everything working.

The other thing that I learnt from this exercise is the importance of having a significant expertise in creating and managing the virtual environments when programming with python.

In my case, I got help from some friends with the hit-and-trial part and found that pandas_ml works on python 3.7. Given below is the pip freeze output which can be used to setup a reliable virtual environment for machine learning and deep learning work using libraries like pandas_ml and imbalanced-learn libraries and may include some other libraries which have not had a new release in the last few years.

To create a working environment with the right version of packages which would ensure that pandas_ml and imbalanced-learn libraries work, create an environment with the following configuration on Python 3.7.


Hope this helps someone who is looking for the right combination of library versions to setup their machine and deep learning environment in python using pandas_ml and imbalanced-learn packages.

Tagged : / /

Code Bug Fix: Abscence of Learning rate and number of iterations in sklearn Linear Regression

Original Source Link

I have found out that neither Linear, nor Lasso, nor Ridge in scikit-learn use learning rate (what we call alpha) or number of iterations.

I want to know how exactly do they implement Linear Regression under the hood without learning rate, considering that it’s at the heart of Gradient Descent?

These methods work by minimizing an objective function, but here’s come the difference between a Linear Regression and a Regularized Regression. On the one hand, Linear Regression fits the optimal coefficients by minimizing the residual sum of squares between the real values and the predicted values, that is, it is minimizing and objective function like ||y – X||^2, where y is your real value and X the predicted one.
For more details of the procedure see Understanding OLS estimation.

On the other hand, Lasso and Ridge Regression incorporates another term that accounts for this coefficients, for example, in Ridge Regression the function to minimize is
||y – X||^2 + alpha * ||w||^2, where alpha is what is normally known as learning rate and w represents the fitted coefficients, see this answer for more details on how the regularization shrinks the coefficients values.

Tagged : / / /

Code Bug Fix: inverse_transform output of recurrent neural network

Original Source Link

I’m using TensorFlow for training my LSTM for prediction weather temperature. this is my data frame:

Weather data

I’m trying to predict weather temperature for next 12h. Before I’m inserting data to the model I’m converting data from DataFrame to NumPy array with shape (72,3) and scaling it:

scaler = StandardScaler()
scaler =
dataset = scaler.transform(dataset)

After my model is trained and evaluate and I’m making new predictions with new data that model haven’t seen yet and I do it the same with new data DataFrame -> NumPy -> scale and after the model has predicted temperature for next 12 hours. Values are scaled and when I try to inverse_transform it throws an error:

non-broadcastable output operand with shape (12,1) doesn’t match the broadcast shape (12,3)

The output of my model is in shape (12,1). 12-> next predicted 12 hours of temperature. Now, I have predicted values but they are scaled and I don’t how to transform it to real values?

Tagged : / / /