Confusing probabilities from scikit-learn randomforest

ghz 11hours ago ⋅ 6 views

I have a time series of integer values which I'm trying to predict. I do this by a sliding window where it learns to associate 99 values to predict the next one. The values are between 0 and 128. The representation for X is a cube of n sliding windows of 99 long and each integer encoded to a one hot encoded vector of 128 elements long. The shape of this array is (n, 99, 128). The shape of Y is (n, 128). I see it as a multi-class problem as Y can take precisely one outcome.

This works fine with Keras/Tensorflow, but when I try to use RandomForest from scikit-learn it complains about the input vector being 3D instead of 2D. So I reshaped the input cube X into a 2D matrix of shape (n, 99 * 128). The results weren't great and in order to understand what's happening I requested the probabilities (see code below).

def rf(X_train, Y_train, X_val, Y_val, samples):
    clf = RandomForestClassifier(n_estimators=32, n_jobs=-1)
    clf.fit(X_train, Y_train)
    score = clf.score(X_val, Y_val)
    print('Score of randomforest =', score)

    # compute some samples
    for i in range(samples):
        index = random.randrange(0, len(X_val) - 1)
        xx = X_val[index].reshape(1, -1)
        probs = clf.predict_proba(xx)
        pred = clf.predict(xx)
        y_true = np.argmax(Y_val[index])
        y_hat = np.argmax(pred)
        print(index, '-', y_true, y_hat, xx.shape, len(probs))
        print(probs)
        print(pred)

The output I get from predict_proba is:

[array([[0.841, 0.159]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), 
 array([[1.]]), array([[1., 0.]]), array([[1., 0.]]), array([[1., 0.]]),
 array([[1., 0.]]), array([[1., 0.]]), array([[0.995, 0.005]]), array([[0.999,
 0.001]]), array([[0.994, 0.006]]), array([[1., 0.]]), array([[0.994, 0.006]]),
 array([[0.977, 0.023]]), array([[0.999, 0.001]]), array([[0.939, 0.061]]),
 array([[0.997, 0.003]]), array([[0.969, 0.031]]), array([[0.997, 0.003]]),
 array([[0.984, 0.016]]), array([[0.949, 0.051]]), array([[1., 0.]]),
 array([[0.95, 0.05]]), array([[1., 0.]]), array([[0.918, 0.082]]), 
 array([[0.887, 0.113]]), array([[1.]]), array([[0.88, 0.12]]), array([[1.]]),
 array([[0.884, 0.116]]), array([[0.941, 0.059]]), array([[1.]]), array([[0.941,
 0.059]]), array([[1.]]), array([[0.965, 0.035]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]])]

The output vector has a length of 128 all right, but why does it consist of a list, containing 2D arrays, sometimes containing one element and sometimes two? As far as I understand from the manual an array should be returned with dimension # samples * # classes, so in my example of shape (1,128).

Could someone help me in pointing out what I am doing wrong?

Edit 1

I did experiments along the lines suggested by @Vivek Kumar (thanks Vivek) in his comments. I input sequences of integers (X) and match them with the next integer in sequence (y). This is the code:

def rff(X_train, Y_train, X_val, Y_val, samples, cont=False):
    print('Input data:', X_train.shape, Y_train.shape, X_val.shape, Y_val.shape)
    clf = RandomForestClassifier(n_estimators=64, n_jobs=-1)
    clf.fit(X_train, Y_train)
    score = clf.score(X_val, Y_val)

    y_true = Y_val
    y_prob = clf.predict_proba(X_val)
    y_hat = clf.predict(X_val)
    print('y_true', y_true.shape, y_true)
    print('y_prob', y_prob.shape, y_prob)
    print('y_hat', y_hat.shape, y_hat)
    #sum_prob = np.sum(y_true == y_prob)
    sum_hat = np.sum(y_true == y_hat)
    print('Score of randomforest =', score)
    print('Score y_hat', sum_hat / len(X_val))
    #print('Score y_prob', sum_prob / len(X_val))

    # compute some individual samples
    for i in range(samples):
        index = random.randrange(0, len(X_val) - 1)
        y_true_i = Y_val[index]
        #y_prob_i = y_prob[index]
        y_hat_i = y_hat[index]
        print('{:4d} - {:3d}{:3d}'.format(index, y_true_i, y_hat_i))

And its output is:

Input data: (4272, 99) (4272,) (1257, 99) (1257,)
y_true (1257,) [ 0  0  0 ... 69 70 70]
y_prob (1257, 29) [[0.09375  0.       0.       ... 0.078125 0.078125 0.015625]
 [0.109375 0.       0.       ... 0.046875 0.0625   0.0625  ]
 [0.125    0.       0.       ... 0.015625 0.078125 0.015625]
 ...
 [0.078125 0.       0.       ... 0.       0.       0.      ]
 [0.046875 0.       0.       ... 0.       0.       0.      ]
 [0.078125 0.       0.       ... 0.       0.       0.      ]]
y_hat (1257,) [81 81 79 ... 67 67 65]
Score of randomforest = 0.20047732696897375
Score y_hat 0.20047732696897375
 228 -  76 77
  51 -  76  0
 563 -  81  0
 501 -   0 77
 457 -  79 79
 285 -  76 77
 209 -  81  0
1116 -  79  0
 178 -  72 77
1209 -  67 65

The probablities array has a consistent size, but its shape is complete weird (128, 29). Where this 29 is coming from...? Yet there is some improvement to report: the accuracy has greatly improved. It used to be around 0.0015, now it is about 0.20.

Any ideas on what the probabilities array represents?

Edit 2

My mistake was that by going back from 128 one-hot-encoded values to integers I did not take into account that I had just 29 unique values. predict_proba neatly predicts these 29 values because these are the ones it learned.

The only question remaining is which values do the probabilities predict? Let us suppose the classes to predict are 0, 101-128, predict_proba returns values for indices 0..28. What is the mapping of probabilities to classes: 0-->0, 1-->101, 2-->102, ... , 29-128? I couldn't find any any hint about this in the manual.

Answer

You're making great progress, and it's good that you've started to get consistent shapes for the probabilities array. Let’s clarify what’s going on and address your final question about the class mapping.

Key Points in Your Current Approach:

  1. One-Hot Encoding & Output Shape:

    • You are encoding each integer value between 0 and 128 into a one-hot vector of length 128.
    • Y values are in a one-hot encoded form, with the shape (n, 128), which means you're dealing with a multi-class classification problem where each of the 128 possible values has a corresponding class.
  2. Reshaping X:

    • The input data (X) is a sequence of 99 time steps, each with 128 features (the one-hot encoding of each integer), giving it a shape of (n, 99, 128). When passing this data to RandomForestClassifier, you reshape it into (n, 99 * 128) — this is essentially flattening each time step into a single row of 99 * 128 features.
  3. The Probabilities Output:

    • After training the Random Forest model (clf.fit(X_train, Y_train)), you call predict_proba on the validation data (X_val), which should return an array with the shape (n_samples, n_classes) — that is, the number of samples by the number of classes.
    • If you are expecting 128 classes, the shape of the output probabilities array should be (1257, 128) (since you have 1257 samples in X_val and 128 classes).

Regarding the 29 Classes and Class Mapping:

In your Edit 2, you noticed that the probabilities array is of shape (1257, 29), which is likely because you have only 29 unique classes in your data, not 128. This means that the Random Forest model only learned to classify the 29 unique values present in your training data.

Here’s the reason:

  • If the target variable (Y_train) only contains values from the set {0, 101, 102, ..., 128}, and you didn't have all 128 possible integer values represented in the training data, the model will only learn those 29 unique classes. This is why predict_proba returns probabilities for only 29 classes.

Class Mapping:

  • If you have 29 unique classes, and the classes are encoded as {0, 101, 102, ..., 128}, then RandomForestClassifier internally assigns indices to each class based on the order they appear in the training data.

    • Class 0 might be mapped to index 0.
    • Class 101 might be mapped to index 1.
    • Class 102 might be mapped to index 2.
    • And so on.

    In other words, the probabilities returned for the index i in predict_proba correspond to the class that is indexed at i in your training data's unique set of classes.

How to Check the Mapping:

You can check how the classes are internally mapped by printing the class labels used by the RandomForestClassifier:

print(clf.classes_)

This will show you the unique class labels in the order that they are mapped internally by the model. For example:

[  0, 101, 102, ..., 128]

You’ll see the class labels corresponding to the indices. If the model learned only 29 classes, predict_proba will output probabilities for these 29 classes, and you can match the indices back to the original integer values (e.g., index 0 maps to class 0, index 1 maps to class 101, and so on).


Solution to the Final Question:

The probabilities returned from predict_proba correspond to the 29 classes that the model has learned. To interpret the result, you simply need to map the index of the probability array back to the original integer classes (which you can do using clf.classes_).

For instance:

  • The probabilities for index 0 correspond to class 0.
  • The probabilities for index 1 correspond to class 101.
  • The probabilities for index 2 correspond to class 102.
  • ... and so on.

Final Recommendations:

  1. Ensure your training data covers all 128 possible classes:

    • If you want to ensure all 128 classes are represented in the model, make sure that your training data (Y_train) contains instances of every possible class (i.e., every integer between 0 and 128).
    • If some classes are missing from the training set, the model will not learn to predict them, and predict_proba will return probabilities only for the classes it has learned.
  2. Mapping probabilities to original classes:

    • After calling predict_proba, use clf.classes_ to get the mapping between class indices and the actual class labels (e.g., 0, 101, 102, ..., 128).
# Example of getting the class labels
print("Class labels: ", clf.classes_)

# For a particular sample, if you have probabilities for class index 0
probabilities = clf.predict_proba(X_val[index])
print(f"Class 0 probability for sample {index}: {probabilities[0][0]}")

Let me know if you need further clarification or have additional questions!