Random forest tree growing algorithm

ghz 4days ago ⋅ 6 views

I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.

When training a decision tree, there are 2 criteria to stop growing a tree: a. Stop when there are no more features left to split a node on. b. Stop when the node has all samples in it belonging to the same class.

Based on that, 1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select *m* of the *M* total features, and then from these *m* features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree. If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier. The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?

2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!

Answer

Let's go through your questions:

1. Feature Selection During Node Splitting

In Random Forest (RF), each decision tree grows independently. The main idea is to add diversity to the trees by selecting a random subset of features at each node split.

Regarding your question about removing the selected feature:

  • No, you do not need to remove the feature from the list for future splits. When you select a random subset of features (say m features from the M total features), you evaluate and split based on the best feature in this subset. The feature selected for the split can be reused in other nodes down the tree, because each node operates independently, and only the m random features are considered for splitting at each level.

    If you were to remove the selected feature from the list, you'd introduce a bias, and it might actually prevent the tree from fully exploring all possible feature splits, slowing down the growth of the tree and limiting the variety of potential splits. So, not removing the feature is consistent with the idea of "maximizing the variance" among trees in the forest, which is why trees can grow deeper and still remain diverse.

  • About the "maximum recursion depth" error: This is typically an issue of the tree not being able to find a pure node (a leaf where all samples belong to the same class). A tree that grows too deep might get stuck in situations where it can no longer split meaningfully. To avoid hitting maximum recursion depth, you may want to:

    • Add stopping criteria: like limiting tree depth, minimum samples per leaf, or setting a minimum information gain threshold to stop splitting.
    • Alternatively, prune the tree after it has grown (though this is less common in the random forest algorithm).

Breiman's statement about growing trees to the "maximum" generally refers to growing each tree as deeply as possible, but that doesn’t mean an unpruned tree should necessarily be infinite in depth. It just means growing the tree until there is no further information to gain (i.e., a pure node or no more features left to split on).

2. Threshold for Splitting Continuous Features

The threshold for continuous features can be a tricky aspect of decision trees.

  • For continuous features, the most common approach is not to create a separate child node for each unique value, as this would indeed result in overfitting and potentially an unreasonably deep tree. Instead, decision trees typically select a threshold value that best splits the data into two subsets.

    For example, if a feature f is continuous, you will:

    • Sort the values of f in the dataset.
    • Try potential thresholds (typically between adjacent values) and evaluate the best threshold by maximizing the information gain or minimizing impurity (e.g., Gini impurity, entropy).

    You don’t need to create a child node for each sample, but instead, you are effectively finding the "cut-off" point where the feature value leads to the best split.

    For example, if f is a continuous variable (e.g., age), you might split on age < 30 vs age >= 30. The algorithm will try different threshold values for this feature and choose the one that maximizes the information gain or minimizes the impurity.

    In summary, create one child node for each side of the split based on the threshold (e.g., left side for f < threshold and right side for f >= threshold), rather than creating a node for each unique value.

In Summary:

  1. No need to remove the selected feature in Random Forest. Features can be reused in different nodes.
  2. For continuous features, the common approach is to find a threshold to split, not create a child node per sample or unique value. This avoids overfitting and allows for reasonable splits.

If you need further clarifications or have additional questions, feel free to ask!