Model guide

Random Forest Feature Importance

Random forests can rank features directly from the fitted trees. That ranking is useful for debugging, but it is not a complete explanation of model behavior. For reports and decisions, compare the built-in impurity-based ranking with permutation importance on held-out data.

Quick answer

Use feature_importances_ when you need a fast first look at how the forest split the training data. Use permutation_importance when you need to explain which features the fitted model relies on for validation or test performance.

Question	Better method	Why
Which columns did the forest use for splits?	`feature_importances_`	It is computed from impurity decreases inside the trees.
Which columns affect validation performance?	`permutation_importance`	It measures the score drop after shuffling each feature.
Which feature causes the business outcome?	Neither by itself	Feature importance is model reliance, not causal evidence.

How impurity importance is calculated

Scikit-learn's built-in random forest importance is impurity-based importance, also called mean decrease in impurity or Gini importance for many classifiers. Each decision tree chooses splits that reduce its criterion, such as Gini impurity, entropy, log loss, or squared error.

At each split, the tree records how much the split reduced impurity.
The reduction is weighted by how many training samples reached that node.
All reductions credited to the same feature are summed within a tree.
The forest averages those values across trees.
The final scores are normalized so importances sum to 1.

The key limitation

The score is based on training-time split behavior. It does not prove that a feature improves generalization, and it can favor continuous or high-cardinality features with many possible split points.

Built-in importance example

Start with a model that has acceptable held-out performance. A clean feature-importance chart from a weak model is not useful evidence.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

forest = RandomForestClassifier(
    n_estimators=500,
    min_samples_leaf=3,
    random_state=42,
    n_jobs=-1,
)
forest.fit(X_train, y_train)

predicted_proba = forest.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, predicted_proba)
print("Validation AUC:", round(auc, 3))

built_in = pd.DataFrame(
    zip(X.columns, forest.feature_importances_),
    columns=["feature", "impurity_importance"],
).sort_values("impurity_importance", ascending=False)

print(built_in.head(10))

The values are relative importances within this fitted forest. A value of 0.20 does not mean the feature explains 20 percent of the outcome. It means the feature received 20 percent of the normalized impurity decrease credited across the forest.

Compare with permutation importance

Permutation importance asks a different question: how much does the model's score decrease when one feature column is shuffled? Run it on validation or test data, and use the same metric you used to evaluate the model.

from sklearn.inspection import permutation_importance

result = permutation_importance(
    forest,
    X_test,
    y_test,
    n_repeats=30,
    random_state=42,
    scoring="roc_auc",
    n_jobs=-1,
)

permutation = pd.DataFrame(
    zip(X.columns, result.importances_mean, result.importances_std),
    columns=["feature", "permutation_mean", "permutation_std"],
).sort_values("permutation_mean", ascending=False)

comparison = built_in.merge(permutation, on="feature")
comparison = comparison.sort_values("permutation_mean", ascending=False)

print(comparison.head(10))

Disagreement is common. A feature can help many tree splits yet add little held-out performance, especially when the forest overfits, the feature has many possible thresholds, or another feature carries similar information.

Plot the results

Keep plots explicit about the method. A random forest importance chart without the method name invites readers to overinterpret the numbers.

import matplotlib.pyplot as plt

top_n = 12
plot_data = built_in.head(top_n).sort_values("impurity_importance")

fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(
    plot_data["feature"],
    plot_data["impurity_importance"],
    color="#2563eb",
)
ax.set_title("Random forest impurity-based importance")
ax.set_xlabel("Normalized mean decrease in impurity")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

For permutation importance, plot the average score drop with uncertainty from repeated shuffles.

plot_data = permutation.head(top_n).sort_values("permutation_mean")

fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(
    plot_data["feature"],
    plot_data["permutation_mean"],
    xerr=plot_data["permutation_std"],
    color="#2563eb",
)
ax.set_title("Random forest permutation importance")
ax.set_xlabel("Mean AUC decrease after shuffling")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

If the error bars overlap heavily, avoid claiming that nearby features have a precise order. Report them as a group of similarly important predictors.

Check stability

Random forests are stochastic, and permutation importance adds another layer of randomness. Before showing a ranking, check whether the top features remain near the top across seeds or data splits.

seeds = [3, 11, 29, 42, 71]
rankings = []

for seed in seeds:
    model = RandomForestClassifier(
        n_estimators=500,
        min_samples_leaf=3,
        random_state=seed,
        n_jobs=-1,
    )
    model.fit(X_train, y_train)

    repeated = permutation_importance(
        model,
        X_test,
        y_test,
        n_repeats=15,
        random_state=seed,
        scoring="roc_auc",
        n_jobs=-1,
    )

    ranking = pd.Series(repeated.importances_mean, index=X.columns)
    rankings.append(ranking.rank(ascending=False, method="average"))

rank_stability = pd.concat(rankings, axis=1)
rank_stability.columns = ["seed_3", "seed_11", "seed_29", "seed_42", "seed_71"]

summary = pd.DataFrame()
summary["mean_rank"] = rank_stability.mean(axis=1)
summary["rank_std"] = rank_stability.std(axis=1)
summary = summary.sort_values("mean_rank")

print(summary.head(12))

A feature with a strong mean importance but high rank variation may be real but hard to order precisely. That usually calls for cautious wording, not another decimal place.

Correlation caveats

Correlated predictors are one of the main reasons random forest importance is difficult to explain. The model may use one member of a correlated group in one tree and another member in a different tree. Credit can be split, swapped, or hidden.

Impurity importance

Correlated features can divide split credit. A feature may look less important because another feature was selected for similar splits.

Permutation importance

Shuffling one correlated feature may cause only a small score drop because the model can still use the remaining correlated feature.

Practical checks include inspecting a correlation matrix, grouping closely related columns, rerunning importance after removing redundant features, and reporting feature groups when a precise individual ranking is not defensible.

corr = X_train.corr(numeric_only=True).abs()

pairs = corr.unstack().reset_index()
pairs.columns = ["feature_a", "feature_b", "correlation"]
pairs = pairs[pairs["feature_a"] != pairs["feature_b"]]
pairs = pairs.sort_values("correlation", ascending=False)

print(pairs.head(10))

Reporting checklist

A useful random forest feature-importance report should let another practitioner understand what was measured and how much trust to place in the ranking.

State the model, target, dataset split, and validation score.
Label the method: impurity-based importance, permutation importance, or both.
For permutation importance, state the scoring metric and number of repeats.
Show uncertainty or stability checks when the ranking supports a decision.
Call out correlated feature groups, leakage risks, and excluded columns.
Avoid causal language unless the modeling design supports causal claims.
Explain disagreements between built-in and permutation rankings instead of hiding them.

Sources: scikit-learn forest importance example and scikit-learn permutation importance documentation.