Method guide

Permutation Importance

Permutation importance measures how much a fitted model's score gets worse when one feature is randomly shuffled. It is one of the most useful first methods because it connects feature importance to validation performance, not just model internals.

How it works

First, score the fitted model on a baseline dataset, usually validation or test data. Then pick one feature column, shuffle its values across rows, and score the model again. Shuffling breaks the relationship between that feature and the target while keeping the rest of the data unchanged.

The importance is the drop in score after shuffling. If the score drops a lot, the model was relying on that feature. If the score barely changes, the model either did not need the feature or could recover similar information from other features.

When to use it

Use permutation importance when you want a model-agnostic answer to a practical question: which features matter to predictive performance on a specific dataset and scoring metric?

Use it after you have a model worth explaining.
Use validation or test data when you care about generalization.
Use the scoring metric that matches the real decision problem.
Use repeated shuffles so you can see uncertainty.

Scikit-learn example

from sklearn.datasets import load_diabetes
from sklearn.inspection import permutation_importance
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import pandas as pd

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

result = permutation_importance(
    model,
    X_test,
    y_test,
    n_repeats=30,
    random_state=42,
    scoring="r2",
)

importance = pd.DataFrame(
    zip(X.columns, result.importances_mean, result.importances_std),
    columns=["feature", "mean", "std"],
).sort_values("mean", ascending=False)

print(importance)

Choose the scoring metric

The metric defines the meaning of the importance score. A feature that matters for AUC may not matter as much for log loss, accuracy, recall, mean absolute error, or R squared.

Problem	Possible metric	What importance means
Ranking classifier	`roc_auc`	Drop in ranking quality
Calibrated classifier	`neg_log_loss`	Worse predicted probabilities
Regression	`r2` or `neg_mean_absolute_error`	Drop in fit or increase in error

Plot with uncertainty

Because permutation importance repeats the shuffle, you can show the average score decrease and the variation across repeats.

import matplotlib.pyplot as plt

top_n = 12
plot_data = importance.head(top_n).sort_values("mean")

fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(
    plot_data["feature"],
    plot_data["mean"],
    xerr=plot_data["std"],
    color="#2563eb",
)
ax.set_title("Permutation importance")
ax.set_xlabel("Mean score decrease")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

Wide error bars mean the ranking is unstable. That can happen with small validation sets, noisy models, weak features, or too few repeats.

Correlated features

Correlated features are the most common reason permutation importance looks surprisingly low. If two columns carry the same signal, shuffling one column may not hurt much because the model can still use the other.

Do not read low importance too literally

Low permutation importance means the model did not lose much performance when that column was shuffled. It does not prove the underlying signal is irrelevant.

When correlated features matter, group related columns, remove redundant features, or compare importance after retraining a simpler model.

Check stability

A single run can make a ranking look cleaner than it is. Check whether the important features stay important when you change the random seed, train-validation split, model settings, or number of repeats.

seeds = [1, 7, 21, 42, 99]
rankings = []

for seed in seeds:
    result = permutation_importance(
        model,
        X_test,
        y_test,
        n_repeats=20,
        random_state=seed,
        scoring="r2",
    )
    rankings.append(result.importances_mean)

stability = pd.DataFrame(rankings, columns=X.columns)
print(stability.mean().sort_values(ascending=False).head(10))
print(stability.std().sort_values(ascending=False).head(10))

What to report

The model type and baseline validation score.
The dataset split used for permutation importance.
The scoring metric and why it was chosen.
The number of repeats and whether rankings were stable.
Known correlated features, leakage risks, and excluded columns.

Source: scikit-learn permutation importance documentation.