Feature Importance

Library guide

XGBoost Feature Importance

XGBoost can rank features from the fitted booster itself, or you can measure feature reliance with permutation importance on validation data. The practical mistake is treating every importance column as the same thing. Gain, weight, cover, total gain, and permutation importance answer different questions.

Quick answer

For most practitioner reports, start with gain or total_gain from the fitted booster, then compare the ranking with permutation importance on held-out data. Built-in XGBoost importance is fast and useful for model diagnostics. Permutation importance is slower, but it is tied to the metric and dataset you care about.

Method Best first use Main caveat
gain Which features made useful splits on average Averages can hide how often a feature was used
total_gain Overall contribution to split improvement Can favor features used many times
weight Split frequency debugging Frequent splits are not necessarily valuable splits
Permutation Held-out metric reliance Correlated features can mask each other

Fit an XGBoost model

The examples below use the scikit-learn estimator interface because it fits naturally into cross-validation, pipelines, metrics, and sklearn.inspection.permutation_importance. Passing importance_type makes the meaning of feature_importances_ explicit.

from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y,
)

model = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric="logloss",
    importance_type="gain",
    random_state=42,
)

model.fit(X_train, y_train)

proba = model.predict_proba(X_test)[:, 1]
print("ROC AUC:", roc_auc_score(y_test, proba))

Always compute importance for a model whose validation performance is already acceptable. Feature importance from a weak or leaky model is usually a description of the modeling problem, not a trustworthy explanation.

Use feature_importances_

The scikit-learn wrapper exposes one score per feature through feature_importances_. For tree boosters, the property uses the estimator's configured importance_type, such as gain, weight, cover, total_gain, or total_cover.

gain_importance = pd.Series(
    model.feature_importances_,
    index=X_train.columns,
    name="gain",
).sort_values(ascending=False)

print(gain_importance.head(10))

This is convenient when you already use the estimator API. It is less flexible than querying the underlying booster, especially when you want several importance types side by side.

Use get_score

model.get_booster().get_score() reads importance directly from the fitted booster. XGBoost omits features that were never used in a split, so reindex the result back to your full feature list before sorting, joining, or reporting.

booster = model.get_booster()

gain = pd.Series(
    booster.get_score(importance_type="gain"),
    name="gain",
)

total_gain = pd.Series(
    booster.get_score(importance_type="total_gain"),
    name="total_gain",
)

weight = pd.Series(
    booster.get_score(importance_type="weight"),
    name="weight",
)

booster_importance = pd.concat(
    [gain, total_gain, weight],
    axis=1,
).reindex(X_train.columns).fillna(0)

booster_importance = booster_importance.sort_values(
    "total_gain",
    ascending=False,
)

print(booster_importance.head(10))

Feature name check

If you fit with a NumPy array instead of a DataFrame, the booster may use generated names such as f0, f1, and f2. Prefer fitting with a DataFrame or keep a reliable mapping from model columns back to source feature names.

Importance types

XGBoost builds an additive model of trees. Each tree split chooses a feature and threshold that improve the objective. The built-in importance types summarize those split decisions after training.

weight

Counts how many times a feature is used to split data across the trees. It is useful for debugging tree structure, but it does not measure how much each split improved the model. A feature can appear often in small, low-value splits.

gain

Averages the improvement in the training objective for splits that used the feature. This is often the most useful built-in ranking when you want to know which features made strong splits when they were selected.

cover

Averages the coverage of splits that used the feature. Coverage is a measure of how much training data reached those splits, based on the instance weights or second-order statistics used by the booster. Treat it as a split-reach diagnostic, not as predictive value.

total_gain and total_cover

Sum gain or cover across all splits that used the feature. These can be better than averages when you care about aggregate model usage, but they naturally reward features that appear in many splits.

Plot importance

For quick inspection, XGBoost includes plot_importance. Use an explicit importance_type and limit the number of displayed features so the chart remains readable.

from xgboost import plot_importance
import matplotlib.pyplot as plt

ax = plot_importance(
    model,
    importance_type="gain",
    max_num_features=15,
    height=0.5,
    show_values=False,
)

ax.set_title("XGBoost feature importance by gain")
ax.set_xlabel("Average gain")
plt.tight_layout()
plt.show()

For reports, you will often get a cleaner result by plotting your own sorted Series. That lets you control labels, normalization, confidence intervals, and side-by-side comparisons.

top_gain = gain_importance.head(15).sort_values()

ax = top_gain.plot.barh(figsize=(7, 5))
ax.set_title("Top XGBoost features by gain")
ax.set_xlabel("Gain importance")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

Compare with permutation importance

Built-in XGBoost importance describes the fitted booster's split behavior. Permutation importance asks a different question: how much does a chosen validation metric drop when one feature is shuffled?

from sklearn.inspection import permutation_importance

permutation = permutation_importance(
    model,
    X_test,
    y_test,
    n_repeats=20,
    random_state=42,
    scoring="roc_auc",
)

permutation_scores = pd.Series(
    permutation.importances_mean,
    index=X_test.columns,
    name="permutation_auc_drop",
).sort_values(ascending=False)

comparison = pd.concat(
    [
        gain_importance.rename("xgboost_gain"),
        permutation_scores,
    ],
    axis=1,
).fillna(0)

comparison = comparison.sort_values(
    "permutation_auc_drop",
    ascending=False,
)

print(comparison.head(10))

Disagreement is not automatically a problem. It can reveal correlated features, redundant variables, train-test drift, leakage, or features that helped training-time splits without improving held-out performance.

Interpretation caveats

  • Feature importance is model-specific. A different objective, preprocessing pipeline, random seed, or train-test split can change the ranking.
  • Built-in importance is based on training-time split statistics. It does not prove that a feature improves future performance.
  • Correlated predictors split credit. One feature can look weak because another feature carries similar information.
  • High importance does not imply causality. It means the model used information associated with that feature under the fitted training setup.
  • Leakage features often look extremely important. Investigate any top feature that would not be available, stable, or legitimate at prediction time.
  • One-hot encoded or target-encoded features may need to be grouped back to their source field before the result is meaningful to a stakeholder.

What to report

A useful XGBoost feature-importance report should make the calculation reproducible and avoid overclaiming. Include the model objective, the validation metric, the dataset split, the importance type, and whether the ranking came from training-time booster statistics or held-out permutation tests.

Good phrasing

"On the validation split, shuffling mean radius reduced ROC AUC more than shuffling any other feature. In the fitted XGBoost model, mean radius also had the highest total gain. This suggests the model relies heavily on this variable, but it does not establish a causal effect."

Avoid phrasing

"Mean radius causes the prediction because it is the most important feature." That conclusion requires a causal design, not a feature importance ranking.

API reference: XGBoost Python API documentation .

Related guides