**End-to-End Tabular Modeling and Deployment with AutoGluon**
In this tutorial, we’re going to dive into the world of AutoGluon, a popular AutoML library, to build a production-grade tabular machine learning pipeline. We’ll take a real-world mixed-type dataset from raw ingestion to deployment-ready artifacts, leveraging high-quality stacked and bagged ensembles, efficient metrics, subgroup and feature-level evaluation, and optimization for real-time inference.
Before we begin, make sure you have the required libraries installed. We’ll use Python 3.8 or later, as well as the following dependencies:
“`python
!pip -q install -U “autogluon==1.5.0” “scikit-learn>1.3” “pandas>2.0” “numpy>1.24”
“`
With our dependencies in place, let’s get started!
**Data Preparation**
We’ll load a real-world mixed-type dataset and perform some mild preprocessing to prepare a clean training dataset. We’ll define the target variable, remove extremely leaky columns, and validate the dataset structure. We’ll then create a stratified train-test split to preserve class stability.
“`python
# Load dataset
df = fetch_openml(data_id=40945, as_frame=True).body
goal = “survived”
df[target] = df[target].astype(int)
drop_cols = [c for c in [“boat”, “body”, “home.dest”] if c in df.columns]
df = df.drop(columns=drop_cols, errors=”ignore”)
df = df.replace({None: np.nan})
print(“Form:”, df.form)
print(“Goal optimistic charge:”, df[target].imply().spherical(4))
print(“Columns:”, df.columns)
“`
**Model Training**
Next, we’ll detect hardware availability to dynamically choose the most suitable AutoGluon training preset. We’ll configure a persistent model directory and initialize the tabular predictor with an appropriate evaluation metric.
“`python
# Detect hardware availability for dynamic preset selection
def has_gpu():
try:
import torch
return torch.cuda.is_available()
except Exception:
return False
presets = “high” if has_gpu() else “best_quality”
save_path = “/content/autogluon_titanic_advanced”
os.makedirs(save_path, exist_ok=True)
predictor = TabularPredictor(
label=goal,
eval_metric=”roc_auc”,
path=save_path,
verbosity=2
)
“`
**Ensemble Training**
Now it’s time to train a high-quality ensemble using bagging and stacking inside a time budget. We’ll rely on AutoGluon’s automated model search to effectively discover robust architectures. We’ll also record training time to understand computational cost.
“`python
# Train ensemble with AutoGluon’s automated model search
begin = time.time()
predictor.fit(
train_data=train_df,
presets=presets,
time_limit=7 * 60,
num_bag_folds=5,
num_stack_levels=2,
refit_full=False
)
train_time = time.time() – begin
print(f”Training accomplished in {train_time:.1f}s with preset='{presets}'”)
“`
**Model Evaluation**
We’ll assess the trained model’s performance on a held-out test set and examine the leaderboard to match efficiency. We’ll compute probabilistic and discrete predictions and derive key classification metrics. This gives us a comprehensive view of model accuracy and calibration.
“`python
# Get test predictions and evaluate model performance
lb = predictor.leaderboard(test_df, silent=True)
print(“=== Leaderboard (top 15) ===”)
print(lb.head(15))
proba = predictor.predict_proba(test_df)
pred = predictor.predict(test_df)
y_true = test_df[target].values
if isinstance(proba, pd.DataFrame) and 1 in proba.columns:
y_proba = proba[1].values
else:
y_proba = np.asarray(proba).reshape(-1)
print(“=== Check Metrics ===”)
print(“ROC-AUC:”, roc_auc_score(y_true, y_proba).round(5))
print(“LogLoss:”, log_loss(y_true, np.clip(y_proba, 1e-6, 1 – 1e-6)).round(5))
print(“Accuracy:”, accuracy_score(y_true, pred).round(5))
print(“Classification report:”)
print(classification_report(y_true, pred))
“`
**Subgroup and Feature-Level Analysis**
We’ll analyze model behavior by subgroup efficiency slicing and permutation-based feature importance. We’ll determine how performance varies across significant segments of the data. This helps us assess robustness and interpretability before deployment.
“`python
# Analyze subgroup performance
if “pclass” in test_df.columns:
print(“=== Slice AUC by pclass ===”)
for grp, subset in test_df.groupby(“pclass”):
subset_proba = predictor.predict_proba(subset)
subset_proba = subset_proba[1].values if isinstance(subset_proba, pd.DataFrame) and 1 in subset_proba.columns else np.asarray(subset_proba).reshape(-1)
auc = roc_auc_score(subset[target].values, subset_proba)
print(f”pclass={grp}: AUC={auc:.4f} (n={len(subset)})”)
# Compute feature importance
fi = predictor.feature_importance(test_df, silent=True)
print(“=== Feature significance (top 20) ===”)
print(fi.head(20))
“`
**Inference Optimization**
We’ll optimize the trained ensemble for inference by collapsing bagged models and benchmarking latency improvements. We’ll optionally distill the ensemble into faster models and validate persistence by save-reload checks. Finally, we’ll export structured artifacts required for production handoff.
“`python
# Optimize ensemble for inference
t0 = time.time()
refit_map = predictor.refit_full()
t_refit = time.time() – t0
print(f”Refit accomplished in {t_refit:.1f}s”)
print(“Refit mapping (pattern):”, dict(df.column(index=refit_map.gadgets())[:5]))
lb_full = predictor.leaderboard(test_df, silent=True)
print(“=== Leaderboard after refit_full (top 15) ===”)
print(lb_full.head(15))
best_model = predictor.get_model_best()
full_candidates = [m for m in predictor.get_model_names() if m.endswith(“_FULL”)]
def bench_infer(model_name, df_in, repeats=3):
occasions = []
for _ in range(repeats):
t1 = time.time()
_ = predictor.predict(df_in, model=model_name)
occasions.append(time.time() – t1)
return float(np.median(occasions))
small_batch = test_df.drop(columns=[target]).head(256)
lat_best = bench_infer(best_model, small_batch)
print(f”Best model: {best_model} | median predict() latency on 256 rows: {lat_best:.4f}s”)
if full_candidates:
lb_full_sorted = lb_full.sort_values(by=”score_test”, ascending=False)
best_full = lb_full_sorted[lb_full_sorted[“model”].str.endswith(“_FULL”)].iloc[0][“model”]
lat_full = bench_infer(best_full, small_batch)
print(f”Best FULL model: {best_full} | median predict() latency on 256 rows: {lat_full:.4f}s”)
print(f”Speedup factor (best / full): {lat_best / max(lat_full, 1e-9):.2f}x”)
try:
t0 = time.time()
distill_result = predictor.distill(
train_data=train_df,
time_limit=4 * 60,
augment_method=”spunge”,
)
t_distill = time.time() – t0
print(f”Distillation accomplished in {t_distill:.1f}s”)
except Exception as e:
print(“Distillation step failed”)
print(“Error:”, repr(e))
lb2 = predictor.leaderboard(test_df, silent=True)
print(“=== Leaderboard after distillation try (top 20) ===”)
print(lb2.head(20))
predictor.save()
reloaded = TabularPredictor.load(save_path)
pattern = test_df.drop(columns=[target]).sample(8, random_state=0)
sample_pred = reloaded.predict(pattern)
sample_proba = reloaded.predict_proba(pattern)
print(“=== Reloaded predictor sanity-check ===”)
print(pattern.assign(pred=sample_pred).head())
print(“Probabilities (head):”)
print(sample_proba.head())
artifacts = {
“path”: save_path,
“presets”: presets,
“best_model”: reloaded.get_model_best(),
“model_names”: reloaded.get_model_names(),
“leaderboard_top10″: lb2.head(10).to_dict(orient=”data”),
}
with open(os.path.join(save_path, “run_summary.json”), “w”) as f:
json.dump(artifacts, f, indent=2)
print(“Saved artifact to:”, os.path.join(save_path, “run_summary.json”))
print(“Done.”)
“`
**Conclusion**
In this tutorial, we’ve demonstrated how to build an end-to-end workflow with AutoGluon that transforms raw tabular data into production-ready models with minimal manual intervention, while maintaining robust control over accuracy, robustness, and inference efficiency. We’ve performed systematic error analysis and feature importance analysis, optimized large ensembles by refitting and distillation, and validated deployment readiness using latency benchmarking and artifact packaging. This workflow enables the deployment of high-performing, scalable, interpretable, and well-suited tabular models for real-world production environments.
Try the FULL CODES here!
