On this tutorial, we present how we deal with prompts as first-class, versioned artifacts and apply rigorous regression testing to massive language mannequin habits utilizing MLflow. We design an analysis pipeline that logs immediate variations, immediate diffs, mannequin outputs, and a number of high quality metrics in a completely reproducible method. By combining classical textual content metrics with semantic similarity and automatic regression flags, we show how we are able to systematically detect efficiency drift attributable to seemingly small immediate modifications. Alongside the tutorial, we concentrate on constructing a workflow that mirrors actual software program engineering practices, however utilized to immediate engineering and LLM analysis. Take a look at the FULL CODES here.
!pip -q set up -U "openai>=1.0.0" mlflow rouge-score nltk sentence-transformers scikit-learn pandas
import os, json, time, difflib, re
from typing import Record, Dict, Any, Tuple
import mlflow
import pandas as pd
import numpy as np
from openai import OpenAI
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
nltk.obtain("punkt", quiet=True)
nltk.obtain("punkt_tab", quiet=True)
if not os.getenv("OPENAI_API_KEY"):
attempt:
from google.colab import userdata # kind: ignore
ok = userdata.get("OPENAI_API_KEY")
if ok:
os.environ["OPENAI_API_KEY"] = ok
besides Exception:
move
if not os.getenv("OPENAI_API_KEY"):
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI_API_KEY (enter hidden): ").strip()
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY is required."
We arrange the execution atmosphere by putting in all required dependencies and importing the core libraries used all through the tutorial. We securely load the OpenAI API key at runtime, making certain credentials are by no means hard-coded within the pocket book. We additionally initialize important NLP assets to make sure the analysis pipeline runs reliably throughout completely different environments.
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.2
MAX_OUTPUT_TOKENS = 250
ABS_SEM_SIM_MIN = 0.78
DELTA_SEM_SIM_MAX_DROP = 0.05
DELTA_ROUGE_L_MAX_DROP = 0.08
DELTA_BLEU_MAX_DROP = 0.10
mlflow.set_tracking_uri("file:/content material/mlruns")
mlflow.set_experiment("prompt_versioning_llm_regression")
consumer = OpenAI()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
EVAL_SET = [
{
"id": "q1",
"input": "Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.",
"reference": "MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts."
},
{
"id": "q2",
"input": "Rewrite professionally: 'this model is kinda slow but it works ok.'",
"reference": "The model is somewhat slow, but it performs reliably."
},
{
"id": "q3",
"input": "Extract key fields as JSON: 'Order 5531 by Alice costs $42.50 and ships to Toronto.'",
"reference": '{"order_id":"5531","customer":"Alice","amount_usd":42.50,"city":"Toronto"}'
},
{
"id": "q4",
"input": "Answer briefly: What is prompt regression testing?",
"reference": "Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline."
},
]
PROMPTS = [
{
"version": "v1_baseline",
"prompt": (
"You are a precise assistant.n"
"Follow the user request carefully.n"
"If asked for JSON, output valid JSON only.n"
"User: {user_input}"
)
},
{
"version": "v2_formatting",
"prompt": (
"You are a helpful, structured assistant.n"
"Respond clearly and concisely.n"
"Prefer clean formatting.n"
"User request: {user_input}"
)
},
{
"version": "v3_guardrailed",
"prompt": (
"You are a rigorous assistant.n"
"Rules:n"
"1) If user asks for JSON, output ONLY valid minified JSON.n"
"2) Otherwise, keep the answer short and factual.n"
"User: {user_input}"
)
},
]
We outline all experimental configurations, together with mannequin parameters, regression thresholds, and MLflow monitoring settings. We assemble the analysis dataset and explicitly declare a number of immediate variations to check and check for regressions. By centralizing these definitions, we be certain that immediate modifications and analysis logic stay managed and reproducible.
def call_llm(formatted_prompt: str) -> str:
resp = consumer.responses.create(
mannequin=MODEL,
enter=formatted_prompt,
temperature=TEMPERATURE,
max_output_tokens=MAX_OUTPUT_TOKENS,
)
out = getattr(resp, "output_text", None)
if out:
return out.strip()
attempt:
texts = []
for merchandise in resp.output:
if getattr(merchandise, "kind", "") == "message":
for c in merchandise.content material:
if getattr(c, "kind", "") in ("output_text", "textual content"):
texts.append(getattr(c, "textual content", ""))
return "n".be part of(texts).strip()
besides Exception:
return ""
easy = SmoothingFunction().method3
rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
def safe_tokenize(s: str) -> Record[str]:
s = (s or "").strip().decrease()
if not s:
return []
attempt:
return nltk.word_tokenize(s)
besides LookupError:
return re.findall(r"bw+b", s)
def bleu_score(ref: str, hyp: str) -> float:
r = safe_tokenize(ref)
h = safe_tokenize(hyp)
if len(h) == 0 or len(r) == 0:
return 0.0
return float(sentence_bleu([r], h, smoothing_function=easy))
def rougeL_f1(ref: str, hyp: str) -> float:
scores = rouge.rating(ref or "", hyp or "")
return float(scores["rougeL"].fmeasure)
def semantic_sim(ref: str, hyp: str) -> float:
embs = embedder.encode([ref or "", hyp or ""], normalize_embeddings=True)
return float(cosine_similarity([embs[0]], [embs[1]])[0][0])
We implement the core LLM invocation and analysis metrics used to evaluate immediate high quality. We compute BLEU, ROUGE-L, and semantic similarity scores to seize each surface-level and semantic variations in mannequin outputs. It permits us to guage immediate modifications from a number of complementary views relatively than counting on a single metric.
def evaluate_prompt(prompt_template: str) -> Tuple[pd.DataFrame, Dict[str, float], str]:
rows = []
for ex in EVAL_SET:
p = prompt_template.format(user_input=ex["input"])
y = call_llm(p)
ref = ex["reference"]
rows.append({
"id": ex["id"],
"enter": ex["input"],
"reference": ref,
"output": y,
"bleu": bleu_score(ref, y),
"rougeL_f1": rougeL_f1(ref, y),
"semantic_sim": semantic_sim(ref, y),
})
df = pd.DataFrame(rows)
agg = {
"bleu_mean": float(df["bleu"].imply()),
"rougeL_f1_mean": float(df["rougeL_f1"].imply()),
"semantic_sim_mean": float(df["semantic_sim"].imply()),
}
outputs_jsonl = "n".be part of(json.dumps(r, ensure_ascii=False) for r in rows)
return df, agg, outputs_jsonl
def log_text_artifact(textual content: str, artifact_path: str):
mlflow.log_text(textual content, artifact_path)
def prompt_diff(previous: str, new: str) -> str:
a = previous.splitlines(keepends=True)
b = new.splitlines(keepends=True)
return "".be part of(difflib.unified_diff(a, b, fromfile="previous_prompt", tofile="current_prompt"))
def compute_regression_flags(baseline: Dict[str, float], present: Dict[str, float]) -> Dict[str, Any]:
d_sem = baseline["semantic_sim_mean"] - present["semantic_sim_mean"]
d_rouge = baseline["rougeL_f1_mean"] - present["rougeL_f1_mean"]
d_bleu = baseline["bleu_mean"] - present["bleu_mean"]
flags = {
"abs_semantic_fail": present["semantic_sim_mean"] < ABS_SEM_SIM_MIN,
"drop_semantic_fail": d_sem > DELTA_SEM_SIM_MAX_DROP,
"drop_rouge_fail": d_rouge > DELTA_ROUGE_L_MAX_DROP,
"drop_bleu_fail": d_bleu > DELTA_BLEU_MAX_DROP,
"delta_semantic": float(d_sem),
"delta_rougeL": float(d_rouge),
"delta_bleu": float(d_bleu),
}
flags["regression"] = any([flags["abs_semantic_fail"], flags["drop_semantic_fail"], flags["drop_rouge_fail"], flags["drop_bleu_fail"]])
return flags
We construct the analysis and regression logic that runs every immediate towards the analysis set and aggregates outcomes. We log immediate artifacts, immediate diffs, and analysis outputs to MLflow, making certain each experiment stays auditable. We additionally compute regression flags that robotically determine whether or not a immediate model degrades efficiency relative to the baseline. Take a look at the FULL CODES here.
print("Operating immediate versioning + regression testing with MLflow...")
print(f"Monitoring URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {mlflow.get_experiment_by_name('prompt_versioning_llm_regression').identify}")
run_summary = []
baseline_metrics = None
baseline_prompt = None
baseline_df = None
baseline_metrics_name = None
with mlflow.start_run(run_name=f"prompt_regression_suite_{int(time.time())}") as parent_run:
mlflow.set_tag("job", "prompt_versioning_regression_testing")
mlflow.log_param("mannequin", MODEL)
mlflow.log_param("temperature", TEMPERATURE)
mlflow.log_param("max_output_tokens", MAX_OUTPUT_TOKENS)
mlflow.log_param("eval_set_size", len(EVAL_SET))
for pv in PROMPTS:
ver = pv["version"]
prompt_t = pv["prompt"]
with mlflow.start_run(run_name=ver, nested=True) as child_run:
mlflow.log_param("prompt_version", ver)
log_text_artifact(prompt_t, f"prompts/{ver}.txt")
if baseline_prompt is just not None and baseline_metrics_name is just not None:
diff = prompt_diff(baseline_prompt, prompt_t)
log_text_artifact(diff, f"prompt_diffs/{baseline_metrics_name}_to_{ver}.diff")
else:
log_text_artifact("BASELINE_PROMPT (no diff)", f"prompt_diffs/{ver}.diff")
df, agg, outputs_jsonl = evaluate_prompt(prompt_t)
mlflow.log_dict(agg, f"metrics/{ver}_agg.json")
log_text_artifact(outputs_jsonl, f"outputs/{ver}_outputs.jsonl")
mlflow.log_metric("bleu_mean", agg["bleu_mean"])
mlflow.log_metric("rougeL_f1_mean", agg["rougeL_f1_mean"])
mlflow.log_metric("semantic_sim_mean", agg["semantic_sim_mean"])
if baseline_metrics is None:
baseline_metrics = agg
baseline_prompt = prompt_t
baseline_df = df
baseline_metrics_name = ver
flags = {"regression": False, "delta_bleu": 0.0, "delta_rougeL": 0.0, "delta_semantic": 0.0}
mlflow.set_tag("regression", "false")
else:
flags = compute_regression_flags(baseline_metrics, agg)
mlflow.log_metric("delta_bleu", flags["delta_bleu"])
mlflow.log_metric("delta_rougeL", flags["delta_rougeL"])
mlflow.log_metric("delta_semantic", flags["delta_semantic"])
mlflow.set_tag("regression", str(flags["regression"]).decrease())
for ok in ["abs_semantic_fail","drop_semantic_fail","drop_rouge_fail","drop_bleu_fail"]:
mlflow.set_tag(ok, str(flags[k]).decrease())
run_summary.append({
"prompt_version": ver,
"bleu_mean": agg["bleu_mean"],
"rougeL_f1_mean": agg["rougeL_f1_mean"],
"semantic_sim_mean": agg["semantic_sim_mean"],
"delta_bleu_vs_baseline": float(flags.get("delta_bleu", 0.0)),
"delta_rougeL_vs_baseline": float(flags.get("delta_rougeL", 0.0)),
"delta_semantic_vs_baseline": float(flags.get("delta_semantic", 0.0)),
"regression_flag": bool(flags["regression"]),
"mlflow_run_id": child_run.information.run_id,
})
summary_df = pd.DataFrame(run_summary).sort_values("prompt_version")
print("n=== Aggregated Outcomes (larger is healthier) ===")
show(summary_df)
regressed = summary_df[summary_df["regression_flag"] == True]
if len(regressed) > 0:
print("n🚩 Regressions detected:")
show(regressed[["prompt_version","delta_bleu_vs_baseline","delta_rougeL_vs_baseline","delta_semantic_vs_baseline","mlflow_run_id"]])
else:
print("n✅ No regressions detected underneath present thresholds.")
if len(regressed) > 0 and baseline_df is just not None:
worst_ver = regressed.sort_values("delta_semantic_vs_baseline", ascending=False).iloc[0]["prompt_version"]
worst_prompt = subsequent(p["prompt"] for p in PROMPTS if p["version"] == worst_ver)
worst_df, _, _ = evaluate_prompt(worst_prompt)
merged = baseline_df[["id","output","bleu","rougeL_f1","semantic_sim"]].merge(
worst_df[["id","output","bleu","rougeL_f1","semantic_sim"]],
on="id",
suffixes=("_baseline", f"_{worst_ver}")
)
merged["delta_semantic"] = merged["semantic_sim_baseline"] - merged[f"semantic_sim_{worst_ver}"]
merged["delta_rougeL"] = merged["rougeL_f1_baseline"] - merged[f"rougeL_f1_{worst_ver}"]
merged["delta_bleu"] = merged["bleu_baseline"] - merged[f"bleu_{worst_ver}"]
print(f"n=== Per-example deltas: baseline vs {worst_ver} (constructive delta = worse) ===")
show(
merged[["id","delta_semantic","delta_rougeL","delta_bleu","output_baseline",f"output_{worst_ver}"]]
.sort_values("delta_semantic", ascending=False)
)
print("nOpen MLflow UI (elective) by operating:")
print("!mlflow ui --backend-store-uri file:/content material/mlruns --host 0.0.0.0 --port 5000")
We orchestrate the complete immediate regression testing workflow utilizing nested MLflow runs. We examine every immediate model towards the baseline, log metric deltas, and report regression outcomes in a structured abstract desk. This completes a repeatable, engineering-grade pipeline for immediate versioning and regression testing that we are able to lengthen to bigger datasets and real-world purposes.
In conclusion, we established a sensible, research-oriented framework for immediate versioning and regression testing that permits us to guage LLM habits with self-discipline and transparency. We confirmed how MLflow permits us to trace immediate evolution, examine outputs throughout variations, and robotically flag regressions based mostly on well-defined thresholds. This strategy helps us transfer away from advert hoc immediate tuning and towards measurable, repeatable experimentation. By adopting this workflow, we ensured that immediate updates enhance mannequin habits deliberately relatively than introducing hidden efficiency regressions.
Take a look at the FULL CODES here. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
