How one can Construct Traceable and Evaluated LLM Workflows Utilizing Promptflow, Prompty, and OpenAI

(WORK_DIR / "choose.prompty").write_text("""---
identify: Choose
mannequin:
 api: chat
 configuration:
   sort: openai
   connection: open_ai_connection
   mannequin: gpt-4o-mini
 parameters:
   temperature: 0
   max_tokens: 150
   response_format: {sort: json_object}
inputs:
 query: {sort: string}
 reply:   {sort: string}
 anticipated: {sort: string}
---
system:
You're an exacting grader. Determine whether or not the assistant's reply comprises the anticipated reality (case-insensitive, permitting affordable phrasing/synonyms). Reply ONLY as JSON: {"rating": 0 or 1, "purpose": "..."}.


person:
Query: {{query}}
Anticipated: {{anticipated}}
Reply:   {{reply}}
""")


(WORK_DIR / "eval_flow.py").write_text(textwrap.dedent('''
   import json
   from pathlib import Path
   from promptflow.tracing import hint
   from promptflow.core import Prompty


   BASE = Path(__file__).mother or father


   class Evaluator:
       def __init__(self):
           self.choose = Prompty.load(supply=BASE / "choose.prompty")


       @hint
       def __call__(self, query: str, reply: str, anticipated: str) -> dict:
           uncooked = self.choose(query=query, reply=reply, anticipated=anticipated)
           if isinstance(uncooked, str):
               strive: uncooked = json.masses(uncooked)
               besides Exception: uncooked = {"rating": 0, "purpose": f"unparseable:{uncooked[:80]}"}
           return {"rating": int(uncooked.get("rating", 0)), "purpose": str(uncooked.get("purpose",""))}


       def __aggregate__(self, line_results):
           """Run-level aggregation. No matter this returns reveals up in pf.get_metrics()."""
           scores = [r["score"] for r in line_results if r]
           return {
               "accuracy": (sum(scores) / len(scores)) if scores else 0.0,
               "handed":   sum(scores),
               "whole":    len(scores),
           }
'''))


(WORK_DIR / "eval.flex.yaml").write_text(
   "$schema: https://azuremlschemas.azureedge.internet/promptflow/newest/Circulate.schema.jsonn"
   "entry: eval_flow:Evaluatorn"
)


print("n=== Analysis run ===")
eval_run = pf.run(
   circulation=str(WORK_DIR / "eval.flex.yaml"),
   information=str(data_path),
   run=base_run,
   column_mapping={
       "query": "${information.query}",
       "anticipated": "${information.anticipated}",
       "reply":   "${run.outputs.reply}",
   },
   stream=True,
)


eval_details = pf.get_details(eval_run)
print(eval_details)


print("n=== Aggregated metrics (from __aggregate__) ===")
print(json.dumps(pf.get_metrics(eval_run), indent=2))


import pandas as pd
if "outputs.rating" in eval_details.columns:
   s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0)
   print(f"Guide accuracy: {s.imply():.2%}  ({int(s.sum())}/{len(s)})")

Source link

How one can Construct Traceable and Evaluated LLM Workflows Utilizing Promptflow, Prompty, and OpenAI

How one enterprise agency is investing in an more and more fragmented world

At his OpenAI trial, Musk relitigates an outdated friendship

OpenAI Actually Desires Codex to Shut Up About Goblins

How one can Construct Traceable and Evaluated LLM Workflows Utilizing Promptflow, Prompty, and OpenAI

Related Posts

How one enterprise agency is investing in an more and more fragmented world

At his OpenAI trial, Musk relitigates an outdated friendship

OpenAI Actually Desires Codex to Shut Up About Goblins