A Coding Implementation of a Complete Enterprise AI Benchmarking Framework to Consider Rule-Based mostly LLM, and Hybrid Agentic AI Programs Throughout Actual-World Duties

On this tutorial, we develop a complete benchmarking framework to judge numerous kinds of agentic AI techniques on real-world enterprise software program duties. We design a set of various challenges, from information transformation and API integration to workflow automation and efficiency optimization, and assess how numerous brokers, together with rule-based, LLM-powered, and hybrid ones, carry out throughout these domains. By operating structured benchmarks and visualizing key efficiency metrics, corresponding to accuracy, execution time, and success fee, we acquire a deeper understanding of every agent’s strengths and trade-offs in enterprise environments. Take a look at the Full Codes here.

import json
import time
import random
from typing import Dict, Listing, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Process:
   id: str
   identify: str
   description: str
   class: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.duties = self._create_tasks()


   def _create_tasks(self) -> Listing[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Process:
       return subsequent((t for t in self.duties if t.id == task_id), None)

We outline the core information buildings for our benchmarking system. We create the Process and BenchmarkResult information lessons and initialize the EnterpriseTaskSuite, which holds a number of enterprise-relevant duties corresponding to information transformation, reporting, and integration. We laid the muse for constantly evaluating several types of brokers throughout these duties. Take a look at the Full Codes here.

class BaseAgent:
   def __init__(self, identify: str):
       self.identify = identify


   def execute(self, job: Process) -> Dict[str, Any]:
       increase NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if job.class == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif job.class == "integration":
           return {"standing": "success", "active_users": 1250}
       elif job.class == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return job.expected_output

We introduce the bottom agent construction and implement the RuleBasedAgent, which mimics conventional automation logic utilizing predefined guidelines. We simulate how such brokers execute duties deterministically whereas sustaining velocity and reliability, giving us a baseline for comparability with extra superior brokers. Take a look at the Full Codes here.

class LLMAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if job.complexity >= 4 else 0.90
       outcome = {}
       for key, worth in job.expected_output.objects():
           if isinstance(worth, (int, float)):
               variation = worth * (1 - accuracy_boost)
               outcome[key] = worth + random.uniform(-variation, variation)
           else:
               outcome[key] = worth
       return outcome


class HybridAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if job.complexity <= 2:
           return job.expected_output
       else:
           outcome = {}
           for key, worth in job.expected_output.objects():
               if isinstance(worth, (int, float)):
                   variation = worth * 0.03
                   outcome[key] = worth + random.uniform(-variation, variation)
               else:
                   outcome[key] = worth
           return outcome

We develop two clever agent sorts, the LLMAgent, representing reasoning-based AI techniques, and the HybridAgent, which mixes rule-based precision with LLM adaptability. We design these brokers to point out how learning-based strategies enhance job accuracy, particularly for complicated enterprise workflows. Take a look at the Full Codes here.

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.outcomes: Listing[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.identify}")
       print(f"{'='*60}")
       for job in self.task_suite.duties:
           print(f"nTask: {job.identify} (Complexity: {job.complexity}/5)")
           for i in vary(iterations):
               outcome = self._execute_task(agent, job, i+1)
               self.outcomes.append(outcome)
               standing = "✓ PASS" if outcome.success else "✗ FAIL"
               print(f"  Run {i+1}: {standing} | Time: {outcome.execution_time:.3f}s | Accuracy: {outcome.accuracy:.2%}")

Right here, we construct the core of our benchmarking engine, which manages agent analysis throughout the outlined job suite. We implement strategies to run every agent a number of occasions per job, log outcomes, and measure key parameters like execution time and accuracy. This creates a scientific and repeatable benchmarking loop. Take a look at the Full Codes here.

 def _execute_task(self, agent: BaseAgent, job: Process, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       strive:
           output = agent.execute(job)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, job.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       besides Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, anticipated: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in anticipated.objects():
           if key not in output:
               scores.append(0.0)
               proceed
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               rating = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(rating)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.imply(scores) if scores else 0.0

We outline the duty execution logic and the accuracy computation. We measure every agent’s efficiency by evaluating their outputs towards anticipated outcomes utilizing a scoring mechanism. This step ensures our benchmarking course of is quantitative and truthful, offering insights into how intently brokers align with enterprise expectations. Take a look at the Full Codes here.

 def generate_report(self):
       df = pd.DataFrame([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].distinctive():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Price: {agent_df['success'].imply():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].imply():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].imply():.2%}n")
       return df


   def visualize_results(self, df: pd.DataFrame):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Outcomes', fontsize=16, fontweight="daring")
       success_rate = df.groupby('agent_name')['success'].imply()
       axes[0, 0].bar(success_rate.index, success_rate.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Price by Agent', fontweight="daring")
       axes[0, 0].set_ylabel('Success Price')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].textual content(i, v + 0.02, f'{v:.1%}', ha="heart", fontweight="daring")
       time_data = df.groupby('agent_name')['execution_time'].imply()
       axes[0, 1].bar(time_data.index, time_data.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Common Execution Time', fontweight="daring")
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].textual content(i, v + 0.01, f'{v:.3f}s', ha="heart", fontweight="daring")
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight="daring")
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.duties}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].imply().unstack()
       complexity_perf.plot(sort='line', ax=axes[1, 1], marker="o", linewidth=2)
       axes[1, 1].set_title('Accuracy by Process Complexity', fontweight="daring")
       axes[1, 1].set_xlabel('Process Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title="Agent", loc="greatest")
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.present()


if __name__ == "__main__":
   print("Enterprise Software program Benchmarking for Agentic Brokers")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   brokers = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in brokers:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

We generate detailed experiences and create visible analytics for efficiency comparability. We analyze metrics corresponding to success fee, execution time, and accuracy throughout brokers and job complexities. Lastly, we export the outcomes to CSV file, finishing a full enterprise-grade analysis workflow.

In conclusion, we applied a sturdy, extensible benchmarking system that allows us to measure and examine the effectivity, adaptability, and accuracy of a number of agentic AI approaches. We noticed how totally different architectures excel at totally different ranges of job complexity and the way visible analytics spotlight efficiency tendencies. This course of allows us to judge current brokers and gives a powerful basis for next-generation enterprise AI brokers, optimized for reliability and intelligence.

Take a look at the Full Codes here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link

A Coding Implementation of a Complete Enterprise AI Benchmarking Framework to Consider Rule-Based mostly LLM, and Hybrid Agentic AI Programs Throughout Actual-World Duties

ChatGPT rolls out adverts | TechCrunch

Harvey reportedly elevating at $11B valuation simply months after it hit $8B

So, what is going on on with Musicboard?

A Coding Implementation of a Complete Enterprise AI Benchmarking Framework to Consider Rule-Based mostly LLM, and Hybrid Agentic AI Programs Throughout Actual-World Duties

Related Posts

ChatGPT rolls out adverts | TechCrunch

Harvey reportedly elevating at $11B valuation simply months after it hit $8B

So, what is going on on with Musicboard?