Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Implementation of a Complete Enterprise AI Benchmarking Framework to Consider Rule-Based mostly LLM, and Hybrid Agentic AI Programs Throughout Actual-World Duties

    Naveed AhmadBy Naveed Ahmad02/11/2025No Comments8 Mins Read
    blog banner 3


    On this tutorial, we develop a complete benchmarking framework to judge numerous kinds of agentic AI techniques on real-world enterprise software program duties. We design a set of various challenges, from information transformation and API integration to workflow automation and efficiency optimization, and assess how numerous brokers, together with rule-based, LLM-powered, and hybrid ones, carry out throughout these domains. By operating structured benchmarks and visualizing key efficiency metrics, corresponding to accuracy, execution time, and success fee, we acquire a deeper understanding of every agent’s strengths and trade-offs in enterprise environments. Take a look at the Full Codes here.

    import json
    import time
    import random
    from typing import Dict, Listing, Any, Callable
    from dataclasses import dataclass, asdict
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    
    @dataclass
    class Process:
       id: str
       identify: str
       description: str
       class: str
       complexity: int
       expected_output: Any
    
    
    @dataclass
    class BenchmarkResult:
       task_id: str
       agent_name: str
       success: bool
       execution_time: float
       accuracy: float
       error_message: str = ""
    
    
    class EnterpriseTaskSuite:
       def __init__(self):
           self.duties = self._create_tasks()
    
    
       def _create_tasks(self) -> Listing[Task]:
           return [
               Task("data_transform", "CSV Data Transformation",
                    "Transform customer data by aggregating sales", "data_processing", 3,
                    {"total_sales": 15000, "avg_order": 750}),
               Task("api_integration", "REST API Integration",
                    "Parse API response and extract key metrics", "integration", 2,
                    {"status": "success", "active_users": 1250}),
               Task("workflow_automation", "Multi-Step Workflow",
                    "Execute data validation -> processing -> reporting", "automation", 4,
                    {"validated": True, "processed": 100, "report_generated": True}),
               Task("error_handling", "Error Recovery",
                    "Handle malformed data gracefully", "reliability", 3,
                    {"errors_caught": 5, "recovery_success": True}),
               Task("optimization", "Query Optimization",
                    "Optimize database query performance", "performance", 5,
                    {"execution_time_ms": 45, "rows_scanned": 1000}),
               Task("data_validation", "Schema Validation",
                    "Validate data against business rules", "validation", 2,
                    {"valid_records": 95, "invalid_records": 5}),
               Task("reporting", "Executive Dashboard",
                    "Generate KPI summary report", "analytics", 3,
                    {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
               Task("integration_test", "System Integration",
                    "Test end-to-end integration flow", "testing", 4,
                    {"all_systems_connected": True, "latency_ms": 120}),
           ]
    
    
       def get_task(self, task_id: str) -> Process:
           return subsequent((t for t in self.duties if t.id == task_id), None)

    We outline the core information buildings for our benchmarking system. We create the Process and BenchmarkResult information lessons and initialize the EnterpriseTaskSuite, which holds a number of enterprise-relevant duties corresponding to information transformation, reporting, and integration. We laid the muse for constantly evaluating several types of brokers throughout these duties. Take a look at the Full Codes here.

    class BaseAgent:
       def __init__(self, identify: str):
           self.identify = identify
    
    
       def execute(self, job: Process) -> Dict[str, Any]:
           increase NotImplementedError
    
    
    class RuleBasedAgent(BaseAgent):
       def execute(self, job: Process) -> Dict[str, Any]:
           time.sleep(random.uniform(0.1, 0.3))
           if job.class == "data_processing":
               return {"total_sales": 15000 + random.randint(-500, 500),
                       "avg_order": 750 + random.randint(-50, 50)}
           elif job.class == "integration":
               return {"standing": "success", "active_users": 1250}
           elif job.class == "automation":
               return {"validated": True, "processed": 98, "report_generated": True}
           else:
               return job.expected_output

    We introduce the bottom agent construction and implement the RuleBasedAgent, which mimics conventional automation logic utilizing predefined guidelines. We simulate how such brokers execute duties deterministically whereas sustaining velocity and reliability, giving us a baseline for comparability with extra superior brokers. Take a look at the Full Codes here.

    class LLMAgent(BaseAgent):
       def execute(self, job: Process) -> Dict[str, Any]:
           time.sleep(random.uniform(0.2, 0.5))
           accuracy_boost = 0.95 if job.complexity >= 4 else 0.90
           outcome = {}
           for key, worth in job.expected_output.objects():
               if isinstance(worth, (int, float)):
                   variation = worth * (1 - accuracy_boost)
                   outcome[key] = worth + random.uniform(-variation, variation)
               else:
                   outcome[key] = worth
           return outcome
    
    
    class HybridAgent(BaseAgent):
       def execute(self, job: Process) -> Dict[str, Any]:
           time.sleep(random.uniform(0.15, 0.35))
           if job.complexity <= 2:
               return job.expected_output
           else:
               outcome = {}
               for key, worth in job.expected_output.objects():
                   if isinstance(worth, (int, float)):
                       variation = worth * 0.03
                       outcome[key] = worth + random.uniform(-variation, variation)
                   else:
                       outcome[key] = worth
               return outcome

    We develop two clever agent sorts, the LLMAgent, representing reasoning-based AI techniques, and the HybridAgent, which mixes rule-based precision with LLM adaptability. We design these brokers to point out how learning-based strategies enhance job accuracy, particularly for complicated enterprise workflows. Take a look at the Full Codes here.

    class BenchmarkEngine:
       def __init__(self, task_suite: EnterpriseTaskSuite):
           self.task_suite = task_suite
           self.outcomes: Listing[BenchmarkResult] = []
    
    
       def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
           print(f"n{'='*60}")
           print(f"Benchmarking Agent: {agent.identify}")
           print(f"{'='*60}")
           for job in self.task_suite.duties:
               print(f"nTask: {job.identify} (Complexity: {job.complexity}/5)")
               for i in vary(iterations):
                   outcome = self._execute_task(agent, job, i+1)
                   self.outcomes.append(outcome)
                   standing = "✓ PASS" if outcome.success else "✗ FAIL"
                   print(f"  Run {i+1}: {standing} | Time: {outcome.execution_time:.3f}s | Accuracy: {outcome.accuracy:.2%}")

    Right here, we construct the core of our benchmarking engine, which manages agent analysis throughout the outlined job suite. We implement strategies to run every agent a number of occasions per job, log outcomes, and measure key parameters like execution time and accuracy. This creates a scientific and repeatable benchmarking loop. Take a look at the Full Codes here.

     def _execute_task(self, agent: BaseAgent, job: Process, run_num: int) -> BenchmarkResult:
           start_time = time.time()
           strive:
               output = agent.execute(job)
               execution_time = time.time() - start_time
               accuracy = self._calculate_accuracy(output, job.expected_output)
               success = accuracy >= 0.85
               return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=success,
                                      execution_time=execution_time, accuracy=accuracy)
           besides Exception as e:
               execution_time = time.time() - start_time
               return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=False,
                                      execution_time=execution_time, accuracy=0.0, error_message=str(e))
    
    
       def _calculate_accuracy(self, output: Dict, anticipated: Dict) -> float:
           if not output:
               return 0.0
           scores = []
           for key, expected_val in anticipated.objects():
               if key not in output:
                   scores.append(0.0)
                   proceed
               actual_val = output[key]
               if isinstance(expected_val, bool):
                   scores.append(1.0 if actual_val == expected_val else 0.0)
               elif isinstance(expected_val, (int, float)):
                   diff = abs(actual_val - expected_val)
                   tolerance = abs(expected_val * 0.1)
                   rating = max(0, 1 - (diff / (tolerance + 1e-9)))
                   scores.append(rating)
               else:
                   scores.append(1.0 if actual_val == expected_val else 0.0)
           return np.imply(scores) if scores else 0.0

    We outline the duty execution logic and the accuracy computation. We measure every agent’s efficiency by evaluating their outputs towards anticipated outcomes utilizing a scoring mechanism. This step ensures our benchmarking course of is quantitative and truthful, offering insights into how intently brokers align with enterprise expectations. Take a look at the Full Codes here.

     def generate_report(self):
           df = pd.DataFrame([asdict(r) for r in self.results])
           print(f"n{'='*60}")
           print("BENCHMARK REPORT")
           print(f"{'='*60}n")
           for agent_name in df['agent_name'].distinctive():
               agent_df = df[df['agent_name'] == agent_name]
               print(f"{agent_name}:")
               print(f"  Success Price: {agent_df['success'].imply():.1%}")
               print(f"  Avg Execution Time: {agent_df['execution_time'].imply():.3f}s")
               print(f"  Avg Accuracy: {agent_df['accuracy'].imply():.2%}n")
           return df
    
    
       def visualize_results(self, df: pd.DataFrame):
           fig, axes = plt.subplots(2, 2, figsize=(14, 10))
           fig.suptitle('Enterprise Agent Benchmarking Outcomes', fontsize=16, fontweight="daring")
           success_rate = df.groupby('agent_name')['success'].imply()
           axes[0, 0].bar(success_rate.index, success_rate.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
           axes[0, 0].set_title('Success Price by Agent', fontweight="daring")
           axes[0, 0].set_ylabel('Success Price')
           axes[0, 0].set_ylim(0, 1.1)
           for i, v in enumerate(success_rate.values):
               axes[0, 0].textual content(i, v + 0.02, f'{v:.1%}', ha="heart", fontweight="daring")
           time_data = df.groupby('agent_name')['execution_time'].imply()
           axes[0, 1].bar(time_data.index, time_data.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
           axes[0, 1].set_title('Common Execution Time', fontweight="daring")
           axes[0, 1].set_ylabel('Time (seconds)')
           for i, v in enumerate(time_data.values):
               axes[0, 1].textual content(i, v + 0.01, f'{v:.3f}s', ha="heart", fontweight="daring")
           df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
           axes[1, 0].set_title('Accuracy Distribution', fontweight="daring")
           axes[1, 0].set_xlabel('Agent')
           axes[1, 0].set_ylabel('Accuracy')
           plt.sca(axes[1, 0])
           plt.xticks(rotation=15)
           task_complexity = {t.id: t.complexity for t in self.task_suite.duties}
           df['complexity'] = df['task_id'].map(task_complexity)
           complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].imply().unstack()
           complexity_perf.plot(sort='line', ax=axes[1, 1], marker="o", linewidth=2)
           axes[1, 1].set_title('Accuracy by Process Complexity', fontweight="daring")
           axes[1, 1].set_xlabel('Process Complexity')
           axes[1, 1].set_ylabel('Accuracy')
           axes[1, 1].legend(title="Agent", loc="greatest")
           axes[1, 1].grid(True, alpha=0.3)
           plt.tight_layout()
           plt.present()
    
    
    if __name__ == "__main__":
       print("Enterprise Software program Benchmarking for Agentic Brokers")
       print("="*60)
       task_suite = EnterpriseTaskSuite()
       benchmark = BenchmarkEngine(task_suite)
       brokers = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
       for agent in brokers:
           benchmark.run_benchmark(agent, iterations=3)
       results_df = benchmark.generate_report()
       benchmark.visualize_results(results_df)
       results_df.to_csv('agent_benchmark_results.csv', index=False)
       print("nResults exported to: agent_benchmark_results.csv")

    We generate detailed experiences and create visible analytics for efficiency comparability. We analyze metrics corresponding to success fee, execution time, and accuracy throughout brokers and job complexities. Lastly, we export the outcomes to CSV file, finishing a full enterprise-grade analysis workflow.

    In conclusion, we applied a sturdy, extensible benchmarking system that allows us to measure and examine the effectivity, adaptability, and accuracy of a number of agentic AI approaches. We noticed how totally different architectures excel at totally different ranges of job complexity and the way visible analytics spotlight efficiency tendencies. This course of allows us to judge current brokers and gives a powerful basis for next-generation enterprise AI brokers, optimized for reliability and intelligence.


    Take a look at the Full Codes here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


    Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

    🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



    Source link

    Naveed Ahmad

    Related Posts

    ChatGPT rolls out adverts | TechCrunch

    10/02/2026

    Harvey reportedly elevating at $11B valuation simply months after it hit $8B

    10/02/2026

    So, what is going on on with Musicboard?

    09/02/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.