Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    RightNow AI Releases AutoKernel: An Open-Supply Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Fashions

    Naveed AhmadBy Naveed Ahmad06/04/2026Updated:06/04/2026No Comments9 Mins Read
    blog 11


    Writing quick GPU code is among the most grueling specializations in machine studying engineering. Researchers from RightNow AI wish to automate it completely.

    The RightNow AI analysis staff has launched AutoKernel, an open-source framework that applies an autonomous LLM agent loop to GPU kernel optimization for arbitrary PyTorch fashions. The method is simple: give it any mannequin earlier than you go to mattress, and get up to quicker Triton kernels — no GPU experience required.

    https://arxiv.org/pdf/2603.21331

    Why GPU Kernels Are So Exhausting to Optimize

    A GPU kernel is a operate that runs in parallel throughout 1000’s of GPU cores. If you run a transformer mannequin like LLaMA or GPT-2, the majority of compute time is spent inside kernels for operations like matrix multiplication (matmul), softmax, layer normalization, and a spotlight. These kernels dwell in libraries like cuBLAS and cuDNN, or get generated routinely by PyTorch’s compilation pipeline.

    The issue is that squeezing most efficiency out of those kernels requires reasoning concurrently about arithmetic depth, reminiscence coalescing, register stress, tile sizes, warp-level synchronization, and tensor core instruction choice — a mixture of expertise that takes years to develop. A single high-performance matmul kernel could contain 200+ traces of CUDA or Triton code with dozens of interdependent parameters. This experience is scarce, and the guide tuning course of scales poorly as mannequin architectures evolve.

    The benchmark suite KernelBench, which evaluates frontier LLMs on 250 GPU kernel issues, discovered that even the most effective fashions matched PyTorch baseline efficiency in fewer than 20% of circumstances utilizing one-shot technology. AutoKernel was constructed instantly in response to that hole.

    The Loop: Edit, Benchmark, Hold or Revert

    AutoKernel’s core perception is that an professional kernel engineer’s workflow is itself a easy loop: write a candidate, benchmark it, maintain enhancements, discard regressions, repeat. The framework mechanizes this loop. An LLM agent modifies a single file — kernel.py — a set benchmark harness verifies correctness and measures throughput, and the outcome determines whether or not the change persists. Crucially, each experiment maps to a git commit. Saved experiments advance the department; reverted experiments are erased cleanly with git reset. Your complete historical past is browsable with normal git instruments, and experiment outcomes are logged to a plain tab-separated outcomes.tsv file — dependency-free, human-readable, and trivially parseable by the agent.

    Every iteration takes roughly 90 seconds — 30 seconds for correctness checking, 30 seconds for efficiency benchmarking through Triton’s do_bench, and 30 seconds for agent reasoning and code modification. At roughly 40 experiments per hour, an in a single day 10-hour run yields 300 to 400 experiments throughout a number of kernels.

    This design attracts instantly from Andrej Karpathy’s autoresearch venture, which demonstrated that an AI agent working a maintain/revert loop on LLM coaching code may uncover 20 optimizations throughout 700 experiments in two days on a single GPU. AutoKernel transplants this loop to kernel code, with a special search house and a correctness-gated benchmark because the analysis operate as an alternative of validation loss.

    The agent reads a 909-line instruction doc referred to as program.md, which encodes professional information right into a six-tier optimization playbook. The tiers progress from block dimension tuning (sweeping tile dimensions by means of powers of two, adjusting num_warps and num_stages) by means of reminiscence entry patterns (coalesced hundreds, software program prefetching, L2 swizzling), compute optimizations (TF32 accumulation, epilogue fusion), superior strategies (split-Ok, persistent kernels, Triton autotune, warp specialization), architecture-specific methods (TMA on Hopper, cp.async on Ampere, adjusted sizes for L4/RTX), and at last kernel-specific algorithms like on-line softmax for consideration and Welford’s algorithm for normalization. The instruction doc is deliberately complete so the agent can run 10+ hours with out getting caught.

    https://arxiv.org/pdf/2603.21331

    Profiling First, Optimizing The place It Issues

    Not like prior work that treats kernel issues in isolation, AutoKernel begins from a whole PyTorch mannequin. It makes use of torch.profiler with form recording to seize per-kernel GPU time, then ranks optimization targets utilizing Amdahl’s regulation — the mathematical precept that the general speedup you’ll be able to obtain is bounded by how a lot of the entire runtime that element represents. A 1.5× speedup on a kernel consuming 60% of complete runtime yields a 1.25× end-to-end achieve. The identical speedup on a kernel consuming 5% of runtime yields only one.03×.

    The profiler detects GPU {hardware} from a database of identified specs overlaying each NVIDIA (H100, A100, L40S, L4, A10, RTX 4090/4080/3090/3080) and AMD (MI300X, MI325X, MI350X, MI355X) accelerators. For unknown GPUs, it estimates peak FP16 throughput from SM rely, clock fee, and compute functionality — making the system usable throughout a wider vary of {hardware} than simply the newest NVIDIA choices.

    The orchestrator (orchestrate.py) transitions from one kernel to the subsequent when any of 4 situations are met: 5 consecutive reverts, 90% of GPU peak utilization reached, a two-hour elapsed time finances, or a 2× speedup already achieved on that kernel. This prevents the agent from spending extreme time on kernels with diminishing returns whereas higher-impact targets wait.

    5-Stage Correctness Harness

    Efficiency with out correctness is ineffective, and AutoKernel is especially thorough on this entrance. Each candidate kernel passes by means of 5 validation phases earlier than any speedup is recorded. Stage 1 runs a smoke check on a small enter to catch compilation errors and form mismatches in below a second. Stage 2 sweeps throughout 8 to 10 enter configurations and three knowledge sorts — FP16, BF16, and FP32 — to catch size-dependent bugs like boundary dealing with and tile the rest logic. Stage 3 exams numerical stability below adversarial inputs: for softmax, rows of enormous similar values; for matmul, excessive dynamic vary; for normalization, near-zero variance. Stage 4 verifies determinism by working the identical enter 3 times and requiring bitwise similar outputs, which catches race situations in parallel reductions and non-deterministic atomics. Stage 5 exams non-power-of-two dimensions like 1023, 4097, and 1537 to show masking bugs and tile the rest errors.

    Tolerances are dtype-specific: FP16 makes use of atol = 10⁻², BF16 makes use of 2 × 10⁻², and FP32 makes use of 10⁻⁴. Within the paper’s full analysis throughout 34 configurations on an NVIDIA H100, all 34 handed correctness with zero failures throughout keen, compiled, and customized kernel outputs.

    Twin Backend: Triton and CUDA C++

    AutoKernel helps each Triton and CUDA C++ backends inside the identical framework. Triton is a Python-like domain-specific language that compiles JIT in 1 to five seconds, making it very best for fast iteration — the agent can modify block sizes, warp counts, pipeline phases, accumulator precision, and loop construction. Triton routinely reaches 80 to 95% of cuBLAS throughput for matmul. CUDA C++ is included for circumstances requiring direct entry to warp-level primitives, WMMA tensor core directions (utilizing 16×16×16 fragments), vectorized hundreds through float4 and half2, bank-conflict-free shared reminiscence layouts, and double buffering. Each backends expose the identical kernel_fn() interface, so the benchmark infrastructure runs identically no matter backend.

    The system covers 9 kernel sorts spanning the dominant operations in trendy transformer architectures: matmul, flash_attention, fused_mlp, softmax, layernorm, rmsnorm, cross_entropy, rotary_embedding, and scale back. Every has a PyTorch reference implementation in reference.py serving because the correctness oracle, and the benchmark computes throughput in TFLOPS or GB/s alongside roofline utilization in opposition to detected GPU peak.

    Benchmark Outcomes on H100

    Measured on an NVIDIA H100 80GB HBM3 GPU (132 SMs, compute functionality 9.0, CUDA 12.8) in opposition to PyTorch keen and torch.compile with max-autotune, the outcomes for memory-bound kernels are vital. RMSNorm achieves 5.29× over keen and a couple of.83× over torch.compile on the largest examined dimension, reaching 2,788 GB/s — 83% of H100’s 3,352 GB/s peak bandwidth. Softmax reaches 2,800 GB/s with a 2.82× speedup over keen and three.44× over torch.compile. Cross-entropy achieves 2.21× over keen and a couple of.94× over torch.compile, reaching 2,070 GB/s. The good points on these kernels come from fusing multi-operation ATen decompositions into single-pass Triton kernels that decrease HBM (Excessive Bandwidth Reminiscence) site visitors.

    AutoKernel outperforms torch.compile on 12 of the 16 consultant configurations benchmarked within the paper, regardless of torch.compile with max-autotune working its personal Triton autotuning. TorchInductor’s generic fusion and autotuning doesn’t at all times discover the specialised tiling and discount methods that kernel-specific implementations exploit.

    Matmul is notably tougher — PyTorch’s cuBLAS backend is extensively tuned per GPU structure. The Triton starter reaches 278 TFLOPS, nicely beneath cuBLAS. Nonetheless, on the 2048³ dimension, AutoKernel beats torch.compile by 1.55×, demonstrating that TorchInductor’s matmul autotuning shouldn’t be at all times optimum both. Closing the cuBLAS hole stays the first goal for continued agent iteration.

    In group deployment, an AutoKernel-optimized kernel took first place on the vectorsum_v2 B200 leaderboard with a latency of 44.086µs, outperforming the second-place entry at 44.249µs and third place at 46.553µs. A group consumer additionally reported {that a} single AutoKernel immediate — requiring roughly three minutes of agent interplay — produced a Triton FP4 matrix multiplication kernel that outperforms CUTLASS by 1.63× to 2.15× throughout a number of shapes on H100. CUTLASS represents hand-optimized C++ template code particularly designed for NVIDIA tensor cores, making this outcome significantly notable.

    Key Takeaways

    • AutoKernel turns weeks of professional GPU tuning into an in a single day autonomous course of. By mechanizing the write-benchmark-keep/revert loop that professional kernel engineers already observe, the system runs 300 to 400 experiments per in a single day session on a single GPU with none human intervention.
    • Correctness is non-negotiable earlier than any speedup is recorded. Each candidate kernel should go a five-stage harness overlaying smoke exams, form sweeps throughout 10+ configurations, numerical stability below adversarial inputs, determinism verification, and non-power-of-two edge circumstances — eliminating the danger of the agent “optimizing” its option to incorrect outputs.
    • Reminiscence-bound kernels see the most important good points over each PyTorch keen and torch.compile. On an NVIDIA H100, AutoKernel’s Triton kernels obtain 5.29× over keen on RMSNorm, 2.82× on softmax, and a couple of.21× on cross-entropy — with the good points coming from fusing multi-operation ATen decompositions into single-pass kernels that decrease HBM site visitors.
    • Amdahl’s regulation drives the place the agent spends its time. Quite than optimizing kernels in isolation, AutoKernel profiles your complete PyTorch mannequin and allocates effort proportionally to every kernel’s share of complete GPU runtime — guaranteeing that enhancements compound on the mannequin stage, not simply the kernel stage.

    Take a look at the Paper and Repo.  Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us




    Source link

    Naveed Ahmad

    Related Posts

    Fuel costs aren’t the one issue fueling used EV gross sales

    07/04/2026

    Google quietly launched an AI dictation app that works offline

    07/04/2026

    Why AMD’s MLPerf Breakthrough Alerts the Starting of the Finish for NVIDIA’s AI Monopoly

    07/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.