Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Tutorial for Working PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG

    Naveed AhmadBy Naveed Ahmad19/04/2026Updated:19/04/2026No Comments15 Mins Read
    2705


    On this tutorial, we implement the right way to run the Bonsai 1-bit massive language mannequin effectively utilizing GPU acceleration and PrismML’s optimized GGUF deployment stack. We arrange the atmosphere, set up the required dependencies, and obtain the prebuilt llama.cpp binaries, and cargo the Bonsai-1.7B mannequin for quick inference on CUDA. As we progress, we study how 1-bit quantization works underneath the hood, why the Q1_0_g128 format is so memory-efficient, and the way this makes Bonsai sensible for light-weight but succesful language mannequin deployment. We additionally take a look at core inference, benchmarking, multi-turn chat, structured JSON era, code era, OpenAI-compatible server mode, and a small retrieval-augmented era workflow, giving us an entire, hands-on view of how Bonsai operates in real-world use.

    Copy CodeCopiedUse a special Browser
    import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap
    
    
    strive:
       import google.colab
       IN_COLAB = True
    besides ImportError:
       IN_COLAB = False
    
    
    def part(title):
       bar = "═" * 60
       print(f"n{bar}n  {title}n{bar}")
    
    
    part("1 · Setting & GPU Test")
    
    
    def run(cmd, seize=False, verify=True, **kw):
       return subprocess.run(
           cmd, shell=True, capture_output=seize,
           textual content=True, verify=verify, **kw
       )
    
    
    gpu_info = run("nvidia-smi --query-gpu=title,reminiscence.whole,driver_version --format=csv,noheader",
                  seize=True, verify=False)
    if gpu_info.returncode == 0:
       print(" GPU detected:", gpu_info.stdout.strip())
    else:
       print("  No GPU discovered — inference will run on CPU (a lot slower).")
    
    
    cuda_check = run("nvcc --version", seize=True, verify=False)
    if cuda_check.returncode == 0:
       for line in cuda_check.stdout.splitlines():
           if "launch" in line:
               print("   CUDA:", line.strip())
               break
    
    
    print(f"   Python {sys.model.cut up()[0]}  |  Platform: Linux (Colab)")
    
    
    part("2 · Putting in Python Dependencies")
    
    
    run("pip set up -q huggingface_hub requests tqdm openai")
    print(" huggingface_hub, requests, tqdm, openai put in")
    
    
    from huggingface_hub import hf_hub_download

    We start by importing the core Python modules that we’d like for system operations, downloads, timing, and JSON dealing with. We verify whether or not we’re working inside Google Colab, outline a reusable part printer, and create a helper perform to run shell instructions cleanly from Python. We then confirm the GPU and CUDA atmosphere, print the Python runtime particulars, set up the required Python dependencies, and put together the Hugging Face obtain utility for the following phases.

    Copy CodeCopiedUse a special Browser
    part("3 · Downloading PrismML llama.cpp Prebuilt Binaries")
    
    
    RELEASE_TAG = "prism-b8194-1179bfc"
    BASE_URL    = f"https://github.com/PrismML-Eng/llama.cpp/releases/obtain/{RELEASE_TAG}"
    BIN_DIR     = "/content material/bonsai_bin"
    os.makedirs(BIN_DIR, exist_ok=True)
    
    
    def detect_cuda_build():
       r = run("nvcc --version", seize=True, verify=False)
       for line in r.stdout.splitlines():
           if "launch" in line:
               strive:
                   ver = float(line.cut up("launch")[-1].strip().cut up(",")[0].strip())
                   if ver >= 13.0: return "13.1"
                   if ver >= 12.6: return "12.8"
                   return "12.4"
               besides ValueError:
                   move
       return "12.4"
    
    
    cuda_build = detect_cuda_build()
    print(f"   Detected CUDA construct slot: {cuda_build}")
    
    
    TAR_NAME = f"llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz"
    TAR_URL  = f"{BASE_URL}/{TAR_NAME}"
    tar_path = f"/tmp/{TAR_NAME}"
    
    
    if not os.path.exists(f"{BIN_DIR}/llama-cli"):
       print(f"   Downloading: {TAR_URL}")
       urllib.request.urlretrieve(TAR_URL, tar_path)
       print("   Extracting …")
       with tarfile.open(tar_path, "r:gz") as t:
           t.extractall(BIN_DIR)
       for fname in os.listdir(BIN_DIR):
           fp = os.path.be part of(BIN_DIR, fname)
           if os.path.isfile(fp):
               os.chmod(fp, 0o755)
       print(f" Binaries extracted to {BIN_DIR}")
       bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.be part of(BIN_DIR, f)))
       print("   Accessible:", ", ".be part of(bins))
    else:
       print(f" Binaries already current at {BIN_DIR}")
    
    
    LLAMA_CLI    = f"{BIN_DIR}/llama-cli"
    LLAMA_SERVER = f"{BIN_DIR}/llama-server"
    
    
    take a look at = run(f"{LLAMA_CLI} --version", seize=True, verify=False)
    if take a look at.returncode == 0:
       print(f"   llama-cli model: {take a look at.stdout.strip()[:80]}")
    else:
       print(f"  llama-cli take a look at failed: {take a look at.stderr.strip()[:200]}")
    
    
    part("4 · Downloading Bonsai-1.7B GGUF Mannequin")
    
    
    MODEL_REPO    = "prism-ml/Bonsai-1.7B-gguf"
    MODEL_DIR     = "/content material/bonsai_models"
    GGUF_FILENAME = "Bonsai-1.7B.gguf"
    os.makedirs(MODEL_DIR, exist_ok=True)
    MODEL_PATH = os.path.be part of(MODEL_DIR, GGUF_FILENAME)
    
    
    if not os.path.exists(MODEL_PATH):
       print(f"   Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace …")
       MODEL_PATH = hf_hub_download(
           repo_id=MODEL_REPO,
           filename=GGUF_FILENAME,
           local_dir=MODEL_DIR,
       )
       print(f" Mannequin saved to: {MODEL_PATH}")
    else:
       print(f" Mannequin already cached: {MODEL_PATH}")
    
    
    size_mb = os.path.getsize(MODEL_PATH) / 1e6
    print(f"   File measurement on disk: {size_mb:.1f} MB")
    
    
    part("5 · Core Inference Helpers")
    
    
    DEFAULT_GEN_ARGS = dict(
       temp=0.5,
       top_p=0.85,
       top_k=20,
       repeat_penalty=1.0,
       n_predict=256,
       n_gpu_layers=99,
       ctx_size=4096,
    )
    
    
    def build_llama_cmd(immediate, system_prompt="You're a useful assistant.", **overrides):
       args = {**DEFAULT_GEN_ARGS, **overrides}
       formatted = (
           f"<|im_start|>systemn{system_prompt}<|im_end|>n"
           f"<|im_start|>usern{immediate}<|im_end|>n"
           f"<|im_start|>assistantn"
       )
       safe_prompt = formatted.substitute('"', '"')
       return (
           f'{LLAMA_CLI} -m "{MODEL_PATH}"'
           f' -p "{safe_prompt}"'
           f' -n {args["n_predict"]}'
           f' --temp {args["temp"]}'
           f' --top-p {args["top_p"]}'
           f' --top-k {args["top_k"]}'
           f' --repeat-penalty {args["repeat_penalty"]}'
           f' -ngl {args["n_gpu_layers"]}'
           f' -c {args["ctx_size"]}'
           f' --no-display-prompt'
           f' -e'
       )
    
    
    def infer(immediate, system_prompt="You're a useful assistant.", verbose=True, **overrides):
       cmd = build_llama_cmd(immediate, system_prompt, **overrides)
       t0 = time.time()
       outcome = run(cmd, seize=True, verify=False)
       elapsed = time.time() - t0
       output = outcome.stdout.strip()
       if verbose:
           print(f"n{'─'*50}")
           print(f"Immediate : {immediate[:100]}{'…' if len(immediate) > 100 else ''}")
           print(f"{'─'*50}")
           print(output)
           print(f"{'─'*50}")
           print(f"  {elapsed:.2f}s  |  ~{len(output.cut up())} phrases")
       return output, elapsed
    
    
    print(" Inference helpers prepared.")
    
    
    part("6 · Primary Inference — Howdy, Bonsai!")
    
    
    infer("What makes 1-bit language fashions particular in comparison with customary fashions?")

    We obtain and put together the PrismML prebuilt llama.cpp CUDA binaries that energy native inference for the Bonsai mannequin. We detect the out there CUDA model, select the matching binary construct, extract the downloaded archive, make the information executable, and confirm that the llama-cli binary works appropriately. After that, we obtain the Bonsai-1.7B GGUF mannequin from Hugging Face, arrange the mannequin path, outline the default era settings, and construct the core helper features that format prompts and run inference.

    Copy CodeCopiedUse a special Browser
    part("7 · Q1_0_g128 Quantization — What's Occurring Below the Hood")
    
    
    print(textwrap.dedent("""
    ╔══════════════════════════════════════════════════════════════╗
    ║           Bonsai Q1_0_g128 Weight Illustration            ║
    ╠══════════════════════════════════════════════════════════════╣
    ║  Every weight = 1 bit:  0  →  −scale                         ║
    ║                        1  →  +scale                         ║
    ║  Each 128 weights share one FP16 scale issue.             ║
    ║                                                              ║
    ║  Efficient bits per weight:                                  ║
    ║    1 bit (signal) + 16/128 bits (shared scale) = 1.125 bpw    ║
    ║                                                              ║
    ║  Reminiscence comparability for Bonsai-1.7B:                         ║
    ║    FP16:            3.44 GB  (1.0×  baseline)               ║
    ║    Q1_0_g128:       0.24 GB  (14.2× smaller!)               ║
    ║    MLX 1-bit g128:  0.27 GB  (12.8× smaller)                ║
    ╚══════════════════════════════════════════════════════════════╝
    """))
    
    
    print(" Python demo of Q1_0_g128 quantization logic:n")
    import random
    random.seed(42)
    GROUP_SIZE   = 128
    weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
    scale        = max(abs(w) for w in weights_fp16)
    quantized    = [1 if w >= 0 else 0 for w in weights_fp16]
    dequantized  = [scale if b == 1 else -scale for b in quantized]
    mse          = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE
    
    
    print(f"  FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
    print(f"  1-bit repr  (first 8): {quantized[:8]}")
    print(f"  Shared scale:          {scale:.4f}")
    print(f"  Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
    print(f"  MSE of reconstruction: {mse:.6f}")
    memory_fp16 = GROUP_SIZE * 2
    memory_1bit = GROUP_SIZE / 8 + 2
    print(f"n  Reminiscence: FP16={memory_fp16}B  vs  Q1_0_g128={memory_1bit:.1f}B  "
         f"({memory_fp16/memory_1bit:.1f}× discount)")
    
    
    part("8 · Efficiency Benchmark — Tokens per Second")
    
    
    def benchmark(immediate, n_tokens=128, n_runs=3, **kw):
       timings = []
       for i in vary(n_runs):
           print(f"   Run {i+1}/{n_runs} …", finish=" ", flush=True)
           _, elapsed = infer(immediate, verbose=False, n_predict=n_tokens, **kw)
           tps = n_tokens / elapsed
           timings.append(tps)
           print(f"{tps:.1f} tok/s")
       avg = sum(timings) / len(timings)
       print(f"n   Common: {avg:.1f} tok/s  (over {n_runs} runs, {n_tokens} tokens every)")
       return avg
    
    
    print(" Benchmarking Bonsai-1.7B in your GPU …")
    tps = benchmark(
       "Clarify the idea of neural community backpropagation step-by-step.",
       n_tokens=128, n_runs=3,
    )
    
    
    print("n  Revealed reference throughputs (from whitepaper):")
    print("  ┌──────────────────────┬─────────┬──────────────┐")
    print("  │ Platform             │ Backend │ TG128 tok/s  │")
    print("  ├──────────────────────┼─────────┼──────────────┤")
    print("  │ RTX 4090             │ CUDA    │     674      │")
    print("  │ M4 Professional 48 GB         │ Metallic   │     250      │")
    print(f"  │ Your GPU (measured)  │ CUDA    │  {tps:>7.1f}    │")
    print("  └──────────────────────┴─────────┴──────────────┘")
    
    
    part("9 · Multi-Flip Chat with Context Accumulation")
    
    
    def chat(user_msg, system="You're a useful assistant.", historical past=None, **kw):
       if historical past is None:
           historical past = []
       historical past.append(("person", user_msg))
       full = f"<|im_start|>systemn{system}<|im_end|>n"
       for position, msg in historical past:
           full += f"<|im_start|>{position}n{msg}<|im_end|>n"
       full += "<|im_start|>assistantn"
       secure = full.substitute('"', '"').substitute('n', 'n')
       cmd = (
           f'{LLAMA_CLI} -m "{MODEL_PATH}"'
           f' -p "{secure}" -e'
           f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
           f' -ngl 99 -c 4096 --no-display-prompt'
       )
       outcome = run(cmd, seize=True, verify=False)
       reply = outcome.stdout.strip()
       historical past.append(("assistant", reply))
       return reply, historical past
    
    
    print("  Beginning a 3-turn dialog about 1-bit fashions …n")
    historical past = []
    turns = [
       "What is a 1-bit language model?",
       "What are the main trade-offs compared to 4-bit or 8-bit quantization?",
       "How does Bonsai specifically address those trade-offs?",
    ]
    for i, msg in enumerate(turns, 1):
       print(f" Flip {i}: {msg}")
       reply, historical past = chat(msg, historical past=historical past)
       print(f" Bonsai: {reply}n")
       time.sleep(0.5)
    
    
    part("10 · Sampling Parameter Exploration")
    
    
    creative_prompt = "Write a one-sentence description of a futuristic metropolis powered completely by 1-bit AI."
    configs = [
       ("Precise / Focused",  dict(temp=0.1, top_k=10,  top_p=0.70)),
       ("Balanced (default)", dict(temp=0.5, top_k=20,  top_p=0.85)),
       ("Creative / Varied",  dict(temp=0.9, top_k=50,  top_p=0.95)),
       ("High entropy",       dict(temp=1.2, top_k=100, top_p=0.98)),
    ]
    
    
    print(f'Immediate: "{creative_prompt}"n')
    for label, params in configs:
       out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
       print(f"  [{label}]")
       print(f"    temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
       print(f"    → {out[:200]}n")

    We transfer from setup into experimentation by first working a fundamental inference name to verify that the mannequin is functioning correctly. We then clarify the Q1_0_g128 quantization format by means of a visible textual content block and a small Python demo that exhibits how 1-bit indicators and shared scales reconstruct weights with robust reminiscence financial savings. After that, we benchmark token era pace, simulate a multi-turn dialog with collected historical past, and evaluate how totally different sampling settings have an effect on the model and variety of the mannequin’s outputs.

    Copy CodeCopiedUse a special Browser
    part("11 · Context Window — Lengthy-Doc Summarisation")
    
    
    long_doc = (
       "The transformer structure, launched in 'Consideration is All You Want' (Vaswani et al., 2017), "
       "changed recurrent and convolutional networks with self-attention mechanisms. The important thing perception was "
       "that spotlight weights could possibly be computed in parallel throughout the complete sequence, not like RNNs which "
       "stacked an identical layers with multi-head self-attention and feed-forward sub-layers. Positional "
       "encodings inject sequence-order info since consideration is permutation-invariant. Subsequent "
       "work eliminated the encoder (GPT household) or decoder (BERT household) to specialise for era or "
       "understanding duties respectively. Scaling legal guidelines (Kaplan et al., 2020) confirmed that loss decreases "
       "predictably with extra compute, parameters, and information. This motivated the emergence of huge language "
       "these fashions turned prohibitive for edge and on-device deployment. Quantisation analysis sought to "
       "cut back the bit-width of weights from FP16/BF16 all the way down to INT8, INT4, and ultimately binary (1-bit). "
       "BitNet (Wang et al., 2023) was among the many first to display that coaching with 1-bit weights from "
       "scratch may strategy the standard of higher-precision fashions at scale. Bonsai (Prism ML, 2026) "
       "prolonged this to an end-to-end 1-bit deployment pipeline throughout CUDA, Metallic, and cell runtimes, "
       "reaching 14x reminiscence discount with the Q1_0_g128 GGUF format."
    )
    
    
    summarize_prompt = f"Summarize the next technical textual content in 3 bullet factors:nn{long_doc}"
    print(f"   Enter size: ~{len(long_doc.cut up())} phrases")
    out, elapsed = infer(summarize_prompt, n_predict=200, ctx_size=2048, verbose=False)
    print(" Abstract:")
    for line in out.splitlines():
       print(f"   {line}")
    print(f"n  {elapsed:.2f}s")
    
    
    part("12 · Structured Output — Forcing JSON Responses")
    
    
    json_system = (
       "You're a JSON API. Reply ONLY with legitimate JSON, no markdown, no clarification. "
       "By no means embrace ```json fences."
    )
    json_prompt = (
       "Return a JSON object with keys: model_name, parameter_count, "
       "bits_per_weight, memory_gb, top_use_cases (array of three strings). "
       "Fill in values for Bonsai-1.7B."
    )
    
    
    uncooked, _ = infer(json_prompt, system_prompt=json_system, temp=0.1, n_predict=300, verbose=False)
    print("Uncooked mannequin output:")
    print(uncooked)
    print()
    
    
    strive:
       clear = uncooked.strip().lstrip("```json").lstrip("```").rstrip("```").strip()
       information  = json.masses(clear)
       print(" Parsed JSON:")
       for ok, v in information.objects():
           print(f"   {ok}: {v}")
    besides json.JSONDecodeError as e:
       print(f"  JSON parse error: {e} — uncooked output proven above.")
    
    
    part("13 · Code Era")
    
    
    code_prompt = (
       "Write a Python perform known as `quantize_weights` that takes an inventory of float "
       "weights and a group_size, applies 1-bit Q1_0_g128-style quantization (signal bit + "
       "per-group FP16 scale), and returns the quantized bits and scale listing. "
       "Embrace a docstring and a brief utilization instance."
    )
    code_system = "You might be an skilled Python programmer. Return clear, well-commented Python code solely."
    
    
    code_out, _ = infer(code_prompt, system_prompt=code_system,
                       temp=0.2, n_predict=400, verbose=False)
    print(code_out)
    
    
    exec_ns = {}
    strive:
       exec(code_out, exec_ns)
       if "quantize_weights" in exec_ns:
           import random as _r
           test_w = [_r.gauss(0, 0.1) for _ in range(256)]
           bits, scales = exec_ns["quantize_weights"](test_w, 128)
           print(f"n Operate executed efficiently!")
           print(f"   Enter  : {len(test_w)} weights")
           print(f"   Output : {len(bits)} bits, {len(scales)} scale values")
    besides Exception as e:
       print(f"n  Exec notice: {e} (mannequin output may have minor tweaks)")

    We take a look at the mannequin on longer-context and structured duties to higher perceive its sensible capabilities. We feed a technical passage right into a summarization mannequin, ask it to return strict JSON output, after which push it additional by producing Python code that we instantly execute within the pocket book. This helps us consider not solely whether or not Bonsai can reply questions, but in addition whether or not it could observe formatting guidelines, generate usable structured responses, and produce code that works in actual execution.

    Copy CodeCopiedUse a special Browser
    part("14 · OpenAI-Suitable Server Mode")
    
    
    SERVER_PORT = 8088
    SERVER_URL  = f"http://localhost:{SERVER_PORT}"
    server_proc = None
    
    
    def start_server():
       international server_proc
       if server_proc and server_proc.ballot() is None:
           print("   Server already working.")
           return
       cmd = (
           f"{LLAMA_SERVER} -m {MODEL_PATH} "
           f"--host 0.0.0.0 --port {SERVER_PORT} "
           f"-ngl 99 -c 4096 --no-display-prompt --log-disable 2>/dev/null"
       )
       server_proc = subprocess.Popen(cmd, shell=True,
                                      stdout=subprocess.DEVNULL,
                                      stderr=subprocess.DEVNULL)
       for _ in vary(30):
           strive:
               urllib.request.urlopen(f"{SERVER_URL}/well being", timeout=1)
               print(f" llama-server working at {SERVER_URL}")
               return
           besides Exception:
               time.sleep(1)
       print("  Server should still be beginning up …")
    
    
    def stop_server():
       international server_proc
       if server_proc:
           server_proc.terminate()
           server_proc.wait()
           print("   Server stopped.")
    
    
    print(" Beginning llama-server …")
    start_server()
    time.sleep(2)
    
    
    strive:
       from openai import OpenAI
       shopper   = OpenAI(base_url=f"{SERVER_URL}/v1", api_key="no-key-needed")
       print("n   Sending request through OpenAI shopper …")
       response = shopper.chat.completions.create(
           mannequin="bonsai",
           messages=[
               {"role": "user",   "content": "What are three key advantages of 1-bit LLMs for mobile devices?"},
           ],
           max_tokens=200,
           temperature=0.5,
       )
       reply = response.decisions[0].message.content material
       print(f"n Server response:n{reply}")
       utilization = response.utilization
       print(f"n   Immediate tokens    : {utilization.prompt_tokens}")
       print(f"   Completion tokens: {utilization.completion_tokens}")
       print(f"   Whole tokens     : {utilization.total_tokens}")
    besides Exception as e:
       print(f"  OpenAI shopper error: {e}")
    
    
    part("15 · Mini-RAG — Grounded Q&A with Context Injection")
    
    
    KB = {
       "bonsai_1.7b": (
           "Bonsai-1.7B makes use of Q1_0_g128 quantization. It has 1.7B parameters, "
           "deployed measurement 0.24 GB, context size 32,768 tokens, and is predicated on "
           "the Qwen3-1.7B dense structure with GQA consideration."
       ),
       "bonsai_8b": (
           "Bonsai-8B makes use of Q1_0_g128 quantization. It helps as much as 65,536 tokens "
           "of context. It achieves 3.0x sooner token era than FP16 on RTX 4090."
       ),
       "quantization": (
           "Q1_0_g128 packs every weight as a single signal bit (0=-scale, 1=+scale). "
           "Every group of 128 weights shares one FP16 scale issue, giving 1.125 bpw."
       ),
    }
    
    
    def rag_query(query):
       q = query.decrease()
       related = []
       if "1.7" in q or "small" in q:  related.append(KB["bonsai_1.7b"])
       if "8b" in q or "context" in q: related.append(KB["bonsai_8b"])
       if "quant" in q or "bit" in q:  related.append(KB["quantization"])
       if not related:                 related = listing(KB.values())
       context    = "n".be part of(f"- {c}" for c in related)
       rag_prompt = (
           "If the reply is just not within the context, say so.nn"
           f"Context:n{context}nnQuestion: {query}"
       )
       ans, _ = infer(rag_prompt, n_predict=150, temp=0.1, verbose=False)
       print(f" {query}")
       print(f" {ans}n")
    
    
    print("Working RAG queries …n")
    rag_query("What's the deployed file measurement of the 1.7B mannequin?")
    rag_query("How does Q1_0_g128 quantization work?")
    rag_query("What context size does the 8B mannequin help?")
    
    
    part("16 · Mannequin Household Comparability")
    
    
    print("""
    ┌─────────────────┬──────────┬────────────┬────────────────┬──────────────┬──────────────┐
    │ Mannequin           │ Params   │ GGUF Dimension  │ Context Len    │ FP16 Dimension    │ Compression  │
    ├─────────────────┼──────────┼────────────┼────────────────┼──────────────┼──────────────┤
    │ Bonsai-1.7B     │  1.7 B   │  0.25 GB   │ 32,768 tokens  │   3.44 GB    │    14.2×     │
    │ Bonsai-4B       │  4.0 B   │  ~0.6 GB   │ 32,768 tokens  │   ~8.0  GB   │    ~13×      │
    │ Bonsai-8B       │  8.0 B   │  ~0.9 GB   │ 65,536 tokens  │  ~16.0  GB   │    ~13.9×    │
    └─────────────────┴──────────┴────────────┴────────────────┴──────────────┴──────────────┘
    
    
    Throughput (from whitepaper):
     RTX 4090  — Bonsai-1.7B:  674 tok/s (TG128) vs FP16 224 tok/s  →  3.0× sooner
     M4 Professional    — Bonsai-1.7B:  250 tok/s (TG128) vs FP16  65 tok/s  →  3.8× sooner
    """)
    
    
    part("17 · Cleanup")
    
    
    stop_server()
    print(" Tutorial full!n")
    print(" Assets:")
    print("   GitHub:      https://github.com/PrismML-Eng/Bonsai-demo")
    print("   HuggingFace: https://huggingface.co/collections/prism-ml/bonsai")
    print("   Whitepaper:  https://github.com/PrismML-Eng/Bonsai-demo/blob/most important/1-bit-bonsai-8b-whitepaper.pdf")
    print("   Discord:     https://discord.gg/prismml")

    We launch the OpenAI-compatible llama-server to work together with Bonsai through the OpenAI Python shopper. We then construct a light-weight Mini-RAG instance by injecting related context into prompts, evaluate the broader Bonsai mannequin household when it comes to measurement, context size, and compression, and eventually shut down the native server cleanly. This closing part exhibits how Bonsai can match into API-style workflows, grounded question-answering setups, and broader deployment situations past easy single-prompt inference.

    In conclusion, we constructed and ran a full Bonsai 1-bit LLM workflow in Google Colab and noticed that excessive quantization can dramatically cut back mannequin measurement whereas nonetheless supporting helpful, quick, and versatile inference. We verified the runtime atmosphere, launched the mannequin domestically, measured token throughput, and experimented with totally different prompting, sampling, context dealing with, and server-based integrations. Alongside the way in which, we additionally linked the sensible execution to the underlying quantization logic, serving to us perceive not simply the right way to use Bonsai, however why its design is vital for environment friendly AI deployment. By the tip, we have now a compact however superior setup that demonstrates how 1-bit language fashions could make high-performance inference extra accessible throughout constrained and mainstream {hardware} environments.


    Take a look at the Full Coding Notebook here. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

    The publish A Coding Tutorial for Working PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG appeared first on MarkTechPost.



    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    NVIDIA Releases Ising: the First Open Quantum AI Mannequin Household for Hybrid Quantum-Classical Techniques

    19/04/2026

    xAI Launches Standalone Grok Speech-to-Textual content and Textual content-to-Speech APIs, Concentrating on Enterprise Voice Builders

    19/04/2026

    Anthropic Releases Claude Opus 4.7: A Main Improve for Agentic Coding, Excessive-Decision Imaginative and prescient, and Lengthy-Horizon Autonomous Duties

    19/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.