On this tutorial, we work immediately with Qwen3.5 fashions distilled with Claude-style reasoning and arrange a Colab pipeline that lets us change between a 27B GGUF variant and a light-weight 2B 4-bit model with a single flag. We begin by validating GPU availability, then conditionally set up both llama.cpp or transformers with bitsandbytes, relying on the chosen path. Each branches are unified by means of shared generate_fn and stream_fn interfaces, guaranteeing constant inference throughout backends. We additionally implement a ChatSession class for multi-turn interplay and construct utilities to parse traces, permitting us to explicitly separate reasoning from remaining outputs throughout execution.
MODEL_PATH = "2B_HF"
import torch
if not torch.cuda.is_available():
increase RuntimeError(
"❌ No GPU! Go to Runtime → Change runtime kind → T4 GPU."
)
gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM")
import subprocess, sys, os, re, time
generate_fn = None
stream_fn = None
We initialize the execution by setting the mannequin path flag and checking whether or not a GPU is offered on the system. We retrieve and print the GPU title together with out there VRAM to make sure the atmosphere meets the necessities. We additionally import all required base libraries and outline placeholders for the unified technology features that will probably be assigned later.
if MODEL_PATH == "27B_GGUF":
print("n📦 Putting in llama-cpp-python with CUDA (takes 3-5 min)...")
env = os.environ.copy()
env["CMAKE_ARGS"] = "-DGGML_CUDA=on"
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "-q", "llama-cpp-python", "huggingface_hub"],
env=env,
)
print("✅ Put in.n")
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
GGUF_REPO = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"
GGUF_FILE = "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf"
print(f"⏳ Downloading {GGUF_FILE} (~16.5 GB)... seize a espresso ☕")
model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
print(f"✅ Downloaded: {model_path}n")
print("⏳ Loading into llama.cpp (GPU offload)...")
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=40,
n_threads=4,
verbose=False,
)
print("✅ 27B GGUF mannequin loaded!n")
def generate_fn(
immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
):
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
],
max_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
)
return output["choices"][0]["message"]["content"]
def stream_fn(
immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
max_new_tokens=2048, temperature=0.6, top_p=0.95,
):
print("⏳ Streaming output:n")
for chunk in llm.create_chat_completion(
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
],
max_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
):
delta = chunk["choices"][0].get("delta", {})
textual content = delta.get("content material", "")
if textual content:
print(textual content, finish="", flush=True)
print()
class ChatSession:
def __init__(self, system_prompt="You're a useful assistant. Suppose step-by-step."):
self.messages = [{"role": "system", "content": system_prompt}]
def chat(self, user_message, temperature=0.6):
self.messages.append({"position": "person", "content material": user_message})
output = llm.create_chat_completion(
messages=self.messages, max_tokens=2048,
temperature=temperature, top_p=0.95,
)
resp = output["choices"][0]["message"]["content"]
self.messages.append({"position": "assistant", "content material": resp})
return resp
We deal with the 27B GGUF path by putting in llama.cpp with CUDA assist and downloading the Qwen3.5 27B distilled mannequin from Hugging Face. We load the mannequin with GPU offloading and outline a standardized generate_fn and stream_fn for inference and streaming outputs. We additionally implement a ChatSession class to keep up dialog historical past for multi-turn interactions.
elif MODEL_PATH == "2B_HF":
print("n📦 Putting in transformers + bitsandbytes...")
subprocess.check_call([
sys.executable, "-m", "pip", "install", "-q",
"transformers @ git+https://github.com/huggingface/transformers.git@main",
"accelerate", "bitsandbytes", "sentencepiece", "protobuf",
])
print("✅ Put in.n")
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer
HF_MODEL_ID = "Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
print(f"⏳ Loading {HF_MODEL_ID} in 4-bit...")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
mannequin = AutoModelForCausalLM.from_pretrained(
HF_MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
print(f"✅ Mannequin loaded! Reminiscence: {mannequin.get_memory_footprint() / 1e9:.2f} GBn")
def generate_fn(
immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
max_new_tokens=2048, temperature=0.6, top_p=0.95,
repetition_penalty=1.05, do_sample=True, **kwargs
):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
]
textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.system)
with torch.no_grad():
output_ids = mannequin.generate(
**inputs, max_new_tokens=max_new_tokens, temperature=temperature,
top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
)
generated = output_ids[0][inputs["input_ids"].form[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
def stream_fn(
immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
max_new_tokens=2048, temperature=0.6, top_p=0.95,
):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
]
textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.system)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
print("⏳ Streaming output:n")
with torch.no_grad():
mannequin.generate(
**inputs, max_new_tokens=max_new_tokens, temperature=temperature,
top_p=top_p, do_sample=True, streamer=streamer,
)
class ChatSession:
def __init__(self, system_prompt="You're a useful assistant. Suppose step-by-step."):
self.messages = [{"role": "system", "content": system_prompt}]
def chat(self, user_message, temperature=0.6):
self.messages.append({"position": "person", "content material": user_message})
textual content = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.system)
with torch.no_grad():
output_ids = mannequin.generate(
**inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
)
generated = output_ids[0][inputs["input_ids"].form[1]:]
resp = tokenizer.decode(generated, skip_special_tokens=True)
self.messages.append({"position": "assistant", "content material": resp})
return resp
else:
increase ValueError("MODEL_PATH have to be '27B_GGUF' or '2B_HF'")
We implement the light-weight 2B path utilizing transformers with 4-bit quantization by means of bitsandbytes. We load the Qwen3.5 2B distilled mannequin effectively onto the GPU and configure technology parameters for managed sampling. We once more outline unified technology, streaming, and chat session logic in order that each mannequin paths behave identically throughout execution.
def parse_thinking(response: str) -> tuple:
m = re.search(r"(.*?)", response, re.DOTALL)
if m:
return m.group(1).strip(), response[m.end():].strip()
return "", response.strip()
def display_response(response: str):
pondering, reply = parse_thinking(response)
if pondering:
print("🧠 THINKING:")
print("-" * 60)
print(pondering[:1500] + ("n... [truncated]" if len(pondering) > 1500 else ""))
print("-" * 60)
print("n💬 ANSWER:")
print(reply)
print("✅ All helpers prepared. Working checks...n")
We outline helper features to extract reasoning traces enclosed inside tags and separate them from remaining solutions. We create a show utility that codecs and prints each the pondering course of and the response in a structured approach. This enables us to examine how the Qwen-based mannequin causes internally throughout technology.
print("=" * 70)
print("📝 TEST 1: Fundamental reasoning")
print("=" * 70)
response = generate_fn(
"If I've 3 apples and provides away half, then purchase 5 extra, what number of do I've? "
"Clarify your reasoning."
)
display_response(response)
print("n" + "=" * 70)
print("📝 TEST 2: Streaming output")
print("=" * 70)
stream_fn(
"Clarify the distinction between concurrency and parallelism. "
"Give a real-world analogy for every."
)
print("n" + "=" * 70)
print("📝 TEST 3: Pondering ON vs OFF")
print("=" * 70)
query = "What's the capital of France?"
print("n--- Pondering ON (default) ---")
resp = generate_fn(query)
display_response(resp)
print("n--- Pondering OFF (concise) ---")
resp = generate_fn(
query,
system_prompt="Reply immediately and concisely. Don't use tags.",
max_new_tokens=256,
)
display_response(resp)
print("n" + "=" * 70)
print("📝 TEST 4: Bat & ball trick query")
print("=" * 70)
response = generate_fn(
"A bat and a ball value $1.10 in complete. "
"How a lot does the ball value? Present full reasoning and confirm.",
system_prompt="You're a exact mathematical reasoner. Arrange equations and confirm.",
temperature=0.3,
)
display_response(response)
print("n" + "=" * 70)
print("📝 TEST 5: Prepare assembly downside")
print("=" * 70)
response = generate_fn(
"A practice leaves Station A at 9:00 AM at 60 mph towards Station B. "
"One other leaves Station B at 10:00 AM at 80 mph towards Station A. "
"Stations are 280 miles aside. When and the place do they meet?",
temperature=0.3,
)
display_response(response)
print("n" + "=" * 70)
print("📝 TEST 6: Logic puzzle (5 homes)")
print("=" * 70)
response = generate_fn(
"5 homes in a row are painted completely different colours. "
"The crimson home is left of the blue home. "
"The inexperienced home is within the center. "
"The yellow home isn't subsequent to the blue home. "
"The white home is at one finish. "
"What's the order from left to proper?",
temperature=0.3,
max_new_tokens=3000,
)
display_response(response)
print("n" + "=" * 70)
print("📝 TEST 7: Code technology — longest palindromic substring")
print("=" * 70)
response = generate_fn(
"Write a Python perform to search out the longest palindromic substring "
"utilizing Manacher's algorithm. Embody docstring, kind hints, and checks.",
system_prompt="You're an knowledgeable Python programmer. Suppose by means of the algorithm rigorously.",
max_new_tokens=3000,
temperature=0.3,
)
display_response(response)
print("n" + "=" * 70)
print("📝 TEST 8: Multi-turn dialog (physics tutor)")
print("=" * 70)
session = ChatSession(
system_prompt="You're a educated physics tutor. Clarify clearly with examples."
)
turns = [
"What is the Heisenberg uncertainty principle?",
"Can you give me a concrete example with actual numbers?",
"How does this relate to quantum tunneling?",
]
for i, q in enumerate(turns, 1):
print(f"n{'─'*60}")
print(f"👤 Flip {i}: {q}")
print(f"{'─'*60}")
resp = session.chat(q, temperature=0.5)
_, reply = parse_thinking(resp)
print(f"🤖 {reply[:1000]}{'...' if len(reply) > 1000 else ''}")
print("n" + "=" * 70)
print("📝 TEST 9: Temperature comparability — inventive writing")
print("=" * 70)
creative_prompt = "Write a one-paragraph opening for a sci-fi story about AI consciousness."
configs = [
{"label": "Low temp (0.1)", "temperature": 0.1, "top_p": 0.9},
{"label": "Med temp (0.6)", "temperature": 0.6, "top_p": 0.95},
{"label": "High temp (1.0)", "temperature": 1.0, "top_p": 0.98},
]
for cfg in configs:
print(f"n🎛️ {cfg['label']}")
print("-" * 60)
begin = time.time()
resp = generate_fn(
creative_prompt,
system_prompt="You're a inventive fiction author.",
max_new_tokens=512,
temperature=cfg["temperature"],
top_p=cfg["top_p"],
)
elapsed = time.time() - begin
_, reply = parse_thinking(resp)
print(reply[:600])
print(f"⏱️ {elapsed:.1f}s")
print("n" + "=" * 70)
print("📝 TEST 10: Pace benchmark")
print("=" * 70)
begin = time.time()
resp = generate_fn(
"Clarify how a neural community learns, step-by-step, for a newbie.",
system_prompt="You're a affected person, clear trainer.",
max_new_tokens=1024,
)
elapsed = time.time() - begin
approx_tokens = int(len(resp.cut up()) * 1.3)
print(f"~{approx_tokens} tokens in {elapsed:.1f}s")
print(f"~{approx_tokens / elapsed:.1f} tokens/sec")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
import gc
for title in ["model", "llm"]:
if title in globals():
del globals()[name]
gc.acquire()
torch.cuda.empty_cache()
print(f"n✅ Reminiscence freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("n" + "=" * 70)
print("🎉 Tutorial full!")
print("=" * 70)
We run a complete check suite that evaluates the mannequin throughout reasoning, streaming, logic puzzles, code technology, and multi-turn conversations. We evaluate outputs below completely different temperature settings and measure efficiency when it comes to pace and token throughput. Lastly, we clear up reminiscence and free GPU sources, guaranteeing the pocket book stays reusable for additional experiments.
In conclusion, we have now a compact however versatile setup for working Qwen3.5-based reasoning fashions enhanced with Claude-style distillation throughout completely different {hardware} constraints. The script abstracts backend variations whereas exposing constant technology, streaming, and conversational interfaces, making it straightforward to experiment with reasoning conduct. Via the check suite, we probe how the mannequin handles structured reasoning, edge-case questions, and longer multi-step duties, whereas additionally measuring pace and reminiscence utilization. What we find yourself with isn’t just a demo, however a reusable scaffold for evaluating and increasing Qwen-based reasoning programs in Colab with out altering the core code.
Try the Full Notebook and Source Page. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
