Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    What’s Tokenization Drift and The best way to Repair It?

    Naveed AhmadBy Naveed Ahmad03/05/2026Updated:03/05/2026No Comments2 Mins Read
    blog 1 2


    phrases     = [p[1] for p in pairs]
    ids_ws    = [tokenizer.encode(" " + w,  add_special_tokens=False)[0] for w in phrases]
    ids_nws   = [tokenizer.encode(w, add_special_tokens=False)[0] for w in phrases]
    delta     = [abs(a - b) for a, b in zip(ids_ws, ids_nws)]
     
    x = np.arange(len(phrases))
    width = 0.35
     
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.patch.set_facecolor("#FAFAF8")
     
    # Left: side-by-side token IDs
    ax = axes[0]
    ax.set_facecolor("#FAFAF8")
    bars1 = ax.bar(x - width/2, ids_ws,  width, label="With main house",    coloration="#3B6FE0", alpha=0.85)
    bars2 = ax.bar(x + width/2, ids_nws, width, label="With out main house",  coloration="#E05C3B", alpha=0.85)
    ax.set_xticks(x)
    ax.set_xticklabels(phrases, rotation=30, ha="proper", fontsize=9)
    ax.set_ylabel("Token ID", fontsize=10)
    ax.set_title("Token IDs: ' phrase'  vs  'phrase'", fontsize=12, fontweight="daring", pad=12)
    ax.legend(fontsize=9)
    ax.spines[["top", "right"]].set_visible(False)
    ax.grid(axis="y", alpha=0.3)
     
    for bar in bars1:
        ax.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                str(int(bar.get_height())), ha="middle", va="backside", fontsize=7, coloration="#3B6FE0")
    for bar in bars2:
        ax.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                str(int(bar.get_height())), ha="middle", va="backside", fontsize=7, coloration="#E05C3B")
     
    # Proper: delta
    ax2 = axes[1]
    ax2.set_facecolor("#FAFAF8")
    color_bars = ["#E05C3B" if d > 500 else "#F0A070" if d > 100 else "#A8C4F0" for d in delta]
    bars3 = ax2.bar(phrases, delta, coloration=color_bars, alpha=0.9)
    ax2.set_ylabel("Absolute Token ID Distance", fontsize=10)
    ax2.set_title("How Far Aside Are the Token IDs?", fontsize=12, fontweight="daring", pad=12)
    ax2.set_xticklabels(phrases, rotation=30, ha="proper", fontsize=9)
    ax2.spines[["top", "right"]].set_visible(False)
    ax2.grid(axis="y", alpha=0.3)
     
    for bar, d in zip(bars3, delta):
        ax2.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
                 str(d), ha="middle", va="backside", fontsize=9, fontweight="daring")
     
    excessive  = mpatches.Patch(coloration="#E05C3B", alpha=0.9, label="> 500 aside")
    med   = mpatches.Patch(coloration="#F0A070", alpha=0.9, label="100-500 aside")
    low   = mpatches.Patch(coloration="#A8C4F0", alpha=0.9, label="< 100 aside")
    ax2.legend(handles=[high, med, low], fontsize=8)
     
    plt.tight_layout(pad=2)
    plt.suptitle("Tokenization Artifacts: One Area, Utterly Totally different Token", 
                 fontsize=14, fontweight="daring", y=1.02)
    plt.savefig("tokenization_artifact.png", dpi=150, bbox_inches="tight", facecolor="#FAFAF8")
    plt.present()



    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    Sakana AI Introduces KAME: A Tandem Speech-to-Speech Structure That Injects LLM Information in Actual Time

    03/05/2026

    Mistral AI Launches Distant Brokers in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Rating

    03/05/2026

    AI-generated actors and scripts are actually ineligible for Oscars

    03/05/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.