Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics

    Naveed AhmadBy Naveed Ahmad29/04/2026Updated:29/04/2026No Comments10 Mins Read
    blog 86


    On this tutorial, we discover the right way to use the ParseBench dataset to judge doc parsing methods in a structured, sensible method. We start by loading the dataset instantly from Hugging Face, inspecting its a number of dimensions, equivalent to textual content, tables, charts, and structure, and remodeling it right into a unified dataframe for deeper evaluation. As we progress, we establish key fields, detect linked PDFs, and construct a light-weight baseline utilizing PyMuPDF to extract and examine textual content. All through the method, we concentrate on creating a versatile pipeline that enables us to grasp the dataset schema, consider parsing high quality, and put together inputs for extra superior OCR or vision-language fashions.

    Copy CodeCopiedUse a unique Browser
    !pip set up -q -U datasets huggingface_hub pandas matplotlib wealthy pymupdf rapidfuzz tqdm
    
    
    import json, re, textwrap, random, math
    from pathlib import Path
    from collections import Counter
    import pandas as pd
    import matplotlib.pyplot as plt
    from tqdm.auto import tqdm
    from wealthy.console import Console
    from wealthy.desk import Desk
    from wealthy.panel import Panel
    from huggingface_hub import hf_hub_download, list_repo_files
    from rapidfuzz import fuzz
    import fitz
    
    
    console = Console()
    DATASET_ID = "llamaindex/ParseBench"
    WORKDIR = Path("/content material/parsebench_tutorial")
    WORKDIR.mkdir(mother and father=True, exist_ok=True)
    
    
    console.print(Panel.match("Superior ParseBench Tutorial on Google Colab", fashion="daring inexperienced"))
    
    
    recordsdata = list_repo_files(DATASET_ID, repo_type="dataset")
    jsonl_files = [f for f in files if f.endswith(".jsonl")]
    pdf_files = [f for f in files if f.endswith(".pdf")]
    
    
    console.print(f"Discovered {len(jsonl_files)} JSONL recordsdata")
    console.print(f"Discovered {len(pdf_files)} PDF recordsdata")
    
    
    desk = Desk(title="ParseBench JSONL Information")
    desk.add_column("File")
    desk.add_column("Dimension")
    for f in jsonl_files:
       desk.add_row(f, Path(f).stem)
    console.print(desk)

    We set up all required libraries and arrange our working setting for the tutorial. We initialize the dataset supply and put together a workspace to retailer all outputs. We additionally fetch and listing all JSONL and PDF recordsdata from the ParseBench repository to grasp the dataset construction.

    Copy CodeCopiedUse a unique Browser
    def load_jsonl_from_hf(filename, max_rows=None):
       path = hf_hub_download(repo_id=DATASET_ID, filename=filename, repo_type="dataset")
       rows = []
       with open(path, "r", encoding="utf-8") as fp:
           for i, line in enumerate(fp):
               if max_rows and that i >= max_rows:
                   break
               line = line.strip()
               if line:
                   rows.append(json.hundreds(line))
       return rows, path
    
    
    def flatten_dict(d, parent_key="", sep="."):
       objects = {}
       if isinstance(d, dict):
           for okay, v in d.objects():
               new_key = f"{parent_key}{sep}{okay}" if parent_key else str(okay)
               if isinstance(v, dict):
                   objects.replace(flatten_dict(v, new_key, sep=sep))
               else:
                   objects[new_key] = v
       return objects
    
    
    dimension_data = {}
    for jf in jsonl_files:
       rows, local_path = load_jsonl_from_hf(jf)
       dimension_data[Path(jf).stem] = rows
       console.print(f"{jf}: {len(rows)} examples loaded")
    
    
    summary_rows = []
    for dim, rows in dimension_data.objects():
       keys = Counter()
       for r in rows[:100]:
           keys.replace(flatten_dict(r).keys())
       summary_rows.append({
           "dimension": dim,
           "examples": len(rows),
           "top_fields": ", ".be a part of([k for k, _ in keys.most_common(12)])
       })
    
    
    summary_df = pd.DataFrame(summary_rows)
    show(summary_df)
    
    
    plt.determine(figsize=(10, 5))
    plt.bar(summary_df["dimension"], summary_df["examples"])
    plt.title("ParseBench Examples by Dimension")
    plt.xlabel("Dimension")
    plt.ylabel("Variety of Examples")
    plt.xticks(rotation=30, ha="proper")
    plt.present()
    
    
    for dim, rows in dimension_data.objects():
       console.print(Panel.match(f"Pattern schema for {dim}", fashion="daring cyan"))
       if rows:
           console.print(json.dumps(rows[0], indent=2)[:3000])

    We load the JSONL recordsdata from the dataset and convert them into usable Python objects. We flatten nested constructions to investigate them simply in a tabular format. We additionally summarize every dimension and visualize the distribution of examples throughout completely different parsing duties.

    Copy CodeCopiedUse a unique Browser
    all_records = []
    for dim, rows in dimension_data.objects():
       for i, r in enumerate(rows):
           flat = flatten_dict(r)
           flat["_dimension"] = dim
           flat["_row_id"] = i
           all_records.append(flat)
    
    
    df = pd.DataFrame(all_records)
    console.print(f"Mixed dataframe form: {df.form}")
    show(df.head())
    
    
    missing_report = []
    for col in df.columns:
       missing_report.append({
           "column": col,
           "non_null": int(df[col].notna().sum()),
           "lacking": int(df[col].isna().sum()),
           "coverage_pct": spherical(100 * df[col].notna().imply(), 2)
       })
    
    
    missing_df = pd.DataFrame(missing_report).sort_values("coverage_pct", ascending=False)
    show(missing_df.head(40))
    
    
    def find_candidate_columns(df, key phrases):
       cols = []
       for c in df.columns:
           lc = c.decrease()
           if any(okay.decrease() in lc for okay in key phrases):
               cols.append(c)
       return cols
    
    
    doc_cols = find_candidate_columns(df, ["doc", "pdf", "file", "path", "source", "image"])
    text_cols = find_candidate_columns(df, ["text", "content", "markdown", "ground", "answer", "expected", "target", "reference"])
    rule_cols = find_candidate_columns(df, ["rule", "check", "assert", "criteria", "question", "prompt"])
    bbox_cols = find_candidate_columns(df, ["bbox", "box", "polygon", "coordinates", "layout"])
    
    
    console.print("[bold]Attainable doc columns:[/bold]", doc_cols[:30])
    console.print("[bold]Attainable textual content/reference columns:[/bold]", text_cols[:30])
    console.print("[bold]Attainable rule/query columns:[/bold]", rule_cols[:30])
    console.print("[bold]Attainable structure columns:[/bold]", bbox_cols[:30])

    We mix all parsed information right into a single dataframe for unified evaluation. We consider lacking values and establish which fields are most informative throughout the dataset. We additionally detect candidate columns associated to paperwork, textual content, guidelines, and structure to information downstream processing.

    Copy CodeCopiedUse a unique Browser
    def pick_first_existing(row, candidates):
       for c in candidates:
           if c in row and pd.notna(row[c]):
               worth = row[c]
               if isinstance(worth, str) and worth.strip():
                   return worth
               if not isinstance(worth, str):
                   return worth
       return None
    
    
    def normalize_text(x):
       if x is None or (isinstance(x, float) and math.isnan(x)):
           return ""
       x = str(x)
       x = re.sub(r"s+", " ", x)
       return x.strip().decrease()
    
    
    def simple_text_similarity(a, b):
       a = normalize_text(a)
       b = normalize_text(b)
       if not a or not b:
           return None
       return fuzz.token_set_ratio(a, b) / 100
    
    
    def locate_pdf_path(worth):
       if worth is None:
           return None
       worth = str(worth)
       candidates = []
       if worth.endswith(".pdf"):
           candidates.append(worth)
           candidates.lengthen([f for f in pdf_files if f.endswith(value.split("/")[-1])])
       else:
           candidates.lengthen([
               f for f in pdf_files
               if value in f or Path(f).stem in value or value in Path(f).stem
           ])
       return candidates[0] if candidates else None
    
    
    def extract_pdf_text_from_hf(pdf_repo_path, max_pages=2):
       local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
       doc = fitz.open(local_pdf)
       texts = []
       for page_idx in vary(min(max_pages, len(doc))):
           texts.append(doc[page_idx].get_text("textual content"))
       doc.shut()
       return "n".be a part of(texts), local_pdf
    
    
    def render_pdf_first_page(pdf_repo_path, zoom=2):
       local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
       doc = fitz.open(local_pdf)
       web page = doc[0]
       pix = web page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))
       out_path = WORKDIR / (Path(pdf_repo_path).stem + "_page1.png")
       pix.save(out_path)
       doc.shut()
       return out_path
    
    
    sample_records = df.pattern(min(25, len(df)), random_state=42).to_dict("information")
    pdf_candidates = []
    
    
    for row in sample_records:
       for c in doc_cols:
           pdf_path = locate_pdf_path(row.get(c))
           if pdf_path:
               pdf_candidates.append((row["_dimension"], row["_row_id"], pdf_path))
               break
    
    
    pdf_candidates = listing(dict.fromkeys(pdf_candidates))
    console.print(f"Detected {len(pdf_candidates)} PDF-linked sampled information")
    
    
    if pdf_candidates:
       dim, row_id, pdf_path = pdf_candidates[0]
       console.print(Panel.match(f"Rendering pattern PDFnDimension: {dim}nRow: {row_id}nPDF: {pdf_path}", fashion="daring yellow"))
       image_path = render_pdf_first_page(pdf_path)
       img = plt.imread(image_path)
       plt.determine(figsize=(10, 12))
       plt.imshow(img)
       plt.axis("off")
       plt.title(f"{dim}: {Path(pdf_path).identify}")
       plt.present()
    else:
       console.print("[yellow]No PDF-linked rows had been detected from the pattern.[/yellow]")

    We outline helper capabilities for textual content normalization, similarity scoring, and PDF dealing with. We find and obtain PDF recordsdata related to dataset entries and extract their textual content material. We additionally present a pattern PDF web page for visible inspection of the doc construction.

    Copy CodeCopiedUse a unique Browser
    preferred_gt_cols = [
       c for c in text_cols
       if any(k in c.lower() for k in ["ground", "expected", "target", "answer", "content", "text", "markdown", "reference"])
    ]
    
    
    evaluation_rows = []
    eval_sample = df.pattern(min(50, len(df)), random_state=7).to_dict("information")
    
    
    for row in tqdm(eval_sample, desc="Operating light-weight PDF textual content extraction baseline"):
       pdf_path = None
       for c in doc_cols:
           pdf_path = locate_pdf_path(row.get(c))
           if pdf_path:
               break
    
    
       if not pdf_path:
           evaluation_rows.append({
               "dimension": row.get("_dimension"),
               "row_id": row.get("_row_id"),
               "pdf": None,
               "ground_truth_column": None,
               "similarity_score": None,
               "standing": "no_pdf_detected"
           })
           proceed
    
    
       gt_col = None
       gt = None
       for c in preferred_gt_cols:
           if c in row and pd.notna(row[c]):
               gt_col = c
               gt = row[c]
               break
    
    
       if gt is None:
           evaluation_rows.append({
               "dimension": row.get("_dimension"),
               "row_id": row.get("_row_id"),
               "pdf": pdf_path,
               "ground_truth_column": None,
               "similarity_score": None,
               "standing": "no_reference_detected"
           })
           proceed
    
    
       attempt:
           extracted, local_pdf = extract_pdf_text_from_hf(pdf_path, max_pages=2)
           rating = simple_text_similarity(extracted, gt)
           evaluation_rows.append({
               "dimension": row.get("_dimension"),
               "row_id": row.get("_row_id"),
               "pdf": pdf_path,
               "ground_truth_column": gt_col,
               "similarity_score": rating,
               "extracted_chars": len(extracted),
               "ground_truth_chars": len(str(gt)),
               "standing": "scored"
           })
       besides Exception as e:
           evaluation_rows.append({
               "dimension": row.get("_dimension"),
               "row_id": row.get("_row_id"),
               "pdf": pdf_path,
               "ground_truth_column": gt_col,
               "similarity_score": None,
               "standing": "error",
               "error": str(e)
           })
    
    
    eval_df = pd.DataFrame(evaluation_rows)
    
    
    if eval_df.empty:
       eval_df = pd.DataFrame(columns=[
           "dimension", "row_id", "pdf", "ground_truth_column",
           "similarity_score", "extracted_chars", "ground_truth_chars",
           "status", "error"
       ])
    
    
    show(eval_df.head(30))
    
    
    if "standing" in eval_df.columns:
       show(eval_df["status"].value_counts().reset_index().rename(columns={"index": "standing", "standing": "rely"}))
    
    
    if not eval_df.empty and "similarity_score" in eval_df.columns:
       valid_eval = eval_df.dropna(subset=["similarity_score"])
    
    
       if len(valid_eval):
           console.print(f"Common light-weight textual content similarity: {valid_eval['similarity_score'].imply():.3f}")
    
    
           plt.determine(figsize=(8, 5))
           plt.hist(valid_eval["similarity_score"], bins=10)
           plt.title("Light-weight Baseline Similarity Distribution")
           plt.xlabel("RapidFuzz Token Set Similarity")
           plt.ylabel("Depend")
           plt.present()
    
    
           per_dim = valid_eval.groupby("dimension")["similarity_score"].imply().reset_index()
           show(per_dim)
    
    
           plt.determine(figsize=(9, 5))
           plt.bar(per_dim["dimension"], per_dim["similarity_score"])
           plt.title("Common Baseline Similarity by Dimension")
           plt.xlabel("Dimension")
           plt.ylabel("Common Similarity")
           plt.xticks(rotation=30, ha="proper")
           plt.present()
       else:
           console.print("[yellow]No legitimate similarity scores had been produced. This often means sampled rows didn't comprise each detectable PDFs and reference textual content.[/yellow]")
    else:
       console.print("[yellow]No similarity_score column discovered.[/yellow]")

    We run a light-weight analysis pipeline by evaluating extracted textual content with obtainable reference fields. We compute similarity scores and analyze how nicely easy extraction performs throughout completely different dimensions. We additionally visualize the outcomes to grasp efficiency traits and limitations.

    Copy CodeCopiedUse a unique Browser
    def inspect_dimension(dimension_name, n=3):
       rows = dimension_data.get(dimension_name, [])
       console.print(Panel.match(f"Inspecting {dimension_name}: {len(rows)} rows", fashion="daring magenta"))
       for idx, row in enumerate(rows[:n]):
           console.print(f"n[bold]Instance {idx}[/bold]")
           console.print(json.dumps(row, indent=2)[:2500])
    
    
    for dim in listing(dimension_data.keys())[:5]:
       inspect_dimension(dim, n=1)
    
    
    def make_parsebench_subset(dimension=None, n=20, seed=123):
       subset = df.copy()
       if dimension:
           subset = subset[subset["_dimension"] == dimension]
       if len(subset) == 0:
           return subset
       return subset.pattern(min(n, len(subset)), random_state=seed)
    
    
    subset = make_parsebench_subset(n=20)
    show(subset.head())
    
    
    def create_llm_parser_prompt(row):
       dimension = row.get("_dimension", "unknown")
       candidate_truth = pick_first_existing(row, preferred_gt_cols)
       rule_hint = pick_first_existing(row, rule_cols)
    
    
       immediate = f"""
    You're evaluating a doc parser on ParseBench.
    
    
    Dimension:
    {dimension}
    
    
    Process:
    Parse the PDF web page right into a structured illustration that preserves the data wanted for agentic workflows.
    
    
    Related benchmark trace or rule:
    {rule_hint if rule_hint shouldn't be None else "No apparent rule discipline detected."}
    
    
    Reference discipline preview:
    {str(candidate_truth)[:1000] if candidate_truth shouldn't be None else "No apparent reference discipline detected."}
    
    
    Return:
    1. Markdown illustration
    2. Extracted tables as JSON arrays when tables exist
    3. Extracted chart values as JSON when charts exist
    4. Structure-sensitive notes when visible grounding issues
    """
       return textwrap.dedent(immediate).strip()
    
    
    prompt_examples = []
    if len(subset):
       for _, row in subset.head(3).iterrows():
           prompt_examples.append(create_llm_parser_prompt(row.to_dict()))
    
    
    if prompt_examples:
       console.print(Panel.match("Instance immediate for testing an exterior OCR or VLM parser", fashion="daring blue"))
       console.print(prompt_examples[0])
    else:
       console.print("[yellow]No immediate examples may very well be created as a result of the subset is empty.[/yellow]")
    
    
    def compare_parser_outputs(reference, candidate):
       return {
           "token_set_similarity": simple_text_similarity(reference, candidate),
           "partial_ratio": fuzz.partial_ratio(normalize_text(reference), normalize_text(candidate)) / 100 if reference and candidate else None,
           "candidate_length": len(str(candidate)) if candidate else 0,
           "reference_length": len(str(reference)) if reference else 0
       }
    
    
    if not eval_df.empty and "similarity_score" in eval_df.columns:
       scored_eval = eval_df.dropna(subset=["similarity_score"])
    
    
       if len(scored_eval):
           finest = scored_eval.sort_values("similarity_score", ascending=False).head(1)
           worst = scored_eval.sort_values("similarity_score", ascending=True).head(1)
    
    
           console.print(Panel.match("Greatest light-weight baseline instance", fashion="daring inexperienced"))
           show(finest)
    
    
           console.print(Panel.match("Worst light-weight baseline instance", fashion="daring purple"))
           show(worst)
       else:
           console.print("[yellow]No legitimate similarity scores had been obtainable for finest/worst comparability.[/yellow]")
    
    
    output_path = WORKDIR / "parsebench_flattened_sample.csv"
    df.head(500).to_csv(output_path, index=False)
    console.print(f"Saved flattened pattern to: {output_path}")
    
    
    console.print(Panel.match("""
    Tutorial full.
    
    
    What we construct:
    1. Load ParseBench recordsdata instantly from Hugging Face.
    2. Examine benchmark dimensions and schemas.
    3. Flatten information right into a dataframe.
    4. Detect linked PDFs and render pattern pages when doable.
    5. Run a light-weight PyMuPDF extraction baseline.
    6. Rating extracted textual content when reference fields can be found.
    7. Generate reusable prompts for OCR, VLM, and doc parser analysis.
    """, fashion="daring inexperienced"))

    We examine dataset samples and create subsets for experimentation. We generate structured prompts for evaluating exterior parsing methods, equivalent to OCR and vision-language fashions. Additionally, we examine outputs, establish finest and worst instances, and save processed knowledge for future use.

    In conclusion, we constructed an entire workflow that enables us to investigate, consider, and experiment with doc parsing utilizing the ParseBench dataset. We extracted and in contrast textual content material and in addition generated structured prompts for testing exterior parsing methods, equivalent to OCR engines and VLMs. This strategy helps us transfer past easy textual content extraction and towards constructing agent-ready representations that protect construction, structure, and semantic which means. Additionally, we established a powerful basis that we are able to lengthen additional for benchmarking, bettering parsing fashions, and integrating doc understanding into real-world AI pipelines.


    Take a look at the Full Codes here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us

    The put up A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics appeared first on MarkTechPost.



    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    Uber is within the resort enterprise now, thanks partially to AI

    29/04/2026

    How AI May Assist Fight Antibiotic Resistance

    29/04/2026

    Jack Dorsey-backed Vine reboot Divine launches to the general public

    29/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.