A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics

On this tutorial, we discover the right way to use the ParseBench dataset to judge doc parsing methods in a structured, sensible method. We start by loading the dataset instantly from Hugging Face, inspecting its a number of dimensions, equivalent to textual content, tables, charts, and structure, and remodeling it right into a unified dataframe for deeper evaluation. As we progress, we establish key fields, detect linked PDFs, and construct a light-weight baseline utilizing PyMuPDF to extract and examine textual content. All through the method, we concentrate on creating a versatile pipeline that enables us to grasp the dataset schema, consider parsing high quality, and put together inputs for extra superior OCR or vision-language fashions.

Copy CodeCopiedUse a unique Browser

!pip set up -q -U datasets huggingface_hub pandas matplotlib wealthy pymupdf rapidfuzz tqdm


import json, re, textwrap, random, math
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from wealthy.console import Console
from wealthy.desk import Desk
from wealthy.panel import Panel
from huggingface_hub import hf_hub_download, list_repo_files
from rapidfuzz import fuzz
import fitz


console = Console()
DATASET_ID = "llamaindex/ParseBench"
WORKDIR = Path("/content material/parsebench_tutorial")
WORKDIR.mkdir(mother and father=True, exist_ok=True)


console.print(Panel.match("Superior ParseBench Tutorial on Google Colab", fashion="daring inexperienced"))


recordsdata = list_repo_files(DATASET_ID, repo_type="dataset")
jsonl_files = [f for f in files if f.endswith(".jsonl")]
pdf_files = [f for f in files if f.endswith(".pdf")]


console.print(f"Discovered {len(jsonl_files)} JSONL recordsdata")
console.print(f"Discovered {len(pdf_files)} PDF recordsdata")


desk = Desk(title="ParseBench JSONL Information")
desk.add_column("File")
desk.add_column("Dimension")
for f in jsonl_files:
   desk.add_row(f, Path(f).stem)
console.print(desk)

We set up all required libraries and arrange our working setting for the tutorial. We initialize the dataset supply and put together a workspace to retailer all outputs. We additionally fetch and listing all JSONL and PDF recordsdata from the ParseBench repository to grasp the dataset construction.

Copy CodeCopiedUse a unique Browser

def load_jsonl_from_hf(filename, max_rows=None):
   path = hf_hub_download(repo_id=DATASET_ID, filename=filename, repo_type="dataset")
   rows = []
   with open(path, "r", encoding="utf-8") as fp:
       for i, line in enumerate(fp):
           if max_rows and that i >= max_rows:
               break
           line = line.strip()
           if line:
               rows.append(json.hundreds(line))
   return rows, path


def flatten_dict(d, parent_key="", sep="."):
   objects = {}
   if isinstance(d, dict):
       for okay, v in d.objects():
           new_key = f"{parent_key}{sep}{okay}" if parent_key else str(okay)
           if isinstance(v, dict):
               objects.replace(flatten_dict(v, new_key, sep=sep))
           else:
               objects[new_key] = v
   return objects


dimension_data = {}
for jf in jsonl_files:
   rows, local_path = load_jsonl_from_hf(jf)
   dimension_data[Path(jf).stem] = rows
   console.print(f"{jf}: {len(rows)} examples loaded")


summary_rows = []
for dim, rows in dimension_data.objects():
   keys = Counter()
   for r in rows[:100]:
       keys.replace(flatten_dict(r).keys())
   summary_rows.append({
       "dimension": dim,
       "examples": len(rows),
       "top_fields": ", ".be a part of([k for k, _ in keys.most_common(12)])
   })


summary_df = pd.DataFrame(summary_rows)
show(summary_df)


plt.determine(figsize=(10, 5))
plt.bar(summary_df["dimension"], summary_df["examples"])
plt.title("ParseBench Examples by Dimension")
plt.xlabel("Dimension")
plt.ylabel("Variety of Examples")
plt.xticks(rotation=30, ha="proper")
plt.present()


for dim, rows in dimension_data.objects():
   console.print(Panel.match(f"Pattern schema for {dim}", fashion="daring cyan"))
   if rows:
       console.print(json.dumps(rows[0], indent=2)[:3000])

We load the JSONL recordsdata from the dataset and convert them into usable Python objects. We flatten nested constructions to investigate them simply in a tabular format. We additionally summarize every dimension and visualize the distribution of examples throughout completely different parsing duties.

Copy CodeCopiedUse a unique Browser

all_records = []
for dim, rows in dimension_data.objects():
   for i, r in enumerate(rows):
       flat = flatten_dict(r)
       flat["_dimension"] = dim
       flat["_row_id"] = i
       all_records.append(flat)


df = pd.DataFrame(all_records)
console.print(f"Mixed dataframe form: {df.form}")
show(df.head())


missing_report = []
for col in df.columns:
   missing_report.append({
       "column": col,
       "non_null": int(df[col].notna().sum()),
       "lacking": int(df[col].isna().sum()),
       "coverage_pct": spherical(100 * df[col].notna().imply(), 2)
   })


missing_df = pd.DataFrame(missing_report).sort_values("coverage_pct", ascending=False)
show(missing_df.head(40))


def find_candidate_columns(df, key phrases):
   cols = []
   for c in df.columns:
       lc = c.decrease()
       if any(okay.decrease() in lc for okay in key phrases):
           cols.append(c)
   return cols


doc_cols = find_candidate_columns(df, ["doc", "pdf", "file", "path", "source", "image"])
text_cols = find_candidate_columns(df, ["text", "content", "markdown", "ground", "answer", "expected", "target", "reference"])
rule_cols = find_candidate_columns(df, ["rule", "check", "assert", "criteria", "question", "prompt"])
bbox_cols = find_candidate_columns(df, ["bbox", "box", "polygon", "coordinates", "layout"])


console.print("[bold]Attainable doc columns:[/bold]", doc_cols[:30])
console.print("[bold]Attainable textual content/reference columns:[/bold]", text_cols[:30])
console.print("[bold]Attainable rule/query columns:[/bold]", rule_cols[:30])
console.print("[bold]Attainable structure columns:[/bold]", bbox_cols[:30])

We mix all parsed information right into a single dataframe for unified evaluation. We consider lacking values and establish which fields are most informative throughout the dataset. We additionally detect candidate columns associated to paperwork, textual content, guidelines, and structure to information downstream processing.

Copy CodeCopiedUse a unique Browser

def pick_first_existing(row, candidates):
   for c in candidates:
       if c in row and pd.notna(row[c]):
           worth = row[c]
           if isinstance(worth, str) and worth.strip():
               return worth
           if not isinstance(worth, str):
               return worth
   return None


def normalize_text(x):
   if x is None or (isinstance(x, float) and math.isnan(x)):
       return ""
   x = str(x)
   x = re.sub(r"s+", " ", x)
   return x.strip().decrease()


def simple_text_similarity(a, b):
   a = normalize_text(a)
   b = normalize_text(b)
   if not a or not b:
       return None
   return fuzz.token_set_ratio(a, b) / 100


def locate_pdf_path(worth):
   if worth is None:
       return None
   worth = str(worth)
   candidates = []
   if worth.endswith(".pdf"):
       candidates.append(worth)
       candidates.lengthen([f for f in pdf_files if f.endswith(value.split("/")[-1])])
   else:
       candidates.lengthen([
           f for f in pdf_files
           if value in f or Path(f).stem in value or value in Path(f).stem
       ])
   return candidates[0] if candidates else None


def extract_pdf_text_from_hf(pdf_repo_path, max_pages=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   texts = []
   for page_idx in vary(min(max_pages, len(doc))):
       texts.append(doc[page_idx].get_text("textual content"))
   doc.shut()
   return "n".be a part of(texts), local_pdf


def render_pdf_first_page(pdf_repo_path, zoom=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   web page = doc[0]
   pix = web page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))
   out_path = WORKDIR / (Path(pdf_repo_path).stem + "_page1.png")
   pix.save(out_path)
   doc.shut()
   return out_path


sample_records = df.pattern(min(25, len(df)), random_state=42).to_dict("information")
pdf_candidates = []


for row in sample_records:
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           pdf_candidates.append((row["_dimension"], row["_row_id"], pdf_path))
           break


pdf_candidates = listing(dict.fromkeys(pdf_candidates))
console.print(f"Detected {len(pdf_candidates)} PDF-linked sampled information")


if pdf_candidates:
   dim, row_id, pdf_path = pdf_candidates[0]
   console.print(Panel.match(f"Rendering pattern PDFnDimension: {dim}nRow: {row_id}nPDF: {pdf_path}", fashion="daring yellow"))
   image_path = render_pdf_first_page(pdf_path)
   img = plt.imread(image_path)
   plt.determine(figsize=(10, 12))
   plt.imshow(img)
   plt.axis("off")
   plt.title(f"{dim}: {Path(pdf_path).identify}")
   plt.present()
else:
   console.print("[yellow]No PDF-linked rows had been detected from the pattern.[/yellow]")

We outline helper capabilities for textual content normalization, similarity scoring, and PDF dealing with. We find and obtain PDF recordsdata related to dataset entries and extract their textual content material. We additionally present a pattern PDF web page for visible inspection of the doc construction.

Copy CodeCopiedUse a unique Browser

preferred_gt_cols = [
   c for c in text_cols
   if any(k in c.lower() for k in ["ground", "expected", "target", "answer", "content", "text", "markdown", "reference"])
]


evaluation_rows = []
eval_sample = df.pattern(min(50, len(df)), random_state=7).to_dict("information")


for row in tqdm(eval_sample, desc="Operating light-weight PDF textual content extraction baseline"):
   pdf_path = None
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           break


   if not pdf_path:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": None,
           "ground_truth_column": None,
           "similarity_score": None,
           "standing": "no_pdf_detected"
       })
       proceed


   gt_col = None
   gt = None
   for c in preferred_gt_cols:
       if c in row and pd.notna(row[c]):
           gt_col = c
           gt = row[c]
           break


   if gt is None:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": None,
           "similarity_score": None,
           "standing": "no_reference_detected"
       })
       proceed


   attempt:
       extracted, local_pdf = extract_pdf_text_from_hf(pdf_path, max_pages=2)
       rating = simple_text_similarity(extracted, gt)
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": rating,
           "extracted_chars": len(extracted),
           "ground_truth_chars": len(str(gt)),
           "standing": "scored"
       })
   besides Exception as e:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": None,
           "standing": "error",
           "error": str(e)
       })


eval_df = pd.DataFrame(evaluation_rows)


if eval_df.empty:
   eval_df = pd.DataFrame(columns=[
       "dimension", "row_id", "pdf", "ground_truth_column",
       "similarity_score", "extracted_chars", "ground_truth_chars",
       "status", "error"
   ])


show(eval_df.head(30))


if "standing" in eval_df.columns:
   show(eval_df["status"].value_counts().reset_index().rename(columns={"index": "standing", "standing": "rely"}))


if not eval_df.empty and "similarity_score" in eval_df.columns:
   valid_eval = eval_df.dropna(subset=["similarity_score"])


   if len(valid_eval):
       console.print(f"Common light-weight textual content similarity: {valid_eval['similarity_score'].imply():.3f}")


       plt.determine(figsize=(8, 5))
       plt.hist(valid_eval["similarity_score"], bins=10)
       plt.title("Light-weight Baseline Similarity Distribution")
       plt.xlabel("RapidFuzz Token Set Similarity")
       plt.ylabel("Depend")
       plt.present()


       per_dim = valid_eval.groupby("dimension")["similarity_score"].imply().reset_index()
       show(per_dim)


       plt.determine(figsize=(9, 5))
       plt.bar(per_dim["dimension"], per_dim["similarity_score"])
       plt.title("Common Baseline Similarity by Dimension")
       plt.xlabel("Dimension")
       plt.ylabel("Common Similarity")
       plt.xticks(rotation=30, ha="proper")
       plt.present()
   else:
       console.print("[yellow]No legitimate similarity scores had been produced. This often means sampled rows didn't comprise each detectable PDFs and reference textual content.[/yellow]")
else:
   console.print("[yellow]No similarity_score column discovered.[/yellow]")

We run a light-weight analysis pipeline by evaluating extracted textual content with obtainable reference fields. We compute similarity scores and analyze how nicely easy extraction performs throughout completely different dimensions. We additionally visualize the outcomes to grasp efficiency traits and limitations.

Copy CodeCopiedUse a unique Browser

def inspect_dimension(dimension_name, n=3):
   rows = dimension_data.get(dimension_name, [])
   console.print(Panel.match(f"Inspecting {dimension_name}: {len(rows)} rows", fashion="daring magenta"))
   for idx, row in enumerate(rows[:n]):
       console.print(f"n[bold]Instance {idx}[/bold]")
       console.print(json.dumps(row, indent=2)[:2500])


for dim in listing(dimension_data.keys())[:5]:
   inspect_dimension(dim, n=1)


def make_parsebench_subset(dimension=None, n=20, seed=123):
   subset = df.copy()
   if dimension:
       subset = subset[subset["_dimension"] == dimension]
   if len(subset) == 0:
       return subset
   return subset.pattern(min(n, len(subset)), random_state=seed)


subset = make_parsebench_subset(n=20)
show(subset.head())


def create_llm_parser_prompt(row):
   dimension = row.get("_dimension", "unknown")
   candidate_truth = pick_first_existing(row, preferred_gt_cols)
   rule_hint = pick_first_existing(row, rule_cols)


   immediate = f"""
You're evaluating a doc parser on ParseBench.


Dimension:
{dimension}


Process:
Parse the PDF web page right into a structured illustration that preserves the data wanted for agentic workflows.


Related benchmark trace or rule:
{rule_hint if rule_hint shouldn't be None else "No apparent rule discipline detected."}


Reference discipline preview:
{str(candidate_truth)[:1000] if candidate_truth shouldn't be None else "No apparent reference discipline detected."}


Return:
1. Markdown illustration
2. Extracted tables as JSON arrays when tables exist
3. Extracted chart values as JSON when charts exist
4. Structure-sensitive notes when visible grounding issues
"""
   return textwrap.dedent(immediate).strip()


prompt_examples = []
if len(subset):
   for _, row in subset.head(3).iterrows():
       prompt_examples.append(create_llm_parser_prompt(row.to_dict()))


if prompt_examples:
   console.print(Panel.match("Instance immediate for testing an exterior OCR or VLM parser", fashion="daring blue"))
   console.print(prompt_examples[0])
else:
   console.print("[yellow]No immediate examples may very well be created as a result of the subset is empty.[/yellow]")


def compare_parser_outputs(reference, candidate):
   return {
       "token_set_similarity": simple_text_similarity(reference, candidate),
       "partial_ratio": fuzz.partial_ratio(normalize_text(reference), normalize_text(candidate)) / 100 if reference and candidate else None,
       "candidate_length": len(str(candidate)) if candidate else 0,
       "reference_length": len(str(reference)) if reference else 0
   }


if not eval_df.empty and "similarity_score" in eval_df.columns:
   scored_eval = eval_df.dropna(subset=["similarity_score"])


   if len(scored_eval):
       finest = scored_eval.sort_values("similarity_score", ascending=False).head(1)
       worst = scored_eval.sort_values("similarity_score", ascending=True).head(1)


       console.print(Panel.match("Greatest light-weight baseline instance", fashion="daring inexperienced"))
       show(finest)


       console.print(Panel.match("Worst light-weight baseline instance", fashion="daring purple"))
       show(worst)
   else:
       console.print("[yellow]No legitimate similarity scores had been obtainable for finest/worst comparability.[/yellow]")


output_path = WORKDIR / "parsebench_flattened_sample.csv"
df.head(500).to_csv(output_path, index=False)
console.print(f"Saved flattened pattern to: {output_path}")


console.print(Panel.match("""
Tutorial full.


What we construct:
1. Load ParseBench recordsdata instantly from Hugging Face.
2. Examine benchmark dimensions and schemas.
3. Flatten information right into a dataframe.
4. Detect linked PDFs and render pattern pages when doable.
5. Run a light-weight PyMuPDF extraction baseline.
6. Rating extracted textual content when reference fields can be found.
7. Generate reusable prompts for OCR, VLM, and doc parser analysis.
""", fashion="daring inexperienced"))

We examine dataset samples and create subsets for experimentation. We generate structured prompts for evaluating exterior parsing methods, equivalent to OCR and vision-language fashions. Additionally, we examine outputs, establish finest and worst instances, and save processed knowledge for future use.

In conclusion, we constructed an entire workflow that enables us to investigate, consider, and experiment with doc parsing utilizing the ParseBench dataset. We extracted and in contrast textual content material and in addition generated structured prompts for testing exterior parsing methods, equivalent to OCR engines and VLMs. This strategy helps us transfer past easy textual content extraction and towards constructing agent-ready representations that protect construction, structure, and semantic which means. Additionally, we established a powerful basis that we are able to lengthen additional for benchmarking, bettering parsing fashions, and integrating doc understanding into real-world AI pipelines.

Take a look at the Full Codes here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us

The put up A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics appeared first on MarkTechPost.

Source link

A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics

Uber is within the resort enterprise now, thanks partially to AI

How AI May Assist Fight Antibiotic Resistance

Jack Dorsey-backed Vine reboot Divine launches to the general public

A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics

Related Posts

Uber is within the resort enterprise now, thanks partially to AI

How AI May Assist Fight Antibiotic Resistance

Jack Dorsey-backed Vine reboot Divine launches to the general public