A Coding Information to Construct Superior Doc Intelligence Pipelines with Google LangExtract, OpenAI Fashions, Structured Extraction, and Interactive Visualization

On this tutorial, we discover tips on how to use Google’s LangExtract library to remodel unstructured textual content into structured, machine-readable data. We start by putting in the required dependencies and securely configuring our OpenAI API key to leverage highly effective language fashions for extraction duties. Additionally, we’ll construct a reusable extraction pipeline that permits us to course of a spread of doc sorts, together with contracts, assembly notes, product bulletins, and operational logs. By way of rigorously designed prompts and instance annotations, we exhibit how LangExtract can determine entities, actions, deadlines, dangers, and different structured attributes whereas grounding them to their precise supply spans. We additionally visualize the extracted data and manage it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making methods.

!pip -q set up -U "langextract[openai]" pandas IPython


import os
import json
import textwrap
import getpass
import pandas as pd


OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


import langextract as lx
from IPython.show import show, HTML

We set up the required libraries, together with LangExtract, Pandas, and IPython, in order that our Colab atmosphere is prepared for structured extraction duties. We securely request the OpenAI API key from the person and retailer it as an atmosphere variable for secure entry throughout runtime. We then import the core libraries wanted to run LangExtract, show outcomes, and deal with structured outputs.

MODEL_ID = "gpt-4o-mini"


def run_extraction(
   text_or_documents,
   prompt_description,
   examples,
   output_stem,
   model_id=MODEL_ID,
   extraction_passes=1,
   max_workers=4,
   max_char_buffer=1800,
):
   end result = lx.extract(
       text_or_documents=text_or_documents,
       prompt_description=prompt_description,
       examples=examples,
       model_id=model_id,
       api_key=os.environ["OPENAI_API_KEY"],
       fence_output=True,
       use_schema_constraints=False,
       extraction_passes=extraction_passes,
       max_workers=max_workers,
       max_char_buffer=max_char_buffer,
   )


   jsonl_name = f"{output_stem}.jsonl"
   html_name = f"{output_stem}.html"


   lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".")
   html_content = lx.visualize(jsonl_name)


   with open(html_name, "w", encoding="utf-8") as f:
       if hasattr(html_content, "knowledge"):
           f.write(html_content.knowledge)
       else:
           f.write(html_content)


   return end result, jsonl_name, html_name


def extraction_rows(end result):
   rows = []
   for ex in end result.extractions:
       start_pos = None
       end_pos = None
       if getattr(ex, "char_interval", None):
           start_pos = ex.char_interval.start_pos
           end_pos = ex.char_interval.end_pos


       rows.append({
           "class": ex.extraction_class,
           "textual content": ex.extraction_text,
           "attributes": json.dumps(ex.attributes or {}, ensure_ascii=False),
           "begin": start_pos,
           "finish": end_pos,
       })
   return pd.DataFrame(rows)


def preview_result(title, end result, html_name, max_rows=50):
   print("=" * 80)
   print(title)
   print("=" * 80)
   print(f"Complete extractions: {len(end result.extractions)}")
   df = extraction_rows(end result)
   show(df.head(max_rows))
   show(HTML(f'Open interactive visualization: {html_name}
'))

We outline the core utility capabilities that energy the complete extraction pipeline. We create a reusable run_extraction operate that sends textual content to the LangExtract engine and generates each JSONL and HTML outputs. We additionally outline helper capabilities to transform the extraction outcomes into tabular rows and preview them interactively within the pocket book.

contract_prompt = textwrap.dedent("""
Extract contract-risk data so as of look.


Guidelines:
1. Use precise textual content spans from the supply. Don't paraphrase extraction_text.
2. Extract the next lessons when current:
  - celebration
  - obligation
  - deadline
  - payment_term
  - penalty
  - termination_clause
  - governing_law
3. Add helpful attributes:
  - party_name for obligations or cost phrases when related
  - risk_level as low, medium, or excessive
  - class for the enterprise which means
4. Hold output grounded to the precise wording within the supply.
5. Don't merge non-contiguous spans into one extraction.
""")


contract_examples = [
   lx.data.ExampleData(
       text=(
           "Acme Corp shall deliver the equipment by March 15, 2026. "
           "The Client must pay within 10 days of invoice receipt. "
           "Late payment incurs a 2% monthly penalty. "
           "This agreement is governed by the laws of Ontario."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="Acme Corp",
               attributes={"category": "supplier", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="obligation",
               extraction_text="shall deliver the equipment",
               attributes={"party_name": "Acme Corp", "category": "delivery", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by March 15, 2026",
               attributes={"category": "delivery_deadline", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="The Client",
               attributes={"category": "customer", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="payment_term",
               extraction_text="must pay within 10 days of invoice receipt",
               attributes={"party_name": "The Client", "category": "payment", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="2% monthly penalty",
               attributes={"category": "late_payment", "risk_level": "high"}
           ),
           lx.data.Extraction(
               extraction_class="governing_law",
               extraction_text="laws of Ontario",
               attributes={"category": "legal_jurisdiction", "risk_level": "low"}
           ),
       ]
   )
]


contract_text = """
BluePeak Analytics shall present a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit cost inside 7 calendar days after last acceptance.
If cost is delayed past 15 days, BluePeak Analytics might droop help companies and cost curiosity at 1.5% monthly.
This Settlement shall be ruled by the legal guidelines of British Columbia.
"""


contract_result, contract_jsonl, contract_html = run_extraction(
   text_or_documents=contract_text,
   prompt_description=contract_prompt,
   examples=contract_examples,
   output_stem="contract_risk_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 1 — Contract threat extraction", contract_result, contract_html)

We construct a contract intelligence extraction workflow by defining an in depth immediate and structured examples. We offer LangExtract with annotated training-style examples in order that it understands tips on how to determine entities corresponding to obligations, deadlines, penalties, and governing legal guidelines. We then run the extraction pipeline on a contract textual content and preview the structured risk-related outputs.

meeting_prompt = textwrap.dedent("""
Extract motion gadgets from assembly notes so as of look.


Guidelines:
1. Use precise textual content spans from the supply. No paraphrasing in extraction_text.
2. Extract these lessons when current:
  - assignee
  - action_item
  - due_date
  - blocker
  - resolution
3. Add attributes:
  - precedence as low, medium, or excessive
  - workstream when inferable from native context
  - proprietor for action_item when tied to a named assignee
4. Hold all spans grounded to the supply textual content.
5. Protect order of look.
""")


meeting_examples = [
   lx.data.ExampleData(
       text=(
           "Sarah will finalize the launch email by Friday. "
           "The team decided to postpone the webinar. "
           "Blocked by missing legal approval."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Sarah",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will finalize the launch email",
               attributes={"owner": "Sarah", "priority": "high", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="due_date",
               extraction_text="by Friday",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="decision",
               extraction_text="decided to postpone the webinar",
               attributes={"priority": "medium", "workstream": "events"}
           ),
           lx.data.Extraction(
               extraction_class="blocker",
               extraction_text="missing legal approval",
               attributes={"priority": "high", "workstream": "compliance"}
           ),
       ]
   )
]


meeting_text = """
Arjun will put together the revised pricing sheet by Tuesday night.
Mina to substantiate the enterprise buyer's knowledge residency necessities this week.
The group agreed to ship the pilot just for the Oman area first.
Blocked by pending safety evaluate from the consumer's IT group.
Ravi will draft the rollback plan earlier than the manufacturing cutover.
"""


meeting_result, meeting_jsonl, meeting_html = run_extraction(
   text_or_documents=meeting_text,
   prompt_description=meeting_prompt,
   examples=meeting_examples,
   output_stem="meeting_action_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 2 — Assembly notes to motion tracker", meeting_result, meeting_html)

We design a gathering intelligence extractor that focuses on motion gadgets, selections, assignees, and blockers. We once more present instance annotations to assist the mannequin construction meet data persistently. We execute the extraction on assembly notes and show the ensuing structured activity tracker.

longdoc_prompt = textwrap.dedent("""
Extract product launch intelligence so as of look.


Guidelines:
1. Use precise textual content spans from the supply.
2. Extract:
  - firm
  - product
  - launch_date
  - area
  - metric
  - partnership
3. Add attributes:
  - class
  - significance as low, medium, or excessive
4. Hold the extraction grounded within the unique textual content.
5. Don't paraphrase the extracted span.
""")


longdoc_examples = [
   lx.data.ExampleData(
       text=(
           "Nova Robotics launched Atlas Mini in Europe on 12 January 2026. "
           "The company reported 18% faster picking speed and partnered with Helix Warehousing."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="company",
               extraction_text="Nova Robotics",
               attributes={"category": "vendor", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="product",
               extraction_text="Atlas Mini",
               attributes={"category": "product_name", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="region",
               extraction_text="Europe",
               attributes={"category": "market", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="launch_date",
               extraction_text="12 January 2026",
               attributes={"category": "timeline", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="metric",
               extraction_text="18% faster picking speed",
               attributes={"category": "performance_claim", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="partnership",
               extraction_text="partnered with Helix Warehousing",
               attributes={"category": "go_to_market", "significance": "medium"}
           ),
       ]
   )
]


long_text = """
Vertex Dynamics launched FleetSense 3.0 for industrial logistics groups throughout the GCC on 5 February 2026.
The corporate stated the discharge improves the accuracy of route deviation detection by 22% and reduces handbook evaluate time by 31%.
Within the first rollout part, the platform will help Oman and the United Arab Emirates.
Vertex Dynamics additionally partnered with Falcon Telematics to combine stay driver conduct occasions into the dashboard.


Per week later, FleetSense 3.0 added a risk-scoring module for security managers.
The replace provides supervisors a each day ranked checklist of high-risk journeys and exception occasions.
The corporate described the module as particularly priceless for oilfield transport operations and contractor fleet audits.


By late February 2026, the group introduced a pilot with Desert Haul Providers.
The pilot covers 240 heavy autos and focuses on dashing up incident triage, compliance evaluate, and proof retrieval.
Inside testing confirmed analysts might assemble evaluate packets in beneath 8 minutes as a substitute of the earlier 20 minutes.
"""


longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
   text_or_documents=long_text,
   prompt_description=longdoc_prompt,
   examples=longdoc_examples,
   output_stem="long_document_extraction",
   extraction_passes=3,
   max_workers=8,
   max_char_buffer=1000,
)


preview_result("USE CASE 3 — Lengthy-document extraction", longdoc_result, longdoc_html)


batch_docs = [
   """
   The supplier must replace defective batteries within 14 days of written notice.
   Any unresolved safety issue may trigger immediate suspension of shipments.
   """,
   """
   Priya will circulate the revised onboarding checklist tomorrow morning.
   The team approved the API deprecation plan for the legacy endpoint.
   """,
   """
   Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
   The company claims the assistant reduces nurse intake time by 17%.
   """
]


batch_prompt = textwrap.dedent("""
Extract operationally helpful spans so as of look.


Allowed lessons:
- obligation
- deadline
- penalty
- assignee
- action_item
- resolution
- firm
- product
- launch_date
- metric


Use precise textual content solely and fix a easy attribute:
- source_type
""")


batch_examples = [
   lx.data.ExampleData(
       text="Jordan will submit the report by Monday. Late delivery incurs a service credit.",
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Jordan",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will submit the report",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by Monday",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="service credit",
               attributes={"source_type": "contract"}
           ),
       ]
   )
]


batch_results = []
for idx, doc in enumerate(batch_docs, begin=1):
   res, jsonl_name, html_name = run_extraction(
       text_or_documents=doc,
       prompt_description=batch_prompt,
       examples=batch_examples,
       output_stem=f"batch_doc_{idx}",
       extraction_passes=2,
       max_workers=4,
       max_char_buffer=1200,
   )
   df = extraction_rows(res)
   df.insert(0, "document_id", idx)
   batch_results.append(df)
   print(f"Completed doc {idx} -> {html_name}")


batch_df = pd.concat(batch_results, ignore_index=True)
print("nCombined batch output")
show(batch_df)


print("nContract extraction counts by class")
show(
   extraction_rows(contract_result)
   .groupby("class", as_index=False)
   .measurement()
   .sort_values("measurement", ascending=False)
)


print("nMeeting motion gadgets solely")
meeting_df = extraction_rows(meeting_result)
show(meeting_df[meeting_df["class"] == "action_item"])


print("nLong-document metrics solely")
longdoc_df = extraction_rows(longdoc_result)
show(longdoc_df[longdoc_df["class"] == "metric"])


final_df = pd.concat([
   extraction_rows(contract_result).assign(use_case="contract_risk"),
   extraction_rows(meeting_result).assign(use_case="meeting_actions"),
   extraction_rows(longdoc_result).assign(use_case="long_document"),
], ignore_index=True)


final_df.to_csv("langextract_tutorial_outputs.csv", index=False)
print("nSaved CSV: langextract_tutorial_outputs.csv")


print("nGenerated information:")
for identify in [
   contract_jsonl, contract_html,
   meeting_jsonl, meeting_html,
   longdoc_jsonl, longdoc_html,
   "langextract_tutorial_outputs.csv"
]:
   print(" -", identify)

We implement a long-document intelligence pipeline able to extracting structured insights from giant narrative textual content. We run the extraction throughout product launch experiences and operational paperwork, and in addition exhibit batch processing throughout a number of paperwork. We additionally analyze the extracted outcomes, filter key lessons, and export the structured dataset to a CSV file for downstream evaluation.

In conclusion, we constructed a sophisticated LangExtract workflow that converts complicated textual content paperwork into structured datasets with traceable supply grounding. We ran a number of extraction situations, together with contract threat evaluation, assembly motion monitoring, long-document intelligence extraction, and batch processing throughout a number of paperwork. We additionally visualized the extractions and exported the ultimate structured outcomes right into a CSV file for additional evaluation. By way of this course of, we noticed how immediate design, example-based extraction, and scalable processing methods enable us to construct strong data extraction methods with minimal code.

Try the Full Codes here. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

Source link

A Coding Information to Construct Superior Doc Intelligence Pipelines with Google LangExtract, OpenAI Fashions, Structured Extraction, and Interactive Visualization

Contained in the Professional-Iran Meme Machine Trolling Trump With AI Lego Cartoons

Ex-Tesla engineer’s startup faucets Pronto to assist automate a copper mine

Amazon to finish help for older Kindle units

A Coding Information to Construct Superior Doc Intelligence Pipelines with Google LangExtract, OpenAI Fashions, Structured Extraction, and Interactive Visualization

Related Posts

Contained in the Professional-Iran Meme Machine Trolling Trump With AI Lego Cartoons

Ex-Tesla engineer’s startup faucets Pronto to assist automate a copper mine

Amazon to finish help for older Kindle units