Skip to content

PMCGrab

Structured PMC context for biomedical RAG.

Parse PubMed Central and JATS XML into clean, section-aware JSON your retrieval layer can trust.

PMC ID fetch Local JATS XML CLI + Python API

Start with one PMC ID uv add pmcgrab

PMCGrab article parsing workflow

Home

A PMC ID in. Clean article JSON out.

uv add pmcgrab
from pmcgrab import process_single_pmc

article = process_single_pmc("7181753")

print(article["paper"]["title"])
print([section["title"] for section in article["paper"]["body"]])

PMCGrab turns a PMC ID or local JATS XML file into loss-aware article data you can store, chunk, embed, inspect, audit, or pass to the next system.

Why it exists

Biomedical AI fails quietly when the context layer is messy.

If retrieval cannot tell Methods from Discussion, the model gets the wrong evidence with confidence. If a parser drops captions, identifiers, equations, supplements, or licensing metadata, the downstream system inherits that loss and still calls it data.

PMCGrab is a narrow tool for one boundary: PMC and JATS article sources in, clean Python objects and JSON out.

Choose your path

  • Start from zero Install PMCGrab, fetch your first article, and inspect the JSON shape.

  • Move fast Use the shortest path if you already know Python packaging and PMC IDs.

  • Use the CLI Turn PMC IDs, PMIDs, DOIs, ID files, or local XML files into JSON output.

  • Process bulk XML Parse pre-downloaded PMC/JATS XML from disk without repeated network calls.

  • Read the API Use Paper, process_single_pmc, and local XML helpers directly.

  • Check the contract Understand the normalized JSON groups before wiring a pipeline around them.

What you get

  • Schema V4 JSON with article metadata, contributors, content, assets, relations, quality, and provenance.
  • Loss-aware body and abstract parsing for paragraphs, nested sections, lists, definition lists, boxed text, formulas, figures, tables, supplements, and unknown JATS blocks.
  • No raw XML payloads in output JSON; traceability lives in structured source metadata.
  • Two ingestion paths: fetch by PMC ID from NCBI, or parse local JATS XML from a repeatable corpus build.
  • A Python API and CLI that share the same output contract.
  • Release checks built around real use: local XML E2E, opt-in live NCBI E2E, wheel smoke install, CLI tests, parser regressions, and JSON serialization.

What it is not

PMCGrab is not a PDF parser, paywalled full-text scraper, clinical tool, or general web crawler.

It is infrastructure for biomedical literature context. Small scope, clear boundary, useful output.