Skip to content

Output Format

PMCGrab emits clean paper JSON by default. The default schema is optimized for signal-to-noise: the paper itself, plus figures/images and tables, without parser trace metadata, diagnostics, bibliography records, contributor metadata, or provenance.

Use --full-json in the CLI, or output_style="full" in Python, when you need the metadata-rich V4 contract. V2 and V3 remain available only through the full output path.

Default Paper Shape

{
  "schema": "pmcgrab.paper.v1",
  "has_data": true,
  "identifiers": {
    "pmcid": "PMC7181753",
    "pmid": "32327715",
    "doi": "10.1038/s42003-020-0922-4"
  },
  "paper": {
    "title": "Single-cell transcriptomes of the human skin reveal ...",
    "abstract": [
      {
        "title": "Abstract",
        "kind": "primary",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ],
    "body": [
      {
        "id": "s1",
        "title": "Introduction",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ]
  },
  "assets": {
    "images": [],
    "tables": []
  }
}

Core Groups

Group Notes
identifiers Minimal paper identifiers: PMCID, PMID, and DOI.
paper Title, structured abstract, and nested body section tree.
assets Clean image/figure records and clean table records.

Body and abstract content is emitted as readable typed blocks. Supported block types include paragraph, list, definition_list, formula, quote, boxed_text, code, preformat, figure_ref, table_ref, supplementary_ref, and unknown_block.

Images And Tables

When image fetching is enabled with --with-images or process_single_pmc_with_assets(), clean image records include local file paths:

{
  "id": "f1",
  "label": "Figure 1",
  "caption": "...",
  "files": [
    {
      "href": "fig1.jpg",
      "local_path": "images/fig1.jpg",
      "status": "downloaded",
      "mime_type": "image/jpeg"
    }
  ]
}

Tables keep one canonical row representation:

{
  "id": "t1",
  "label": "Table 1",
  "caption": "...",
  "columns": ["A", "B"],
  "rows": [{ "A": "1", "B": "2" }],
  "footnotes": []
}

Access Article Data

from pmcgrab import process_single_pmc

data = process_single_pmc("7181753")

if data:
    print(data["paper"]["title"])
    print(data["identifiers"]["doi"])
    print(data["paper"]["abstract"][0]["content"][0]["text"][:300])
    print([section["title"] for section in data["paper"]["body"]])

Prepare Vector Chunks

def iter_text_blocks(sections):
    for section in sections:
        for block in section["content"]:
            if block["type"] == "paragraph":
                yield section, block
        yield from iter_text_blocks(section["sections"])


def prepare_for_vector_db(data):
    chunks = []
    metadata_base = {
        "pmcid": data["identifiers"]["pmcid"],
        "doi": data["identifiers"]["doi"],
    }

    for abstract_section in data["paper"]["abstract"]:
        for block in abstract_section["content"]:
            if block["type"] == "paragraph":
                chunks.append(
                    {
                        "content": block["text"],
                        "metadata": {
                            **metadata_base,
                            "type": "abstract",
                            "section": abstract_section["title"],
                        },
                    }
                )

    for section, block in iter_text_blocks(data["paper"]["body"]):
        chunks.append(
            {
                "content": block["text"],
                "metadata": {
                    **metadata_base,
                    "type": "paragraph",
                    "section": section["title"],
                },
            }
        )

    return chunks

Full JSON Compatibility

from pmcgrab import Paper, process_single_pmc

full_v4 = process_single_pmc("7181753", output_style="full")
v2_data = process_single_pmc("7181753", output_style="full", schema_version=2)
v2_paper_dict = Paper.from_pmc("7181753").to_dict(
    output_style="full",
    schema_version=2,
)
v3_data = process_single_pmc("7181753", output_style="full", schema_version=3)

Full V4 keeps article metadata, contributors, bibliography, relations, quality diagnostics, provenance, source paths, and parse coverage. New projects should start with the default pmcgrab.paper.v1 view and opt in to full JSON only when that extra metadata is needed.