Output Format¶

PMCGrab emits clean paper JSON by default. The default schema is optimized for signal-to-noise: the paper itself, plus figures/images and tables, without parser trace metadata, diagnostics, bibliography records, contributor metadata, or provenance.

Use --full-json in the CLI, or output_style="full" in Python, when you need the metadata-rich V4 contract. V2 and V3 remain available only through the full output path.

Default Paper Shape¶

{
  "schema": "pmcgrab.paper.v1",
  "has_data": true,
  "identifiers": {
    "pmcid": "PMC7181753",
    "pmid": "32327715",
    "doi": "10.1038/s42003-020-0922-4"
  },
  "paper": {
    "title": "Single-cell transcriptomes of the human skin reveal ...",
    "abstract": [
      {
        "title": "Abstract",
        "kind": "primary",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ],
    "body": [
      {
        "id": "s1",
        "title": "Introduction",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ]
  },
  "assets": {
    "images": [],
    "tables": []
  }
}

Core Groups¶

Group	Notes
`identifiers`	Minimal paper identifiers: PMCID, PMID, and DOI.
`paper`	Title, structured abstract, and nested body section tree.
`assets`	Clean image/figure records and clean table records.

Body and abstract content is emitted as readable typed blocks. Supported block types include paragraph, list, definition_list, formula, quote, boxed_text, code, preformat, figure_ref, table_ref, supplementary_ref, and unknown_block.

Images And Tables¶

When image fetching is enabled with --with-images or process_single_pmc_with_assets(), clean image records include local file paths:

{
  "id": "f1",
  "label": "Figure 1",
  "caption": "...",
  "files": [
    {
      "href": "fig1.jpg",
      "local_path": "images/fig1.jpg",
      "status": "downloaded",
      "mime_type": "image/jpeg"
    }
  ]
}

Tables keep one canonical row representation:

{
  "id": "t1",
  "label": "Table 1",
  "caption": "...",
  "columns": ["A", "B"],
  "rows": [{ "A": "1", "B": "2" }],
  "footnotes": []
}

Access Article Data¶

from pmcgrab import process_single_pmc

data = process_single_pmc("7181753")

if data:
    print(data["paper"]["title"])
    print(data["identifiers"]["doi"])
    print(data["paper"]["abstract"][0]["content"][0]["text"][:300])
    print([section["title"] for section in data["paper"]["body"]])

Prepare Vector Chunks¶

def iter_text_blocks(sections):
    for section in sections:
        for block in section["content"]:
            if block["type"] == "paragraph":
                yield section, block
        yield from iter_text_blocks(section["sections"])


def prepare_for_vector_db(data):
    chunks = []
    metadata_base = {
        "pmcid": data["identifiers"]["pmcid"],
        "doi": data["identifiers"]["doi"],
    }

    for abstract_section in data["paper"]["abstract"]:
        for block in abstract_section["content"]:
            if block["type"] == "paragraph":
                chunks.append(
                    {
                        "content": block["text"],
                        "metadata": {
                            **metadata_base,
                            "type": "abstract",
                            "section": abstract_section["title"],
                        },
                    }
                )

    for section, block in iter_text_blocks(data["paper"]["body"]):
        chunks.append(
            {
                "content": block["text"],
                "metadata": {
                    **metadata_base,
                    "type": "paragraph",
                    "section": section["title"],
                },
            }
        )

    return chunks

Full JSON Compatibility¶

from pmcgrab import Paper, process_single_pmc

full_v4 = process_single_pmc("7181753", output_style="full")
v2_data = process_single_pmc("7181753", output_style="full", schema_version=2)
v2_paper_dict = Paper.from_pmc("7181753").to_dict(
    output_style="full",
    schema_version=2,
)
v3_data = process_single_pmc("7181753", output_style="full", schema_version=3)

Full V4 keeps article metadata, contributors, bibliography, relations, quality diagnostics, provenance, source paths, and parse coverage. New projects should start with the default pmcgrab.paper.v1 view and opt in to full JSON only when that extra metadata is needed.