Architecture¶

This page documents the current PMCGrab architecture as it exists in the repository. It is intentionally factual: proposed improvements live in docs/development/clean-code-final-plan.md.

Current Shape¶

graph TB
    CLI[pmcgrab.cli.pmcgrab_cli] --> Processing[pmcgrab.application.processing]
    Processing --> Builder[pmcgrab.application.paper_builder]
    Processing --> Model[pmcgrab.model.Paper]
    Builder --> Parser[pmcgrab.parser]
    Parser --> Parsing[pmcgrab.application.parsing.*]
    Parser --> Fetch[pmcgrab.fetch]
    Fetch --> NCBI[NCBI Entrez / local XML]
    Model --> Common[pmcgrab.common.*]
    Parser --> Domain[pmcgrab.domain.value_objects]

PMCGrab has a modern package layout, but it is still partly transitional. Some legacy top-level modules remain public for compatibility, while the primary processing path now runs through pmcgrab.application.

Main Modules¶

Module	Responsibility
`pmcgrab.__init__`	Public package exports and version.
`pmcgrab.__main__`	`python -m pmcgrab` entry point.
`pmcgrab.cli.pmcgrab_cli`	Argparse CLI, ID conversion, progress reporting, and file writing.
`pmcgrab.application.processing`	Pure processing helpers for network PMCID input and local XML input.
`pmcgrab.application.paper_builder`	Builds `Paper` objects from PMCID inputs.
`pmcgrab.parser`	Public parser facade and orchestration for XML-to-dictionary extraction.
`pmcgrab.application.parsing.*`	Focused metadata, contributor, content, and section extraction helpers.
`pmcgrab.model`	`Paper`, `TextSection`, `TextParagraph`, `TextTable`, and serialization helpers.
`pmcgrab.fetch`	Network XML retrieval and local XML parsing.
`pmcgrab.idconvert`	PMC/PMID/DOI normalization and NCBI ID conversion.
`pmcgrab.common.*`	Output schema ownership, serialization, HTML cleanup, and XML text helpers.
`pmcgrab.infrastructure.settings`	Environment-driven settings, email rotation, timeout and rate configuration.
`pmcgrab.bioc`, `oa_service`, `oai`, `litctxp`	Lightweight NCBI service clients.

Data Flow¶

Network PMCID¶

sequenceDiagram
    participant User
    participant CLI
    participant Processing
    participant Builder
    participant Parser
    participant Fetch
    participant Paper

    User->>CLI: pmcgrab --pmcids 7181753
    CLI->>Processing: process_single_pmc("7181753")
    Processing->>Builder: build_paper_from_pmc(...)
    Builder->>Parser: paper_dict_from_pmc(...)
    Parser->>Fetch: get_xml(...)
    Fetch-->>Parser: XML root
    Parser-->>Builder: legacy parser dictionary
    Builder-->>Processing: Paper
    Processing-->>CLI: normalized dictionary
    CLI-->>User: JSON or JSONL file

Local XML¶

sequenceDiagram
    participant User
    participant CLI
    participant Processing
    participant Parser
    participant Fetch
    participant Paper

    User->>CLI: pmcgrab --from-file article.xml
    CLI->>Processing: process_single_local_xml(path)
    Processing->>Parser: paper_dict_from_local_xml(path)
    Parser->>Fetch: parse_local_xml(path)
    Fetch-->>Parser: PMCID and XML root
    Parser-->>Processing: legacy parser dictionary
    Processing->>Paper: Paper(parser_dict)
    Processing-->>CLI: normalized dictionary

Public Contracts¶

The stable high-level APIs are:

from pmcgrab import Paper, process_single_pmc, process_single_local_xml

paper = Paper.from_pmc("7181753")
data = process_single_pmc("7181753")
local = process_single_local_xml("article.xml")

The normalized processing dictionary uses the clean paper output by default: pmcgrab.paper.v1. It contains identifiers, paper.title, paper.abstract, paper.body, assets.images, and assets.tables.

Full output remains available with output_style="full" or --full-json. V2 and V3 remain available only in full output by passing schema_version=2 or schema_version=3.

In full V4 output:

article contains identifiers, title, publication, compliance, and metadata.
contributors contains people, affiliations, and author notes.
content contains abstract records and the ordered section tree.
assets contains full references, tables, figures, equations, and supplementary material.
relations contains inline reference spans, contributor-affiliation links, and resolved target IDs.
quality contains parser status, diagnostics, and output counts.
provenance contains parser version, source, timestamp, and XML source path.

Deprecated or legacy modules such as pmcgrab.processing remain importable for compatibility, but new code should use pmcgrab.application.processing or the top-level exports.

Error Handling¶

The parser supports two caller choices:

suppress_errors=False: acquisition and parsing errors propagate.
suppress_errors=True: acquisition, local XML parsing, and parser errors are converted to empty results where possible.

The CLI treats empty results as failed article processing and continues with the remaining IDs or files.

Test Boundaries¶

Current tests live directly under tests/:

test_public_api.py protects package exports and version consistency.
test_cli_complete.py protects argparse behavior and CLI smoke paths.
test_local_xml.py protects local XML parsing and PMCID extraction.
test_parser.py, test_model.py, and test_application_processing.py protect core parsing, model, and processing behavior.
Service-specific tests cover settings, figures, utilities, HTML cleaning, and regressions.

Known Deepening Work¶

The current implementation works, but the final clean-code plan identifies several deeper improvements:

Make pmcgrab.parser a thinner facade over explicit parser services.
Stabilize table and figure serialization contracts.
Expand CLI subprocess tests around file output and ID conversion modes.
Move more compatibility behavior behind explicit adapters.

Those changes are intentionally staged because they affect public API behavior.