Processing API¶

Functions for processing PMC articles efficiently.

Primary Processing Function¶

The recommended way to process PMC articles:

process_single_pmc¶

pmcgrab.application.processing.process_single_pmc ¶

process_single_pmc(
    pmc_id: str,
) -> dict[str, str | dict | list] | None

Download and parse a single PMC article into normalized dictionary format.

Application-layer function that handles the complete processing pipeline for a single PMC article: fetching XML, parsing content, extracting structured data, and normalizing for JSON serialization. Includes timeout protection and robust error handling.

Parameters:

Name	Type	Description	Default
`pmc_id` ¶	`str`	String representation of the PMC ID (e.g., "7181753")	required

Returns:

Type	Description
`dict[str, str \| dict \| list] \| None`	dict[str, str \| dict \| list] \| None: Normalized article dictionary with keys: - pmc_id: Article identifier - title: Article title - abstract: Plain text abstract - body: Dictionary of section titles mapped to text content - authors: Normalized author information - Journal and publication metadata - Content metadata (funding, ethics, etc.)
`dict[str, str \| dict \| list] \| None`	Returns None if processing fails or article has no usable content.

Examples:

>>> article_data = process_single_pmc("7181753")
>>> if article_data:
...     print(f"Title: {article_data['title']}")
...     print(f"Sections: {list(article_data['body'].keys())}")
...     print(f"Authors: {len(article_data['authors'])}")

Note

This function includes a 60-second timeout for network/parsing operations and performs garbage collection for memory management in batch scenarios. All values are normalized using normalize_value() for JSON compatibility.

Source code in src/pmcgrab/application/processing.py

def process_single_pmc(pmc_id: str) -> dict[str, str | dict | list] | None:
    """Download and parse a single PMC article into normalized dictionary format.

    Application-layer function that handles the complete processing pipeline
    for a single PMC article: fetching XML, parsing content, extracting
    structured data, and normalizing for JSON serialization. Includes
    timeout protection and robust error handling.

    Args:
        pmc_id: String representation of the PMC ID (e.g., "7181753")

    Returns:
        dict[str, str | dict | list] | None: Normalized article dictionary with keys:
            - pmc_id: Article identifier
            - title: Article title
            - abstract: Plain text abstract
            - body: Dictionary of section titles mapped to text content
            - authors: Normalized author information
            - Journal and publication metadata
            - Content metadata (funding, ethics, etc.)
        Returns None if processing fails or article has no usable content.

    Examples:
        >>> article_data = process_single_pmc("7181753")
        >>> if article_data:
        ...     print(f"Title: {article_data['title']}")
        ...     print(f"Sections: {list(article_data['body'].keys())}")
        ...     print(f"Authors: {len(article_data['authors'])}")

    Note:
        This function includes a 60-second timeout for network/parsing operations
        and performs garbage collection for memory management in batch scenarios.
        All values are normalized using normalize_value() for JSON compatibility.
    """
    gc.collect()
    paper_info: dict[str, str | dict | list] = {}
    body_info: dict[str, str] = {}

    try:
        pmc_id_num = int(pmc_id)
        current_email = next_email()

        # Time-boxed network / parsing
        signal.alarm(60)
        try:
            paper = build_paper_from_pmc(
                pmc_id_num, email=current_email, download=True, validate=False
            )
        except TimeoutException:
            return None
        finally:
            signal.alarm(0)

        if paper is None:
            return None

        # ---------------- Text body extraction -------------------------
        body_sections = paper.body
        if body_sections is not None:
            try:
                iter(body_sections)  # Ensure iterable
                sec_counter = 1
                for section in body_sections:
                    try:
                        text = getattr(
                            section, "get_section_text", lambda s=section: str(s)
                        )()
                        title = (
                            section.title
                            if getattr(section, "title", None)
                            else f"Section {sec_counter}"
                        )
                        sec_counter += 1
                        body_info[title] = text
                    except Exception:
                        pass  # Robustness: ignore malformed sections
            except (TypeError, ValueError):
                pass

        # ---------------- Assemble output dict -------------------------
        paper_info["pmc_id"] = str(pmc_id_num)
        paper_info["abstract"] = paper.abstract_as_str() if paper.abstract else ""
        paper_info["has_data"] = str(paper.has_data)
        paper_info["body"] = body_info or {}
        paper_info["title"] = paper.title or ""
        paper_info["authors"] = (
            normalize_value(paper.authors) if paper.authors is not None else ""
        )
        paper_info["non_author_contributors"] = (
            normalize_value(paper.non_author_contributors)
            if paper.non_author_contributors is not None
            else ""
        )
        paper_info["publisher_name"] = (
            normalize_value(paper.publisher_name)
            if paper.publisher_name is not None
            else ""
        )
        paper_info["publisher_location"] = (
            normalize_value(paper.publisher_location)
            if paper.publisher_location is not None
            else ""
        )
        paper_info["article_id"] = (
            normalize_value(paper.article_id) if paper.article_id is not None else ""
        )
        paper_info["journal_title"] = (
            normalize_value(paper.journal_title)
            if paper.journal_title is not None
            else ""
        )
        paper_info["journal_id"] = (
            normalize_value(paper.journal_id) if paper.journal_id is not None else ""
        )
        paper_info["issn"] = (
            normalize_value(paper.issn) if paper.issn is not None else ""
        )
        paper_info["article_types"] = (
            normalize_value(paper.article_types)
            if paper.article_types is not None
            else ""
        )
        paper_info["article_categories"] = (
            normalize_value(paper.article_categories)
            if paper.article_categories is not None
            else ""
        )
        paper_info["published_date"] = (
            normalize_value(paper.published_date)
            if paper.published_date is not None
            else ""
        )
        paper_info["volume"] = (
            normalize_value(paper.volume) if paper.volume is not None else ""
        )
        paper_info["issue"] = (
            normalize_value(paper.issue) if paper.issue is not None else ""
        )
        paper_info["permissions"] = (
            normalize_value(paper.permissions) if paper.permissions is not None else ""
        )
        paper_info["copyright"] = (
            normalize_value(paper.copyright) if paper.copyright is not None else ""
        )
        paper_info["license"] = (
            normalize_value(paper.license) if paper.license is not None else ""
        )
        paper_info["funding"] = (
            normalize_value(paper.funding) if paper.funding is not None else ""
        )
        paper_info["footnote"] = (
            normalize_value(paper.footnote) if paper.footnote is not None else ""
        )
        paper_info["acknowledgements"] = (
            normalize_value(paper.acknowledgements)
            if paper.acknowledgements is not None
            else ""
        )
        paper_info["notes"] = (
            normalize_value(paper.notes) if paper.notes is not None else ""
        )
        paper_info["custom_meta"] = (
            normalize_value(paper.custom_meta) if paper.custom_meta is not None else ""
        )
        paper_info["last_updated"] = normalize_value(getattr(paper, "last_updated", ""))

        # Normalise nested structures one last time
        paper_info = {k: normalize_value(v) for k, v in paper_info.items()}
        if not paper_info.get("body"):
            return None
        return paper_info

    finally:
        with contextlib.suppress(Exception):
            del body_info, paper
        gc.collect()

options: show_source: true show_root_heading: true show_root_toc_entry: false show_object_full_path: false show_category_heading: false show_signature_annotations: true heading_level: 3

Recommended Usage Pattern¶

# ─── Recommended Processing Pattern ──────────────────────────────────────────
import json
from pathlib import Path

from pmcgrab.application.processing import process_single_pmc
from pmcgrab.infrastructure.settings import next_email

# The PMC IDs we want to process
PMC_IDS = ["7114487", "3084273", "7690653", "5707528", "7979870"]

OUT_DIR = Path("pmc_output")
OUT_DIR.mkdir(exist_ok=True)

for pmcid in PMC_IDS:
    email = next_email()
    print(f"• Fetching PMC{pmcid} using email {email} …")
    data = process_single_pmc(pmcid)
    if data is None:
        print(f"  ↳ FAILED to parse PMC{pmcid}")
        continue

    # Pretty-print a few key fields
    print(
        f"  Title   : {data['title'][:80]}{'…' if len(data['title']) > 80 else ''}\n"
        f"  Abstract: {data['abstract'][:120]}{'…' if len(data['abstract']) > 120 else ''}\n"
        f"  Authors : {len(data['authors']) if data['authors'] else 0}"
    )

    # Persist full JSON
    dest = OUT_DIR / f"PMC{pmcid}.json"
    with dest.open("w", encoding="utf-8") as fh:
        json.dump(data, fh, indent=2, ensure_ascii=False)
    print(f"  ↳ JSON saved to {dest}\n")

Email Management¶

next_email¶

pmcgrab.infrastructure.settings.next_email ¶

next_email() -> str

Return the next email address in round-robin rotation.

Provides thread-safe access to the email pool using round-robin rotation. This ensures fair distribution of API requests across available email addresses, which helps with rate limiting and API usage policies.

Returns:

Name	Type	Description
`str`	`str`	Next email address from the configured pool

Examples:

>>> # Get email for NCBI Entrez request
>>> email = next_email()
>>> print(f"Using email: {email}")
>>>
>>> # Multiple calls rotate through pool
>>> emails = [next_email() for _ in range(3)]
>>> print(f"Rotation: {emails}")

Thread Safety

This function is thread-safe and can be called concurrently from multiple threads without requiring external synchronization. The underlying itertools.cycle iterator handles concurrent access safely.

Configuration

The email pool can be customized via the PMCGRAB_EMAILS environment variable. If not set, uses a default pool of test email addresses.

Example environment setup: export PMCGRAB_EMAILS="user1@example.com,user2@example.com"

Note

NCBI Entrez requires a valid email address for API identification. The email is used to identify the requester and enable NCBI to contact users about API usage if necessary. Use real email addresses in production environments.

Source code in src/pmcgrab/infrastructure/settings.py

def next_email() -> str:
    """Return the next email address in round-robin rotation.

    Provides thread-safe access to the email pool using round-robin rotation.
    This ensures fair distribution of API requests across available email
    addresses, which helps with rate limiting and API usage policies.

    Returns:
        str: Next email address from the configured pool

    Examples:
        >>> # Get email for NCBI Entrez request
        >>> email = next_email()
        >>> print(f"Using email: {email}")
        >>>
        >>> # Multiple calls rotate through pool
        >>> emails = [next_email() for _ in range(3)]
        >>> print(f"Rotation: {emails}")

    Thread Safety:
        This function is thread-safe and can be called concurrently from
        multiple threads without requiring external synchronization. The
        underlying itertools.cycle iterator handles concurrent access safely.

    Configuration:
        The email pool can be customized via the PMCGRAB_EMAILS environment
        variable. If not set, uses a default pool of test email addresses.

        Example environment setup:
        export PMCGRAB_EMAILS="user1@example.com,user2@example.com"

    Note:
        NCBI Entrez requires a valid email address for API identification.
        The email is used to identify the requester and enable NCBI to
        contact users about API usage if necessary. Use real email addresses
        in production environments.
    """
    return next(_email_cycle)