Command Line Interface¶

PMCGrab's command-line interface for batch processing and article retrieval.

CLI Module¶

pmcgrab.cli.pmcgrab_cli ¶

Functions¶

main ¶

main() -> None

Main CLI entry point for batch PMC article processing.

Orchestrates the complete batch processing workflow: 1. Parse command-line arguments 2. Create output directory structure 3. Process PMC IDs in manageable chunks with progress tracking 4. Collect and report processing statistics 5. Write summary results to JSON file

The function processes articles in 100-article chunks to manage memory usage and provide regular progress updates. Each chunk is processed concurrently using the specified number of worker threads.

Output

Creates individual JSON files for each successfully processed article in the output directory, plus a summary.json file containing processing statistics for all articles.

Examples:

This function is typically called via: python -m pmcgrab.cli.pmcgrab_cli --pmcids 7181753 3539614

Note

The function assumes that process_pmc_ids() handles the actual file writing for individual articles. It focuses on orchestration, progress tracking, and summary generation.

Source code in src/pmcgrab/cli/pmcgrab_cli.py

def main() -> None:
    """Main CLI entry point for batch PMC article processing.

    Orchestrates the complete batch processing workflow:
    1. Parse command-line arguments
    2. Create output directory structure
    3. Process PMC IDs in manageable chunks with progress tracking
    4. Collect and report processing statistics
    5. Write summary results to JSON file

    The function processes articles in 100-article chunks to manage memory
    usage and provide regular progress updates. Each chunk is processed
    concurrently using the specified number of worker threads.

    Output:
        Creates individual JSON files for each successfully processed article
        in the output directory, plus a summary.json file containing processing
        statistics for all articles.

    Examples:
        This function is typically called via:
            python -m pmcgrab.cli.pmcgrab_cli --pmcids 7181753 3539614

    Note:
        The function assumes that process_pmc_ids() handles the actual file
        writing for individual articles. It focuses on orchestration,
        progress tracking, and summary generation.
    """
    args = _parse_args()
    pmc_ids: list[str] = args.pmcids
    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    results = {}
    bar = tqdm(total=len(pmc_ids), desc="Processing PMC IDs", unit="paper")
    for chunk_start in range(0, len(pmc_ids), 100):
        chunk = pmc_ids[chunk_start : chunk_start + 100]
        chunk_results = process_pmc_ids(chunk, batch_size=args.batch_size)
        for pid, success in chunk_results.items():
            if success:
                # assuming process_single_pmc already wrote the file via higher-level call
                pass
            results[pid] = success
            bar.update(1)
    bar.close()

    summary_path = out_dir / "summary.json"
    with open(summary_path, "w", encoding="utf-8") as jf:
        json.dump(results, jf, indent=2)
    print(f"Summary written to {summary_path}")

Usage Examples¶

Basic Commands¶

# Process single paper
uv run python -m pmcgrab PMC7181753

# Process multiple papers
uv run python -m pmcgrab PMC7181753 PMC3539614 PMC5454911

Advanced Options¶

# Custom output directory
uv run python -m pmcgrab --output-dir ./results PMC7181753

# Parallel processing
uv run python -m pmcgrab --workers 8 PMC7181753 PMC3539614

# From file input
uv run python -m pmcgrab --input-file pmc_ids.txt --max-retries 3

All Options¶

--output-dir: Specify output directory (default: ./pmc_output)
--workers: Number of parallel workers (default: 4)
--email: Contact email for NCBI API
--input-file: Read PMC IDs from file
--max-retries: Maximum retry attempts for failed downloads
--batch-size: Number of articles per batch
--timeout: Request timeout in seconds
--verbose: Enable verbose logging
--help: Show help message