Architecture¶

PMCGrab follows clean architecture principles with clear separation of concerns.

Overview¶

graph TB
    CLI[CLI Layer] --> App[Application Layer]
    App --> Domain[Domain Layer]
    App --> Common[Common Utilities]
    App --> Infra[Infrastructure Layer]
    Domain --> Models[Domain Models]
    Common --> Utils[Utilities & Helpers]
    Infra --> External[External APIs]

Layer Descriptions¶

CLI Layer (`pmcgrab.cli`)¶

Command-line interface
Argument parsing
User interaction
Progress reporting

Application Layer (`pmcgrab.application`)¶

Use case orchestration
Business workflow logic
Paper construction
Content parsing coordination

Domain Layer (`pmcgrab.domain`)¶

Core business entities
Value objects
Domain rules
No external dependencies

Common Layer (`pmcgrab.common`)¶

Shared utilities
HTML cleaning
XML processing
Serialization helpers

Infrastructure Layer (`pmcgrab.infrastructure`)¶

External API clients
HTTP utilities
Settings management
I/O operations

Key Components¶

Paper Model¶

The central domain entity representing a PMC article:

@dataclass
class Paper:
    pmcid: str
    title: str
    authors: List[Author]
    abstract: Dict[str, str]
    body: Dict[str, str]
    citations: List[Citation]
    # ... other fields

Parser System¶

Modular parsing system with specialized parsers:

MetadataParser: Article metadata
ContentParser: Main content sections
ContributorParser: Authors and affiliations
SectionParser: Section organization

Processing Pipeline¶

sequenceDiagram
    participant User
    participant CLI
    participant App
    participant Parser
    participant Fetcher

    User->>CLI: pmcgrab PMC123456
    CLI->>App: process_pmc_ids()
    App->>Fetcher: get_xml()
    Fetcher-->>App: XML content
    App->>Parser: parse_paper()
    Parser-->>App: Paper object
    App-->>CLI: Processing result
    CLI-->>User: JSON output

Design Patterns¶

Factory Pattern¶

Used for creating Paper objects from various sources:

class PaperFactory:
    @staticmethod
    def from_pmc(pmcid: str) -> Paper:
        # Construction logic

    @staticmethod
    def from_xml(xml_content: str) -> Paper:
        # Construction logic

Strategy Pattern¶

Different parsing strategies for different content types:

class ContentParser:
    def __init__(self, strategy: ParsingStrategy):
        self.strategy = strategy

    def parse(self, content: str) -> Dict:
        return self.strategy.parse(content)

Builder Pattern¶

Complex Paper object construction:

class PaperBuilder:
    def add_metadata(self, metadata: Dict) -> 'PaperBuilder':
        # Add metadata
        return self

    def add_content(self, content: Dict) -> 'PaperBuilder':
        # Add content
        return self

    def build(self) -> Paper:
        # Construct final Paper object

Data Flow¶

Single Paper Processing¶

Input: PMC ID from user
Fetch: Download XML from NCBI
Parse: Extract structured data
Build: Construct Paper object
Output: Serialize to JSON

Batch Processing¶

Input: List of PMC IDs
Chunk: Split into batches
Parallel: Process batches concurrently
Aggregate: Collect results
Report: Generate summary

Error Handling¶

Error Types¶

class PMCGrabError(Exception):
    """Base exception for PMCGrab"""

class NetworkError(PMCGrabError):
    """Network-related errors"""

class ParsingError(PMCGrabError):
    """XML parsing errors"""

class ValidationError(PMCGrabError):
    """Data validation errors"""

Error Handling Strategy¶

Fail Fast: For critical errors
Graceful Degradation: For parsing issues
Retry Logic: For network errors
User Feedback: Clear error messages

Testing Architecture¶

Test Structure¶

tests/
├── unit/           # Unit tests
├── integration/    # Integration tests
├── e2e/           # End-to-end tests
├── fixtures/      # Test data
└── conftest.py    # Test configuration

Test Categories¶

Unit Tests: Individual components
Integration Tests: Component interactions
End-to-end Tests: Full workflows
Performance Tests: Speed and memory usage

Configuration Management¶

Settings Hierarchy¶

Command line arguments (highest priority)
Environment variables
Configuration files
Default values (lowest priority)

Configuration Schema¶

@dataclass
class Settings:
    email: str
    timeout: int = 30
    max_retries: int = 3
    batch_size: int = 10
    workers: int = 4

Extension Points¶

Custom Parsers¶

Implement the ParserInterface:

class CustomParser(ParserInterface):
    def parse(self, xml_root: Element) -> Dict:
        # Custom parsing logic

Custom Output Formats¶

Implement the SerializerInterface:

class CustomSerializer(SerializerInterface):
    def serialize(self, paper: Paper) -> str:
        # Custom serialization logic

Performance Considerations¶

Optimization Strategies¶

Concurrent Processing: Multiple workers
Caching: XML and parsed data
Memory Management: Streaming for large datasets
Network Optimization: Connection pooling

Monitoring¶

Processing speed metrics
Memory usage tracking
Error rate monitoring
Network latency measurements

This architecture ensures PMCGrab is maintainable, testable, and extensible while providing excellent performance for both single article and batch processing scenarios.