Architecture¶
PMCGrab follows clean architecture principles with clear separation of concerns.
Overview¶
graph TB
CLI[CLI Layer] --> App[Application Layer]
App --> Domain[Domain Layer]
App --> Common[Common Utilities]
App --> Infra[Infrastructure Layer]
Domain --> Models[Domain Models]
Common --> Utils[Utilities & Helpers]
Infra --> External[External APIs]
Layer Descriptions¶
CLI Layer (pmcgrab.cli
)¶
- Command-line interface
- Argument parsing
- User interaction
- Progress reporting
Application Layer (pmcgrab.application
)¶
- Use case orchestration
- Business workflow logic
- Paper construction
- Content parsing coordination
Domain Layer (pmcgrab.domain
)¶
- Core business entities
- Value objects
- Domain rules
- No external dependencies
Common Layer (pmcgrab.common
)¶
- Shared utilities
- HTML cleaning
- XML processing
- Serialization helpers
Infrastructure Layer (pmcgrab.infrastructure
)¶
- External API clients
- HTTP utilities
- Settings management
- I/O operations
Key Components¶
Paper Model¶
The central domain entity representing a PMC article:
@dataclass
class Paper:
pmcid: str
title: str
authors: List[Author]
abstract: Dict[str, str]
body: Dict[str, str]
citations: List[Citation]
# ... other fields
Parser System¶
Modular parsing system with specialized parsers:
MetadataParser
: Article metadataContentParser
: Main content sectionsContributorParser
: Authors and affiliationsSectionParser
: Section organization
Processing Pipeline¶
sequenceDiagram
participant User
participant CLI
participant App
participant Parser
participant Fetcher
User->>CLI: pmcgrab PMC123456
CLI->>App: process_pmc_ids()
App->>Fetcher: get_xml()
Fetcher-->>App: XML content
App->>Parser: parse_paper()
Parser-->>App: Paper object
App-->>CLI: Processing result
CLI-->>User: JSON output
Design Patterns¶
Factory Pattern¶
Used for creating Paper objects from various sources:
class PaperFactory:
@staticmethod
def from_pmc(pmcid: str) -> Paper:
# Construction logic
@staticmethod
def from_xml(xml_content: str) -> Paper:
# Construction logic
Strategy Pattern¶
Different parsing strategies for different content types:
class ContentParser:
def __init__(self, strategy: ParsingStrategy):
self.strategy = strategy
def parse(self, content: str) -> Dict:
return self.strategy.parse(content)
Builder Pattern¶
Complex Paper object construction:
class PaperBuilder:
def add_metadata(self, metadata: Dict) -> 'PaperBuilder':
# Add metadata
return self
def add_content(self, content: Dict) -> 'PaperBuilder':
# Add content
return self
def build(self) -> Paper:
# Construct final Paper object
Data Flow¶
Single Paper Processing¶
- Input: PMC ID from user
- Fetch: Download XML from NCBI
- Parse: Extract structured data
- Build: Construct Paper object
- Output: Serialize to JSON
Batch Processing¶
- Input: List of PMC IDs
- Chunk: Split into batches
- Parallel: Process batches concurrently
- Aggregate: Collect results
- Report: Generate summary
Error Handling¶
Error Types¶
class PMCGrabError(Exception):
"""Base exception for PMCGrab"""
class NetworkError(PMCGrabError):
"""Network-related errors"""
class ParsingError(PMCGrabError):
"""XML parsing errors"""
class ValidationError(PMCGrabError):
"""Data validation errors"""
Error Handling Strategy¶
- Fail Fast: For critical errors
- Graceful Degradation: For parsing issues
- Retry Logic: For network errors
- User Feedback: Clear error messages
Testing Architecture¶
Test Structure¶
tests/
├── unit/ # Unit tests
├── integration/ # Integration tests
├── e2e/ # End-to-end tests
├── fixtures/ # Test data
└── conftest.py # Test configuration
Test Categories¶
- Unit Tests: Individual components
- Integration Tests: Component interactions
- End-to-end Tests: Full workflows
- Performance Tests: Speed and memory usage
Configuration Management¶
Settings Hierarchy¶
- Command line arguments (highest priority)
- Environment variables
- Configuration files
- Default values (lowest priority)
Configuration Schema¶
@dataclass
class Settings:
email: str
timeout: int = 30
max_retries: int = 3
batch_size: int = 10
workers: int = 4
Extension Points¶
Custom Parsers¶
Implement the ParserInterface
:
class CustomParser(ParserInterface):
def parse(self, xml_root: Element) -> Dict:
# Custom parsing logic
Custom Output Formats¶
Implement the SerializerInterface
:
class CustomSerializer(SerializerInterface):
def serialize(self, paper: Paper) -> str:
# Custom serialization logic
Performance Considerations¶
Optimization Strategies¶
- Concurrent Processing: Multiple workers
- Caching: XML and parsed data
- Memory Management: Streaming for large datasets
- Network Optimization: Connection pooling
Monitoring¶
- Processing speed metrics
- Memory usage tracking
- Error rate monitoring
- Network latency measurements
This architecture ensures PMCGrab is maintainable, testable, and extensible while providing excellent performance for both single article and batch processing scenarios.