Core API¶
The core API provides the main functions for processing PMC articles.
Primary Processing Function¶
process_single_pmc¶
pmcgrab.application.processing.process_single_pmc ¶
process_single_pmc(
pmc_id: str,
) -> dict[str, str | dict | list] | None
Download and parse a single PMC article into normalized dictionary format.
Application-layer function that handles the complete processing pipeline for a single PMC article: fetching XML, parsing content, extracting structured data, and normalizing for JSON serialization. Includes timeout protection and robust error handling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
String representation of the PMC ID (e.g., "7181753") |
required |
Returns:
Type | Description |
---|---|
dict[str, str | dict | list] | None
|
dict[str, str | dict | list] | None: Normalized article dictionary with keys: - pmc_id: Article identifier - title: Article title - abstract: Plain text abstract - body: Dictionary of section titles mapped to text content - authors: Normalized author information - Journal and publication metadata - Content metadata (funding, ethics, etc.) |
dict[str, str | dict | list] | None
|
Returns None if processing fails or article has no usable content. |
Examples:
>>> article_data = process_single_pmc("7181753")
>>> if article_data:
... print(f"Title: {article_data['title']}")
... print(f"Sections: {list(article_data['body'].keys())}")
... print(f"Authors: {len(article_data['authors'])}")
Note
This function includes a 60-second timeout for network/parsing operations and performs garbage collection for memory management in batch scenarios. All values are normalized using normalize_value() for JSON compatibility.
Source code in src/pmcgrab/application/processing.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
options: show_source: true show_root_heading: true show_root_toc_entry: false show_object_full_path: false show_category_heading: false show_signature_annotations: true heading_level: 3
Email Management¶
next_email¶
pmcgrab.infrastructure.settings.next_email ¶
Return the next email address in round-robin rotation.
Provides thread-safe access to the email pool using round-robin rotation. This ensures fair distribution of API requests across available email addresses, which helps with rate limiting and API usage policies.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Next email address from the configured pool |
Examples:
>>> # Get email for NCBI Entrez request
>>> email = next_email()
>>> print(f"Using email: {email}")
>>>
>>> # Multiple calls rotate through pool
>>> emails = [next_email() for _ in range(3)]
>>> print(f"Rotation: {emails}")
Thread Safety
This function is thread-safe and can be called concurrently from multiple threads without requiring external synchronization. The underlying itertools.cycle iterator handles concurrent access safely.
Configuration
The email pool can be customized via the PMCGRAB_EMAILS environment variable. If not set, uses a default pool of test email addresses.
Example environment setup: export PMCGRAB_EMAILS="user1@example.com,user2@example.com"
Note
NCBI Entrez requires a valid email address for API identification. The email is used to identify the requester and enable NCBI to contact users about API usage if necessary. Use real email addresses in production environments.
Source code in src/pmcgrab/infrastructure/settings.py
options: show_source: true show_root_heading: true show_root_toc_entry: false show_object_full_path: false show_category_heading: false show_signature_annotations: true heading_level: 3
Example Usage¶
from pmcgrab.application.processing import process_single_pmc
from pmcgrab.infrastructure.settings import next_email
# Process a single PMC article
email = next_email()
data = process_single_pmc("7114487")
if data:
print(f"Title: {data['title']}")
print(f"Authors: {len(data['authors'])}")
print(f"Sections: {list(data['body'].keys())}")