Core API¶
The core API provides the main functions for processing PMC articles.
Primary Processing Function¶
process_single_pmc¶
pmcgrab.application.processing.process_single_pmc ¶
process_single_pmc(
pmc_id: str | int,
*,
download: bool = False,
timeout: int = NCBI_TIMEOUT,
metadata_only: bool = False,
schema_version: int | None = None,
output_style: str | None = None,
) -> ArticleOutput | None
Download and parse a single PMC article into normalized dictionary format.
Application-layer function that handles the complete processing pipeline for a single PMC article: fetching XML, parsing content, extracting structured data, and normalizing for JSON serialization. Includes thread-safe timeout protection and robust error handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | int
|
PMC ID as int or string, with or without the |
required |
|
bool
|
If True, cache raw XML locally in data/ directory for reuse. |
False
|
|
int
|
Maximum seconds to wait for network/parsing (default: 60). |
NCBI_TIMEOUT
|
|
bool
|
If True, allow metadata-only output without body sections. |
False
|
|
int | None
|
Full-output schema version. Passing a schema version
without |
None
|
|
str | None
|
|
None
|
Returns:
| Type | Description |
|---|---|
ArticleOutput | None
|
Normalized article dictionary. The default clean paper output includes |
ArticleOutput | None
|
|
ArticleOutput | None
|
(images and tables). Pass |
ArticleOutput | None
|
V4/V3/V2 contracts. |
ArticleOutput | None
|
Returns None if processing fails or article has no usable content. |
Examples:
>>> article_data = process_single_pmc("7181753")
>>> if article_data:
... print(f"Title: {article_data['paper']['title']}")
... sections = article_data["paper"]["body"]
... print(f"Sections: {[section['title'] for section in sections]}")
Source code in src/pmcgrab/application/processing.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | |
options: show_source: true show_root_heading: true show_root_toc_entry: false show_object_full_path: false show_category_heading: false show_signature_annotations: true heading_level: 3
Email Management¶
next_email¶
pmcgrab.infrastructure.settings.next_email ¶
Return the next email address in round-robin rotation.
Thread-safe via a lock-protected index counter.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Next email address from the configured pool |
Source code in src/pmcgrab/infrastructure/settings.py
options: show_source: true show_root_heading: true show_root_toc_entry: false show_object_full_path: false show_category_heading: false show_signature_annotations: true heading_level: 3
Example Usage¶
from pmcgrab.application.processing import process_single_pmc
from pmcgrab.infrastructure.settings import next_email
# Process a single PMC article
email = next_email()
data = process_single_pmc("7114487")
if data:
print(f"Title: {data['paper']['title']}")
print(f"PMCID: {data['identifiers']['pmcid']}")
print(f"Sections: {[section['title'] for section in data['paper']['body']]}")