Interactive Jupyter Notebook Tutorial¶
Want hands-on experience with PMCGrab? Our interactive Jupyter notebook provides a complete walkthrough from installation to building AI-ready datasets.
What's Inside¶
The notebook covers:
- Single Paper Processing: Start with one paper and explore the output
- Batch Processing: Build a multi-paper dataset
- AI/ML Preparation: Structure data for RAG, vector databases, and LLM training
- Data Export: Save everything in organized formats
- Analysis: Visualize and understand your dataset
Quick Start¶
- Download the notebook:
- View on GitHub (renders properly)
-
Direct download (notebook file)
-
Install dependencies:
# Update uv first (or install if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install PMCGrab and Jupyter
uv add pmcgrab jupyter pandas matplotlib seaborn
- Launch Jupyter:
- Open the notebook and start processing papers!
Prerequisites¶
- Python 3.10+
- Internet connection (to fetch papers from PMC)
- Basic Python knowledge (helpful but not required)
What You'll Build¶
By the end of the notebook, you'll have:
- Individual Paper JSONs: Clean, structured data for each paper
- RAG Chunks: Ready for vector database ingestion
- Training Examples: Structured for LLM fine-tuning
- Dataset Analysis: Statistical overview of your collection
Perfect For¶
- First-time users wanting interactive exploration
- Data scientists building biomedical datasets
- AI researchers preparing training data
- Students learning about scientific text processing
Troubleshooting¶
Common Issues:¶
"Failed to process PMC..."
- Network connectivity issue
- Paper may not be open access
- Try a different PMC ID
"ModuleNotFoundError"
- Make sure you're using
uv run jupyter notebook - Verify installation:
uv run python -c "import pmcgrab; print('OK')"
Notebook won't start
- Check Python version:
python --version(need 3.10+) - Try:
uv add jupyterthenuv run jupyter notebook
Advanced Usage¶
Once you're comfortable with the basics:
- Scale up: Process 100s or 1000s of papers using the CLI
- Integrate: Connect to your vector database or ML pipeline
- Customize: Modify the notebook for your specific use case
Next Steps¶
After completing the notebook:
- Complete Beginner Guide: More detailed explanations
- CLI Reference: Command-line usage
- Advanced Examples: Production workflows
Need Help?¶
Ready to turn scientific literature into AI-ready data?
- View notebook on GitHub (fully rendered)
- Download notebook and start exploring!