Skip to content

Conversation

jeremy-wayland
Copy link
Collaborator

Summary

This PR addresses Issue #15 by implementing comprehensive local database management utilities in the Python package, enabling users to programmatically download the remote SQL database and launch a local Datasette instance for offline access or environments with restricted connectivity.

🚀 Main Features Implemented

Core Functions Added to apparent/utils.py:

  1. download_and_launch_local_datasette() - The primary function that:

    • Downloads the remote .db file from configurable URL (default: https://apparent.topology.rocks/us_physician_referral_networks.db)
    • Launches local Datasette server with comprehensive error handling
    • Supports configurable ports, timeouts, and Datasette settings
    • Returns process information (URL, PID, port, CSV endpoint)
  2. stop_local_datasette() - Process management utility that:

    • Gracefully terminates Datasette processes by port or PID
    • Implements fallback to force-kill if graceful shutdown fails
    • Uses psutil for robust cross-platform process management
  3. Supporting utilities:

    • download_file() - Robust file downloader with progress reporting for large files (~8GB database)
    • update_env_file() - Automatic .env management for seamless local/remote URL switching
    • list_datasette_processes() - Process discovery and monitoring utilities

🔧 Technical Implementation Highlights

  • Comprehensive error handling for missing dependencies (datasette, psutil)
  • Configurable Datasette settings (SQL timeout: 500s, max rows: 200k, CSV streaming control)
  • Process lifecycle management with graceful shutdown → force-kill fallback pattern
  • Progress reporting for large file downloads with MB-level granularity
  • Cross-platform compatibility with proper subprocess management
  • Verbose logging mode for debugging and monitoring
  • Automatic directory creation for database storage paths

📊 Why This Matters

  • Reduces friction for users who need local access without manually running shell scripts
  • Solves connectivity issues in constrained environments (institutional firewalls, stream redirection problems)
  • Enables programmatic access within Python workflows (Jupyter notebooks, CLI tools, automated pipelines)
  • Supports cloud deployments on SLURM clusters, JupyterHub setups, and other restricted environments
  • Maintains backward compatibility with existing remote URL workflows

💡 Usage Examples

from apparent.utils import download_and_launch_local_datasette, stop_local_datasette

# Download DB and launch local Datasette server
result = download_and_launch_local_datasette(
    db_path="data/us_physician_referral_networks.db",
    port=8001,
    update_env=True,
    verbose=True
)

print(f"Datasette server running at: {result['url']}")
print(f"CSV endpoint available at: {result['csv_url']}")
print(f"Process ID: {result['pid']}")

# Later, stop the server
stop_local_datasette(port=8001, verbose=True)

🧪 Testing

  • Comprehensive test suite with 17 test cases covering all functionality
  • Mock-based testing for external dependencies (requests, subprocess, psutil)
  • Edge case handling (timeouts, missing processes, access denied scenarios)
  • Temporary directory isolation for filesystem operations
  • Process management testing including graceful and forced termination

📈 Impact

This implementation transforms the existing bash-script-only database access (from tests/run-integration-tests.sh) into a first-class Python API, making the functionality accessible to:

  • Users installing via pip who don't have access to the repository scripts
  • Automated workflows and CI/CD pipelines
  • Cloud-based analysis environments with connectivity restrictions
  • Interactive Python sessions and Jupyter notebooks

Files Changed

  • apparent/utils.py (551 lines added) - New utilities module with all core functionality
  • tests/test_utils.py (419 lines added) - Comprehensive test suite
  • Documentation updates - README and docs reflecting new capabilities

Total: 970+ lines of new functionality


Closes #15 - Successfully implements the requested download_and_launch_local_db() functionality (implemented as download_and_launch_local_datasette() for clarity) along with comprehensive database management utilities.

…environments

Add comprehensive utilities module to enable downloading and running the US physician
referral networks database locally. This addresses connectivity issues and firewall
restrictions that prevent users from accessing the remote Datasette instance.

### New Features:
- `download_and_launch_local_datasette()`: Downloads SQLite database and starts local Datasette server
- `download_file()`: Robust file downloader with progress reporting and error handling
- `update_env_file()`: Automatic .env file management for switching between local/remote URLs
- `stop_local_datasette()`: Graceful termination of local Datasette processes by port or PID
- `list_datasette_processes()`: Process discovery and management utilities

### Technical Implementation:
- Comprehensive error handling for missing dependencies (datasette, psutil)
- Configurable Datasette settings (SQL timeout, max rows, CSV streaming)
- Process lifecycle management with graceful shutdown and force-kill fallback
- Progress reporting for large file downloads (~8GB database)
- Cross-platform compatibility with proper subprocess management
- Verbose logging mode for debugging and monitoring

### Testing:
- Complete test suite with 17 test cases covering all functionality
- Mock-based testing for external dependencies (requests, subprocess, psutil)
- Edge case handling (timeouts, missing processes, access denied scenarios)
- Temporary directory isolation for filesystem operations
- Process management testing including graceful and forced termination

### Integration:
- Seamless integration with existing Apparent workflow
- Automatic fallback between remote and local database URLs
- Environment variable management for easy configuration switching
- Compatible with existing data pulling and network analysis functionality

Resolves #15 - Download DB functionality for local development and restricted environments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

📦 Add download_and_launch_local_db() to support remote DB access directly in Python package

2 participants