LLM-powered analysis of biodiversity-related investment activities in financial reports
BiodiversityASSET is a comprehensive pipeline for extracting, classifying, and analyzing biodiversity-related content from investor reports. The system uses LLMs to evaluate paragraphs across three key dimensions:
- πΏ Biodiversity relevance - Identifies content related to biodiversity and environmental impact
- π° Investment activity - Classifies paragraphs containing concrete investment activities
- π Assetization characteristics - Scores content on intrinsic value, cash flow, and ownership/control
- Key Features
- Quick Start
- Processing Pipeline
- Batch Job Management
- Prompt Customization
- Project Structure
- Output Organization
- Documentation
β¨ Modular Architecture
- Submit batch jobs and monitor progress independently
- Resume workflows from any step using batch IDs
- Cancel running jobs with safety confirmations
π€ LLM-Powered Processing
- OpenAI Batch API integration for cost-effective analysis (~50% cost reduction)
- External prompt system for easy customization
- Support for multiple models and configurations
π Organized Output
- Results saved in batch-specific subfolders
- Clean filenames without ID conflicts
- Individual chunk processing for large datasets
π§ Developer-Friendly
- Comprehensive CLI tools with intuitive options
- Detailed progress monitoring and error handling
- Flexible configuration and custom prompt support
The processing pipeline consists of sequential steps, with LLM-powered batch processing for steps 3-4:
| Step | Purpose | Script | Input | Output |
|---|---|---|---|---|
| 1 | Extract paragraphs from PDFs | extract_pdfs.py |
data/raw/pdfs/ |
extracted_paragraphs_from_pdfs/ |
| 2 | Filter biodiversity content | filter_biodiversity_paragraphs.py |
extracted_paragraphs_from_pdfs/ |
biodiversity_related_paragraphs/ |
| 3a | Submit investment classification | submit_batch_job.py |
biodiversity_related_paragraphs/ |
Returns batch ID |
| 3b | Monitor batch progress | check_batch_status.py |
Batch ID | Status updates |
| 3c | Download investment results | download_batch_results.py |
Batch ID | investment_activity_classification/ |
| 4a | Submit assetization scoring | submit_batch_job.py |
investment_activity_classification/ |
Returns batch ID |
| 4b | Monitor batch progress | check_batch_status.py |
Batch ID | Status updates |
| 4c | Download assetization results | download_batch_results.py |
Batch ID | assetization_features_scoring/ |
π‘ Key Points:
- Steps 3-4 use OpenAI's Batch API for cost-effective processing
- Each batch step can be run independently
- Step 4 requires a completed investment activity classification batch ID
- All results are organized in batch-specific subfolders
Ensure you have uv installed:
π¦ Install uv (click to expand)
Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Linux/MacOS:
curl -LsSf https://astral.sh/uv/install.sh | sh# Clone the repository
git clone <repository-url>
cd BiodiversityASSET
# Install dependencies
uv sync# Set your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"
# Or create a .env file
echo "OPENAI_API_KEY=your-openai-api-key" > .envpython scripts/extract_pdfs.pypython scripts/filter_biodiversity_paragraphs.py# Submit the batch job
python scripts/submit_batch_job.py --task investment_activity_classification
# Monitor progress (replace <batch-id> with actual ID)
python scripts/check_batch_status.py --batch-id <batch-id> --wait
# Download results
python scripts/download_batch_results.py --batch-id <batch-id># Submit dependent job (requires investment batch ID)
python scripts/submit_batch_job.py --task assetization_features_scoring --batch-id <investment_batch_id>
# Monitor and download
python scripts/check_batch_status.py --batch-id <assetization_batch_id> --wait
python scripts/download_batch_results.py --batch-id <assetization_batch_id># List all batch jobs with LAST-CHECKED status and timestamps
python scripts/check_batch_status.py --list-jobs
# Check CURRENT status of a specific job
python scripts/check_batch_status.py --batch-id <batch-id>
# Wait for job completion (polls every 30 seconds)
python scripts/check_batch_status.py --batch-id <batch-id> --wait
# Custom polling interval
python scripts/check_batch_status.py --batch-id <batch-id> --wait --poll-interval 60# Cancel a running batch job (requires confirmation)
python scripts/check_batch_status.py --batch-id <batch-id> --cancel=== Batch Jobs (3 found) ===
Batch ID Task Status Last Checked Submitted Paragraphs
-------------------------------------------------------------------------------------------------------------------------------
batch_686fc36b2da08190903bc237510c52f5 investment_activity_class completed 07-10 16:55 2025-07-10T15:43 120
batch_686fd9e4f814819088b69150a57753d6 assetization_features_sc submitted never 2025-07-10T17:19 3
batch_686fdd5143248190aae3f8185f24a415 investment_activity_class in_progress 07-10 14:30 2025-07-10T14:15 274
BiodiversityASSET uses external text files for prompts, making them easy to customize without code changes:
prompts/investment_activity_classification_system_prompt.txt- System prompt for investment activity classificationprompts/assetization_features_scoring_system_prompt.txt- System prompt for assetization features scoringprompts/user_prompt_template.txt- User prompt template applied to each paragraph
# Use custom system prompt
python scripts/submit_batch_job.py --task investment_activity_classification \
--system-prompt prompts/my_custom_system.txt
# Use both custom system and user prompts
python scripts/submit_batch_job.py --task investment_activity_classification \
--system-prompt prompts/my_custom_system.txt \
--user-prompt prompts/my_custom_user.txt
# Use different model with custom prompts
python scripts/submit_batch_job.py --task assetization_features_scoring \
--batch-id <investment_batch_id> \
--model gpt-4o \
--max-tokens 750 \
--system-prompt prompts/my_custom_system.txtBiodiversityASSET/
βββ π data/
β βββ π raw/
β β βββ π pdfs/ # π Input: PDF investor reports
β βββ π processed/
β β βββ π extracted_paragraphs_from_pdfs/ # Step 1: Extracted paragraphs
β β βββ π biodiversity_related_paragraphs/ # Step 2: Filtered biodiversity content
β β βββ π investment_activity_classification/ # Step 3: Investment classification results
β β β βββ π <batch_id>/
β β β βββ π batch_results.jsonl
β β β βββ π chunk_1.csv
β β β βββ π chunk_2.csv
β β βββ π assetization_features_scoring/ # Step 4: Assetization scoring results
β β βββ π <batch_id>/
β β βββ π batch_results.jsonl
β β βββ π assetization_features_scored.csv
β βββ π human_annotations/ # π₯ Manual annotations for evaluation
βββ π prompts/ # π€ LLM prompt templates
β βββ π investment_activity_classification_system_prompt.txt
β βββ π assetization_features_scoring_system_prompt.txt
β βββ π user_prompt_template.txt
βββ π results/
β βββ π batch_jobs/ # π Batch job metadata and raw results
β β βββ π <batch_id>.json
β β βββ π investment_activity_classification_processing/
β β βββ π assetization_features_scoring_processing/
β βββ π evaluation/ # π Evaluation results (future)
βββ π scripts/ # π Python processing scripts
βββ βοΈ pyproject.toml # π¦ Project dependencies
βββ π uv.lock # π Lock file for dependencies
βββ π README.md # π Project documentation
βββ π BATCH_WORKFLOW.md # π Detailed batch processing workflow
βββ π REFACTORING_SUMMARY.md # π Summary of refactoring changes
Results are organized in batch-specific subfolders to prevent conflicts and enable easy tracking:
data/processed/investment_activity_classification/<batch_id>/
βββ π batch_results.jsonl # Raw API responses
βββ π chunk_1.csv # Processed results for chunk 1
βββ π chunk_2.csv # Processed results for chunk 2
Contains: Investment activity scores, explanations, and original paragraph metadata
data/processed/assetization_features_scoring/<batch_id>/
βββ π batch_results.jsonl # Raw API responses
βββ π assetization_features_scored.csv # Scored paragraphs with all dimensions
Contains: Intrinsic value, cash flow, and ownership/control scores with detailed reasoning
- π Conflict-free: Each batch job gets its own subfolder
- π·οΈ Clean naming: Filenames without batch ID suffixes
- π Traceable: Easy to identify which batch produced which results
- π Resumable: Can re-run or reference specific batch outputs
π BATCH_WORKFLOW.md - Detailed step-by-step workflow guide with examples
π REFACTORING_SUMMARY.md - Complete summary of system architecture and changes
We welcome contributions! Please see our contribution guidelines for more information.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use BiodiversityASSET in your research, please cite:
@software{biodiversityasset,
title={BiodiversityASSET: LLM-powered analysis of biodiversity-related investment activities},
author={SoDa},
year={2025},
url={https://github.com/yourusername/BiodiversityASSET}
}