Skip to content

catalinapapari1/BiodiversityASSET_SODA

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BiodiversityASSET

LLM-powered analysis of biodiversity-related investment activities in financial reports

Python OpenAI uv

BiodiversityASSET is a comprehensive pipeline for extracting, classifying, and analyzing biodiversity-related content from investor reports. The system uses LLMs to evaluate paragraphs across three key dimensions:

  1. 🌿 Biodiversity relevance - Identifies content related to biodiversity and environmental impact
  2. πŸ’° Investment activity - Classifies paragraphs containing concrete investment activities
  3. πŸ“Š Assetization characteristics - Scores content on intrinsic value, cash flow, and ownership/control

Table of Contents

Key Features

✨ Modular Architecture

  • Submit batch jobs and monitor progress independently
  • Resume workflows from any step using batch IDs
  • Cancel running jobs with safety confirmations

πŸ€– LLM-Powered Processing

  • OpenAI Batch API integration for cost-effective analysis (~50% cost reduction)
  • External prompt system for easy customization
  • Support for multiple models and configurations

πŸ“ Organized Output

  • Results saved in batch-specific subfolders
  • Clean filenames without ID conflicts
  • Individual chunk processing for large datasets

πŸ”§ Developer-Friendly

  • Comprehensive CLI tools with intuitive options
  • Detailed progress monitoring and error handling
  • Flexible configuration and custom prompt support

Processing Pipeline

The processing pipeline consists of sequential steps, with LLM-powered batch processing for steps 3-4:

Step Purpose Script Input Output
1 Extract paragraphs from PDFs extract_pdfs.py data/raw/pdfs/ extracted_paragraphs_from_pdfs/
2 Filter biodiversity content filter_biodiversity_paragraphs.py extracted_paragraphs_from_pdfs/ biodiversity_related_paragraphs/
3a Submit investment classification submit_batch_job.py biodiversity_related_paragraphs/ Returns batch ID
3b Monitor batch progress check_batch_status.py Batch ID Status updates
3c Download investment results download_batch_results.py Batch ID investment_activity_classification/
4a Submit assetization scoring submit_batch_job.py investment_activity_classification/ Returns batch ID
4b Monitor batch progress check_batch_status.py Batch ID Status updates
4c Download assetization results download_batch_results.py Batch ID assetization_features_scoring/

πŸ’‘ Key Points:

  • Steps 3-4 use OpenAI's Batch API for cost-effective processing
  • Each batch step can be run independently
  • Step 4 requires a completed investment activity classification batch ID
  • All results are organized in batch-specific subfolders

Quick Start

Prerequisites

Ensure you have uv installed:

πŸ“¦ Install uv (click to expand)

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Linux/MacOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

πŸš€ Installation

# Clone the repository
git clone <repository-url>
cd BiodiversityASSET

# Install dependencies
uv sync

βš™οΈ Environment Setup

# Set your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"

# Or create a .env file
echo "OPENAI_API_KEY=your-openai-api-key" > .env

πŸ“ Basic Usage

1. Extract paragraphs from PDFs

python scripts/extract_pdfs.py

2. Filter biodiversity-related content

python scripts/filter_biodiversity_paragraphs.py

3. Classify investment activities

# Submit the batch job
python scripts/submit_batch_job.py --task investment_activity_classification

# Monitor progress (replace <batch-id> with actual ID)
python scripts/check_batch_status.py --batch-id <batch-id> --wait

# Download results
python scripts/download_batch_results.py --batch-id <batch-id>

4. Score assetization features

# Submit dependent job (requires investment batch ID)
python scripts/submit_batch_job.py --task assetization_features_scoring --batch-id <investment_batch_id>

# Monitor and download
python scripts/check_batch_status.py --batch-id <assetization_batch_id> --wait
python scripts/download_batch_results.py --batch-id <assetization_batch_id>

Batch Job Management

πŸ“Š Monitoring Jobs

# List all batch jobs with LAST-CHECKED status and timestamps
python scripts/check_batch_status.py --list-jobs

# Check CURRENT status of a specific job
python scripts/check_batch_status.py --batch-id <batch-id>

# Wait for job completion (polls every 30 seconds)
python scripts/check_batch_status.py --batch-id <batch-id> --wait

# Custom polling interval
python scripts/check_batch_status.py --batch-id <batch-id> --wait --poll-interval 60

❌ Canceling Jobs

# Cancel a running batch job (requires confirmation)
python scripts/check_batch_status.py --batch-id <batch-id> --cancel

πŸ“‹ Example Job Listing Output

=== Batch Jobs (3 found) ===
Batch ID                              Task                      Status          Last Checked      Submitted         Paragraphs  
-------------------------------------------------------------------------------------------------------------------------------
batch_686fc36b2da08190903bc237510c52f5 investment_activity_class completed       07-10 16:55       2025-07-10T15:43  120         
batch_686fd9e4f814819088b69150a57753d6 assetization_features_sc  submitted       never             2025-07-10T17:19  3           
batch_686fdd5143248190aae3f8185f24a415 investment_activity_class in_progress     07-10 14:30       2025-07-10T14:15  274         

Prompt Customization

BiodiversityASSET uses external text files for prompts, making them easy to customize without code changes:

πŸ“ Default Prompt Files

  • prompts/investment_activity_classification_system_prompt.txt - System prompt for investment activity classification
  • prompts/assetization_features_scoring_system_prompt.txt - System prompt for assetization features scoring
  • prompts/user_prompt_template.txt - User prompt template applied to each paragraph

πŸ› οΈ Using Custom Prompts

# Use custom system prompt
python scripts/submit_batch_job.py --task investment_activity_classification \
    --system-prompt prompts/my_custom_system.txt

# Use both custom system and user prompts
python scripts/submit_batch_job.py --task investment_activity_classification \
    --system-prompt prompts/my_custom_system.txt \
    --user-prompt prompts/my_custom_user.txt

# Use different model with custom prompts
python scripts/submit_batch_job.py --task assetization_features_scoring \
    --batch-id <investment_batch_id> \
    --model gpt-4o \
    --max-tokens 750 \
    --system-prompt prompts/my_custom_system.txt

Project Structure

BiodiversityASSET/
β”œβ”€β”€ πŸ“ data/
β”‚   β”œβ”€β”€ πŸ“ raw/
β”‚   β”‚   └── πŸ“ pdfs/                     # πŸ“„ Input: PDF investor reports
β”‚   β”œβ”€β”€ πŸ“ processed/
β”‚   β”‚   β”œβ”€β”€ πŸ“ extracted_paragraphs_from_pdfs/      # Step 1: Extracted paragraphs
β”‚   β”‚   β”œβ”€β”€ πŸ“ biodiversity_related_paragraphs/     # Step 2: Filtered biodiversity content
β”‚   β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification/  # Step 3: Investment classification results
β”‚   β”‚   β”‚   └── πŸ“ <batch_id>/
β”‚   β”‚   β”‚       β”œβ”€β”€ πŸ“Š batch_results.jsonl
β”‚   β”‚   β”‚       β”œβ”€β”€ πŸ“Š chunk_1.csv
β”‚   β”‚   β”‚       └── πŸ“Š chunk_2.csv
β”‚   β”‚   └── πŸ“ assetization_features_scoring/       # Step 4: Assetization scoring results
β”‚   β”‚       └── πŸ“ <batch_id>/
β”‚   β”‚           β”œβ”€β”€ πŸ“Š batch_results.jsonl
β”‚   β”‚           └── πŸ“Š assetization_features_scored.csv
β”‚   └── πŸ“ human_annotations/            # πŸ‘₯ Manual annotations for evaluation
β”œβ”€β”€ πŸ“ prompts/                          # πŸ€– LLM prompt templates
β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification_system_prompt.txt
β”‚   β”œβ”€β”€ πŸ“ assetization_features_scoring_system_prompt.txt
β”‚   └── πŸ“ user_prompt_template.txt
β”œβ”€β”€ πŸ“ results/
β”‚   β”œβ”€β”€ πŸ“ batch_jobs/                   # πŸ“‹ Batch job metadata and raw results
β”‚   β”‚   β”œβ”€β”€ πŸ“„ <batch_id>.json
β”‚   β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification_processing/
β”‚   β”‚   └── πŸ“ assetization_features_scoring_processing/
β”‚   └── πŸ“ evaluation/                   # πŸ“ˆ Evaluation results (future)
β”œβ”€β”€ πŸ“ scripts/                          # 🐍 Python processing scripts
β”œβ”€β”€ βš™οΈ pyproject.toml                    # πŸ“¦ Project dependencies
β”œβ”€β”€ πŸ”’ uv.lock                           # πŸ” Lock file for dependencies
β”œβ”€β”€ πŸ“– README.md                         # πŸ“š Project documentation
β”œβ”€β”€ πŸ“– BATCH_WORKFLOW.md                 # πŸ”„ Detailed batch processing workflow
└── πŸ“– REFACTORING_SUMMARY.md            # πŸ“ Summary of refactoring changes

Output Organization

Results are organized in batch-specific subfolders to prevent conflicts and enable easy tracking:

πŸ’Ό Investment Activity Classification

data/processed/investment_activity_classification/<batch_id>/
β”œβ”€β”€ πŸ“Š batch_results.jsonl              # Raw API responses
β”œβ”€β”€ πŸ“Š chunk_1.csv                      # Processed results for chunk 1
└── πŸ“Š chunk_2.csv                      # Processed results for chunk 2

Contains: Investment activity scores, explanations, and original paragraph metadata

πŸ“ˆ Assetization Features Scoring

data/processed/assetization_features_scoring/<batch_id>/
β”œβ”€β”€ πŸ“Š batch_results.jsonl              # Raw API responses
└── πŸ“Š assetization_features_scored.csv # Scored paragraphs with all dimensions

Contains: Intrinsic value, cash flow, and ownership/control scores with detailed reasoning

πŸ”‘ Key Benefits

  • πŸ”’ Conflict-free: Each batch job gets its own subfolder
  • 🏷️ Clean naming: Filenames without batch ID suffixes
  • πŸ“ Traceable: Easy to identify which batch produced which results
  • πŸ”„ Resumable: Can re-run or reference specific batch outputs

Documentation

πŸ“– BATCH_WORKFLOW.md - Detailed step-by-step workflow guide with examples

πŸ“ REFACTORING_SUMMARY.md - Complete summary of system architecture and changes


Contributing

We welcome contributions! Please see our contribution guidelines for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BiodiversityASSET in your research, please cite:

@software{biodiversityasset,
  title={BiodiversityASSET: LLM-powered analysis of biodiversity-related investment activities},
  author={SoDa},
  year={2025},
  url={https://github.com/yourusername/BiodiversityASSET}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%