Skip to content

Conversation

@Opsmithe
Copy link

@Opsmithe Opsmithe commented Nov 10, 2025

Fixes

Description

This PR adds comprehensive DOAJ API v4 integration to the quantifying commons project, enabling collection and analysis of Creative Commons licensed academic journals. The implementation includes two main components:

  1. scripts/1-fetch/doaj_fetch.py - Main data collection script for DOAJ journals
  2. dev/generate_country_codes.py - Utility for programmatic ISO country code generation

Key Features

  • DOAJ API v4 integration with enhanced metadata collection
  • Creative Commons license analysis (BY, NC, SA, ND combinations)
  • Publisher and geographic distribution analysis
  • Temporal filtering to prevent CC license false positives (default: ≥2002)
  • Automatic country code generation and mapping
  • Comprehensive error handling and data validation
  • Self-contained execution with auto-dependency resolution

Useful Links

Articles:

Journals:

Technical details

API Integration

  • Endpoint: base url `https://doaj.org/api/v4/
  • Rate Limiting: 0.5 seconds between requests
  • Pagination: 100 journals per page with automatic pagination
  • Error Handling: Comprehensive exception handling without swallowing errors

Data Quality Measures

  • Date Filtering: Default --date-back=2002 to avoid retroactive CC license false positives
  • License Validation: Only processes journals with valid CC license declarations
  • Country Mapping: ISO 3166-1 alpha-2 codes automatically mapped to readable names

Output Files Generated

data/2025Q4/1-fetch/
├── doaj_1_count.csv                    # License type counts
├── doaj_2_count_by_subject_report.csv  # Subject classification analysis
├── doaj_3_count_by_language.csv        # Language distribution
├── doaj_4_count_by_year.csv           # Temporal analysis by oa_start year
├── doaj_5_count_by_publisher.csv      # Publisher and country analysis
├── doaj_provenance.yaml               # Execution metadata and audit trail
└── iso_country_codes.yaml             # Auto-generated country mapping

Query Strategy

License Extraction

def extract_license_type(license_info):
    """Extract CC license type from DOAJ license information."""
    if not license_info:
        return "UNKNOWN CC legal tool"
    for lic in license_info:
        lic_type = lic.get("type", "")
        if lic_type in CC_LICENSE_TYPES:
            return lic_type
    return "UNKNOWN CC legal tool"

Date Filtering Implementation

# Apply date-back filter if specified
if args.date_back and oa_start and oa_start < args.date_back:
    continue

Publisher Analysis

# Extract publisher information (new in v4)
publisher_info = bibjson.get("publisher", {})
if publisher_info:
    publisher_name = publisher_info.get("name", "Unknown")
    publisher_country = publisher_info.get("country", "Unknown")
    publisher_key = f"{publisher_name}|{publisher_country}"
    publisher_counts[license_type][publisher_key] += 1

Auto-Dependency Resolution

# Generate country codes file if it doesn't exist
if not os.path.isfile(country_file):
    LOGGER.info("Country codes file not found, generating it...")
    generate_script = shared.path_join(PATHS["repo"], "dev", "generate_country_codes.py")
    subprocess.run([sys.executable, generate_script], check=True)

Tests

Basic Code Execution

# Run with default settings (date-back=2002, limit=1000)
pipenv run ./scripts/1-fetch/doaj_fetch.py --enable-save

# Run with custom parameters
pipenv run ./scripts/1-fetch/doaj_fetch.py --limit 50 --date-back 2020 --enable-save

# Test country code generation
pipenv run ./dev/generate_country_codes.py

# Static analysis
pipenv run pre-commit run --files scripts/1-fetch/doaj_fetch.py

Data Quality Note

Please Note: DOAJ data represents journal-level licensing policies, not individual article licenses. This data should be interpreted as indicators of institutional commitment to CC licensing rather than precise counts of CC-licensed articles. The --date-back=2002 default prevents false positives from journals that retroactively adopted CC licenses.

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Migrate from DOAJ API v3 to v4 for enhanced metadata access
- Add comprehensive CC license analysis for academic journals
- Implement publisher and geographic distribution analysis
- Add programmatic ISO 3166-1 alpha-2 country code generation
- Include automatic dependency resolution and error handling
- Apply date filtering (default ≥2002) to prevent false positives
- Generate 5 CSV files plus provenance for comprehensive analysis
- Ensure static analysis compliance and comprehensive testing

This integration enables quantification of institutional commitment
to Creative Commons licensing in the scholarly publishing ecosystem.
@Opsmithe Opsmithe requested review from a team as code owners November 10, 2025 11:13
@Opsmithe Opsmithe requested review from Shafiya-Heena and TimidRobot and removed request for a team November 10, 2025 11:13
@Opsmithe
Copy link
Author

@TimidRobot , Hello I have attempted to implement the fetch script to collect CC license information from the doaj datasource using its API. To eliminate false positives, the API fetches a license from a field, which is the actual journal licenses. I have also set a --dates-back argument to help us with effective filtering because older articles before the development of CC may not have CC licenses. I'ld like a review on this

Comment on lines 160 to 163
LOGGER.error(f"Failed to generate country codes file: {e}")
raise shared.QuantifyingException(
f"Critical error generating country codes: {e}", exit_code=1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to log an error here, as the raised exception will log a message to the terminal

Comment on lines +326 to +327
if not license_info:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering, wouldn't it be better to log a warning here saying that you skipped this journal because there is no CC license?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will depend on how many warnings are generated. If the minority of log messages are warnings, I think they'll be helpful. If the majority are, then it becomes noise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this script and use pycountry instead (which will also require updated pipenv files)

@TimidRobot
Copy link
Member

The data returned appears to focus primarily on articles. Given the lack of licensing information on the articles, I think the focus should be on the journals with article information providing context. Even though a lot of the data currently returned is really interesting, I think it is out of scope for this project.

@TimidRobot TimidRobot self-assigned this Nov 14, 2025
@Opsmithe
Copy link
Author

Opsmithe commented Nov 15, 2025

The data returned appears to focus primarily on articles. Given the lack of licensing information on the articles, I think the focus should be on the journals with article information providing context. Even though a lot of the data currently returned is really interesting, I think it is out of scope for this project.

@TimidRobot, The script actually focuses on Journals, as this is the only available records with license fields. Articles in the DOAJ database do not have license fields, and doing a full text search would be slow and unreliable. I could provide context on the number of articles per journal. Knowing how many journals allow CC licenses, and what's the total article volume in those journals could give meaningful context while acknowledging the API limitation that prevents counting actual CC-licensed articles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add Directory of Open Access Journal(DOAJ) as new source to Improve Quantifying Creative Commons

3 participants