Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737

Copilot · 2025-10-20T17:29:34Z

Summary

This PR implements comprehensive processing of course XML archives from edX.org and Open edX instances, extracting all block-level data into the raw layer and providing clean dbt staging models for downstream analysis. Previously, only specific metadata (course info, videos, certificates, policy) was extracted. This change processes the complete XML structure including all blocks and components.

Motivation

The existing extract_edxorg_courserun_metadata asset extracted only targeted metadata fields from course XML archives. The complete course structure—including all chapters, sequentials, verticals, and content blocks—was not being systematically loaded into the data warehouse for comprehensive analysis. This gap limited the ability to analyze:

Complete course structure and hierarchy
All block types and their XML attributes
Direct XML-to-warehouse data lineage
Block-level course content analytics

Changes Made

1. XML Block Extraction Function

Added process_course_xml_blocks() in packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py:

Extracts all blocks from course XML archives (not just specific metadata)
Processes 11+ block types: chapter, sequential, vertical, video, problem, html, discussion, lti, lti_consumer, word_cloud, poll_question
Captures complete XML metadata: display names, attributes, file paths
Includes type-specific fields: video duration/edx_video_id, problem weight/max_attempts
Gracefully handles malformed XML with proper error handling
Returns structured JSONL data ready for raw layer ingestion

2. Dagster Asset Integration

Updated extract_edxorg_courserun_metadata multi-asset in both:

dg_projects/edxorg/edxorg/assets/openedx_course_archives.py
dg_projects/openedx/openedx/assets/openedx_course_archives.py

Added new course_xml_blocks output that:

Writes to S3 via s3file_io_manager at edxorg/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json
Leverages existing partitioning strategy (course_id + source_system)
Includes metadata: course_id, object_key, block_count
Uses SHA256 hashing for data versioning

3. dbt Source Definition

Added raw__edxorg__s3__course_xml_blocks source in src/ol_dbt/models/staging/edxorg/_edxorg_sources.yml:

Complete column documentation for 12 fields
Core identifiers: course_id, block_id, block_type
XML metadata: block_display_name, xml_attributes, xml_path
Type-specific fields: video (edx_video_id, duration), problem (max_attempts, weight, markdown)
Configured for Airbyte S3 source connector ingestion

4. dbt Staging Model

Created src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_xml_blocks.sql:

Implements deduplication using deduplicate_raw_table macro on (course_id, block_id, block_type)
Applies semantic naming convention: courserun_readable_id, coursestructure_xml_block_*, video_*, problem_*
Type casts video duration to decimal(38, 4)
Standardizes timestamps to ISO8601 via cast_timestamp_to_iso8601 macro
Preserves video and problem metadata fields conditionally

5. Model Documentation

Added comprehensive documentation in src/ol_dbt/models/staging/edxorg/_stg__edxorg__models.yml:

Model description explaining purpose and content
Column-level documentation for all 12 fields
Data quality tests:
- not_null constraints on key fields
- Compound unique test on [courserun_readable_id, coursestructure_xml_block_id, coursestructure_xml_block_type, coursestructure_xml_retrieved_at]

Data Flow

Course XML Archives (S3)
    ↓
Dagster: extract_edxorg_courserun_metadata
    → process_course_xml_blocks()
    ↓
S3: edxorg/processed_data/course_xml_blocks/
    ↓
Airbyte S3 Source Connector
    ↓
Raw Layer: raw__edxorg__s3__course_xml_blocks
    ↓
dbt Staging: stg__edxorg__s3__course_xml_blocks
    ↓
Ready for marts and analytics

Benefits

For Data Analysts:

Access to complete course XML structure (not just select metadata)
Video metadata for content analysis
Problem metadata for assessment analytics
Direct mapping to XML source files for validation

For Data Engineering:

Systematic XML processing (no ad-hoc scripts)
Proper data lineage and versioning via SHA256
Incremental loading via existing partitions
Quality tests prevent data issues

For Analytics:

Foundation for comprehensive course structure analysis
Block-level granularity for detailed reporting
Historical tracking of course content changes
Rich metadata for content analytics

Integration Notes

This implementation:

Follows existing pipeline patterns (Dagster → S3 → Airbyte → dbt)
Complements existing raw__edxorg__s3__course_blocks (from JSON structure API)
Provides XML-native attributes not available in JSON structure
Uses existing partitioning and resource management

Testing

Completed:

✅ Python code passes ruff formatting and linting
✅ Python code passes mypy type checking
✅ YAML files validated for syntax
✅ SQL files formatted with sqlfmt

Deferred (network/environment constraints):

⚠️ dbt compile/test requires package installation (hub.getdbt.com unavailable in build environment)
⚠️ End-to-end testing requires Airbyte configuration and actual XML archives

Post-Merge Actions

Configure Airbyte S3 source to sync edxorg/processed_data/course_xml_blocks/ → raw__edxorg__s3__course_xml_blocks
Trigger Dagster asset materialization for existing course archives
Run dbt test --select stg__edxorg__s3__course_xml_blocks to validate data quality
Monitor initial data loads and verify block counts
Consider creating intermediate models that join XML blocks with JSON structure blocks

Files Changed

6 files, 284 lines added
3 Python files (extraction logic + asset updates)
3 dbt files (source, staging model, documentation)
1 gitignore update (exclude dbt-generated .user.yml)

Closes #[issue_number]

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

astral.sh
- Triggering command: curl -LsSf REDACTED (dns block)
fishtownanalytics.sinter-collect.com
- Triggering command: /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt deps (dns block)
- Triggering command: /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt parse --profiles-dir . --project-dir . --target dev (dns block)
hub.getdbt.com
- Triggering command: /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt deps (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models</issue_title>
<issue_description>## Summary

Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.

Background

Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the extract_edxorg_courserun_metadata multi-asset in src/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.

Current State

The existing implementation:

Extracts course XML archives from edX.org and Open edX instances
Processes specific elements using functions in src/ol_orchestrate/lib/openedx.py:
- process_course_xml() - extracts course metadata
- process_video_xml() - extracts video elements
- process_policy_json() - extracts policy information
Outputs processed data to S3 as JSON/JSONL files
Has some staging models in src/ol_dbt/models/staging/edxorg/ that reference raw course structure data

What's Missing

The complete course XML contents, including all course components, blocks, and their relationships, need to be:

Loaded into the raw layer (ol_warehouse_*_raw schemas)
Staged through dbt models for consistent transformation and data quality
Made available for downstream marts and analytics

Requirements

1. Raw Layer Data Loading

Objective: Load course XML contents into raw layer tables

Tasks:

Design schema for raw course XML data tables
- Consider table structure for course blocks/components
- Include metadata fields (retrieved_at, source_system, course_id, etc.)
- Determine granularity (one row per block, per file, etc.)
Create Dagster assets to extract and load course XML data
- Extend or create new assets in src/ol_orchestrate/assets/
- Parse XML structure comprehensively (not just metadata)
- Handle both edX.org and Open edX course formats
- Implement incremental loading strategy
Configure data quality checks
- Validate XML parsing completeness
- Check for required fields
- Monitor data freshness

Data Sources:

edX.org course archives (production and edge)
Open edX instance course exports
Archives stored in S3 buckets

Target Schema Pattern:

raw__edxorg__s3__course_xml_<entity>
raw__openedx__s3__course_xml_<entity>

2. dbt Staging Models

Objective: Create staging models to transform raw course XML data into clean, typed datasets

Tasks:

Generate dbt source definitions
- Use bin/dbt-create-staging-models.py utility to scaffold sources
- Define source freshness checks
- Document all source columns
Create staging models
- Build staging models in src/ol_dbt/models/staging/edxorg/
- Apply consistent naming conventions (stg__edxorg__s3__course_xml_*)
- Implement standard transformations:
  - Timestamp standardization (ISO8601)
  - JSON parsing and flattening where appropriate
  - Deduplication logic
  - Type casting
- Add course_id and block_id semantic renaming
Create model documentation
- Document all columns in YAML
- Add model descriptions
- Include examples of use cases
Add data quality tests
- Unique/not null tests for key fields
- Referential integrity checks
- Value range validations
- Freshness tests

Staging Model Pattern:

-- stg__edxorg__s3__course_xml_blocks.sql
with source as (
    select * from {{ source('ol_warehouse_raw_data', 'raw__edxorg__s3__course_xml_blocks') }}
)

, cleaned as (
    select
        course_id as courserun_id
        , block_id as coursestructure_block_id
        , block_type as coursestructure_block_type
        , block_title as coursestructure_block_title
        , {{ cast_timestamp_to_iso8601('retrieved_at') }} as coursestructure_retrieved_at
        , ...
    from source
)

select * from cleaned

3. Integration with Existing Pipeline

Tasks:

Update existing assets to output to raw layer
- Modify extract_edxorg_courserun_metadata if needed
- Ensure consistency with existing course structure processing
Align with existing course_structure staging model
- Review stg__edxorg__s3__course_structure.sql
- Ensure new models complement existing structure
- Update intermediate/mart models if needed
Update orchestration schedules
- Configure partitioning (by course_id and source_system)
- Set appropriate refresh schedules
- Handle backfilling for historical data

Technical Considerations

XML Structure

Course XML arc...

Fixes #1714

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Co-authored-by: blarghmatey <[email protected]>

Initial plan

8a7cdd5

Copilot AI assigned Copilot and blarghmatey Oct 20, 2025

Copilot started work on behalf of blarghmatey October 20, 2025 17:30 View session

Copilot AI and others added 3 commits October 20, 2025 17:42

Add comprehensive XML block extraction for course archives

30ebcea

Co-authored-by: blarghmatey <[email protected]>

Add dbt source and staging model for course XML blocks

95dc552

Co-authored-by: blarghmatey <[email protected]>

Add .user.yml to gitignore (dbt-generated file)

13768f4

Co-authored-by: blarghmatey <[email protected]>

Copilot AI changed the title ~~[WIP] Process course XML contents for data lakehouse and dbt~~ Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses Oct 20, 2025

Copilot AI requested a review from blarghmatey October 20, 2025 17:53

Copilot finished work on behalf of blarghmatey October 20, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737

Uh oh!

Copilot AI commented Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737

Are you sure you want to change the base?

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737

Uh oh!

Conversation

Copilot AI commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes Made

1. XML Block Extraction Function

2. Dagster Asset Integration

3. dbt Source Definition

4. dbt Staging Model

5. Model Documentation

Data Flow

Benefits

Integration Notes

Testing

Post-Merge Actions

Files Changed

I tried to connect to the following addresses, but was blocked by firewall rules:

Background

Current State

What's Missing

Requirements

1. Raw Layer Data Loading

2. dbt Staging Models

3. Integration with Existing Pipeline

Technical Considerations

XML Structure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 20, 2025 •

edited

Loading