Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements comprehensive processing of course XML archives from edX.org and Open edX instances, extracting all block-level data into the raw layer and providing clean dbt staging models for downstream analysis. Previously, only specific metadata (course info, videos, certificates, policy) was extracted. This change processes the complete XML structure including all blocks and components.
Motivation
The existing
extract_edxorg_courserun_metadataasset extracted only targeted metadata fields from course XML archives. The complete course structure—including all chapters, sequentials, verticals, and content blocks—was not being systematically loaded into the data warehouse for comprehensive analysis. This gap limited the ability to analyze:Changes Made
1. XML Block Extraction Function
Added
process_course_xml_blocks()inpackages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py:2. Dagster Asset Integration
Updated
extract_edxorg_courserun_metadatamulti-asset in both:dg_projects/edxorg/edxorg/assets/openedx_course_archives.pydg_projects/openedx/openedx/assets/openedx_course_archives.pyAdded new
course_xml_blocksoutput that:s3file_io_manageratedxorg/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json3. dbt Source Definition
Added
raw__edxorg__s3__course_xml_blockssource insrc/ol_dbt/models/staging/edxorg/_edxorg_sources.yml:4. dbt Staging Model
Created
src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_xml_blocks.sql:deduplicate_raw_tablemacro on (course_id, block_id, block_type)courserun_readable_id,coursestructure_xml_block_*,video_*,problem_*decimal(38, 4)cast_timestamp_to_iso8601macro5. Model Documentation
Added comprehensive documentation in
src/ol_dbt/models/staging/edxorg/_stg__edxorg__models.yml:not_nullconstraints on key fields[courserun_readable_id, coursestructure_xml_block_id, coursestructure_xml_block_type, coursestructure_xml_retrieved_at]Data Flow
Benefits
For Data Analysts:
For Data Engineering:
For Analytics:
Integration Notes
This implementation:
raw__edxorg__s3__course_blocks(from JSON structure API)Testing
Completed:
Deferred (network/environment constraints):
Post-Merge Actions
edxorg/processed_data/course_xml_blocks/→raw__edxorg__s3__course_xml_blocksdbt test --select stg__edxorg__s3__course_xml_blocksto validate data qualityFiles Changed
Closes #[issue_number]
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
astral.shcurl -LsSf REDACTED(dns block)fishtownanalytics.sinter-collect.com/home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt deps(dns block)/home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt parse --profiles-dir . --project-dir . --target dev(dns block)hub.getdbt.com/home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/python /home/REDACTED/work/ol-data-platform/ol-data-platform/.venv/bin/dbt deps(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models</issue_title>
<issue_description>## Summary
Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.
Background
Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the
extract_edxorg_courserun_metadatamulti-asset insrc/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.Current State
The existing implementation:
src/ol_orchestrate/lib/openedx.py:process_course_xml()- extracts course metadataprocess_video_xml()- extracts video elementsprocess_policy_json()- extracts policy informationsrc/ol_dbt/models/staging/edxorg/that reference raw course structure dataWhat's Missing
The complete course XML contents, including all course components, blocks, and their relationships, need to be:
ol_warehouse_*_rawschemas)Requirements
1. Raw Layer Data Loading
Objective: Load course XML contents into raw layer tables
Tasks:
src/ol_orchestrate/assets/Data Sources:
Target Schema Pattern:
2. dbt Staging Models
Objective: Create staging models to transform raw course XML data into clean, typed datasets
Tasks:
bin/dbt-create-staging-models.pyutility to scaffold sourcessrc/ol_dbt/models/staging/edxorg/Staging Model Pattern:
3. Integration with Existing Pipeline
Tasks:
extract_edxorg_courserun_metadataif neededstg__edxorg__s3__course_structure.sqlTechnical Considerations
XML Structure
Course XML arc...
Fixes #1714
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.