This directory contains RAMP-corrected CTM model outputs formatted as soft data for BME data fusion.
2softdata/
├── README.md # This file
├── softData_UKML_YYYY-YYYY.mat # Cached soft data structures
├── plots/ # Visualization plots
│ ├── UKML_mean_yYYYY_mMM.png
│ └── UKML_variance_yYYYY_mMM.png
└── archived/ # Old versions (if any)
The system requires TWO types of files:
Generated by extractModelSpatialInfo.m from original CSV model outputs.
Located in 1data/CTM/model_output_data/spatial_grids/:
- Naming:
{modelName}_spatial_grid.mat(e.g., M3fusion_spatial_grid.mat)
Contents:
lon- Longitude vector [nGrid × 1]lat- Latitude vector [nGrid × 1]nGridPoints- Total number of grid pointsyearsChecked- Years verified for consistencyisConsistent- Boolean flag for spatial consistency
Purpose: Provides spatial grid structure (lon, lat coordinates)
Generation: Run extractModelSpatialInfo.m once to create these files from CSV data
Located in 1data/CTM/:
lambda1_{model}_{year}_v{version}-parallel.parquet- Mean fieldlambda2_{model}_{year}_v{version}-parallel.parquet- Variance field
Where:
- lambda1 = RAMP-corrected mean MDA8 ozone (ppb)
- lambda2 = RAMP-corrected variance (ppb²)
- model = Model name (e.g., UKML)
- version = RAMP calibration version (e.g., v3)
Parquet File Format:
- Columns 1-12: Monthly MDA8 values (Jan-Dec)
- NO spatial information - spatial coordinates come from .mat files
- Row order must match .mat file grid order (critical assumption)
The workflow:
- Read lon, lat from spatial grid .mat file
- Read lambda1, lambda2 from parquet files (12 monthly columns)
- Match rows assuming same grid order
- Create unified structure with sMS ([lon, lat]), Z (mean), Zv (variance)
Important: Run extractModelSpatialInfo.m first to generate spatial grid .mat files!
% Load from parquet files (creates cache on first run)
ctmData = loadRAMPdata('UKML', [2015:2020]);
% Force reload from parquet (ignore cache)
ctmData = loadRAMPdata('UKML', [2015:2020], '1data/CTM', 1);Output structure:
ctmData.lon % [nGrid × 1] Longitude (degrees)
ctmData.lat % [nGrid × 1] Latitude (degrees)
ctmData.sMS % [nGrid × 2] Spatial coordinates as [lon, lat]
ctmData.tME % [1 × nMonths] Time in decimal years
ctmData.Z % [nGrid × nMonths] Mean field (lambda1)
ctmData.Zv % [nGrid × nMonths] Variance field (lambda2)
ctmData.gridInfo % Grid metadata (nGridPoints, yearsChecked, isConsistent)Caching: Creates 1data/CTM/CTM_RAMP_UKML_2015-2020_v3.mat (~50-500 MB)
% Basic usage
softData = createSoftDataStructure(ctmData, obs);
% With spatial subsetting (e.g., CONUS)
options.spatialBounds = [-125 -65 24 50];
softData = createSoftDataStructure(ctmData, obs, options);
% With spatial thinning (every 2nd grid point)
options.thinningFactor = 2;
softData = createSoftDataStructure(ctmData, obs, options);Output structure:
softData.sMS % [nPoints × 2] Spatial coordinates as [lon, lat]
softData.lon % [nPoints × 1] Longitude (degrees, for reference)
softData.lat % [nPoints × 1] Latitude (degrees, for reference)
softData.tME % [1 × nMonths] Time vector (aligned with obs)
softData.Z % [nPoints × nMonths] Mean values
softData.Zv % [nPoints × nMonths] Variance valuesMemory: ~14-140 MB depending on grid resolution and thinning
% Plot mean and variance for January
plotSoftData(softData, obs, 1, 'both');
% Plot mean only for multiple months
plotSoftData(softData, obs, [1 6 12], 'mean');Output: PNG files in 2softdata/plots/
% Specify soft data when creating knowledge base
BMEmethod = '11000132'; % Note: digit 2 = '1' for soft data
[KG, KS, BMEparam] = getTOARknowledgeBase(obs, go, cov, softData, BMEmethod);
% KS.softdata now contains:
% .p - [nSoft × 3] coordinates (lon, lat, time)
% .z - [nSoft × 1] residual mean values
% .vs - [nSoft × 1] variance values'10000132'
↑
Digit 2 = 0 (no CTM data)
'11000132'
↑
Digit 2 = 1 (includes CTM soft data)
Change in BME parameters:
nsmax: Maximum soft data neighbors (digit 6)- 0 → 0 soft neighbors (hard only)
- 1 → 3 soft neighbors
- 2 → 4 soft neighbors
- 4 → 50 soft neighbors
- 6 → 200 soft neighbors
Example: '11000242' = hard + soft, nsmax=50, nhmax=200
Reduce to analysis domain only:
options.spatialBounds = [minLon maxLon minLat maxLat];Benefit: 50-90% memory reduction for regional studies
Keep every Nth grid point:
options.thinningFactor = 2; % Every 2nd pointBenefit: 75% memory reduction, minimal accuracy loss (BME uses nsmax anyway)
Avoid double-counting:
options.removeHardData = 1;Benefit: Prevents soft data at same location/time as hard data
Typical sizes for 6 years (2015-2020, 72 months):
| Grid Resolution | Points | Raw Cache | Soft Data | With Thinning (×2) |
|---|---|---|---|---|
| 0.1° × 0.1° | 64,800 | 450 MB | 140 MB | 35 MB |
| 0.25° × 0.25° | 10,368 | 72 MB | 22 MB | 5.5 MB |
| 0.5° × 0.5° | 2,592 | 18 MB | 5.5 MB | 1.4 MB |
All stored as single precision for efficiency
Automated QC during structure creation:
- ✓ Negative variance → set to 0.01
- ✓ Infinite/NaN values → removed
- ✓ Temporal alignment with obs
- ✓ Minimum variance threshold (default: 0.01)
Based on typical BME data fusion results:
- R² improvement: +0.02-0.05
- RMSE improvement: -1-2 ppb
- R² improvement: +0.10-0.20
- RMSE improvement: -3-8 ppb
- R² improvement: +0.05-0.10
- RMSE improvement: -2-4 ppb
Run the complete workflow:
test_softdata_workflowThis demonstrates:
- Loading RAMP data
- Creating soft data structure
- Visualization
- BME integration
- Test estimation
Extracts spatial grids from original CSV model outputs and saves as .mat files.
Run this ONCE before using loadRAMPdata:
% Edit the script to select which models to process
% The script will:
% 1. Read first available CSV file to get reference grid
% 2. Verify consistency across all available years
% 3. Save {modelName}_spatial_grid.mat files
% Run the script
extractModelSpatialInfoOutput files:
1data/CTM/model_output_data/spatial_grids/
├── M3fusion_spatial_grid.mat
├── AM4_spatial_grid.mat
├── CAMS_spatial_grid.mat
└── ...
Each .mat file contains:
lon- Longitude coordinateslat- Latitude coordinatesnGridPoints- Number of pointsyearsChecked- Years verifiedisConsistent- Spatial consistency flag
Error: Spatial grid file not found: 1data/CTM/model_output_data/spatial_grids/M3fusion_spatial_grid.mat
Please run extractModelSpatialInfo.m first to generate spatial grid files.
Solution: Run extractModelSpatialInfo.m to generate spatial grid .mat files from CSV data
Error: Lambda1 file not found
Solution: Check file naming convention and path. Expected format:
1data/CTM/lambda1_UKML_2017_v3-parallel.parquet
Out of memory error during loading
Solutions:
- Use spatial thinning:
options.thinningFactor = 2 - Subset to analysis domain:
options.spatialBounds = [...] - Process fewer years at once
Warning: No temporal overlap between obs and CTM
Solution: Check that obs.tME and ctmData.tME have overlapping time periods
Error: Lambda1 file has 5000 rows, expected 6480 (grid size)
Solution: Parquet file rows must match spatial grid .mat file size. Check:
- Spatial grid .mat file: load and check
nGridPoints - Parquet file row count: should match
nGridPoints - Ensure parquet files and CSV files used the same grid structure
Warning: Spatial grid is not consistent across all years. Proceed with caution.
Information: This warning means extractModelSpatialInfo.m detected that some years have
different grid structures. The reference grid (first available year) is used. Check the
yearsChecked and isConsistent fields in the gridInfo for details.
-
v1.3 (2025-01-28): Simplified coordinate system
- Removed Mercator coordinate conversion
- sMS field now simply [lon, lat] in degrees
- Updated distance calculations to use degree-based tolerance (0.01 deg ~1 km)
- Changed spaceUnit from 'mercator' to 'degrees'
- Simplified workflow - no coordinate transformations needed
-
v1.2 (2025-01-28): .mat file integration for spatial coordinates
- Added extractModelSpatialInfo.m to extract spatial grids from CSV files
- Updated loadRAMPdata.m to use .mat file spatial grids (not NetCDF)
- Spatial grids saved as {modelName}_spatial_grid.mat
- Added spatial consistency checking across years
- Breaking change: Requires running extractModelSpatialInfo.m first
-
v1.1 (2025-01-28): NetCDF integration for spatial coordinates (deprecated)
- Added getCTMspatialGrid.m to read lon/lat from NetCDF files
- Updated loadRAMPdata.m to use NetCDF spatial grid
- Added Mercator coordinate conversion
- Fixed createSoftDataStructure.m to use Mercator coordinates
- Breaking change: Parquet files now have 12 columns (no spatial info)
-
v1.0 (2025-01-15): Initial soft data loading system
- Parquet file loading with caching
- Soft data structure creation
- Visualization tools
- BME integration
For questions about:
- RAMP correction: See RAMP documentation
- Data format: Check parquet file structure
- BME integration: See getTOARknowledgeBase.m documentation