Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b6af82c
feat: add streamlined analysis-only workflow script
neuromechanist Sep 18, 2025
b4a2dff
docs: add comprehensive cleanup plan for issue #23
neuromechanist Sep 18, 2025
2268bd3
docs: add cleanup plan and testing checklist
neuromechanist Sep 18, 2025
3ff454f
feat: save reference dashboard state before cleanup
neuromechanist Sep 18, 2025
83e0445
cleanup: remove test output directories
neuromechanist Sep 18, 2025
e6b41ed
cleanup: remove obsolete analysis directories
neuromechanist Sep 18, 2025
4739b8d
cleanup: remove obsolete CLI commands and Neo4j modules
neuromechanist Sep 18, 2025
ca24d1f
cleanup: remove unused graph files from interactive_reports
neuromechanist Sep 18, 2025
3a87648
fix: correct dashboard generation by passing citations_dir
neuromechanist Sep 18, 2025
4d9fdef
chore: update .gitignore for test outputs and backups
neuromechanist Sep 18, 2025
0615bd0
cleanup: remove old analysis files from dashboard_data
neuromechanist Sep 18, 2025
45f5572
cleanup: remove old theme files from interactive_reports
neuromechanist Sep 18, 2025
1676f78
feat: add new dataset embeddings
neuromechanist Sep 18, 2025
b82d870
cleanup: remove test metrics file
neuromechanist Sep 18, 2025
1ea17b3
fix: repair broken imports and add comprehensive analysis script
neuromechanist Sep 18, 2025
cd41c41
feat: add standalone analysis modules for dashboard data generation
neuromechanist Sep 18, 2025
2d916e5
fix: update workflow scripts to use correct dashboard_data path
neuromechanist Sep 18, 2025
84307af
chore: update .gitignore for generated dashboard outputs
neuromechanist Sep 18, 2025
2a49199
chore: remove generated files from git tracking
neuromechanist Sep 18, 2025
4b67029
generalize the themes
neuromechanist Sep 18, 2025
a0ea2e9
cleanup: remove unused create_interactive_reports CLI command
neuromechanist Sep 18, 2025
2bfe80e
feat: make theme component use dynamic data from analysis
neuromechanist Sep 18, 2025
f095682
fix: correct field mappings in modal component
neuromechanist Sep 18, 2025
d091743
cleanup: remove redundant run_analysis_only.sh script
neuromechanist Sep 19, 2025
2941164
fix: handle both array and string formats in bridge modal JavaScript
neuromechanist Sep 19, 2025
3886c05
feat: track individual paper titles in bridge paper analysis
neuromechanist Sep 19, 2025
a19843a
feat: update modal component to display bridge paper titles and year
neuromechanist Sep 19, 2025
6846a72
feat: align GitHub workflow with local script improvements
neuromechanist Sep 19, 2025
b0aa430
refactor: update run_end_to_end_workflow.sh to use dashboard_data path
neuromechanist Sep 19, 2025
90dae42
docs: update documentation with new workflows and analysis modules
neuromechanist Sep 19, 2025
d78052d
chore: remove tracked embeddings from repository
neuromechanist Sep 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
74 changes: 49 additions & 25 deletions .github/workflows/update_citations.yml
Original file line number Diff line number Diff line change
Expand Up @@ -238,23 +238,47 @@ jobs:
id: generate_dashboard
run: |
echo "Generating analysis results for dashboard..."

# Create results directory structure
mkdir -p results/temporal_analysis
mkdir -p results/network_analysis

echo "Running temporal analysis..."
dataset-citations-analyze-temporal \

# Clean up and prepare directories
rm -rf dashboard_data/network dashboard_data/themes dashboard_data/temporal 2>/dev/null || true
mkdir -p dashboard_data/{network,themes,temporal,visualizations}

echo "Generating embeddings for analysis..."
dataset-citations-generate-embeddings \
--citations citations/json \
--datasets datasets \
--embeddings-dir embeddings \
--embedding-type both \
--batch-size 32 \
--verbose || echo "Some embeddings may have failed (non-critical)"

echo "Running UMAP analysis for theme identification..."
dataset-citations-analyze-umap \
--embeddings-dir embeddings \
--output-dir dashboard_data \
--embedding-type both \
--n-components 2 \
--clustering \
--clustering-method kmeans \
--n-clusters 4 \
--create-visualizations \
--verbose || echo "UMAP analysis had issues (non-critical)"

echo "Generating theme analysis with wordclouds..."
python -m dataset_citations.analysis.generate_themes \
--citations-dir citations/json \
--output-dir results/temporal_analysis \
--log-level INFO || echo "Temporal analysis failed (non-critical)"

echo "Running network analysis..."
dataset-citations-analyze-networks \
--output-dir dashboard_data/themes || echo "Theme generation had issues (non-critical)"

echo "Generating network analysis..."
python -m dataset_citations.analysis.generate_network \
--citations-dir citations/json \
--output-dir results/network_analysis \
--log-level INFO || echo "Network analysis failed (non-critical)"

--output-dir dashboard_data/network || echo "Network analysis failed (non-critical)"

echo "Generating temporal analysis..."
python -m dataset_citations.analysis.generate_temporal \
--citations-dir citations/json \
--output-dir dashboard_data/temporal || echo "Temporal analysis failed (non-critical)"

echo "Generating interactive dashboard..."
python -c "
from dataset_citations.dashboard.core import DashboardGenerator
Expand Down Expand Up @@ -283,24 +307,24 @@ print(f'Dashboard generated: {output_path}')
run: |
# Refresh Git status to detect new or modified files
git add -A
# Check if there are any unstaged changes in the citations, datasets, results, and interactive_reports directories
if ! git diff --cached --quiet -- citations/ datasets/ results/ interactive_reports/; then

# Check if there are any unstaged changes in the relevant directories
if ! git diff --cached --quiet -- citations/ datasets/ dashboard_data/ embeddings/ interactive_reports/; then
echo "CHANGES_DETECTED=true" >> $GITHUB_ENV
echo "changes_detected=true" >> $GITHUB_OUTPUT

# Count changed files for reporting (including JSON, HTML files)
CHANGED_FILES=$(git diff --cached --name-only -- citations/ datasets/ results/ interactive_reports/ | grep -E '\.(csv|pkl|txt|json|html|png)$' | wc -l)
CHANGED_FILES=$(git diff --cached --name-only -- citations/ datasets/ dashboard_data/ embeddings/ interactive_reports/ | grep -E '\.(csv|pkl|txt|json|html|png)$' | wc -l)
echo "CHANGED_FILES=$CHANGED_FILES" >> $GITHUB_ENV
echo "changed_files=$CHANGED_FILES" >> $GITHUB_OUTPUT

# List changed files for the commit message (including JSON, HTML files)
CHANGED_FILES_LIST=$(git diff --cached --name-only -- citations/ datasets/ results/ interactive_reports/ | grep -E '\.(csv|pkl|txt|json|html|png)$')
CHANGED_FILES_LIST=$(git diff --cached --name-only -- citations/ datasets/ dashboard_data/ embeddings/ interactive_reports/ | grep -E '\.(csv|pkl|txt|json|html|png)$')
echo "CHANGED_FILES_LIST<<EOF" >> $GITHUB_ENV
echo "$CHANGED_FILES_LIST" >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
echo "Detected $CHANGED_FILES changed files in citations, datasets, results, and interactive_reports directories"

echo "Detected $CHANGED_FILES changed files in tracked directories"
else
echo "CHANGES_DETECTED=false" >> $GITHUB_ENV
echo "changes_detected=false" >> $GITHUB_OUTPUT
Expand All @@ -323,7 +347,7 @@ print(f'Dashboard generated: {output_path}')
if: ${{ env.CHANGES_DETECTED == 'true' && steps.update_citations.outcome == 'success' && steps.update_previous.outcome == 'success' }}
run: |
# Stage all changes for commit
git add citations/ datasets/ results/ interactive_reports/
git add citations/ datasets/ dashboard_data/ embeddings/ interactive_reports/

# Create a descriptive commit message
COMMIT_DATE=$(date +'%Y-%m-%d')
Expand Down
37 changes: 36 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,39 @@ CLAUDE.md
.claude/*
.playwright-mcp/*
.context/
.rules/
.rules/
# Test outputs
test_output*/
test_debug*/
test_trace/
test_modular*/

# Temporary analysis outputs
after_*.txt
reference_metrics.txt

# Backup files
*_old.json
reference_dashboard*.html
reference_dashboard_data/

# Generated dashboard outputs (regenerate with run_full_analysis.sh)
dashboard_data/aggregated_data.json
dashboard_data/comprehensive_theme_analysis.json
dashboard_data/network/
dashboard_data/temporal/
dashboard_data/themes/
dashboard_data/visualizations/
dashboard_data/*.csv
dashboard_data/*.png

# Interactive report outputs
interactive_reports/*.html
interactive_reports/data/

# Embeddings analysis outputs (can be regenerated)
embeddings/analysis/
embeddings/metadata/embedding_registry.json
embeddings/citation_embeddings/
embeddings/dataset_embeddings/

92 changes: 92 additions & 0 deletions CLEANUP_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Cleanup Plan for Issue #23

## Objective
Consolidate all analysis outputs to `dashboard_data/` and remove obsolete components.

## Directories to Remove

### Test Output Directories
- `test_output/` - Old test output
- `test_output_*/` - Multiple test output directories
- `test_modular_output/` - Test modular output
- `test_output_nemar/` - Test NEMAR output
- `.playwright-mcp/` - Screenshot directory (Not needed anymore)

### Obsolete Analysis Directories
- `results/` - Old results directory (check for needed files first)
- `embeddings/analysis/` - Duplicate analysis in embeddings
- `interactive_reports/data/` - Should be generated from dashboard_data
- `data/` - Old data directory

## Files to Remove
- `dashboard_data/aggregated_data_old.json` - Old aggregated data
- `interactive_reports/*.png` - Unused network images
- `interactive_reports/*.graphml` - Unused graph files
- `interactive_reports/*.gexf` - Unused graph files
- `interactive_reports/*.cx` - Unused graph files

## Consolidation Plan

### Target Structure

```bash
dashboard_data/
├── aggregated_data.json # Main aggregated data
├── network/ # Network analysis results
│ ├── *.csv # CSV exports
│ └── *.json # JSON summaries
├── themes/ # Theme analysis
│ ├── comprehensive_theme_analysis.json
│ └── theme_*_wordcloud.png
├── temporal/ # Temporal analysis
│ └── *.csv
└── visualizations/ # All generated visualizations
└── *.png
```

### Dashboard Dependencies
The dashboard needs:
1. `dashboard_data/aggregated_data.json` - Main data source
2. `dashboard_data/themes/*.png` - Wordcloud images
3. `dashboard_data/themes/comprehensive_theme_analysis.json` - Theme data
4. `citations/json/*.json` - Citation data (keep as-is)
5. `datasets/*.json` - Dataset metadata (keep as-is)

## CLI Commands to Keep

### Essential Commands
- `dataset-citations-discover` - Dataset discovery
- `dataset-citations-update` - Citation updates
- `dataset-citations-retrieve-metadata` - Metadata retrieval
- `dataset-citations-score-confidence` - Confidence scoring
- `dataset-citations-generate-embeddings` - Embedding generation

### Analysis Commands (to simplify)
- `dataset-citations-analyze-temporal` - Keep but simplify
- `dataset-citations-analyze-temporal-themes` - Keep for wordclouds

### Commands to Remove
- `dataset-citations-analyze-networks` - Requires Neo4j, not used
- `dataset-citations-create-network-visualizations` - Obsolete
- `dataset-citations-create-research-context-networks` - Not used
- `dataset-citations-create-theme-networks` - Not used
- `dataset-citations-load-graph` - Neo4j related
- `dataset-citations-export-external-tools` - Not used
- `dataset-citations-analyze-umap` - Replaced by themes
- `dataset-citations-automate-visualization-updates` - Obsolete

## Python Modules to Remove
- `src/dataset_citations/graph/neo4j_*.py` - Neo4j dependencies
- `src/dataset_citations/graph/network_visualizations.py` - Obsolete
- Unused CLI modules corresponding to removed commands

## Workflow Updates
1. Update `run_end_to_end_workflow.sh` to use simplified pipeline
2. Use `run_analysis_only.sh` for analysis updates
3. Remove Neo4j-dependent steps

## Testing After Cleanup
1. Run `run_analysis_only.sh` to verify analysis works
2. Check dashboard displays correctly
3. Verify all wordclouds and stats are generated
4. Run tests to ensure nothing critical was removed
125 changes: 125 additions & 0 deletions DASHBOARD_TEST_CHECKLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Dashboard Testing Checklist

## Pre-Cleanup Reference State
Date: 2025-09-17
Branch: fix/issue-23-cleanup-consolidate-outputs

### Current Dashboard Components

#### 1. Header Section
- [x] Title: "NEMAR Dataset Citation Analysis"
- [x] Description text present
- [x] 318 datasets mentioned in description

#### 2. Statistics Cards (Top Row)
- [x] Datasets Analyzed: 318
- [x] High-Confidence Citations: 1062
- [x] Research Bridge Papers: 80
- [x] Confidence Threshold: ≥0.4

#### 3. Main Tabs
- [x] Overview (default selected)
- [x] Network Analysis
- [x] Research Themes

#### 4. Overview Tab Content
##### Left Panel - Citation Quality
- [x] Bar chart showing High-Confidence (1062, 84%) vs Low-Confidence (198, 16%)
- [x] Red and gray colors

##### Middle Panel - Growth Timeline
- [x] Line chart showing cumulative citations from 2018 to 2024
- [x] Blue line with data points

##### Right Panel - Data Modalities
- [x] Donut chart showing EEG (63.2%), MEG (23.7%), EG (2%), iEEG (11.2%)
- [x] Blue, purple, red colors

#### 5. Network Analysis Tab
- [x] Interactive network visualization loads
- [x] Nodes and edges visible
- [x] Zoom/pan controls work
- [x] Node hover shows dataset info

#### 6. Research Themes Tab
- [x] Theme 0 wordcloud visible
- [x] Theme 1 wordcloud visible
- [x] Theme 2 wordcloud visible
- [x] Theme 3 wordcloud visible
- [x] Wordclouds have proper colors and layouts

#### 7. Modal Dialogs (Click on cards)
##### Datasets Modal
- [x] Shows 302 total datasets (note: inconsistency with main card)
- [x] Shows 290 with high-conf citations
- [x] Shows 96% coverage
- [x] Lists top 20 datasets with citations

##### Citations Modal
- [x] Shows citation quality breakdown
- [x] Shows temporal distribution
- [x] Shows confidence score distribution

##### Bridge Papers Modal
- [x] Lists bridge papers
- [x] Shows datasets bridged
- [x] Shows citation impact

##### Threshold Modal
- [x] Explains confidence scoring
- [x] Shows threshold value

### File Dependencies Check
```bash
# Files that must exist for dashboard to work
interactive_reports/dataset_citations_dashboard_nemar.html # Main dashboard
interactive_reports/dashboard_styles.css # Styles
dashboard_data/aggregated_data.json # Data source
interactive_reports/data/themes/theme_*_wordcloud.png # Wordclouds
interactive_reports/data/themes/comprehensive_theme_analysis.json # Theme data
```

### Console Errors Check
- [ ] No JavaScript errors in browser console
- [ ] All resources load successfully (no 404s)
- [ ] Network visualization initializes without errors

## Testing Procedure After Each Cleanup Step

### Quick Test (after minor changes)
1. Open dashboard HTML in browser
2. Check all 4 statistics cards show numbers
3. Click through all 3 tabs
4. Verify wordclouds visible in Research Themes tab

### Full Test (after major changes)
1. Run `./run_analysis_only.sh`
2. Check output for errors
3. Open new dashboard in browser
4. Go through entire checklist above
5. Compare with reference dashboard (saved copy)

### Critical Failures (stop if any occur)
- Dashboard doesn't open
- Statistics cards show 0 or NaN
- Wordclouds missing
- Network visualization broken
- JavaScript errors in console

## Comparison Method
```bash
# Save reference dashboard before cleanup
cp interactive_reports/dataset_citations_dashboard_nemar.html reference_dashboard.html

# After cleanup, compare key metrics
grep -o "318\|1062\|80\|0.4" reference_dashboard.html | sort | uniq -c > reference_metrics.txt
grep -o "318\|1062\|80\|0.4" interactive_reports/dataset_citations_dashboard_nemar.html | sort | uniq -c > new_metrics.txt
diff reference_metrics.txt new_metrics.txt
```

## Rollback Plan
If critical failure occurs:
1. `git stash` current changes
2. `git checkout main`
3. Analyze what went wrong
4. Try more conservative cleanup approach
Loading
Loading