-
Notifications
You must be signed in to change notification settings - Fork 1
Clean up scattered analysis outputs and consolidate to dashboard_data #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
neuromechanist
merged 31 commits into
main
from
fix/issue-23-cleanup-consolidate-outputs
Sep 19, 2025
Merged
Clean up scattered analysis outputs and consolidate to dashboard_data #24
neuromechanist
merged 31 commits into
main
from
fix/issue-23-cleanup-consolidate-outputs
Sep 19, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Skips time-consuming discovery and citation scraping - Focuses only on analysis and dashboard generation - Generates embeddings, themes, wordclouds, and dashboard - Consolidates outputs to dashboard_data directory - Related to issue #23
- Document all obsolete directories and files - Map dashboard dependencies - List CLI commands to keep vs remove - Define target consolidated structure
- Add comprehensive cleanup plan for issue #23 - Create dashboard testing checklist - Save reference metrics for comparison - Document all components to preserve
- Save complete dashboard HTML as reference - Save dashboard data directory copy - These will be used to verify dashboard still works after cleanup
- Remove test_output/ directory - Remove test_output_nemar/ directory - Remove test_modular_output/ directory - These were temporary test directories not needed in production
- Remove results/ directory (old analysis outputs) - Remove embeddings/analysis/ (duplicate analysis) - Remove data/ directory (old data location) - All outputs now consolidated to dashboard_data/
- Remove unused CLI commands: - automate_visualization_updates - create_network_visualizations - create_research_context_networks - create_theme_networks - export_external_tools - load_graph - Remove Neo4j dependent modules: - neo4j_loader.py - neo4j_network_analysis.py - network_visualizations.py - These were not used in the current pipeline
- Remove *.png files (unused network visualizations) - Remove *.graphml, *.gexf, *.cx files (unused graph formats) - Keep only essential dashboard files
- Fix DashboardGenerator to receive citations_dir parameter - Update run_analysis_only.sh to pass citations_dir - Update run_end_to_end_workflow.sh to use correct paths - Dashboard now correctly shows 318 datasets and 1062 citations
- Add test_output*/ directories to ignore - Add temporary analysis text files - Add backup files and reference dashboards - These are generated during testing and not needed in repo
- Remove old network analysis files - Remove old temporal analysis files - Remove old theme wordclouds and analysis - Remove old visualizations - These will be regenerated fresh by run_analysis_only.sh - Keep only aggregated_data.json and README
- Remove old wordcloud images - Remove old theme analysis JSON - Update dashboard HTML with corrected statistics - Dashboard data will be regenerated on demand
- Generated embeddings for 15 new datasets - Updated embedding registry and metadata - These are needed for theme analysis and visualization
- Remove broken Neo4j imports from graph module - Add run_full_analysis.sh script that generates all dashboard components - Script generates: embeddings, UMAP analysis, themes with wordclouds, network analysis (without Neo4j), temporal analysis, and bridge papers - Ensures dashboard has all needed data for full functionality
- Add generate_themes.py for wordcloud generation - Add generate_network.py for network analysis without Neo4j - Add generate_temporal.py for temporal analysis - Fix confidence score extraction from nested structure
- Fix DataAggregator and DashboardGenerator initialization in all scripts
- Replace results_dir=Path('.') with results_dir=Path('dashboard_data')
- Use new analysis modules instead of inline Python code
- Ensures aggregator finds the generated analysis files
- Add dashboard_data generated files to .gitignore - Add interactive_reports HTML and data to .gitignore - Add embeddings analysis outputs to .gitignore - These files can be regenerated using run_full_analysis.sh
- Remove dashboard_data/aggregated_data.json - Remove embeddings/metadata/embedding_registry.json - Remove interactive_reports HTML and JSON data - These files are now properly ignored by .gitignore
- Remove src/dataset_citations/cli/create_interactive_reports.py - Remove CLI entry point from pyproject.toml - This was replaced by the dashboard module and is no longer used
- Read theme names and sizes from comprehensive_theme_analysis.json - Display citation count for each theme cluster - Add fallback for when themes haven't been generated yet - Removes hardcoded theme names
- Map citation_count to high_confidence_citations for datasets - Fix bridge papers to use correct field names from CSV (author, num_datasets) - Add proper field mapping for template compatibility - Resolves issue with modals showing 0 citations and Unknown authors
- run_full_analysis.sh provides all the same functionality - Reduces maintenance burden of keeping multiple scripts in sync - Simplifies workflow to a single analysis script
- Added type check for datasets_bridged field - Prevents JavaScript error when field is already an array - Fixes bridge modal not opening on click
- Changed from author-only aggregation to paper-level tracking - Bridge papers now include title, author, and year fields - Identifies papers citing multiple datasets with full metadata - CSV output includes paper titles instead of 'Unknown Title'
- Added bridge_paper_year field to modal data - Modal now properly displays paper titles from CSV - Improved bridge paper display with year information
- Added embeddings generation step - Added UMAP analysis for theme identification - Added theme analysis with wordclouds using new Python modules - Updated to use dashboard_data/ instead of results/ - Use new modular Python analysis scripts - Include dashboard_data/ and embeddings/ in git commits
- Changed DashboardGenerator to use dashboard_data directory - Consistent with new analysis module structure - Maintains compatibility with all workflow modes
- Added shell scripts section to README - Documented new analysis modules in DEVELOPMENT.md - Updated testing benchmarks and workflows in TESTING.md - Added examples for running analysis modules independently - Updated last modified date
- Removed 1165 embeddings pkl files from git tracking - These are generated files that can be recreated - Added explicit gitignore entries for embeddings subdirectories - Keeps repository clean and prevents outdated embeddings
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses issue #23 by cleaning up scattered analysis outputs and consolidating everything to the
dashboard_data/directory.Changes Made
1. Directory Cleanup (atomic commits for easy revert)
test_output*/,test_modular_output/)results/,embeddings/analysis/,data/)dashboard_data/interactive_reports/2. Code Cleanup
3. Workflow Improvements
run_analysis_only.sh- streamlined script that skips discovery/scrapingcitations_dirparameter).gitignoreto exclude test outputs and backups4. Documentation
CLEANUP_PLAN.mddocumenting the cleanup strategyDASHBOARD_TEST_CHECKLIST.mdfor manual testingTesting
Before Cleanup
After Cleanup
Test Command
Benefits
dashboard_data/run_analysis_only.shscript for quick iterationsAtomic Commits
Each major change is in a separate commit for easy rollback if needed:
feat: add streamlined analysis-only workflow scriptcleanup: remove test output directoriescleanup: remove obsolete analysis directoriescleanup: remove obsolete CLI commands and Neo4j modulesfix: correct dashboard generation by passing citations_dirFixes #23