Skip to content

Add phase-1 MIDI dataset pipeline with manifest generation and audio->MIDI adapter interface#2

Open
weiweiweiopen wants to merge 1 commit into
masterfrom
codex/design-dataset-pipeline-architecture
Open

Add phase-1 MIDI dataset pipeline with manifest generation and audio->MIDI adapter interface#2
weiweiweiopen wants to merge 1 commit into
masterfrom
codex/design-dataset-pipeline-architecture

Conversation

@weiweiweiopen
Copy link
Copy Markdown
Owner

Motivation

  • Provide a lightweight, portable phase-1 pipeline to scan artist-organized MIDI folders and produce consistent per-artist manifests.
  • Detect and flag duplicate MIDI candidates using exact (sha256) and conservative fuzzy signatures to support downstream curation.
  • Define a stable adapter contract for future phase-2 audio->MIDI transcription backends so transcribed MIDI can reuse the same manifesting flow.

Description

  • Add pipeline/core/manifest_models.py to define MidiMetadata, ManifestEntry, and ArtistManifest typed models and CSV flattening helpers.
  • Add pipeline/core/midi_scan.py that implements MIDI file iteration, header validation (_parse_midi_header), SHA256 hashing, fuzzy signature generation, duplicate grouping, and write_artist_manifest for JSON/CSV outputs.
  • Add CLI entrypoint scripts/dataset/phase1_build_manifest.py to scan --midi-library, build manifests, and write outputs under each artist's --manifest-subdir (default .dataset).
  • Add pipeline/adapters/audio2midi/interface.py to declare AudioIngestionRecord, TranscriptionResult, and the AudioToMidiAdapter Protocol for future transcription integrations, plus a top-level README.md describing the pipeline and usage.

Testing

  • No automated tests were added or executed as part of this change.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant