Add phase-1 MIDI dataset pipeline with manifest generation and audio->MIDI adapter interface by weiweiweiopen · Pull Request #2 · weiweiweiopen/ageha

weiweiweiopen · 2026-04-11T05:47:33Z

Provide a lightweight, portable phase-1 pipeline to scan artist-organized MIDI folders and produce consistent per-artist manifests.
Detect and flag duplicate MIDI candidates using exact (sha256) and conservative fuzzy signatures to support downstream curation.
Define a stable adapter contract for future phase-2 audio->MIDI transcription backends so transcribed MIDI can reuse the same manifesting flow.

Add pipeline/core/manifest_models.py to define MidiMetadata, ManifestEntry, and ArtistManifest typed models and CSV flattening helpers.
Add pipeline/core/midi_scan.py that implements MIDI file iteration, header validation (_parse_midi_header), SHA256 hashing, fuzzy signature generation, duplicate grouping, and write_artist_manifest for JSON/CSV outputs.
Add CLI entrypoint scripts/dataset/phase1_build_manifest.py to scan --midi-library, build manifests, and write outputs under each artist's --manifest-subdir (default .dataset).
Add pipeline/adapters/audio2midi/interface.py to declare AudioIngestionRecord, TranscriptionResult, and the AudioToMidiAdapter Protocol for future transcription integrations, plus a top-level README.md describing the pipeline and usage.

Add phase-1 MIDI manifest pipeline with dedup and metadata

fb96c6b

weiweiweiopen added the codex label Apr 11, 2026 — with ChatGPT Codex Connector

Provide feedback