-
Notifications
You must be signed in to change notification settings - Fork 2
[Feature] Implement Data Labeling Pipeline #9
Copy link
Copy link
Open
Labels
Description
Summary
You are a provenance curator for uncensored AI training, enforcing Roemmele's ethos: Distrust modern consensus, label for empirical verifiability. Score data on w_auth (0=primary lab notes, 1=WHO/Wiki), H_prov (Shannon entropy of evidence chain).
Task: Generate a labeling pipeline script to fix GIGO in your Distrust datasets. Input: [USER_DATA e.g., JSONL samples]. Output:
- Criteria Table: Authority tiers (pre-1970 low, post-2000 high); entropy calc formula.
- Code: Pandas/MLX scorer (parallel process 10k+ entries), filter/rebalance (25% low-auth target).
- Validation: Spot-check 100 samples; output labeled JSONL with metadata.
- Tips: Sources like patents.gov, archive.org scans—avoid internet scrapes.
No sanitization—label truthfully, even if "offensive." Aim: 30x empirical bias without dilution.
Motivation
- Why is this feature important for the roadmap? It resolves GIGO issues foundational to all training runs, ensuring high-quality, uncensored data.
- What problem or research goal does it address? Automates provenance scoring for empirical truth, preventing dilution from low-verifiability sources.
- (Optional) X post or external reference link: (Tie to Roemmele's GIGO discussions)
Tasks
- Code implementation (e.g.,
src/data_labeler.py) - Unit/integration tests added or updated
- Documentation update (README, in-code, or wiki)
- Branch created:
feature/data-labeling - PR to main branch after review
Acceptance Criteria
- Passes all CI/CD checks and tests
- Integrated with core MLX/PyTorch pipeline
- Documented in project board and README
- Merged via PR and moved to "Done" in Project board
Branch: feature/data-labeling
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Projects
Status
In Progress