Skip to content

Commit eab367b

Browse files
authored
Merge pull request #12 from Boyeep/chore/add-community-files
docs: add sign language and tooling guides
2 parents 8a8db8f + be733eb commit eab367b

File tree

2 files changed

+462
-0
lines changed

2 files changed

+462
-0
lines changed

docs/sign-language-template.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Sign Language Stack For This Template
2+
3+
This repo is a good fit for a sign-language project, but the best stack depends on what you mean by "sign language."
4+
5+
## Start With The Problem Shape
6+
7+
There are three common versions of this project:
8+
9+
1. `Static hand signs`
10+
Example: alphabet letters or a small fixed set of hand poses.
11+
2. `Dynamic signs`
12+
Example: signs that depend on motion over time, not a single frame.
13+
3. `Full sign-language understanding`
14+
Example: larger vocabularies where hand shape, motion, body pose, and face cues matter together.
15+
16+
The further you move from static poses into real sign language, the less a simple object detector is enough on its own.
17+
18+
## Best Recommendation For This Repo
19+
20+
For this template, the strongest path is:
21+
22+
- `Frontend`: keep using the existing Next.js webcam or upload flow
23+
- `Feature extraction`: use `MediaPipe` hand landmarks first
24+
- `Model training`: use `PyTorch`
25+
- `Inference runtime`: export to `ONNX` and run with `ONNX Runtime` in the backend
26+
- `Backend API`: keep FastAPI as the contract boundary
27+
28+
That gives you a practical stack that is:
29+
30+
- fast enough for demos and hackathons
31+
- easier to train than raw image-to-label models
32+
- more stable than trying to force YOLO into a gesture problem
33+
- compatible with this repo's existing "analyze image or frame and return typed results" shape
34+
35+
## What To Use By Project Type
36+
37+
### 1. Static Sign Demo
38+
39+
Use this when you want:
40+
41+
- alphabet recognition
42+
- a small vocabulary
43+
- one signer in front of a webcam
44+
- a fast MVP
45+
46+
Recommended stack:
47+
48+
- `MediaPipe Hand Landmarker`
49+
- a small classifier on top of hand landmarks
50+
- `PyTorch` for training
51+
- `ONNX Runtime` for backend inference
52+
53+
Why:
54+
55+
- landmarks reduce the amount of visual noise
56+
- you do not need a heavy detector for a single webcam user
57+
- training on landmarks is usually easier than training on raw images
58+
59+
### 2. Dynamic Sign Recognition
60+
61+
Use this when the sign depends on motion across multiple frames.
62+
63+
Recommended stack:
64+
65+
- `MediaPipe Holistic` or at least `hands + pose`
66+
- sequence model such as `LSTM`, `GRU`, or a small `Transformer`
67+
- `PyTorch` for training
68+
- `ONNX Runtime` for serving
69+
70+
Why:
71+
72+
- many signs are not defined by one frame
73+
- temporal context matters
74+
- body and face cues can matter, not only the hand outline
75+
76+
### 3. Larger Or More Realistic Sign-Language Systems
77+
78+
Use this when you want more than a demo and need better linguistic coverage.
79+
80+
Recommended stack:
81+
82+
- `MediaPipe Holistic`
83+
- a sequence model over landmarks and possibly cropped image features
84+
- optional dataset tooling for alignment and labeling
85+
- `ONNX Runtime` or another production runtime
86+
87+
Important note:
88+
89+
If the goal is actual sign language rather than "gesture control," a hands-only pipeline will likely cap out early.
90+
91+
## Where It Fits In This Repo
92+
93+
### Frontend
94+
95+
Use the existing webcam and upload experience as the input layer:
96+
97+
- `frontend/src/components/webcam-console.tsx`
98+
- `frontend/src/components/inference-console.tsx`
99+
100+
That means you can keep the product flow the repo already teaches:
101+
102+
1. capture or upload an image or frame
103+
2. send it to the backend
104+
3. receive typed results
105+
4. render overlays, labels, and metrics
106+
107+
### Backend
108+
109+
The backend is where the actual CV or ML logic should live:
110+
111+
- `backend/app/vision/service.py`
112+
- `backend/app/vision/pipelines.py`
113+
- `backend/app/api/routes/inference.py`
114+
115+
The cleanest extension is to add a new pipeline entry such as:
116+
117+
- `sign-static`
118+
- `sign-sequence`
119+
120+
That keeps the repo's pipeline registry pattern intact.
121+
122+
### Contract
123+
124+
If you change the shape of the response, also update:
125+
126+
- `docs/openapi.yaml`
127+
- `frontend/src/generated/openapi.ts`
128+
129+
If you can keep the response close to the existing typed contract, integration stays easier.
130+
131+
## Recommended Output Shape
132+
133+
For a sign-language MVP in this template, I would return:
134+
135+
- top predicted sign label
136+
- confidence score
137+
- optional hand boxes or landmark-derived regions
138+
- metrics such as handedness, frame count, or latency
139+
140+
For dynamic signs, consider adding:
141+
142+
- sequence window size
143+
- temporal confidence
144+
- optional "still collecting frames" status
145+
146+
Try to avoid coupling the frontend to raw model internals. Keep the backend responsible for translating model output into product-friendly fields.
147+
148+
## When To Use YOLO
149+
150+
`YOLO` is useful when you need detection, such as:
151+
152+
- multiple people in frame
153+
- signer localization in a wide camera view
154+
- hand or person detection before a second-stage recognizer
155+
156+
It is usually not my first recommendation for a single-user webcam sign demo because:
157+
158+
- you still need recognition after detection
159+
- landmarks are often a better representation for sign tasks
160+
- it adds training and inference complexity early
161+
162+
## When To Use A Hosted Model
163+
164+
A hosted model can be useful for:
165+
166+
- quick experiments
167+
- low-ops prototypes
168+
- testing ideas before local deployment
169+
170+
But for sign-language interaction, local inference is often better because of:
171+
172+
- lower latency
173+
- lower recurring cost
174+
- better privacy
175+
- fewer network dependencies during demos
176+
177+
## Suggested Build Order
178+
179+
1. `MVP`
180+
Add a `sign-static` backend pipeline using hand landmarks and a small classifier.
181+
2. `Webcam loop`
182+
Reuse the current webcam page and submit captured frames to the same inference endpoint.
183+
3. `Temporal model`
184+
Add a second pipeline for dynamic signs using short frame sequences.
185+
4. `Contract refinement`
186+
Expand the API only when the frontend truly needs more than label, confidence, and review metadata.
187+
188+
## Simple Decision Guide
189+
190+
- If you want a fast hackathon demo: `MediaPipe Hand Landmarker + small classifier`
191+
- If you want real-time local inference: `PyTorch -> ONNX -> ONNX Runtime`
192+
- If you want broader sign understanding: `MediaPipe Holistic + sequence model`
193+
- If you need person or hand detection in messy scenes: add `YOLO` as a helper, not the whole solution
194+
195+
## Official References
196+
197+
- MediaPipe Hand Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker>
198+
- MediaPipe Gesture Recognizer: <https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer>
199+
- MediaPipe Gesture customization: <https://ai.google.dev/edge/mediapipe/solutions/customization/gesture_recognizer>
200+
- MediaPipe Holistic Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/holistic_landmarker>
201+
- ONNX Runtime docs: <https://onnxruntime.ai/docs/>
202+
- Ultralytics YOLO docs: <https://docs.ultralytics.com/>

0 commit comments

Comments
 (0)