|
| 1 | +# Sign Language Stack For This Template |
| 2 | + |
| 3 | +This repo is a good fit for a sign-language project, but the best stack depends on what you mean by "sign language." |
| 4 | + |
| 5 | +## Start With The Problem Shape |
| 6 | + |
| 7 | +There are three common versions of this project: |
| 8 | + |
| 9 | +1. `Static hand signs` |
| 10 | + Example: alphabet letters or a small fixed set of hand poses. |
| 11 | +2. `Dynamic signs` |
| 12 | + Example: signs that depend on motion over time, not a single frame. |
| 13 | +3. `Full sign-language understanding` |
| 14 | + Example: larger vocabularies where hand shape, motion, body pose, and face cues matter together. |
| 15 | + |
| 16 | +The further you move from static poses into real sign language, the less a simple object detector is enough on its own. |
| 17 | + |
| 18 | +## Best Recommendation For This Repo |
| 19 | + |
| 20 | +For this template, the strongest path is: |
| 21 | + |
| 22 | +- `Frontend`: keep using the existing Next.js webcam or upload flow |
| 23 | +- `Feature extraction`: use `MediaPipe` hand landmarks first |
| 24 | +- `Model training`: use `PyTorch` |
| 25 | +- `Inference runtime`: export to `ONNX` and run with `ONNX Runtime` in the backend |
| 26 | +- `Backend API`: keep FastAPI as the contract boundary |
| 27 | + |
| 28 | +That gives you a practical stack that is: |
| 29 | + |
| 30 | +- fast enough for demos and hackathons |
| 31 | +- easier to train than raw image-to-label models |
| 32 | +- more stable than trying to force YOLO into a gesture problem |
| 33 | +- compatible with this repo's existing "analyze image or frame and return typed results" shape |
| 34 | + |
| 35 | +## What To Use By Project Type |
| 36 | + |
| 37 | +### 1. Static Sign Demo |
| 38 | + |
| 39 | +Use this when you want: |
| 40 | + |
| 41 | +- alphabet recognition |
| 42 | +- a small vocabulary |
| 43 | +- one signer in front of a webcam |
| 44 | +- a fast MVP |
| 45 | + |
| 46 | +Recommended stack: |
| 47 | + |
| 48 | +- `MediaPipe Hand Landmarker` |
| 49 | +- a small classifier on top of hand landmarks |
| 50 | +- `PyTorch` for training |
| 51 | +- `ONNX Runtime` for backend inference |
| 52 | + |
| 53 | +Why: |
| 54 | + |
| 55 | +- landmarks reduce the amount of visual noise |
| 56 | +- you do not need a heavy detector for a single webcam user |
| 57 | +- training on landmarks is usually easier than training on raw images |
| 58 | + |
| 59 | +### 2. Dynamic Sign Recognition |
| 60 | + |
| 61 | +Use this when the sign depends on motion across multiple frames. |
| 62 | + |
| 63 | +Recommended stack: |
| 64 | + |
| 65 | +- `MediaPipe Holistic` or at least `hands + pose` |
| 66 | +- sequence model such as `LSTM`, `GRU`, or a small `Transformer` |
| 67 | +- `PyTorch` for training |
| 68 | +- `ONNX Runtime` for serving |
| 69 | + |
| 70 | +Why: |
| 71 | + |
| 72 | +- many signs are not defined by one frame |
| 73 | +- temporal context matters |
| 74 | +- body and face cues can matter, not only the hand outline |
| 75 | + |
| 76 | +### 3. Larger Or More Realistic Sign-Language Systems |
| 77 | + |
| 78 | +Use this when you want more than a demo and need better linguistic coverage. |
| 79 | + |
| 80 | +Recommended stack: |
| 81 | + |
| 82 | +- `MediaPipe Holistic` |
| 83 | +- a sequence model over landmarks and possibly cropped image features |
| 84 | +- optional dataset tooling for alignment and labeling |
| 85 | +- `ONNX Runtime` or another production runtime |
| 86 | + |
| 87 | +Important note: |
| 88 | + |
| 89 | +If the goal is actual sign language rather than "gesture control," a hands-only pipeline will likely cap out early. |
| 90 | + |
| 91 | +## Where It Fits In This Repo |
| 92 | + |
| 93 | +### Frontend |
| 94 | + |
| 95 | +Use the existing webcam and upload experience as the input layer: |
| 96 | + |
| 97 | +- `frontend/src/components/webcam-console.tsx` |
| 98 | +- `frontend/src/components/inference-console.tsx` |
| 99 | + |
| 100 | +That means you can keep the product flow the repo already teaches: |
| 101 | + |
| 102 | +1. capture or upload an image or frame |
| 103 | +2. send it to the backend |
| 104 | +3. receive typed results |
| 105 | +4. render overlays, labels, and metrics |
| 106 | + |
| 107 | +### Backend |
| 108 | + |
| 109 | +The backend is where the actual CV or ML logic should live: |
| 110 | + |
| 111 | +- `backend/app/vision/service.py` |
| 112 | +- `backend/app/vision/pipelines.py` |
| 113 | +- `backend/app/api/routes/inference.py` |
| 114 | + |
| 115 | +The cleanest extension is to add a new pipeline entry such as: |
| 116 | + |
| 117 | +- `sign-static` |
| 118 | +- `sign-sequence` |
| 119 | + |
| 120 | +That keeps the repo's pipeline registry pattern intact. |
| 121 | + |
| 122 | +### Contract |
| 123 | + |
| 124 | +If you change the shape of the response, also update: |
| 125 | + |
| 126 | +- `docs/openapi.yaml` |
| 127 | +- `frontend/src/generated/openapi.ts` |
| 128 | + |
| 129 | +If you can keep the response close to the existing typed contract, integration stays easier. |
| 130 | + |
| 131 | +## Recommended Output Shape |
| 132 | + |
| 133 | +For a sign-language MVP in this template, I would return: |
| 134 | + |
| 135 | +- top predicted sign label |
| 136 | +- confidence score |
| 137 | +- optional hand boxes or landmark-derived regions |
| 138 | +- metrics such as handedness, frame count, or latency |
| 139 | + |
| 140 | +For dynamic signs, consider adding: |
| 141 | + |
| 142 | +- sequence window size |
| 143 | +- temporal confidence |
| 144 | +- optional "still collecting frames" status |
| 145 | + |
| 146 | +Try to avoid coupling the frontend to raw model internals. Keep the backend responsible for translating model output into product-friendly fields. |
| 147 | + |
| 148 | +## When To Use YOLO |
| 149 | + |
| 150 | +`YOLO` is useful when you need detection, such as: |
| 151 | + |
| 152 | +- multiple people in frame |
| 153 | +- signer localization in a wide camera view |
| 154 | +- hand or person detection before a second-stage recognizer |
| 155 | + |
| 156 | +It is usually not my first recommendation for a single-user webcam sign demo because: |
| 157 | + |
| 158 | +- you still need recognition after detection |
| 159 | +- landmarks are often a better representation for sign tasks |
| 160 | +- it adds training and inference complexity early |
| 161 | + |
| 162 | +## When To Use A Hosted Model |
| 163 | + |
| 164 | +A hosted model can be useful for: |
| 165 | + |
| 166 | +- quick experiments |
| 167 | +- low-ops prototypes |
| 168 | +- testing ideas before local deployment |
| 169 | + |
| 170 | +But for sign-language interaction, local inference is often better because of: |
| 171 | + |
| 172 | +- lower latency |
| 173 | +- lower recurring cost |
| 174 | +- better privacy |
| 175 | +- fewer network dependencies during demos |
| 176 | + |
| 177 | +## Suggested Build Order |
| 178 | + |
| 179 | +1. `MVP` |
| 180 | + Add a `sign-static` backend pipeline using hand landmarks and a small classifier. |
| 181 | +2. `Webcam loop` |
| 182 | + Reuse the current webcam page and submit captured frames to the same inference endpoint. |
| 183 | +3. `Temporal model` |
| 184 | + Add a second pipeline for dynamic signs using short frame sequences. |
| 185 | +4. `Contract refinement` |
| 186 | + Expand the API only when the frontend truly needs more than label, confidence, and review metadata. |
| 187 | + |
| 188 | +## Simple Decision Guide |
| 189 | + |
| 190 | +- If you want a fast hackathon demo: `MediaPipe Hand Landmarker + small classifier` |
| 191 | +- If you want real-time local inference: `PyTorch -> ONNX -> ONNX Runtime` |
| 192 | +- If you want broader sign understanding: `MediaPipe Holistic + sequence model` |
| 193 | +- If you need person or hand detection in messy scenes: add `YOLO` as a helper, not the whole solution |
| 194 | + |
| 195 | +## Official References |
| 196 | + |
| 197 | +- MediaPipe Hand Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker> |
| 198 | +- MediaPipe Gesture Recognizer: <https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer> |
| 199 | +- MediaPipe Gesture customization: <https://ai.google.dev/edge/mediapipe/solutions/customization/gesture_recognizer> |
| 200 | +- MediaPipe Holistic Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/holistic_landmarker> |
| 201 | +- ONNX Runtime docs: <https://onnxruntime.ai/docs/> |
| 202 | +- Ultralytics YOLO docs: <https://docs.ultralytics.com/> |
0 commit comments