Skip to content
View janhilgard's full-sized avatar

Block or report janhilgard

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
janhilgard/README.md

Jan Hilgard — end-to-end AI builder (need → infra), ex-founder (Hosting90 exit)

I build AI products end-to-end — from the business need, to the agentic flows that run them, down to the infrastructure underneath: inference, proxies, data. When the economics demand it I go all the way down to the metal — that's where the inference work below comes from.

Previously CTO at Miranda Media; before that I founded and ran Hosting90 (Czech hosting/cloud) for 18 years, exiting in 2020. That background in running infrastructure at scale informs how I think about LLM serving: reliability, latency, and resource efficiency over benchmark headlines. Open to fractional-CTO and advisory work.

What I'm working on

  • vllm-mlx — Core contributor to vllm-mlx (80+ PRs; second external contributor by join order), an OpenAI-compatible inference server built on MLX for Apple Silicon. Specific contributions:
  • MTP speculative decoding for Qwen3-Next (PR #82, merged) — Multi-Token Prediction with always-advance strategy and rejection sampling; 1.43x verified / 1.76x optimistic on M3 Ultra
  • Draft-model speculative decoding (PR #45, merged) — HybridEngine sharing a single model instance across speculative and batched modes; 1.2–1.4x throughput improvement
  • Prefix caching, KV cache quantization, Anthropic Messages API integration, MoE model support
  • Assigned collaborator on MTP roadmap for Qwen3-Next / MiMo / Qwen3.5 family alongside the repo owner
  • vllm-mlx-dashboard — Real-time monitoring dashboard for local LLM inference servers (llama.cpp + vllm-mlx), built with Next.js.
  • Data & access infrastructure — self-built residential LTE proxy pool (9 modems, CGNAT rotation) feeding production scraping for AI products. Writeup: https://hilgard.cz/writing/lte-proxy-pool
  • M3 Ultra (256 GB) benchmarking — sustained throughput, quantization tradeoffs, batch size effects at the edge of consumer hardware.

Background

  • Now: independent — AI infrastructure builder, open to fractional-CTO & advisory
  • 2026: CTO, Miranda Media (built their AI products)
  • 2002–2020: Founder & CEO, Hosting90 (Czech hosting/cloud, exited)
  • Focus areas: inference optimization, agentic systems, local LLM deployment, data/access infrastructure

Tech I work with

Python TypeScript MLX vLLM Next.js Apple Silicon llama.cpp KV cache quantization

Reach me

Pinned Loading

  1. vllm-mlx-dashboard vllm-mlx-dashboard Public

    Real-time monitoring dashboard for local LLM inference servers (vllm-mlx + llama.cpp) — built with Next.js

    TypeScript 5

  2. vllm-mlx vllm-mlx Public

    Forked from waybarrios/vllm-mlx

    OpenAI-compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ …

    Python