Senior Cloud-Native & AIOps Engineer with 10+ years architecting, automating, and scaling distributed ML/AI systems across AWS, Azure, and GCP. Currently at Apple (AiML Infrastructure), building production-grade ML platforms serving millions of users.
π― Uptime: 99.95% | MTTR: β73% | Cost: β25% | Scale: Petabyte+ text
- π€ Build scalable ML infrastructure with Kubernetes (EKS/GKE)
- π Design observability systems (OpenTelemetry, Datadog, Grafana)
- π° Optimize cloud costs through automation & right-sizing
- π Implement Zero Trust security & compliance (SOC2, HIPAA, GDPR)
- π€ Mentor teams on MLOps, SRE, and production best practices
AI/ML Infrastructure
- Multi-cloud Kubernetes (EKS/GKE/AKS) β’ GPU/TPU optimization β’ Model serving β’ Feature stores β’ Real-time inference
Observability & SRE
- OpenTelemetry β’ SLI/SLO monitoring β’ 73% MTTR reduction β’ Incident response β’ Chaos engineering
Cost Optimization
- 25% cost reduction β’ Cluster right-sizing β’ FinOps β’ Resource automation β’ Multi-cloud management
Security & Compliance
- Zero Trust β’ IAM/RBAC β’ Vault β’ SOC2/HIPAA/GDPR β’ Policy-as-code
| ποΈ 99.95% Uptime | Maintained across distributed AI/ML platforms | | β‘ 73% MTTR Reduction | Through unified observability & automation | | π° 25% Cost Savings | Via intelligent optimization & right-sizing | | π Zero-Downtime | Petabyte-scale migrations with no interruption | | π₯ 60+ Incidents | Led critical production incident resolutions |
- βοΈ AWS Cloud Practitioner
- β Kubernetes Application Developer (CKAD)
- ποΈ HashiCorp Terraform Associate
- π€ AI & Machine Learning for Business
- π AWS: Design and Implement Systems
I'm passionate about MLOps architecture, multi-cloud Kubernetes, observability, and production ML systems. Always happy to discuss SRE best practices, cost optimization strategies, and building reliable infrastructure at scale.

