Machine Learning (Ops) Engineer
Newbridge Alliance · Singapore
Our clients ML Platform team enables 100+ ML scientists and engineers to train, deploy, and monitor models that serve 10M+ QPS across recommendation, search, ads, and GenAI products. Our platform powers e-commerce and content experiences similar to TikTok Shop, with a focus on reliability, speed, and developer velocity.They treat ML infrastructure as a product and operate at the scale of major social-commerce platforms.The RoleWe are hiring an MLOps Engineer to build and scale the core ML platform used by all ML teams. You will own systems for training, serving, experimentation, and monitoring. Your work directly impacts how fast they can ship new models to production and how reliably they serve millions of users.What You’ll DoModel Serving: Build and operate low-latency, high-throughput online inference services for deep learning and LLM models. Optimize with vLLM, Triton, TensorRT, GPU scheduling, and autoscalingTraining Infrastructure: Scale distributed training on GPU clusters using Kubernetes, Ray, DeepSpeed, or Megatron. Improve job scheduling, checkpointing, and resource utilizationML Platform Products: Develop internal tools for the full ML lifecycle: feature store, model registry, experiment tracking, workflow orchestration, and CI/CD for MLGenAI Infra: Build infrastructure for LLM fine-tuning, RAG evaluation, vector database management, and cost/latency monitoring for GenAI workloadsData & Feature Platform: Maintain real-time and batch feature pipelines. Ensure data quality, lineage, and SLAs for Spark, Flink, and Kafka jobsObservability: Implement monitoring, alerting, and debugging tools for model performance, data drift, training failures, and online servingDeveloper Experience: Reduce friction for ML teams. Provide SDKs, CLI tools, and documentation. Run internal office hours and gather requirementsReliability: Own SLOs for critical ML services. Lead incident response and postmortems. Drive capacity planning and cost optimizationMinimum QualificationsEducation: BS/MS in Computer Science, Engineering, or related fieldExperience: Software engineering, DevOps, or ML engineering, with 3+ years building ML infrastructure or platform servicesProgramming: Strong proficiency in Python, Go, or Java. Solid understanding of software design, testing, and distributed systemsCloud & Containers: Production experience with Kubernetes, Docker, and AWS/GCP/Azure. Familiar with Terraform or infrastructure-as-codeML Systems: Understanding of ML workflows. Experience with at least one: model serving, distributed training, feature stores, or workflow orchestrators like Airflow/KubeflowData Systems: Experience with Spark, Kafka, or similar large-scale data toolsProblem Solving: Ability to debug complex systems across ML, data, and infra layersPreferred QualificationsBuilt ML platforms supporting 50+ ML engineers or 100+ models in productionDeep expertise in GPU inference optimization: batching, quantization, CUDA, vLLM, Triton Inference ServerExperience with LLM infra: fine-tuning pipelines, vector DBs like Milvus/Weaviate, prompt/version managementKnowledge of ML frameworks internals: PyTorch, TensorFlow, JAXExperience with Ray, Kubeflow, MLflow, Feast, or TectonBackground in high-QPS online services, SRE, or performance engineeringContributions to open-source ML infra projects