AI Platform Engineer

Kuok Singapore · Singapore

Sector
AI
Function
Product & Engineering
Level
Mid-Level
Employment type
Full Time
Posted
2026-06-04
Source
mycareersfuture

About the RoleWe are seeking a passionate AI Platform Engineer to build and own the infrastructure layer that every AI use case in Kuok Group runs on —the LLM gateway, the deployment platform, CI/CD pipelines, model serving, observability, cost controls, and the eval pipeline infrastructure, end to end. This role will be reporting to the Principal AI Architect.This is a T-shaped role: broad cloud and DevOps foundations, with deep specialism in LLM infrastructure. The ideal candidate is equally comfortable provisioning environments and managing release pipelines as they are configuring a model gateway, wiring up LangSmith traces, and buildingan eval harness.Working closely with the Head, AI Platform on architecture direction and with the LLM Ops / MLOps Engineer on the observability and eval layer, this person will be the backbone of the platform that Applied AI Engineers depend on to ship confidently and at pace..Key ResponsibilitiesDeployment Platform & CI/CDDesign, build, and maintain CI/CD pipelines for all AI use cases — from code commit through staging to production, with automated release gates and rollback capabilityOwn environment provisioning and infra-as-code (Terraform or equivalent) — staging, UAT, and production environments should be reproducible, version-controlled, and auditableManage the deployment platform end to end: release scheduling, environment promotion, incident response, and post-deployment validationChampion good deployment hygiene: automated pipelines, version-controlled configuration, and documented environment differences as standard practiceLLM Gateway & Model ServingBuild and operate the LLM gateway layer (LiteLLM or equivalent) — API access controls, rate limiting, model routing, and failover across Azure-backed endpointsManage model serving configuration: endpoint management, load balancing, latency SLOs, and model switching without disrupting live use casesOwn secrets and access management for all model API credentials and service accounts across environmentsMaintain a prompt and model version registry so that every production use case can be traced to a specific model version and prompt configurationObservability, Cost & ControlsInstrument all deployed use cases with LLM observability tooling (LangSmith or equivalent)— traces, latency, token counts, and error rates as standardBuild and maintain cost telemetry dashboards: per-use-case token consumption, compute spend, and alerting on cost anomaliesImplement and maintain token budget controls and rate limits across BUs — keeping cost visible and predictable is a shared responsibility that starts at the platform layerOwn general platform monitoring and reliability: uptime, alerting, on-call runbooks, and incident response for platform-layer issuesEval Pipeline InfrastructureBuild the infrastructure layer for LLM evaluation pipelines — test harnesses, regression runners, and LLM-as-judge scaffolding used by Applied AI Engineers per use caseWork with the LLM Ops / MLOps Engineer on eval pipeline designEnsure eval pipeline runs are logged, versioned, and traceable — eval results should be reproducibleSupport evals as a consistent deployment gate — working with the team to ensure every use case has a passing eval run on the current model version before moving to productionStandards & CollaborationMaintain platform documentation — architecture diagrams, runbooks, environment specs, and onboarding guides — so institutional knowledge is shared and accessible across the teamWork within the Head, AI Platform's engineering standards: all platform changes go through code review before deploymentSupport the QA / Dev Engineers (Applied AI cluster) on integration and regression testing where it touches the platform layerProactively surface platform-layer risks and capacity constraints to the Head, AI Platform.RequirementsMust-HaveSolid cloud and DevOps engineering foundations — you have built and operated CI/CD pipelines, managed environments with IaC, and handled production deployments and rollbacks on at least one major cloud platform (Azure, AWS, or GCP);comfortable working across Linux and Windows Server, and familiar with core networking concepts — VPC/VNET, DNS, firewalls, and load balancers Hands-on experience with LLM infrastructure: you have configured and operated a model gateway or API proxy layer, managed multi-model routing, and dealt with rate limits and failover in a live environmentLLM observability experience — you have instrumented production AI systems with tracing and monitoring tooling and used the data to diagnose issuesCost telemetry and token controls — you understand how LLM API costs are structured and have built or operated dashboards and controls to keep spend visible and boundedStrong Python skills and comfort with the full LLM deployment tooling ecosystem —equally at home in application code and infrastructure configurationStrong appreciation for documentation and configuration management — environments as code, clear runbooks, and written context that helps the team move faster together.Strong AdvantageExperience with eval pipeline infrastructure: test harness design, regression frameworks, LLM-as-judge scaffolding, or automated output quality checksSecurity and access management experience in an AI context: IAM, RBAC, secrets management, API credential rotation, encryption at rest and in transit, and least-privilege access design for model-serving environmentsFamiliarity with MLOps practices: model versioning, A/B traffic splitting, canary deployments for model updatesExperience supporting engineering teams as a platform provider — you understand that your internal customers are the engineers shipping use cases, and you design for their velocity as well as for reliabilityExposure to enterprise multi-tenant environments: managing shared infrastructure across multiple teams or business units with different access and cost boundaries; familiarity with virtualisation platforms (VMware, Hyper-V, or Nutanix) is a plus

Apply on mycareersfuture →
AI Dashboards Design Terraform Endpoint Management Kubernetes CloudWatch Azure