AI Engineer (ML Systems & Infrastructure)

Swapetech · Singapore

Sector
AI
Function
Product & Engineering
Level
Mid-Level
Employment type
Full Time
Posted
2026-06-22
Source
mycareersfuture

About the RoleWe are looking for exceptional AI Engineers to build the next generation of AI infrastructure and Machine Learning Systems(MLSys).This role focuses on large-scale system infrastructure rather than model research. You will work on the core foundations that power large-scale AI training and inference systems, including Kubernetes cluster management, RDMA networking, unified KV Cache architecture, observability platforms, distributed systems, GPU orchestration, and CUDA kernel optimisation.You will collaborate closely with AI researchers, infrastructure architects, networking engineers, and platform teams to maximize the efficiency, scalability, and reliability of AI systems.Key ResponsibilitiesAI Infrastructure & KubernetesDesign, deploy, and operate large-scale Kubernetes-based AI infrastructure. Develop cluster governance frameworks, scheduling policies, resource isolation, and multi-tenancy capabilities. Build and optimize GPU orchestration platforms using Kubernetes, Slurm, Volcano, Kueue, Ray, and related technologies. Improve cluster utilization, reliability, elasticity, and operational efficiency. RDMA & High-Performance NetworkingDesign and optimize RDMA, InfiniBand, RoCE, and high-speed Ethernet fabrics for distributed AI workloads. Optimize GPU-to-GPU and GPU-to-NIC communication paths. Improve distributed communication efficiency for large-scale training and inference. Analyze and eliminate networking bottlenecks across AI clusters. Unified KV Cache & Distributed Memory SystemsDesign and implement unified KV Cache architecture across: GPU HBM CPU Memory RDMA-accessible Memory NVMe SSD Distributed Storage Develop efficient KV Cache sharing, migration, offloading, and scheduling mechanisms. Optimize latency and throughput for large-scale inference systems. CUDA & System Performance OptimisationDevelop and optimize CUDA kernels for training and inference workloads. Profile and optimize GPU compute, memory, communication, and scheduling efficiency. Contribute to low-level optimization of AI frameworks and inference engines. Work on technologies such as FlashAttention, TensorRT, Triton, NCCL, CUTLASS, and custom operators. Observability & ReliabilityBuild end-to-end observability platforms for AI infrastructure. Design monitoring, logging, tracing, alerting, and troubleshooting frameworks. Develop performance dashboards and SLO-driven operational systems. Improve maintainability, debuggability, and operational excellence of AI platforms. Automation & Platform EngineeringBuild automation tools for deployment, provisioning, monitoring, and operations. Develop Infrastructure-as-Code (IaC) solutions using Terraform, Ansible, and related tools. Build CI/CD pipelines and engineering productivity platforms. Improve platform scalability and operational efficiency. Required QualificationsEducationBachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or related fields. Technical SkillsStrong software engineering and programming skills. Excellent system design capability and strong engineering craftsmanship. Strong coding standards and code quality awareness. Strong sense of ownership, accountability, and execution. System FundamentalsStrong understanding of:Operating Systems Computer Networks Distributed Systems Data Structures and Algorithms Linux Internals Programming LanguagesProficiency in one or more of:C++ Go Python Rust AI Infrastructure ExperienceHands-on experience in one or more of:Kubernetes GPU Infrastructure Distributed Systems AI Infrastructure HPC (High Performance Computing) Cloud-Native Platforms Networking ExperienceExperience with:RDMA InfiniBand RoCE/RoCEv2 GPUDirect NCCL UCX High-Speed Ethernet GPU & Performance EngineeringExperience with:CUDA GPU Performance Optimization Multi-GPU Systems Distributed Training Distributed Inference Preferred QualificationsExperience building large-scale AI training or inference clusters. Experience with vLLM, SGLang, TensorRT-LLM, Triton, DeepSpeed, Megatron-LM, Ray, or similar frameworks. Experience with unified KV Cache systems, memory hierarchy optimisation, or distributed storage systems. Experience with Kubernetes GPU Operator and NVIDIA NetworkOperator. Experience with Prometheus, Grafana, Loki, OpenTelemetry, and observability platforms. Experience contributing to open-source projects such as: vLLM, FlashAttention, CUTLASS, TVM, MLIR, Triton, Kubernetes, NCCL Experience working across AI Infrastructure, HPC, Networking, and Silicon Systems is highly desirable.

Apply on mycareersfuture →
AI Network Performance GPGPU Design Elastic System Specification Kubernetes Technical Skills