Senior AI Infrastructure & Networking Engineer
Genesis Networks · Singapore
We are seeking an expert Senior AI Infrastructure & Networking Engineer to lead the architecture, deployment, and optimization of our next-generation AI Factory. In this role, you will be responsible for building and scaling high-density GPU supercomputing clusters (up to 512+ nodes) featuring NVIDIA Blackwell UltraB300 systems. You will bridge the gap between heavy physical infrastructure (liquid cooling/busbar power) and advanced logical fabrics, ensuring predictable, line-rate, and lossless transport for massive generative AI training and reasoning workloads.Key ResponsibilitiesAI Fabric Architecture & Deployment: Design, build, and optimize high-throughput, ultra-low-latency East-West compute networks using NVIDIA Spectrum-X Ethernet platforms (Spectrum-4 ASICs) and/or NVIDIA Quantum-X800 InfiniBand switching.Performance Tuning for Lossless Networking: Configure and fine-tune critical Layer 2/3 lossless transport mechanisms, including Remote Direct Memory Access over Converged Ethernet (RoCE v2), Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and DCQCN.Rail-Optimized Topologies: Implement and maintain non-blocking, multi-plane, full fat-tree network topologies mapped to 8-GPU server architectures to maximize collective communication performance via NCCL (NVIDIA Collective Communications Library).SmartNIC & DPU Management: Deploy and manage high-speed compute network interfaces, including ConnectX-8 SuperNICs (800 Gb/s) and BlueField-3 DPUs for isolated infrastructure management, storage acceleration, and multi-tenant security.Full-Stack Orchestration & Automation: Drive infrastructure-as-code deployments using Ansible and Terraform. Initialize and monitor the NVIDIA Network Operator within core Kubernetes orchestration layers.Telemetry & Validation: Utilize deep network telemetry tools such as NVIDIA NetQ and "What Just Happened" (WJH) to stream real-time switch diagnostics. Conduct line-rate cluster benchmarking using ib_write_bw and ib_write_lat to eliminate physical layer bottlenecks.Cross-Functional Infrastructure Alignment: Collaborate closely with data center facility teams on high-density environment metrics (~15–20 kW+ per rack, liquid-cooled rows, Coolant Distribution Units (CDUs), and Rear Door Heat Exchangers). Ensure operational verification aligns with international standards (e.g., IDCA G-Grade or Uptime Institute).Required Technical Skills &QualificationsEducation: Bachelor’s or Master’s degree in Computer Science, Network Engineering, Systems Engineering, or a related technical discipline.AI Networking Expertise: Proven track record of configuring RoCE v2, adaptive routing, and traffic optimization specifically for machine learning/HPC workloads.Hardware Familiarity: Deep understanding of high-density scale-up and scale-out systems (NVIDIA HGX/DGX architectures, PCIe switching, OSFP/QSFP112 optical and copper assemblies).Software & Cluster Management: Experience with cluster deployment suites like NVIDIA Mission Control, Base Command Manager, Run:ai, or similar enterprise MLOps frameworks.Routing Protocols: Strong proficiency with advanced datacenter networking protocols, particularly eBGP IPv6 unnumbered underlays and EVPN/VXLAN overlays for multi-tenant isolation.Cabling & Layer 1 Validation: Experience managing complex structured fiber trunking (MPO-12/MPO-24 APC) and executing layer-1 diagnostics (ibdiagnet, iblinkinfo).Preferred CertificationsNVIDIA Certified Professional - AI Networking (NCP-AIN) (Highly Preferred)NVIDIA Certified Expert - Cloud End-to-End Fabric (NCE-CEF)Advanced networking tracks from major vendors (e.g., CCIE, JNCIE, or Nokia Service Routing Architect) combined with proven data center fabric experience.What We OfferOpportunity to work with first-of-its-kind, world-class AI supercomputing technologies (NVIDIA Blackwell Ultra).High-impact role shaping the foundational architecture for enterprise generative AI and large-scale LLM initiatives.Competitive salary, comprehensive benefits package, and continuous learning paths for advanced AI operations certifications.