AI Training Infras Engineer

Fragment Works · Singapore

Sector: AI
Function: Product & Engineering
Level: Mid-Level
Employment type: Full Time
Posted: 2026-06-09
Source: mycareersfuture

About Fragment Works and Moment AIFRAGMENT WORKS PTE. LTD. is a Singapore-incorporated technology company operating the Moment AI platform at https://momentvideo.ai/.Moment AI is a leading VC-backed video AI foundation model company developing advanced video AI technologies and business-to-business infrastructure for enterprise customers. The company focuses on building scalable, production-grade video AI systems that support real-world commercial applications across video generation, video understanding, model deployment, and model-sales operations.Through API-based capabilities, Moment AI serves companies in the short-video, creator economy, advertising, media, and AI application sectors. Its platform is designed to help enterprises integrate video AI into their products and workflows efficiently, enabling automated content creation, intelligent video analysis, and next-generation AI-powered media experiences.Moment AI is founded by a team of video industry veterans with deep operational and technical experience in scaling short-video platforms to millions of users. The founding team brings together battle-tested entrepreneurs, infrastructure builders, and AI researchers with hands-on experience across video platforms, creator ecosystems, high-concurrency video infrastructure, and multimodal AI.About the RoleWe are building our own foundation model for video generation, based on DiT and Flow Matching architectures. We are looking for a Training Infrastructure Engineer who can turn cutting-edge research code into a stable, scalable, and high-throughput training system running on large-scale GPU clusters.This role is ideal for an engineer who enjoys solving deep systems problems at the intersection of distributed training, CUDA performance, video data pipelines, model training stability, and large-scale ML infrastructure. You will work closely with researchers and platform engineers to ensure that our video generation training stack can reliably produce results at the thousand-GPU scale.Key ResponsibilitiesYou will design, optimise, and maintain large-scale distributed training systems for video generation foundation models. This includes implementing and improving training strategies such as FSDP, tensor parallelism, context parallelism, and Ulysses-style sequence parallelism, with a strong focus on improving throughput, scaling efficiency, and MFU.You will build and optimise PB-scale video data pipelines, including NVDEC-based video decoding, VAE latent caching, variable-resolution bucket sampling, and efficient data loading for high-throughput model training.You will work on memory and performance optimisation across the training stack, including FlashAttention,FP8 mixed precision, Triton kernels, CUDA-aware profiling, activation check pointing strategies, and communication-computation overlap.You will also be responsible for training stability and reliability. This includes identifying the root causes of loss spikes, divergence, slow nodes, communication bottlenecks, check point failures, and data-related instability, as well as designing mechanisms for fast checkpoint recovery and automatic exclusion of problematic nodes.RequirementsThe ideal candidate has strong hands-on experience with PyTorch distributed training and a solid understanding of CUDA architecture, GPU memory hierarchy, NCCL communication, and performance profiling.You should have source-level familiarity with at least one major large-scale training framework, such as Megatron-LM, DeepSpeed, PyTorch FSDP, or TorchTitan, and be comfortable reading, modifying, and debugging framework internals.You should have at least one year of practical experience training models on large GPU clusters of 256 GPUs or more, with proven experience in debugging distributed training failures and improving system-level training efficiency.Strong candidates will be able to reason across the full training stack, from data ingestion and model parallelism to kernel-level optimisation and fault-tolerant training operations.Preferred QualificationsExperience with DiT, diffusion models, Flow Matching, or video generation models would be highly advantageous.Experience processing large-scale video datasets, building video decoding pipelines, or working with VAE latent caching systems would be a strong plus.Hands-on experience writing or optimising Triton, CUDA, or CUTLASS kernels would be valuable.Familiarity with open-source video generation projects such as HunyuanVideo, Wan, CogVideoX, or similar systems at source-code level would also be beneficial.What We OfferYou will work with real large-scale compute resources, including access to thousand-GPU-level training infrastructure. You will join a team that treats large-scale model training as a rigorous engineering discipline, not just a research experiment.We provide an environment where engineers can work on technically meaningful infrastructure problems, collaborate closely with frontier model researchers, and contribute to open-source or publication work where appropriate and within applicable compliance boundaries.Employment PracticesWe are committed to fair and merit-based hiring. All candidates will be assessed based on job-related skills, experience, and ability to perform the role. We welcome applications from qualified candidates and do not discriminate on the basis of age, race, gender, religion, marital status, family responsibilities, disability, or other non-job-related characteristics.

Apply on mycareersfuture →

AI Optimization Caché Training Design GPU Artificial Intelligence Throughput CUDA