SRE Leader
Bybit · Malaysia
About UsEstablished in March 2018, Bybit is one of the fastest growing cryptocurrency derivatives exchanges, with more than 70 million registered users. We offer a professional platform where crypto traders can find an ultra-fast matching engine, excellent customer service and multilingual community support. We provide innovative online spot and derivatives trading services, mining and staking products, as well as API support, to retail and institutional clients around the world, and strive to be the most reliable exchange for the emerging digital asset class. Our core values define us. We listen, care, and improve to create a faster, fairer, and more humane trading environment for our users. Our innovative, highly advanced, user-friendly platform has been designed from the ground-up using best-in-class infrastructure to provide our users with the industry's safest, fastest, fairest, and most transparent trading experience. Built on customer-centric values, we endeavour to provide a professional, 24/7 multi-language customer support to help in a timely manner. As of today, Bybit is one of the most trusted, reliable, and transparent cryptocurrency derivatives platforms in the space.Core responsibilities
Construction of reliability engineering system
Establish a company-wide SLO/SLA system: Define quantifiable reliability indicators (availability, latency, error rate) for each Line of Business, and drive change rhythm and investment decisions based on Error Budget
Construct MTTD/MTTR measurement system, set grading goals and continuously optimize: P-1 target MTTD
Building fault self-healing capabilities: automated fault detection → diagnosis → recovery link, reducing reliance on manual intervention
Promote chaos engineering practice: regularly conduct fault drills (Chaos Engineering) and actively discover weak links in the system
Establish a change risk control system: canary release standardization, change impact pre-assessment, automatic rollback mechanism
Cost Governance System (Key Points)
Building a Data-driven cost governance closed loop: from cost visualization → attribution analysis → optimization decision → execution verification → continuous monitoring of whole-link automation
Establish a scientific capacity planning model: based on the correlation model between business indicators (QPS/TPS/number of users) and resource consumption, instead of impulsive N-fold reservation
Promote the implementation of FinOps culture.
Line of Business/Application Cost Billing and Showback
Define cost efficiency metrics ($/transaction, $/user, $/QPS) and conduct industry benchmarking
Embed cost assessment into the resource request process to achieve 100% capacity assessment of new resources
Automated cost optimization engine:
Low-load automatic recognition and scaled-down recommendation (AI-based anomaly detection and prediction model)
Reserved Instance/Savings Plan Automated Purchase Decision System
Optimization of elastic volume expansion and contraction strategies: pre-scaling based on predictive models to reduce over-reservation
Automatic recycling and lifecycle management of idle resources
Goal: Annual cloud cost optimization of 15-20% without affecting business SLO.
III. Automated operation and maintenance (key)
Toil elimination system: measure team toil ratio (target
GitOps/IaC fully implemented:
Infrastructure 100% coded, all changes executed through PR review and automated pipeline
Environmental consistency guarantee: Ensure drift detection and automatic repair of dev/staging/prod configuration through IaC
Intelligent Operations and Maintenance (AIOps) Construction:
AI-based alarm aggregation, root cause analysis, and repair suggestions
Automatic detection of log/metric anomalies, moving from passive alarms to active discovery
Knowledge Base AI: natural language query operation status, execution standard operation
Self-service platform construction:
Business teams can complete more than 80% of routine operation and maintenance operations (volume expansion and contraction, configuration change, permission application) by themselves.
Operation and maintenance ticket automation processing rate target > 60%
On-call system optimization:
Alarm accuracy > 95% (eliminating alarm fatigue)
Establish Runbook automated execution capability
On-call quality measurement and continuous improvement
Financial cloud isolation and multi-compliance station deployment (key)
Financial-grade network isolation architecture design and operation and maintenance:
Design and implementation of network isolation strategies for multiple accounts, multiple VPCs, and multiple regions
Standardized management of security groups, end point nodes, and dedicated lines across compliance stations
Zero Trust Network architecture landing: micro-segmentation, minimum privilege, dynamic access control
Compliance station efficient building website ability:
Goal: Deployment of new compliance station infrastructure from weekly to hourly (fully automated)
Standardized Compliance Station Templates: One-click Delivery of Network Topology, Security Policy, Middleware, and Monitoring
Automated inter-site isolation verification: Regular automated scans ensure no cross-site data leakage
Cloudy and multi-regional operation and maintenance:
AWS/Tencent Cloud/Huawei Cloud unified operation and maintenance abstraction layer, shielding underlying differences
Cross-regional disaster recovery architecture design: RPO/RTO definition and walkthrough verification
Data Sovereignty Guarantee for Independent Deployment of Compliance Station (Data Residency, Encryption, Audit)
Financial-grade guarantee for wallet/transaction core chain.
Operation and maintenance guarantee of cold and hot wallet isolation architecture
Transaction link zero downtime change capability
Multiactive/disaster recovery switching SOP and periodic drills
Team Building and Talent Cultivation
Push the team to transform from "traditional operation and maintenance" to "Site Reliability Engineering": solve operation and maintenance problems with engineering methods
Establishing an SRE competency model and growth path: what abilities should be possessed at each level from P5 to P7 and how to measure them
Establish knowledge sedimentation and sharing mechanisms: Runbook, Post-mortem culture, internal Tech Talk
Eliminate single-point personnel risk: at least 2 people can handle each core system independently
Echelon Construction: Cultivate 2-3 senior SREs who can independently be responsible for Line of Business reliability
Job requirements
Required conditions
More than 10 years of experience in infrastructure/operations/SRE, and more than 5 years of experience leading a team of more than 10 people in SRE/Infra
Deep understanding of SRE methodology: SLO/SLI/Error Budget, Toil Management, Capacity Planning, Incident Management are not concepts but practices
Large-scale cost management practical experience:
Manage environments where annual cloud spending exceeds $5 million
Systematic FinOps practical experience (not brainstorming resources, but data-driven cost optimization)
Capable of capacity modeling: able to predict resource requirements based on business metrics
In-depth practice of automated operation and maintenance
Successful cases of reducing toil from > 50% to
Proficient in IaC tools (Terraform/Pulumi/CloudFormation) and experienced in large-scale implementation
Experience in exploring and implementing AIOps or intelligent operation and maintenance
Financial grade/compliance environment operation and maintenance experience
Infrastructure operation and maintenance experience in the financial industry (banks, exchanges, payments) or equivalent security requirements
Familiar with multi-account/multi-VPC network isolation architecture design
Experience in independent deployment and operation and maintenance of multiple regions and compliance stations
Understanding the infrastructure requirements of compliance frameworks such as Data Sovereignty, PCI-DSS, SOC2
Multi-cloud experience: AWS (required) + at least one other cloud (Tencent Cloud/GCP/Azure)
Programming ability: able to write operation and maintenance tools and automation systems in Go/Python (not writing scripts, but writing systems).
Bonus points
SRE management experience in cryptocurrency exchanges, traditional securities firms, or payment companies
Kubernetes large-scale cluster (100 + clusters/10000 + nodes) operation and maintenance experience
Familiar with the high availability architecture of the trading system (master-slave switching, multi-active deployment, zero downtime release).
Experience in building internal cost platforms or FinOps tools
Possessing practical experience in chaos engineering (Chaos Monkey/Litmus/self-developed)
Participated in infrastructure preparation work for compliance audits such as SOC2/ISO27001/PCI-DSS
The leadership traits we value
Engineering thinking: When facing operation and maintenance problems, the first reaction is "how to avoid such problems in the system" rather than "be careful next time". Data drive: All decisions are based on metrics - not accepting "feels okay", not accepting "has always been like this". - Cost awareness internalized: not passively doing cost optimization projects, but integrating cost efficiency into daily architectural decisions - Scale thinking: When designing the plan, consider "If the number of compliance stations increases from 10 to 30, can this plan still work?" Talent cultivator: able to cultivate "conventional" engineers into independent SRE experts with methods, patience, and standardsWhy Join UsAt Bybit, we are committed to fostering a supportive and enriching work environment. Our benefits include:- Study Growth Fund: We support your professional development and continuous learning.- Internal Events: Participate in regular team-building activities, workshops, and events designed to promote collaboration and innovation.- Global Collaboration: Be part of a diverse, international team, working alongside colleagues from around the world.- Career Advancement: Access opportunities for growth and advancement within a rapidly expanding global company.- Internal Mobility: Grow with us- Your long-term development is important to us. We offer internal job opportunities to help build your career path.