Site Reliability Engineer - USDS
New Today
Site Reliability Engineering(SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you’ll have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design. We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving. We encourage close collaboration while promoting self-direction.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention. - Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust. - Ensure system scalability to handle growth in web traffic and data. - Implement monitoring tools and set up metrics to keep track of system health and performance. - Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues. - Conduct performance tests to find and address system bottlenecks. - Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
Minimum Qualifications:
- Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience
- Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
- Proficient knowledge of high-level programming languages (. Python, Go, Java, and Shell script). - Experience in network architecture, database modeling, cloud systems and large-scale distributed systems.
- Strong understanding of Linux operating systems and open-source technologies. Preferred Qualifications:
- Experience with containers and container orchestration platforms such as Docker, Kubernetes or equivalent.
- Knowledge of monitoring tools and methodologies (such as Prometheus, Grafana).
- Excellent problem-solving skills, strategic thinking, and a strong ability to debug complex systems.
- Exceptional communication skills and the ability to effectively collaborate with cross-functional teams.
- Location:
- San Jose