Site Reliability Engineer - USDS, San Jose

Site Reliability Engineer - USDS

New Today

Site Reliability Engineering(SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you’ll have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design. We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving. We encourage close collaboration while promoting self-direction. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities - Develop and maintain automation procedures to maximize system efficiency and minimize human intervention. - Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust. - Ensure system scalability to handle growth in web traffic and data. - Implement monitoring tools and set up metrics to keep track of system health and performance. - Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues. - Conduct performance tests to find and address system bottlenecks. - Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). - Practice sustainable user support, incident response, and blameless postmortems.

Minimum Qualifications: - Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience - Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role. - Proficient knowledge of high-level programming languages (. Python, Go, Java, and Shell script). - Experience in network architecture, database modeling, cloud systems and large-scale distributed systems. - Strong understanding of Linux operating systems and open-source technologies. Preferred Qualifications: - Experience with containers and container orchestration platforms such as Docker, Kubernetes or equivalent. - Knowledge of monitoring tools and methodologies (such as Prometheus, Grafana). - Excellent problem-solving skills, strategic thinking, and a strong ability to debug complex systems. - Exceptional communication skills and the ability to effectively collaborate with cross-functional teams.

Apply

Location:: San Jose