Senior Site Reliability Engineer, San Francisco, CA

Senior Site Reliability Engineer

New Yesterday

Salesforce hosts web services and applications written by thousands of internal developers and tens of thousands of customers to provide the largest business automation cloud on the planet. The underlying infrastructure that enables this innovation and value is evolving to fully embrace lights-out operations, single-click deploy to tens of thousands of nodes, and services that self-heal and self-optimize. The platform is a multi-substrate Kubernetes and microservices platform including k8s, service mesh and ingress which powers Core CRM and a growing set of applications across Salesforce.

Scroll down for a complete overview of what this job will require Are you the right candidate for this opportunity

We are seeking a Site Reliability Engineer / DevOps Engineer to join our team and help build, and operate the next-generation Microservices Platform leveraging Service Mesh, Ingress Gateway load balancing. Our goal is to transform our software stack by adopting more cloud-native and AI-driven operational practices to build a highly reliable, self-healing, and scalable service mesh.

In this role, You are responsible for the high availability for the microservices supporting service mesh and ingress gateway on a large fleet of 1000+ clusters running various technologies like Kubernetes, Docker, network load balancers, service mesh, Istio and so on. You’ll gain valuable experience troubleshooting real production issues which will expand your knowledge of the architecture. You will contribute code to drive availability improvement for services. You will help improve the platform's visibility by implementing necessary monitoring and metrics with Prometheus, Grafana and other monitoring frameworks. You will drive automation efforts in Python/Golang/Puppet/Jenkins to eliminate manual work with day to day operations. You will drive improvements to CI/CD pipelines built on Terraform, Spinnaker and Argo You’ll implement AIOps automation, monitoring and self-healing mechanisms to proactively fix issues to reduce MTTR and Operational Toil. You will get a chance to improve your communication and collaboration skills working with various other Infrastructure teams across Salesforce. You will interact with a highly innovative and creative team of developers and architects. You will evaluate new technologies to solve problems as needed.

Job Requirements: 3+ years of experience in SRE/Devops/Systems Engineering roles Experience operating large scale cluster management systems (e.g. Kubernetes) of a mission critical service Strong working experience with Kubernetes, Docker, Container Orchestration, Service Mesh, Ingress Gateway Good knowledge with network technologies, such as TCP/IP, DNS, TLS termination, HTTP proxies, Load Balancers, etc. Excellent troubleshooting skills with the ability to learn new technologies in complex distributed systems Strong Experience in Observability tools like Prometheus, Grafana, Splunk, ElasticSearch etc. Strong working experience with Linux Systems Administration. Good knowledge of Linux internals Good experience in scripting/programming languages: Python, GoLang etc . Experience with AWS, Terraform, Spinnaker, ArgoCD Ability to manage multiple projects simultaneously, meet deadlines and adapt to shifting priorities Excellent problem-solving, analytical and communication skills, with a strong ability to work effectively in a team environment

#J-18808-Ljbffr

Apply

Location:: San Francisco, CA
Salary:: $200
Category:: Engineering

Start a New Search