Principal Engineer, GPU Platform

New Today

About the Team The Applied Engineering team works across research, engineering, product, and design to bring OpenAI’s technology to consumers and businesses. You’ll join the team responsible for running the infrastructure that supports the models backing ChatGPT and the API. The systems we support include inference kubernetes clusters, GPU health, Infiniband performance, node lifecycle, and more.We seek to learn from deployment and distribute the benefits of AI, while ensuring that this powerful tool is used responsibly and safely. Safety is more important to us than unfettered growth. About the Role The inference compute team builds and maintains infrastructure abstractions allowing OpenAI to run models at scale. This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new In this role, you will: Design and build the inference infrastructure that power our products, enabling reliability and performance
Ensure our infrastructure can scale to the next order of magnitude
Like all other teams, we are responsible for the reliability of the systems we build. This includes an on-call rotation to respond to critical incidents as needed.
You might thrive in this role if you: Have 10+ years building core infrastructure
Have experience running GPU clusters at scale
Have experience operating orchestration systems such as Kubernetes at scale
Take pride in building and operating scalable, reliable, secure systems
Are comfortable with ambiguity and rapid change
This role is exclusively based in our San Francisco HQ. We offer relocation assistance to new employees. .
Location:
San Francisco

We found some similar jobs based on your search