Member of Technical Staff, Essential AI

ML Infrastructure, Platform Engineer

$191.3-225k

+ Equity

AWS

Docker

Kubernetes

GCP

Senior and Expert level

San Francisco Bay Area

5 days a week in office

AI products for enterprise

Open for applications

AI products for enterprise

21-100 employees

B2BArtificial IntelligenceEnterpriseInternal toolsBusiness IntelligenceSaaS

Open for applications

$191.3-225k

+ Equity

AWS

Docker

Kubernetes

GCP

Senior and Expert level

San Francisco Bay Area

5 days a week in office

21-100 employees

B2BArtificial IntelligenceEnterpriseInternal toolsBusiness IntelligenceSaaS

Company mission

To build an open platform to fuel and accelerate AI breakthroughs globally.

Job

Company

Role

Who you are

A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU etc and their tradeoffs
Knowledge of parallel computing concepts and distributed systems
Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued
6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems
Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc
Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA
Experience with INT8/FP8 training and inference, quantization and/or distillation
Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc
Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc

What the job involves

The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models
This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels
Running and training models at scale often requires solving novel system problems
As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems
With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle
Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications
Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way
You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research
Develop tools and frameworks to automate and streamline ML experimentation and management
Collaborate with other researchers and product engineers to bring magical product experiences through large language models
Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements
Be willing to optimize performance and efficiency across different accelerators

Our take

Essential AI is on a mission to make frontier AI accessible to everyone. The team builds open models, easy-to-use tools, and reliable pipelines so anyone can experiment, iterate, and create faster. Founded by a team with experience at Google and other tech giants, the company combines deep know-how with a curiosity-driven approach to AI.

Even though it's still early days, Essential AI is making waves by sharing cutting-edge models and open science along the way. This approach doesn't just accelerate research but it invites collaboration and helps new talent jump in as well as turns complex AI breakthrough into real-world tools people can actually use.

With funding and a growing team too, Essential AI is now focused on building products that can really change how businesses work, making workflows smarter and productivity smoother. By partnering with builders and developers around the world, the company is shaping the future of AI in a way that's open, practical, and pretty exciting.

Freddie

Company Specialist at Welcome to the Jungle

Company

Funding (2 rounds)

Dec 2023

$56.5m

SERIES A

Mar 2023

$8.3m

SEED

Total funding: $64.8m

Company HQ

SoMa, San Francisco, CA

Leadership

Ashish Vaswani

(Co-Founder & CEO)

Previously co-founder and Chief Scientist of Adept AI Labs and Staff Research Scientist at Google Brain.

Salary benchmarks

We don't have enough data yet to provide salary benchmarks for this role.

Submit your salary to help other candidates with crowdsourced salary estimates.

Share this job

View 1 more job at Essential AI