Member of Technical Staff, Essential AI

ML Infrastructure, Platform Engineer

$191.3-225k

+ Equity

AWS
Docker
Kubernetes
GCP
Senior and Expert level
San Francisco Bay Area

5 days a week in office

Essential AI

AI products for enterprise

Job no longer available

Essential AI

AI products for enterprise

1-20 employees

B2BArtificial IntelligenceEnterpriseInternal toolsBusiness IntelligenceSaaS

Job no longer available

$191.3-225k

+ Equity

AWS
Docker
Kubernetes
GCP
Senior and Expert level
San Francisco Bay Area

5 days a week in office

1-20 employees

B2BArtificial IntelligenceEnterpriseInternal toolsBusiness IntelligenceSaaS

Company mission

To deepen the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today.

Role

Who you are

  • A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU etc and their tradeoffs
  • Knowledge of parallel computing concepts and distributed systems
  • Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued
  • 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems
  • Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc
  • Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA
  • Experience with INT8/FP8 training and inference, quantization and/or distillation
  • Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc
  • Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc

What the job involves

  • The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models
  • This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels
  • Running and training models at scale often requires solving novel system problems
  • As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems
  • With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle
  • Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications
  • Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way
  • You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research
  • Develop tools and frameworks to automate and streamline ML experimentation and management
  • Collaborate with other researchers and product engineers to bring magical product experiences through large language models
  • Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements
  • Be willing to optimize performance and efficiency across different accelerators

Share this job

Company

Funding (2 rounds)

Dec 2023

$56.5m

SERIES A

Mar 2023

$8.3m

SEED

Total funding: $64.8m

Our take

Essential AI shifted out of stealth mode in 2023 with the $56.5 million backing of Google and Nvidia, no mean feat for a company founded the year before. It develops full-stack AI products to support productivity by automating time-consuming and monotonous workflows.

Bringing forward experience and knowledge from careers in giants like Google, the company is fuelled with severe understanding of the sector. Though in its primitive stages, its promise of benefiting enterprise operations by deepening the relationship between computers and humans led to its exciting raise in funding.

This Series B will allow the company to build a diverse team to delve into a new market, building tools that will drastically change the scale productivity for its customers. With ambitious goals, we’re excited to see the specifics of what Essential AI will bring to the space moving forward.

Freddie headshot

Freddie

Company Specialist at Welcome to the Jungle