What is the primary focus of the Senior ML Systems Engineer role at Cohere?

The Senior ML Systems Engineer at Cohere will focus on building, maintaining, and evolving the training framework for large-scale language models. This involves designing distributed training systems, improving performance and stability, and developing essential tooling for ML teams.

What technical skills are most important for this Senior ML Systems Engineer position?

Key technical skills include strong engineering experience in large-scale distributed training or HPC systems, deep familiarity with JAX internals or similar distributed training libraries, and experience with multi-node cluster orchestration tools like Kubernetes or Slurm. Comfort in debugging performance across various layers of the ML stack is also crucial.

Does Cohere require experience with specific ML frameworks for this role?

While deep familiarity with JAX internals is highly preferred, Cohere also values experience with other ML frameworks such as PyTorch, DeepSpeed, or Megatron. Contributions to these frameworks are considered a plus.

What kind of impact can a Senior ML Systems Engineer have at Cohere?

This role offers the opportunity for massive impact by owning critical components of the training stack. You will shape the infrastructure for frontier-scale models, build tools that accelerate research, and contribute to the widespread adoption of AI by enterprises.

Is this Senior ML Systems Engineer role remote or hybrid?

Cohere is remote-friendly, offering flexibility for this role. They have offices in several global locations, but remote work is supported, potentially with a co-working benefit if not near an office.

How does Cohere handle accommodations for applicants with disabilities?

Cohere is committed to an inclusive work environment and provides equal opportunities. Applicants requiring accommodations during the recruitment process are encouraged to submit an Accommodations Request Form so their needs can be met.

What are the opportunities for professional growth as a Senior ML Systems Engineer at Cohere?

The role provides opportunities to work on challenging and consequential ML systems problems, collaborate with a world-class team, and have end-to-end ownership. This exposure to frontier AI infrastructure allows for significant professional development and impact.

What are the key performance indicators for a Senior ML Systems Engineer at Cohere?

Key performance indicators would likely revolve around improving training throughput and stability, enhancing the reliability and scalability of the training framework, and increasing developer velocity through effective tooling and system design.

Senior ML Systems Engineer, Frameworks & Tooling

Cohere · New York, NY

Hybrid
Full-time
$180,000 / year
New York, NY

✓ Hiring manager found for this role

Email the hiring manager to get a response.

Get their verified email + an intro that's ready to send.

★★★★★4.7 · 120,000+ users on the Chrome Web Store

Senior ML Systems Engineer, Frameworks & Tooling

Cohere · New York, NY

Verified ✓

Taylor Morgan

Hiring Manager · h•••••@jobs.ashbyhq.com

✍️ Your intro emailReady to send

Subject: Interested in the Senior ML Systems Engineer, Frameworks & Tooling role at Cohere

Hi Taylor — I came across the Senior ML Systems Engineer, Frameworks & Tooling opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Cohere stood out because…

🔒 Unlock to read & send

✎ Personalized to your résumé after sign-up.

$1 once

Just this hiring manager

Best value

$9/mo

Unlimited — any job, anywhere

✓ Verified email of the hiring manager
✓ Intro email personalized to your résumé
✓ $9/mo = unlimited — any job link

Secure checkout · cancel anytime

View the original posting ↗

Not recommended alone — most applicants never hear back.

Job highlights

Build and own LLM training framework.
Design distributed training abstractions.
Improve training throughput and stability.
Develop tooling for ML teams.
Debug performance across ML systems.

About the role

About Cohere

Cohere is the leading security-first enterprise AI company. We build cutting-edge foundation AI models and end-to-end products that are designed to solve real-world business problems. We’re training and deploying frontier models for enterprises who are building AI systems. We believe that our work is instrumental to the widespread adoption of AI and we are looking for folks that want to be part of that. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. Cohere is a team of researchers, engineers, designers, and more, who are all passionate about their craft. We are a global technology company co-headquartered in Toronto and San Francisco, with key offices in London, New York City, Montreal, Seoul, Germany and Paris. Join us!

The Role

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You’ll Work On

Build and own the training framework responsible for large-scale LLM training.
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.
Investigate and resolve performance bottlenecks across the ML systems stack.
Build robust systems that ensure reproducible, debuggable, large-scale runs.

You Might Be a Good Fit If You Have

Strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
Experience working with containerized environments (Docker, Singularity/Apptainer).
A track record of building tools that increase developer velocity for ML teams.
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

Nice to Have

Experience with training LLMs or other large transformer architectures.
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
Experience with data pipeline optimization, sharded datasets, or caching strategies.
Background in performance engineering, profiling, or low-level systems.
Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us

You’ll work on some of the most challenging and consequential ML systems problems today.
You’ll collaborate with a world-class team working fast and at scale.
You’ll have end-to-end ownership over critical components of the training stack.
You’ll shape the next generation of infrastructure for frontier-scale models.
You’ll build tools and systems that directly accelerate research and model quality.

Sample Projects

Build a high-performance data loading and caching pipeline.
Implement performance profiling across the ML systems stack.
Develop internal metrics and monitoring for training runs.
Build reproducibility and regression testing infrastructure.
Develop a performant fault-tolerant distributed checkpointing system.

How And Where We Work

Cohere is remote-friendly. We have offices in Toronto, San Francisco, New York City, London, Paris, Montreal, and more coming soon. For those in the office: a daily lunch program, plenty of snacks, and regular community and social events. For those not near an office: a co-working benefit so you can work alongside others in your city.

If any of the above doesn’t line up exactly with your experience, we still encourage you to apply. We strive to create an inclusive work environment for all; we welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs. We may use AI-enabled tools to screen and assess applicants against the criteria for this position. This helps our recruiters identify potentially qualified candidates, but it doesn't limit the applications our recruiters may review or consider.

Key skills/competency

Senior ML Systems Engineer
Large-scale distributed training
HPC systems
JAX internals
Multi-node cluster orchestration
Performance debugging
Containerized environments
Developer tooling
ML frameworks
LLM training

Skills & topics

Senior ML Systems Engineer
Machine Learning
Distributed Systems
HPC
LLM Training
JAX
Kubernetes
Performance Tuning
MLOps
Software Engineering

How to get hired

Tailor your resume: Highlight experience with large-scale distributed training, HPC systems, and ML frameworks like JAX.
Showcase tooling skills: Emphasize your track record in building tools that improve developer velocity for ML teams.
Prepare for technical interviews: Be ready to discuss debugging performance issues across CUDA, NCCL, networking, and data pipelines.
Demonstrate system design: Practice designing distributed training abstractions and robust, reproducible ML systems.
Research Cohere: Understand their focus on enterprise AI and their mission to drive AI adoption.

Technical preparation

Master JAX internals and distributed training concepts.,Practice debugging CUDA, NCCL, and network performance.,Design scalable distributed training architectures.,Build and test ML tooling and CI/CD pipelines.

Behavioral questions

Describe a complex system you designed.,How do you handle trade-offs in engineering?,Share an experience debugging distributed systems.,How do you collaborate with research teams?

Prefer to apply the usual way?

Not recommended alone — most applicants never hear back. Email the hiring manager first.

View original posting ↗

Frequently asked questions

What is the primary focus of the Senior ML Systems Engineer role at Cohere?: The Senior ML Systems Engineer at Cohere will focus on building, maintaining, and evolving the training framework for large-scale language models. This involves designing distributed training systems, improving performance and stability, and developing essential tooling for ML teams.
What technical skills are most important for this Senior ML Systems Engineer position?: Key technical skills include strong engineering experience in large-scale distributed training or HPC systems, deep familiarity with JAX internals or similar distributed training libraries, and experience with multi-node cluster orchestration tools like Kubernetes or Slurm. Comfort in debugging performance across various layers of the ML stack is also crucial.
Does Cohere require experience with specific ML frameworks for this role?: While deep familiarity with JAX internals is highly preferred, Cohere also values experience with other ML frameworks such as PyTorch, DeepSpeed, or Megatron. Contributions to these frameworks are considered a plus.
What kind of impact can a Senior ML Systems Engineer have at Cohere?: This role offers the opportunity for massive impact by owning critical components of the training stack. You will shape the infrastructure for frontier-scale models, build tools that accelerate research, and contribute to the widespread adoption of AI by enterprises.
Is this Senior ML Systems Engineer role remote or hybrid?: Cohere is remote-friendly, offering flexibility for this role. They have offices in several global locations, but remote work is supported, potentially with a co-working benefit if not near an office.
How does Cohere handle accommodations for applicants with disabilities?: Cohere is committed to an inclusive work environment and provides equal opportunities. Applicants requiring accommodations during the recruitment process are encouraged to submit an Accommodations Request Form so their needs can be met.
What are the opportunities for professional growth as a Senior ML Systems Engineer at Cohere?: The role provides opportunities to work on challenging and consequential ML systems problems, collaborate with a world-class team, and have end-to-end ownership. This exposure to frontier AI infrastructure allows for significant professional development and impact.

Similar roles

Open positions we recommend based on this role.