
Senior ML Systems Engineer, Frameworks & Tooling
Cohere · New York, NY
- Hybrid
- Full-time
- $180,000 / year
- New York, NY
Email the hiring manager to get a response.
Get their verified email + an intro that's ready to send.
Subject: Interested in the Senior ML Systems Engineer, Frameworks & Tooling role at Cohere
Hi Taylor — I came across the Senior ML Systems Engineer, Frameworks & Tooling opening and wanted to reach out directly. I've spent the last few years doing exactly this kind of work, and Cohere stood out because…
✎ Personalized to your résumé after sign-up.
- ✓ Verified email of the hiring manager
- ✓ Intro email personalized to your résumé
- ✓ $9/mo = unlimited — any job link
Secure checkout · cancel anytime
Job highlights
- Build and own LLM training framework.
- Design distributed training abstractions.
- Improve training throughput and stability.
- Develop tooling for ML teams.
- Debug performance across ML systems.
About the role
About Cohere
Cohere is the leading security-first enterprise AI company. We build cutting-edge foundation AI models and end-to-end products that are designed to solve real-world business problems. We’re training and deploying frontier models for enterprises who are building AI systems. We believe that our work is instrumental to the widespread adoption of AI and we are looking for folks that want to be part of that. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. Cohere is a team of researchers, engineers, designers, and more, who are all passionate about their craft. We are a global technology company co-headquartered in Toronto and San Francisco, with key offices in London, New York City, Montreal, Seoul, Germany and Paris. Join us!
The Role
We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.
What You’ll Work On
- Build and own the training framework responsible for large-scale LLM training.
- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
- Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
- Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.
- Investigate and resolve performance bottlenecks across the ML systems stack.
- Build robust systems that ensure reproducible, debuggable, large-scale runs.
You Might Be a Good Fit If You Have
- Strong engineering experience in large-scale distributed training or HPC systems.
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
- Experience working with containerized environments (Docker, Singularity/Apptainer).
- A track record of building tools that increase developer velocity for ML teams.
- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.
Nice to Have
- Experience with training LLMs or other large transformer architectures.
- Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
- Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
- Experience with data pipeline optimization, sharded datasets, or caching strategies.
- Background in performance engineering, profiling, or low-level systems.
- Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).
Why Join Us
- You’ll work on some of the most challenging and consequential ML systems problems today.
- You’ll collaborate with a world-class team working fast and at scale.
- You’ll have end-to-end ownership over critical components of the training stack.
- You’ll shape the next generation of infrastructure for frontier-scale models.
- You’ll build tools and systems that directly accelerate research and model quality.
Sample Projects
- Build a high-performance data loading and caching pipeline.
- Implement performance profiling across the ML systems stack.
- Develop internal metrics and monitoring for training runs.
- Build reproducibility and regression testing infrastructure.
- Develop a performant fault-tolerant distributed checkpointing system.
How And Where We Work
Cohere is remote-friendly. We have offices in Toronto, San Francisco, New York City, London, Paris, Montreal, and more coming soon. For those in the office: a daily lunch program, plenty of snacks, and regular community and social events. For those not near an office: a co-working benefit so you can work alongside others in your city.
If any of the above doesn’t line up exactly with your experience, we still encourage you to apply. We strive to create an inclusive work environment for all; we welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs. We may use AI-enabled tools to screen and assess applicants against the criteria for this position. This helps our recruiters identify potentially qualified candidates, but it doesn't limit the applications our recruiters may review or consider.
Key skills/competency
- Senior ML Systems Engineer
- Large-scale distributed training
- HPC systems
- JAX internals
- Multi-node cluster orchestration
- Performance debugging
- Containerized environments
- Developer tooling
- ML frameworks
- LLM training
Skills & topics
- Senior ML Systems Engineer
- Machine Learning
- Distributed Systems
- HPC
- LLM Training
- JAX
- Kubernetes
- Performance Tuning
- MLOps
- Software Engineering
How to get hired
- Tailor your resume: Highlight experience with large-scale distributed training, HPC systems, and ML frameworks like JAX.
- Showcase tooling skills: Emphasize your track record in building tools that improve developer velocity for ML teams.
- Prepare for technical interviews: Be ready to discuss debugging performance issues across CUDA, NCCL, networking, and data pipelines.
- Demonstrate system design: Practice designing distributed training abstractions and robust, reproducible ML systems.
- Research Cohere: Understand their focus on enterprise AI and their mission to drive AI adoption.
Technical preparation
Behavioral questions
Frequently asked questions
- What is the primary focus of the Senior ML Systems Engineer role at Cohere?
- The Senior ML Systems Engineer at Cohere will focus on building, maintaining, and evolving the training framework for large-scale language models. This involves designing distributed training systems, improving performance and stability, and developing essential tooling for ML teams.
- What technical skills are most important for this Senior ML Systems Engineer position?
- Key technical skills include strong engineering experience in large-scale distributed training or HPC systems, deep familiarity with JAX internals or similar distributed training libraries, and experience with multi-node cluster orchestration tools like Kubernetes or Slurm. Comfort in debugging performance across various layers of the ML stack is also crucial.
- Does Cohere require experience with specific ML frameworks for this role?
- While deep familiarity with JAX internals is highly preferred, Cohere also values experience with other ML frameworks such as PyTorch, DeepSpeed, or Megatron. Contributions to these frameworks are considered a plus.
- What kind of impact can a Senior ML Systems Engineer have at Cohere?
- This role offers the opportunity for massive impact by owning critical components of the training stack. You will shape the infrastructure for frontier-scale models, build tools that accelerate research, and contribute to the widespread adoption of AI by enterprises.
- Is this Senior ML Systems Engineer role remote or hybrid?
- Cohere is remote-friendly, offering flexibility for this role. They have offices in several global locations, but remote work is supported, potentially with a co-working benefit if not near an office.
- How does Cohere handle accommodations for applicants with disabilities?
- Cohere is committed to an inclusive work environment and provides equal opportunities. Applicants requiring accommodations during the recruitment process are encouraged to submit an Accommodations Request Form so their needs can be met.
- What are the opportunities for professional growth as a Senior ML Systems Engineer at Cohere?
- The role provides opportunities to work on challenging and consequential ML systems problems, collaborate with a world-class team, and have end-to-end ownership. This exposure to frontier AI infrastructure allows for significant professional development and impact.
Similar roles
Open positions we recommend based on this role.
