PitchMeAI
Bespoke Labs

GPU / CUDA Engineer

Bespoke Labs · MENA

  • Hybrid
  • Contract
  • $150,000 / year
  • MENA

Job highlights

  • Optimize CUDA kernels for AI systems.
  • Build tooling for peak system performance.
  • Work with cutting-edge GPU hardware.
  • Integrate kernels into major ML frameworks.
  • Contribute to open-source GPU libraries.

About the role

About Bespoke Labs

Bespoke Labs is a VC-backed applied AI research startup in Mountain View, CA, building core infrastructure and RL environments to train and evaluate intelligent agents. Home of OpenThoughts (100K+ monthly downloads, 200+ models trained) and Terminal Bench, a leading agentic coding benchmark used by frontier labs. Founded by ex-Google DeepMind and UC Berkeley faculty, advised by Jeff Dean.

The Role

You'll write, optimize, and debug CUDA kernels that directly power our AI systems — from training and inference to RL workloads. You'll also build the tooling (profilers, inference engines, benchmarks) that keep our systems at peak performance.

What You'll Do

  • Write and optimize CUDA kernels for GEMM, attention, MoE, and graph operations
  • Use PTX assembly and architecture-specific techniques for Hopper/Blackwell hardware
  • Apply memory coalescing, warp-level programming, tensor cores, and compute/memory overlap
  • Integrate kernels into PyTorch, vLLM, Megatron, and TorchTitan
  • Profile and debug with Nsight Systems, Nsight Compute, and Torch Profiler
  • Build internal tooling and contribute to open-source GPU libraries

What We Need - Must Have

  • Hands-on CUDA kernel optimization experience (kernel hacking strongly preferred)
  • Strong grasp of GPU architecture — memory hierarchy, warp execution, synchronization
  • Proficient in C/C++ for high-performance systems
  • Experience in profiling and resolving GPU bottlenecks

What We Need - Nice to Have

  • Flash Attention or Transformer kernel optimization
  • Cutlass, Triton, Thrust, or CUB experience
  • Distributed/multi-GPU (NVLink, NCCL) background
  • Open-source GPU contributions or published research

Why Join

High ownership. Frontier research. Real production impact. A small, elite team is building the infrastructure that the next generation of AI runs on.

Key skills/competency

  • GPU CUDA Engineer
  • CUDA Kernel Optimization
  • GPU Architecture
  • C/C++
  • High-Performance Computing
  • Profiling and Debugging
  • PTX Assembly
  • Tensor Cores
  • Distributed Systems
  • AI Infrastructure

Skills & topics

  • GPU
  • CUDA
  • Engineer
  • AI
  • Machine Learning
  • Deep Learning
  • Optimization
  • C++
  • High Performance Computing
  • Kernel Development

How to get hired

  • Tailor your resume: Highlight CUDA kernel optimization and GPU architecture experience.
  • Showcase contributions: Emphasize open-source GPU library involvement or research.
  • Prepare for technical interviews: Be ready to discuss GPU profiling and C/C++ performance tuning.
  • Research Bespoke Labs: Understand their work in applied AI and RL environments.
  • Demonstrate problem-solving: Prepare examples of resolving GPU bottlenecks.

Technical preparation

Master CUDA kernel optimization techniques.,Deep dive into GPU architecture specifics.,Practice C/C++ high-performance coding.,Study profiling tools like Nsight Systems.

Behavioral questions

Describe a challenging GPU optimization problem.,How do you debug complex kernel issues?,Tell me about your experience with ML frameworks.,How do you stay updated on GPU advancements?

Frequently asked questions

What kind of CUDA optimization experience is Bespoke Labs looking for in a GPU CUDA Engineer?
Bespoke Labs is seeking hands-on CUDA kernel optimization experience, with a strong preference for 'kernel hacking'. This means a deep understanding of writing and fine-tuning CUDA kernels for performance, particularly for AI workloads like GEMM, attention, and MoE operations.
How important is knowledge of specific GPU hardware like Hopper or Blackwell for this role?
While not strictly required, proficiency with architecture-specific techniques for Hopper/Blackwell hardware is a significant advantage. Understanding how to leverage these modern architectures, including PTX assembly, is key to maximizing performance.
What ML frameworks will I integrate CUDA kernels into as a GPU CUDA Engineer at Bespoke Labs?
As a GPU CUDA Engineer, you will integrate optimized kernels into popular machine learning frameworks such as PyTorch, vLLM, Megatron, and TorchTitan. This ensures that the high-performance kernels directly benefit the AI systems being developed.
Does Bespoke Labs encourage contributions to open-source GPU projects?
Yes, Bespoke Labs values and encourages contributions to open-source GPU libraries. Experience with or a desire to contribute to projects like Cutlass, Triton, Thrust, or CUB is considered a plus.
What kind of impact can a GPU CUDA Engineer have at Bespoke Labs?
As a GPU CUDA Engineer, you'll have high ownership and direct impact on frontier AI research and production systems. You will be building the core infrastructure that the next generation of AI runs on, working within a small, elite team.
Is this GPU CUDA Engineer role remote or on-site?
The job description indicates the company is located in Mountain View, CA, and does not explicitly state a remote work arrangement. Therefore, it is likely an on-site or potentially hybrid role based in that location.