Production Engineer
CoreWeave
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
Role Overview: Production Engineer at CoreWeave
As a Production Engineer, you will be instrumental in ensuring the reliability and stability of CoreWeave’s essential cloud infrastructure for AI. Working closely with the Production Engineer Team Lead and other engineers, your responsibilities will span incident response, platform reliability, and continuous operational improvements. This role offers an excellent opportunity for individuals eager to enhance their technical skills, contribute significantly to a high-performing team, and drive operational excellence across CoreWeave’s innovative cloud services.
You will join a dynamic team focused on monitoring infrastructure health, troubleshooting issues, and participating in both daily operational tasks and critical incident resolution. As you gain experience, you will progress into more complex areas of incident management, process optimization, and system reliability.
Key Responsibilities
- Incident Management & Support: Assist in rapid incident response to identify and resolve service disruptions under senior guidance. Contribute to documenting incidents, root cause analysis (RCA), and post-incident reviews (PIRs). Help develop and maintain incident response playbooks and participate in stakeholder communication during incidents.
- Operational Support & Reliability: Monitor system performance using tools like Prometheus and Grafana. Support the implementation of automation and process improvements to enhance efficiency in incident detection and recovery. Contribute to defining KPIs and SLAs for incident management and collaborate across teams for platform resilience and disaster recovery.
- Team Collaboration & Development: Work collaboratively with engineers on system troubleshooting and workflow refinement. Engage in knowledge-sharing activities and participate in training and mentorship to advance technical skills and responsibilities.
Required Qualifications
- 4 years of experience in cloud operations, site reliability engineering (SRE), or similar technical roles.
- Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic cloud infrastructure knowledge.
- Familiarity with incident management practices (e.g., ITIL, SRE best practices).
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana) or a strong willingness to learn.
- Basic experience with scripting or automation tools (e.g., Python, Bash, Terraform, Ansible).
- Strong communication skills to articulate technical concepts clearly to diverse audiences.
- Ability to thrive in a fast-paced, high-pressure environment while adapting quickly.
Preferred Qualifications
- Exposure to Kubernetes, containerization, and distributed systems.
- Familiarity with change management processes and post-incident analysis.
- Experience with automated systems or self-healing infrastructure.
- A strong desire for continuous learning and growth in cloud operations, reliability engineering, and incident management.
What CoreWeave Offers
CoreWeave provides a competitive base salary range of $139,000 to $204,000, determined by your qualifications, experience, and market location. Our comprehensive total rewards package includes a discretionary bonus, equity awards, and a full benefits program. We prioritize both market alignment and internal equity in our compensation decisions.
Our benefits include 100% company-paid medical, dental, and vision insurance, life insurance, short and long-term disability, FSA, HSA, tuition reimbursement, ESPP, mental wellness, family-forming support, paid parental leave, flexible childcare with Kinside, 401(k) with employer match, and flexible PTO. Enjoy catered lunch daily in our offices and data centers, a casual work environment, and a culture focused on innovative disruption.
Our Workplace
We embrace a hybrid work environment, with remote work considered for candidates over 30 miles from an office, based on specific role requirements. New hires will attend onboarding at one of our hubs, and teams gather quarterly for collaboration.
Key skills/competency
- Cloud Operations
- Site Reliability Engineering (SRE)
- Kubernetes
- Prometheus
- Grafana
- Incident Management
- Root Cause Analysis (RCA)
- Python
- Bash
- Terraform
- Ansible
How to Get Hired at CoreWeave
- Research CoreWeave's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor to understand their 'Essential Cloud for AI' vision.
- Tailor your resume for SRE: Customize your resume to highlight experience with cloud platforms, SRE best practices, incident management, and tools like Kubernetes, Prometheus, and Grafana for a Production Engineer role.
- Showcase technical expertise: During interviews, be prepared to discuss specific examples of your experience in cloud operations, troubleshooting complex issues, and implementing automation.
- Demonstrate problem-solving skills: CoreWeave values adaptability; emphasize your ability to learn quickly and solve technical challenges in a fast-paced, high-pressure environment.
- Network with CoreWeave employees: Connect with current or past employees on LinkedIn to gain insights into the company culture and interview process.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background