12 days ago

Senior Principal Site Reliability Engineer

Oracle

Hybrid
Full Time
$180,000
Hybrid

Job Overview

Job TitleSenior Principal Site Reliability Engineer
Job TypeFull Time
CategoryCommerce
Experience5 Years
DegreeMaster
Offered Salary$180,000
LocationHybrid

Who's the hiring manager?

Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Uncover Hiring Manager

Job Description

About the Role: Senior Principal Site Reliability Engineer

Are you a creative person who loves a challenge? Solve the complex puzzles you’ve been dreaming of as our Senior Principal Site Reliability Engineer. If you have a passion for innovation in tech, we want you on our team! Thrive in this crucial automation role. Oracle is a technology leader that’s changing how the world does business. We’re looking for an experienced and self-motivated person. We appreciate you taking the time to review the list of qualifications and to apply for the position.

Come and join us! Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health. This team will focus on product deployment, sustainability, troubleshooting and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.

As a Senior Principal Site Reliability Engineer, you will be responsible for defining and deploying key services with deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering. You will work with multiple cross-functional teams helping deliver new and outstanding experiences to our collaborators while ensuring reliability and performance.

What You'll Do - Key Responsibilities

  • Own the full service lifecycle: design, implementation, deployment, on-call, and continuous improvement—maintaining high code and reliability standards.
  • Define and meet service-level objectives (availability, latency, durability) while reducing toil through automation, observability, and self-healing mechanisms.
  • Lead architecture, analysis, design, implementation, and production operations for Core System Framework solutions, with strong documentation and runbooks.
  • Create and maintain clear, version-controlled documentation—architectural diagrams, SOPs, runbooks, and incident playbooks—to ensure repeatable operations, auditability, and fast onboarding.
  • Design, write, and deploy software that improves the availability, scalability, and efficiency of platform services.
  • Develop designs, architectures, standards, and methods for large-scale distributed systems.
  • Build automation to prevent problem recurrence; drive real-time monitoring, alerting, and self-healing into production systems.
  • Conduct capacity planning and demand forecasting; perform software performance analysis, system tuning, and optimization.
  • Contribute to and support platform services across architecture, provisioning, configuration, deployment, and ongoing operations.
  • Partner with distributed teams to prototype and launch new platform services.
  • Stay current on emerging technologies and introduce innovations that improve reliability, security, and developer productivity.

Leadership and Collaboration

  • Mentor and guide engineers in distributed systems design, high-scale data processing, and operational excellence.
  • Set and raise engineering standards across multiple teams; model best practices in reliability, security, and automation.
  • Collaborate closely with storage, networking, observability, and security teams to deliver platform features and secure-by-default designs.

On-call and Operations

  • Participate in an on-call rotation; lead incident response, postmortems, and follow-through on corrective actions to drive continuous improvement.

Key Requirements & Experience

  • The ability to acquire & maintain a federal security clearance vital for this role, which requires you to be a US citizen.
  • Developing/operating large scale distributed services / applications.
  • Container administration and development applying Kubernetes, Docker, Mesos, or similar.
  • Infrastructure automation through Terraform, Chef, Ansible, Puppet, Packer or similar.
  • Experience with Cloud Orchestration frameworks, development and SRE support of these systems.
  • Experience with CI/CD pipelines including VCS (git, svn, etc), Gitlab Runners, Jenkins, Rundeck.
  • Working with or supporting production, test, and development environments for medium to large user environments.
  • Experience in developing scripts to automate software deployments and installations using PowerShell or Bash.
  • Knowledge of cloud compute technologies, network monitoring, data processing and analytics.
  • Experience with a modern programming language such as Go, Java, Python, or C++ or equivalent.
  • Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems.
  • Experience operating services in one of the major Clouds such as AWS, OCI, Azure, etc.

About Oracle

Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.

True innovation starts when everyone is empowered to contribute. That’s why we’re committed to growing a workforce that promotes opportunities for all with competitive benefits that support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.

Key Skills/Competency

  • Site Reliability Engineering (SRE)
  • DevOps
  • Cloud Operations
  • Distributed Systems
  • Kubernetes & Docker
  • Infrastructure as Code (IaC)
  • CI/CD
  • Automation Scripting
  • Incident Response
  • Capacity Planning

Tags:

Site Reliability Engineer
Site Reliability
DevOps
Automation
Cloud Operations
Architecture
Distributed Systems
Capacity Planning
Incident Response
Performance
Mentoring
Kubernetes
Docker
Terraform
Ansible
CI/CD
Go
Python
Java
AWS
OCI
Azure

Share Job:

How to Get Hired at Oracle

  • Research Oracle's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Highlight SRE, DevOps, cloud, and automation experience relevant to Oracle Health Federal Operations.
  • Showcase technical prowess: Prepare for in-depth questions on Kubernetes, Terraform, and distributed systems architecture.
  • Demonstrate problem-solving: Be ready to discuss incident response, post-mortems, and automation strategies to reduce toil.
  • Emphasize collaboration & leadership: Illustrate your experience in mentoring, cross-functional teamwork, and setting engineering standards.

Frequently Asked Questions

Find answers to common questions about this job opportunity

Explore similar opportunities that match your background