
Staff Database Platform Reliability Engineer
Rackspace Technology · Gurgaon Rural, Haryana, India
- Hybrid
- Full-time
- $150,000 / year
- Gurgaon Rural, Haryana, India
Job highlights
- Ensure multi-cloud DBaaS platform reliability and performance.
- Combine database expertise with SRE principles.
- Build highly available, automated, resilient platforms.
- Drive operational standards and automation frameworks.
- Requires 8-10+ years in DBA/Platform Engineering.
About the role
Staff Database Platform Reliability Engineer
We are seeking an experienced SRE/DBRE to ensure reliability, performance, scalability, and operational excellence of our multi-cloud DBaaS platform across Microsoft Azure, Amazon Web Services, and Google Cloud Platform. This role combines deep database expertise with SRE principles to build highly available, automated, and resilient database platforms. The DBRE Lead will drive operational standards, automation frameworks, and reliability engineering practices across distributed cloud environments.
What We’re Looking For
- 8-10+ years in DBA / Platform Engineering
- Strong multi-cloud experience (Azure / AWS / GCP – at least two)
- Deep HA/DR & performance tuning expertise
- Automation-first mindset (Terraform, scripting, CI/CD)
- Experience in SaaS/DBaaS environments preferred
Key Skills and Competencies
For a Site Reliability Engineer (SRE) in a DBaaS (Database-as-a-Service) support role, the following mandatory skills are typically required:
Database Administration (DBA) Skills
- Primary Database: MySQL
- Secondary Database: PostgreSQL, SQLServer
- Database Backup & Recovery: Tools and strategies for database backups and disaster recovery.
- Performance Tuning: Query optimization, indexing strategies, and database performance troubleshooting.
- Database Security: User management, roles, access control, and auditing.
Cloud Infrastructure Knowledge (DBaaS)
- Cloud Platforms: AWS (RDS, Aurora), Azure (Cosmos DB, SQL Database), GCP (Cloud SQL, Firestore).
- Infrastructure as Code (IaC): Terraform, CloudFormation, Kubernetes.
- Kubernetes & Containers: Running databases in containers (like Kubernetes).
- Observability Tools: ELK stack (Elasticsearch, Logstash, Kibana)
- Database Migration: Migrating databases across different platforms or cloud environments.
- Database Scaling: Vertical and horizontal scaling techniques in cloud environments.
SRE Principles (Site Reliability Engineering)
- Incident Management: Handling database outages, incident response, and on-call rotations.
- Monitoring and Alerting: Tools like Prometheus, Grafana, Datadog, CloudWatch.
- Service Level Objectives (SLOs) / Service Level Agreements (SLAs): Ensuring uptime and performance targets.
- Disaster Recovery Planning: Ensuring high availability (HA) and disaster recovery (DR) solutions.
Scripting and Automation
- Scripting Languages: Python, Shell scripting, Bash, PowerShell.
- Automation Tools: Ansible, Puppet, Chef.
- Infrastructure Automation: Automating database deployment, patching, and scaling.
Networking and Infrastructure
- Networking Basics: TCP/IP, DNS, Firewall, Load Balancers.
- Database Connectivity: Connection pooling, failover strategies, and multi-region deployment.
- Storage and Disk Management: Understanding IOPS, latency, and throughput.
Expertise in Linux OS
- Operating Systems: RHEL, UBunto, Centos
- File Systems: Understanding of file systems (ext4, XFS, etc.), permissions, and ownership (chmod, chown, ACLs).
- Process Management: Knowledge of process monitoring, management, and troubleshooting (ps, top, htop, kill, pkill, etc.).
- Monitoring Tools: Proficiency with tools like top, htop, vmstat, iostat, sar, and dstat to monitor CPU, memory, disk I/O, and network usage.
- Log Analysis: Ability to analyze system logs (/var/log/, journalctl, dmesg) for troubleshooting.
- Resource Limits: Understanding of resource limits (CPU, memory, disk, network) and how they impact database performance.
- Partitioning and Storage: Knowledge of partitioning tools (fdisk, parted) and file system management (mkfs, mount, umount). Understanding of RAID configurations and Logical Volume Management (LVM) for storage scalability.
Troubleshooting and Debugging
- Log Analysis: Reading and analysing database and system logs.
- Root Cause Analysis (RCA): Performing in-depth analysis after major incidents.
- Query Performance: Analysing slow queries, deadlocks, and resource contention.
Soft Skills
- Communication Skills: Clear communication with stakeholders and engineering teams.
- Problem-Solving: Ability to troubleshoot complex database issues under pressure.
- Collaboration: Working closely with DevOps, Infrastructure, and Engineering teams.
About Rackspace Technology
We are the multicloud solutions experts. We combine our expertise with the world’s leading technologies — across applications, data and security — to deliver end-to-end solutions. We have a proven record of advising customers based on their business challenges, designing solutions that scale, building and managing those solutions, and optimizing returns into the future. Named a best place to work, year after year according to Fortune, Forbes and Glassdoor, we attract and develop world-class talent. Join us on our mission to embrace technology, empower customers and deliver the future.
More on Rackspace Technology
Though we’re all different, Rackers thrive through our connection to a central goal: to be a valued member of a winning team on an inspiring mission. We bring our whole selves to work every day. And we embrace the notion that unique perspectives fuel innovation and enable us to best serve our customers and communities around the globe. We welcome you to apply today and want you to know that we are committed to offering equal employment opportunity without regard to age, color, disability, gender reassignment or identity or expression, genetic information, marital or civil partner status, pregnancy or maternity status, military or veteran status, nationality, ethnic or national origin, race, religion or belief, sexual orientation, or any legally protected characteristic. If you have a disability or special need that requires accommodation, please let us know.
Key skills/competency
- Staff Database Platform Reliability Engineer
- Site Reliability Engineering (SRE)
- Database Administration (DBA)
- Multi-cloud Platforms (Azure, AWS, GCP)
- Performance Tuning
- Automation (Terraform, Scripting, CI/CD)
- High Availability (HA) / Disaster Recovery (DR)
- Kubernetes
- Linux OS
- Incident Management
Skills & topics
- Database Reliability Engineer
- SRE
- DBA
- MySQL
- PostgreSQL
- SQL Server
- AWS
- Azure
- GCP
- Terraform
- Python
- Linux
- Performance Tuning
- High Availability
- Disaster Recovery
- Kubernetes
- CI/CD
- Platform Engineering
- Multi-cloud
How to get hired
- Tailor your resume: Highlight multi-cloud experience, SRE principles, and database expertise. Quantify achievements in performance tuning and automation.
- Showcase IaC and automation skills: Emphasize experience with Terraform, scripting (Python, Bash), and CI/CD pipelines.
- Demonstrate cloud platform knowledge: Detail your experience with AWS, Azure, and GCP database services (RDS, Aurora, Cosmos DB, SQL Database, Cloud SQL).
- Prepare for technical interviews: Be ready to discuss complex troubleshooting scenarios, HA/DR strategies, and performance optimization techniques.
- Understand SRE culture: Research Rackspace Technology's commitment to reliability, innovation, and customer success.
Technical preparation
Behavioral questions
Frequently asked questions
- What are the primary databases managed by the Staff Database Platform Reliability Engineer at Rackspace Technology?
- The Staff Database Platform Reliability Engineer at Rackspace Technology primarily manages MySQL databases, with secondary support for PostgreSQL and SQL Server. This role focuses on ensuring their reliability, performance, and scalability on multi-cloud platforms.
- What cloud platforms does this role involve supporting for the DBaaS platform?
- This role involves supporting the DBaaS platform across major cloud providers: Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP). Strong experience with at least two of these is required.
- What level of experience is required for the Staff Database Platform Reliability Engineer role?
- We are looking for experienced professionals with 8-10+ years in Database Administration (DBA) or Platform Engineering. This experience should include significant work with multi-cloud environments and SRE principles.
- Does Rackspace Technology offer opportunities for professional development for this role?
- Rackspace Technology is committed to attracting and developing world-class talent. While specific development programs aren't detailed here, the company's culture emphasizes continuous learning and innovation, suggesting ample opportunities for growth in advanced cloud and database technologies.
- What are the key SRE principles relevant to this DBRE role?
- Key SRE principles for this role include robust incident management, implementing effective monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog), defining and adhering to Service Level Objectives (SLOs) and Agreements (SLAs), and ensuring comprehensive disaster recovery planning for high availability.
- How important is automation in this Staff Database Platform Reliability Engineer position?
- Automation is crucial. An 'automation-first mindset' is explicitly sought, with experience in Infrastructure as Code tools like Terraform, scripting languages (Python, Bash), and CI/CD practices being highly valued for automating database deployment, patching, and scaling.
- What kind of performance tuning expertise is expected for this role?
- Deep expertise in performance tuning is essential. This includes query optimization, implementing effective indexing strategies, and general database performance troubleshooting to ensure the DBaaS platform runs efficiently.