Site Reliability Engineer, Retail & Banking Technology
ING Hubs Romania
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About ING Hubs Romania
ING Hubs Romania is a powerhouse delivering 130 services across software development, data management, non-financial risk & compliance, audit, and retail operations. Serving 24 ING units globally, the team comprises over 2000 high-performing engineers and professionals. Established in 2015 as ING's software development hub, it has expanded its competencies significantly, operating from Bucharest and Cluj-Napoca with over 1800 colleagues dedicated to Data and Analytics Tech, Tech Foundation and Channels, Retail Core Banking and Architecture, and Global Products and Technology Services.
The organization fosters a flexible and highly collaborative environment that encourages fair and constructive feedback. Impact is a core driver, with a shared desire to deliver innovative solutions and make a positive difference.
The Mission: Empowering Digital Banking
ING aims to be Europe's leading digital banking brand, providing an empowering, personalized, and differentiated customer experience. The Site Reliability Engineer, Retail & Banking Technology will be instrumental in transforming working methodologies to achieve this ambition.
The R&BT SRE Team
The Retail & Banking Technology (R&BT) Site Reliability Engineering (SRE) team is a multidisciplinary group of senior engineers specializing in development and operations across applications and infrastructure. Their main objective is to continuously and structurally enhance the reliability and maintainability of IT environments associated with R&BT Platforms, managed across various international ING domains.
- Objective: Site Reliability Engineering (SRE) enhances the reliability and scalability of BTP platform services through collaborative efforts, prioritizing availability, performance, efficiency, and observability.
- Measurement: SRE targets increased MTBF, decreased MTTR, and minimized operational toil.
- Approach: This is facilitated by automation, standardized procedures, and the adoption of SRE best practices.
- Cultivate a Reliability Mindset: The aim is to foster a culture of reliability throughout the BTP organization, encouraging proactive behaviours and attitudes.
Your Day-to-Day Responsibilities
- Ensure Service Level Objective (SLO) levels are established and consistently met.
- Optimize Observability tooling, particularly Grafana dashboards, for enhanced insights.
- Report on Global SRE targets and Key Performance Indicators (KPIs).
- Conduct yearly Well-Architected Reviews and Observability Assessments for all critical components.
- Champion an 'Always Available' mindset and behavior within the R&BT organization, providing necessary resources, skills, guidance, and training to DevOps teams.
- Define and improve standards for logging, monitoring, and alerting, actively tracking end-to-end platform performance using white and black box monitoring tools.
- Enhance incident response practices and actively engage in escalated and critical incidents. On-call duty is currently not required but may become so.
- Participate in Root Cause Analysis, prioritizing and implementing recommendations through improvement plans with responsible Squads / DevOps teams.
- Track and trace actions derived from post-mortems and Emirs.
- Drive continuous improvement for all R&BT Platform services by analyzing service levels, functional/technical setup, code, DevOps practices, and incident root causes.
- Roll out new resilience features across the organization.
- Set up and maintain automatic reporting and feedback loops.
- Contribute to automating Build, Test, and Deployment practices via the CI/CD pipeline.
- Assist in tuning application resources and updating high-availability deployment patterns for container and VM-based environments.
- Initiate and contribute to new SRE initiatives such as AI Ops, Chaos Engineering, Public Cloud migrations, and Error Budgeting.
- Participate in and initiate experiments with new tools and concepts, evaluating their value against set goals.
What You’ll Bring To The Team
As a Site Reliability Engineer, Retail & Banking Technology, you will be an operations expert with over 4 years of experience applying Agile DevOps principles. You will possess a solid understanding of how technology setup and ITSM processes impact service level objectives like Availability (time-based, successful call rate, response times), MTTR, and MTBF. A good grasp of microservices architecture, high availability/resilience patterns, and experience in building systems with multiple layers of redundancy to withstand failures is essential.
Proven Experience In:
- Working as a Site Reliability Engineer or DevOps Engineer.
- Scripting in Ruby, Python, Bash, or PowerShell.
- Setting up Build and Deployment pipelines in Azure DevOps (ADO).
- Establishing white-box monitoring and formulating meaningful metrics for Grafana and TraceING.
- Eliminating toil through automation and process optimization.
- Coordinating/leading incident response and Post-mortem / Root Cause Analysis activities.
- Understanding IT Service Management processes (ING Global Way of Working) and their relation to SRE objectives.
- Good understanding of Public Cloud concepts.
Prior Work Experience With Tools:
- CI/CD Pipeline: OnePipeline / Azure DevOps / Kingsroad.
- Cloud computing and container orchestration: Linux VMs and Kubernetes container platforms (OpenShift + AKS knowledge and certifications are a plus).
- Touchpoint service mesh and SDK/Merak.
- Logging/monitoring/alerting: Kafka, ELK, Prometheus, IAT. Experience with blackbox monitoring tools like Rigor/Splunk and AI Ops tools like Loom is a plus.
- Backlog management: Azure Boards.
- ITSM: SNOW.
The Ideal Candidate Has:
- A Bachelor or Master’s degree in computer science or a related field.
- Experience coaching and training DevOps engineers on technical subjects.
- Previous experience as a consumer of R&BT Platforms, preferably Touchpoint Platform.
- Understanding of the ING application risk journey.
Key Skills/Competency
- Site Reliability Engineering
- DevOps
- Automation
- Observability
- Incident Management
- Microservices
- Kubernetes
- Azure DevOps
- Scripting (Python, Bash)
- Monitoring & Alerting
How to Get Hired at ING Hubs Romania
- Research ING Hubs Romania's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
- Tailor your Site Reliability Engineer resume: Highlight SRE achievements, DevOps experience, and specific tools mentioned like Kubernetes, Azure DevOps, and Grafana.
- Showcase your technical expertise: Prepare to discuss your experience with microservices, incident response, and scripting (Python/Bash) relevant to the Site Reliability Engineer, Retail & Banking Technology role.
- Demonstrate a reliability mindset: Be ready to provide examples of how you've fostered an 'Always Available' approach and contributed to continuous improvement.
- Network within ING Hubs Romania: Connect with current employees on LinkedIn to gain insights and potentially learn about internal referrals for SRE positions.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background