PitchMeAI
SatSure

ML Ops Engineer

SatSure · Bengaluru, Karnataka, India

  • On site
  • Full-time
  • ₹1,800,000 / year
  • Bengaluru, Karnataka, India

Job highlights

  • Bridge data science and operations with MLOps expertise.
  • Design and build CI/CD pipelines for ML models.
  • Deploy and optimize Kubernetes for deep learning workloads.
  • Containerize and deploy PyTorch models efficiently.
  • Manage AWS cloud infrastructure for ML projects.

About the role

About SatSure

SatSure is a deep tech, decision intelligence company working at the nexus of agriculture, infrastructure, and climate action. We focus on creating impact for millions in the developing world by making insights from Earth observation data accessible to all. Join us to be at the forefront of building a deep tech company from India that solves global problems.

The Opportunity: ML Ops Engineer

We are seeking a dedicated ML Ops Engineer to join our team, bridging the critical gap between Data Science and Operations. This role requires specific expertise in applying DevOps principles to Machine Learning workflows, rather than a generalist software engineering background. You will be instrumental in building robust CI/CD pipelines for models, managing Kubernetes clusters for deep learning training and inference, and will be proficient in the language and tools of Data Scientists, including PyTorch, Tensors, and GPUs.

Roles & Responsibilities

  • ML Pipelines: Design and build CI/CD pipelines specifically for ML workflows, encompassing training triggers, model versioning, and rigorous testing, utilizing tools such as Jenkins, Bitbucket, or GitHub Actions.
  • Orchestration: Deploy, configure, and optimize Kubernetes clusters to seamlessly support containerized deep learning applications, with a focus on managing GPU resources and implementing effective node scaling strategies.
  • Model Serving: Collaborate closely with Data Scientists to containerize and efficiently deploy PyTorch models using Docker and industry-standard serving frameworks like KServe or Nvidia Triton Inference Server.
  • Infrastructure: Manage core cloud infrastructure on AWS for robust data processing and secure model storage, leveraging services such as S3, ECR, and IAM.
  • GitOps: Implement and maintain GitOps practices to streamline the entire lifecycle management of both infrastructure and ML configurations.
  • Monitoring: Establish comprehensive monitoring systems for both system health (CPU/Memory) and critical ML metrics like Model Drift and Performance, using tools such as Prometheus, Grafana, or ELK.
  • Automation: Automate repetitive tasks related to dataset management and environment setup, primarily using Python.

Qualification

  • 1 - 3 years of relevant experience as an MLEngineer, MLOps, or Platform Engineering professional.
  • Mandatory: Functional understanding of Machine Learning/Deep Learning concepts and hands-on experience with the PyTorch framework.
  • Mandatory: Prior experience working with Kubernetes and CI/CD in a production environment.
  • Bachelor’s degree in Computer Science, IT, or a related field. Non-IT degrees with directly relevant experience will also be considered.

Must-have Skills

  • Core MLOps: Proven practical experience in deploying ML/DL models in production systems, demonstrating a clear understanding of the distinctions between deploying traditional web applications and deep learning models.
  • Kubernetes: Strong hands-on experience with K8s, including deployments, services, and ingress, with a preference for experience scheduling GPU workloads.
  • CI/CD & GitOps: Proficiency in building robust pipelines (e.g., Jenkins/Bitbucket) and a solid understanding of GitOps workflows (e.g., ArgoCD/Flux).
  • ML Fundamentals: Working knowledge of PyTorch and Python, including the ability to read model code, comprehend training/inference loops, optimize PyTorch models, and debug environment issues (e.g., CUDA, dependencies).
  • Containerization: Expert-level Docker skills, including multi-stage builds and strategies for reducing image sizes, especially for large ML dependencies.
  • Cloud: Hands-on experience with essential AWS services such as EC2, S3, and ECR.
  • Linux/Scripting: Strong command of Linux internals and shell scripting.

Good-to-have

  • Experience with ML workflow tools like KServe, Triton Inference Server, and MLflow.
  • Experience profiling and optimizing PyTorch models for production inference on accelerator platforms such as NVIDIA GPUs, TPUs, and AWS Inferentia.
  • Background in processing Geospatial or Remote Sensing data.

Competencies

  • Technical Translator: Ability to effectively understand requirements from Data Scientists and translate them into robust, scalable infrastructure components.
  • Debugging: Excellent troubleshooting skills for complex distributed systems, including diagnosing issues like pod crashes during inference.
  • Collaboration: Strong communication skills to foster effective teamwork within a cross-functional environment.

Benefits

  • Medical Health Cover for you and your family, including unlimited online doctor consultations.
  • Access to mental health experts for you and your family.
  • Dedicated allowances for learning and skill development.
  • Comprehensive leave policy with casual leaves, paid leaves, marriage leaves, and bereavement leaves.

Interview Process

  • Intro call
  • Assessment (Focus on Kubernetes/Docker/ML Deployment)
  • Interview rounds (ideally up to 3 rounds)
  • Culture Round / HR round

Key skills/competency

  • MLOps
  • Kubernetes
  • CI/CD
  • PyTorch
  • Docker
  • AWS
  • GitOps
  • Machine Learning
  • Deep Learning
  • Python

Skills & topics

  • ML Ops Engineer
  • MLOps
  • Kubernetes
  • CI/CD
  • PyTorch
  • Docker
  • AWS
  • Machine Learning
  • Deep Learning
  • Python
  • GitOps
  • Model Serving
  • Cloud Infrastructure
  • Data Science
  • Automation
  • Monitoring
  • Jenkins
  • Bitbucket
  • GitHub Actions
  • S3

How to get hired

  • Research SatSure's culture: Study their mission, values, recent news, and employee testimonials on LinkedIn and Glassdoor.
  • Tailor your resume: Customize your resume to highlight MLOps, Kubernetes, PyTorch, and AWS experience, aligning with SatSure's requirements.
  • Showcase ML project experience: Prepare to discuss practical examples of deploying ML/DL models in production environments.
  • Master Kubernetes and CI/CD: Demonstrate strong hands-on expertise in Kubernetes and building CI/CD pipelines during technical assessments.
  • Understand ML fundamentals: Be ready to discuss PyTorch model optimization, training/inference loops, and environment debugging.

Technical preparation

Review Kubernetes GPU scheduling.,Practice Docker multi-stage builds.,Optimize PyTorch model inference.,Study AWS S3, ECR, IAM.

Behavioral questions

Describe complex system debugging.,Share a data scientist collaboration example.,Explain translating ML needs to infra.,Discuss a challenging automation project.

Frequently asked questions

What kind of projects does an ML Ops Engineer at SatSure work on?
As an ML Ops Engineer at SatSure, you will primarily work on projects involving Earth observation data, particularly at the intersection of agriculture, infrastructure, and climate action. Your role will focus on building and maintaining the infrastructure and pipelines that enable data scientists to deploy and manage deep learning models for these critical applications.
What specific cloud platform experience is essential for the ML Ops Engineer role at SatSure?
For the ML Ops Engineer position at SatSure, strong experience with AWS services is essential. You should be proficient with EC2 for compute, S3 for storage, and ECR for container registries, as these form the backbone of their cloud infrastructure for data processing and model storage.
How important is PyTorch knowledge for this ML Ops Engineer role at SatSure?
PyTorch knowledge is mandatory for the ML Ops Engineer role at SatSure. You're expected to have a functional understanding of the framework, be able to read and understand model code, optimize PyTorch models, and debug related environment issues like CUDA dependencies.
What is the interview process like for an ML Ops Engineer at SatSure?
The interview process for the ML Ops Engineer at SatSure typically begins with an introductory call. This is followed by a technical assessment focused on Kubernetes, Docker, and ML deployment. Subsequently, there will be up to three interview rounds, concluding with a culture or HR round.
Does SatSure support learning and development for ML Ops Engineers?
Yes, SatSure is committed to employee growth. They offer dedicated allowances for learning and skill development, ensuring ML Ops Engineers have resources to stay updated with the latest technologies and advance their expertise in the field.
What specific containerization skills are expected for this role at SatSure?
For the ML Ops Engineer role at SatSure, expert-level Docker skills are required. This includes proficiency in multi-stage builds and techniques for reducing image sizes, especially crucial when dealing with large machine learning dependencies.
What is SatSure's approach to health and well-being for employees?
SatSure provides comprehensive health and well-being benefits, including medical health cover for employees and their families. This also extends to unlimited online doctor consultations and access to mental health experts, reflecting a holistic approach to employee welfare.