
ML Operations Engineer - Associate Vice President
Citi · Irving, TX
- On site
- Full-time
- $160,000 / year
- Irving, TX
Job highlights
- Operate and scale AI/ML applications for Citi.
- Build and automate robust ML pipelines.
- Manage ML lifecycle with MLflow and Ray Tune.
- Deploy models using Docker and Kubernetes.
- Integrate with data platforms and ensure reliability.
About the role
ML Operations Engineer - Associate Vice President
We are seeking an experienced MLOps Engineer to join our DevOps and Infrastructure Engineering team. This role is crucial for operationalizing, scaling, and maintaining our Artificial Intelligence (AI) and Machine Learning (ML) applications. The successful candidate will leverage their expertise to ensure seamless, scalable, and reliable deployment and management of AI/ML models, working closely with data scientists and ML engineers. This position requires strong proficiency in Python, hands-on experience with Ray Tune for hyperparameter optimization, and MLflow for experiment tracking and model lifecycle management.
Key Responsibilities:
- ML Pipeline Development & Automation: Design, build, and maintain robust and scalable end-to-end ML pipelines for data ingestion, preprocessing, model training, validation, and deployment.
- CI/CD for ML: Implement and manage Continuous Integration/Continuous Delivery (CI/CD) pipelines specifically tailored for machine learning workflows, ensuring automated testing, versioning, and deployment of ML artifacts.
- Experiment Tracking & Model Management: Utilize MLflow extensively for experiment tracking, reproducible runs, managing model versions, and maintaining a centralized model registry.
- Hyperparameter Optimization: Leverage Ray Tune for efficient and distributed hyperparameter optimization to enhance model performance and accelerate experimentation.
- Containerization & Orchestration: Package ML models and their dependencies using Docker and deploy/manage them effectively on Kubernetes clusters.
- Data Platform Integration: Integrate with and optimize existing data platforms, including Apache Iceberg, Apache Spark, and FLINK, to ensure efficient data processing and feature engineering for ML models.
- Data Storage & Streaming: Work with PostgreSQL, Oracle, and MongoDB for diverse data storage needs, and utilize Kafka for real-time data streaming to support various ML applications.
- Monitoring & Observability: Implement comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana) for ML models in production, tracking model performance, data drift, and infrastructure health to ensure reliability and facilitate automated retraining or rollback.
- Scripting & Automation: Develop automation scripts and tools using Python and Bash/Go to streamline MLOps processes and integrate various systems.
- Collaboration: Act as a vital link between data scientists, ML engineers, and infrastructure teams, facilitating clear communication and ensuring that ML solutions are production-ready.
Required Qualifications:
- Experience: 3-5 years of hands-on experience in an MLOps, DevOps, or Machine Learning Engineering role, with a proven track record of deploying and managing ML models in production environments.
- Programming: Expert-level proficiency in Python for ML development, scripting, and automation.
- MLOps Tooling: Demonstrated hands-on experience with Ray Tune for hyperparameter optimization and AirFlow or MLflow for experiment tracking and model management.
- Containerization & Orchestration: Strong experience with Docker and Kubernetes (including Helm).
- CI/CD: Experience implementing CI/CD practices for software and/or ML pipelines.
- Data Technologies: Familiarity with or experience with Apache Spark, Apache Iceberg, FLINK, and Kafka.
- Databases: Experience with PostgreSQL, Oracle, and MongoDB.
- Workflow Orchestration: Experience with Apache Airflow.
- Infrastructure as Code: Experience with HashiCorp (Terraform).
- Operating Systems: Proficiency in Linux/Unix environments.
Desirable Skills:
- Experience with cloud platforms (AWS, Azure, GCP) and managing cloud-native ML infrastructure.
- Knowledge of deep learning frameworks such as TensorFlow or PyTorch.
- Experience with generative AI technologies (e.g., LLMs, prompt engineering, RAG pipelines).
- Understanding of distributed computing and big data processing techniques.
Key skills/competency
- MLOps
- DevOps
- Machine Learning Engineering
- Python
- Ray Tune
- MLflow
- Docker
- Kubernetes
- CI/CD
- Apache Spark
Skills & topics
- MLOps Engineer
- Machine Learning
- DevOps
- Python
- Ray Tune
- MLflow
- Docker
- Kubernetes
- CI/CD
- Apache Spark
- Associate Vice President
- Irving, Texas
- Full-time
How to get hired
- Tailor your resume: Highlight your Python, MLOps tools (MLflow, Ray Tune), Docker, and Kubernetes experience.
- Showcase project impact: Quantify achievements in ML pipeline automation, CI/CD, and model deployment.
- Prepare for technical questions: Review MLOps concepts, containerization, orchestration, and data technologies.
- Demonstrate collaboration: Be ready to discuss how you've worked with data scientists and engineers.
- Research Citi's values: Understand their commitment to technology and innovation in finance.
Technical preparation
Behavioral questions
Frequently asked questions
- What specific experience does Citi look for in an ML Operations Engineer?
- Citi seeks candidates with 3-5 years of hands-on experience in MLOps, DevOps, or Machine Learning Engineering. Key requirements include expert Python proficiency, experience with MLOps tooling like Ray Tune and MLflow, and strong skills in Docker and Kubernetes for containerization and orchestration.
- How important are data platform and database skills for this role at Citi?
- Experience with data platforms such as Apache Spark, Iceberg, and FLINK, along with databases like PostgreSQL, Oracle, and MongoDB, is highly valued. Familiarity with Kafka for streaming is also important for supporting various ML applications.
- What are the primary responsibilities of an ML Operations Engineer at Citi?
- The primary responsibilities include designing and automating ML pipelines, implementing CI/CD for ML, managing the ML model lifecycle using tools like MLflow, optimizing hyperparameters with Ray Tune, and deploying models using Docker and Kubernetes.
- Does Citi require experience with cloud platforms for this ML Operations Engineer position?
- While not strictly required, experience with cloud platforms like AWS, Azure, or GCP and managing cloud-native ML infrastructure is considered a desirable skill for this role at Citi.
- What programming languages are essential for the ML Operations Engineer role at Citi?
- Expert-level proficiency in Python is essential for ML development, scripting, and automation. Experience with Bash/Go for scripting and automation is also beneficial.
- How does Citi ensure the reliability of deployed ML models?
- Citi implements comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana) for ML models in production. This includes tracking model performance, data drift, and infrastructure health to ensure reliability and enable automated retraining or rollback.
- What is the role of an ML Operations Engineer in collaborating with other teams at Citi?
- The ML Operations Engineer acts as a vital link between data scientists, ML engineers, and infrastructure teams. They facilitate clear communication and ensure that ML solutions are production-ready, contributing to the overall success of AI/ML initiatives.
- What opportunities are there for learning new technologies in this role at Citi?
Similar roles
Open positions we recommend based on this role.
Operations Services Lead (Teller Lead)
First Citizens Bank · Wendell, North Carolina, United States
Automation Test engineer- Selenium/playwright
Citi · Chennai, Tamil Nadu, India
Lead Backend Engineer (Java/Cloud), Vice President
Citi · Irving, TX
Senior Software Engineer - Kafka
First Citizens Bank · Remote