Senior DevOps Engineer @ TextLayer
Your Application Journey
Email Hiring Manager
Job Details
About TextLayer
TextLayer helps enterprises and funded startups deploy advanced AI systems without rewriting their infrastructure. We bridge the gap between AI potential and practical implementation in sectors like fintech and healthtech.
The Role
The Senior DevOps Engineer will architect production-grade monitoring, logging, and tracing systems for AI workloads. This role includes implementing OpenTelemetry pipelines, building deployment workflows with Infrastructure as Code, and creating resilient observability solutions for LLM applications and conversational AI systems.
Key Responsibilities
- Design and maintain OpenTelemetry-based observability infrastructure.
- Build and scale ELK stack deployments for log aggregation and visualization.
- Implement tracing and monitoring for LLM inference and AI workflows.
- Develop data ingestion pipelines for high-volume telemetry data.
- Configure and optimize OpenSearch clusters for real-time analytics.
- Deploy and manage observability platforms like Langfuse and OpenLLMetry.
- Implement IaC using Terraform, CloudFormation, and similar tools.
- Build automated alerting and incident response systems.
- Collaborate with engineering teams for proper telemetry instrumentation.
- Optimize data retention, indexing strategies, and query performance.
What You Will Bring
A deep expertise in observability infrastructure, experience with OpenTelemetry and ELK, and a passion for scaling AI workloads. Strong skills in IaC, container orchestration, and scripting are required.
Required Qualifications
- 4+ years in DevOps/Infrastructure engineering with focus on observability.
- Expert-level experience with OpenTelemetry implementation and customization.
- Production experience with the ELK stack and cluster management.
- Strong knowledge of distributed tracing, metrics collection, and log aggregation.
- Experience with container orchestration (Kubernetes, Docker) and cloud platforms (AWS/GCP/Azure).
- Proficiency in IaC tools like Terraform, Ansible, and CloudFormation.
- Experience with high-throughput data ingestion and real-time analytics systems.
- Strong scripting skills in Python and Bash.
- Knowledge of observability best practices, SLIs/SLOs, and incident response.
- Familiarity with monitoring tools like Prometheus, Grafana, or DataDog.
Bonus Points
- Experience with LLMOps observability tools (Langfuse, LiteLLM, etc.).
- Proficiency in Golang, Rust, or C/C++.
- Knowledge of AI/ML system monitoring patterns and telemetry.
- Experience with OpenSearch, ClickHouse, and conversational AI analytics.
- Contributions to open-source observability or LLMOps projects.
- Familiarity with eval-driven development and automated AI system testing frameworks.
Key skills/competency
- OpenTelemetry
- ELK
- Observability
- Monitoring
- Infrastructure as Code
- Kubernetes
- Terraform
- Scripting
- Telemetry
- AI Workloads
How to Get Hired at TextLayer
🎯 Tips for Getting Hired
- Research TextLayer's culture: Explore its mission, values, and projects.
- Tailor your resume: Highlight DevOps and observability achievements.
- Showcase technical skills: Demonstrate OpenTelemetry and IaC experience.
- Practice interview questions: Prepare for technical and behavioral queries.