Senior Devops Engineer @ Textlayer
Your Application Journey
Email Hiring Manager
Job Details
About Textlayer
Textlayer helps enterprises and funded startups deploy advanced AI systems without rewriting their infrastructure. We work with organizations across fintech, healthtech, and other sectors to bridge the gap between AI potential and practical implementation.
Our approach combines deep technical expertise with proven frameworks like TextLayer Core to accelerate development and ensure production-ready results. We support bespoke AI workflows and agentic systems enabling clients to adopt AI in existing tech stacks.
We are on a mission to address the implementation gap faced by over 85% of enterprise clients adding AI to their operations and products.
The Role
The Senior Devops Engineer will architect production-grade monitoring, logging, and tracing systems specifically designed for AI workloads. This includes implementing OpenTelemetry-based data collection pipelines, robust deployment workflows using IaC, and resilient observability solutions to gain deep insights into LLM applications and conversational AI systems.
Key Responsibilities
- Design and maintain OpenTelemetry-based observability infrastructure for distributed AI systems and LLM applications.
- Build and scale ELK stack deployments for log aggregation and visualization.
- Implement comprehensive tracing and monitoring solutions for LLM inference, RAG pipelines, and AI Agent workflows.
- Develop and maintain data ingestion pipelines for high-volume telemetry data.
- Configure and optimize OpenSearch clusters for real-time analytics and trace reconstruction.
- Deploy and manage LLM observability platforms such as Langfuse and OpenLLMetry.
- Implement Infrastructure as Code using Terraform, CloudFormation, and Ansible for reproducible deployments.
- Build automated alerting and incident response systems for AI application performance and reliability.
- Collaborate with engineering teams to instrument AI applications with telemetry and observability hooks.
- Optimize data retention, indexing strategies, and query performance for large-scale observability data.
What You Will Bring
You should have deep expertise in observability infrastructure, hands-on experience with OpenTelemetry and the ELK stack, and sound knowledge of AI/ML system monitoring challenges. A passion for scalable, reliable infrastructure and a proactive approach to automation and incident management is essential.
Required Qualifications
- 4+ years of DevOps/Infrastructure engineering experience focusing on observability and monitoring.
- Expert-level experience with OpenTelemetry implementation and configuration.
- Production experience with ELK stack including cluster management and optimization.
- Strong knowledge of distributed tracing, metrics collection, and log aggregation architectures.
- Experience with container orchestration (Kubernetes, Docker) and cloud infrastructure (AWS/GCP/Azure).
- Proficiency with Infrastructure as Code tools (Terraform, Ansible, CloudFormation).
- Experience building high-throughput data ingestion pipelines and real-time analytics systems.
- Strong scripting skills (Python, Bash/Sh) for automation and tooling.
- Knowledge of observability best practices, SLI/SLO definitions, and incident response.
- Experience with monitoring tools like Prometheus, Grafana, or DataDog.
Bonus Points
- Experience with LLMOps observability tools such as Langfuse and LiteLLM.
- Experience with Golang, Rust, or C/C++.
- Knowledge of AI/ML system monitoring patterns and LLM application telemetry.
- Experience with OpenSearch and ClickHouse for analytics workloads.
- Familiarity with conversational AI analytics and trace reconstruction techniques.
- Experience instrumenting LLM applications, RAG systems, or AI Agent workflows.
- Background in time-series databases and vector search optimization.
- Contributions to open-source observability or LLMOps projects.
- Knowledge of eval-driven development and automated AI system testing.
Employment Details
Employment Type: Full Time
Location: Remote - Canada
Compensation: $200,000 - $220,000 CAD base salary
Start Date: Flexible, but preferred immediate
How to Apply
Apply directly via our portal: Apply Here
Key skills/competency
- DevOps
- Observability
- OpenTelemetry
- ELK
- IaC
- Monitoring
- Automation
- Cloud
- Data Ingestion
- Tracing
How to Get Hired at Textlayer
🎯 Tips for Getting Hired
- Research Textlayer's culture: Study their mission, values, and projects online.
- Customize your resume: Highlight DevOps and observability skills.
- Showcase relevant projects: Demonstrate AI systems and IaC expertise.
- Prepare technical answers: Focus on OpenTelemetry and ELK experience.