Senior Devops Engineer
@ Textlayer

Hybrid
CA$210,000
Hybrid
Contractor
Posted 23 days ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXXX XXXXXXXXX XXXXXXXXXX****** @textlayer.com
Recommended after applying

Job Details

About Textlayer

Textlayer helps enterprises and funded startups deploy advanced AI systems without rewriting their infrastructure. We work with organizations across fintech, healthtech, and other sectors to bridge the gap between AI potential and practical implementation.

Our approach combines deep technical expertise with proven frameworks like TextLayer Core to accelerate development and ensure production-ready results. We support bespoke AI workflows and agentic systems enabling clients to adopt AI in existing tech stacks.

We are on a mission to address the implementation gap faced by over 85% of enterprise clients adding AI to their operations and products.

The Role

The Senior Devops Engineer will architect production-grade monitoring, logging, and tracing systems specifically designed for AI workloads. This includes implementing OpenTelemetry-based data collection pipelines, robust deployment workflows using IaC, and resilient observability solutions to gain deep insights into LLM applications and conversational AI systems.

Key Responsibilities

  • Design and maintain OpenTelemetry-based observability infrastructure for distributed AI systems and LLM applications.
  • Build and scale ELK stack deployments for log aggregation and visualization.
  • Implement comprehensive tracing and monitoring solutions for LLM inference, RAG pipelines, and AI Agent workflows.
  • Develop and maintain data ingestion pipelines for high-volume telemetry data.
  • Configure and optimize OpenSearch clusters for real-time analytics and trace reconstruction.
  • Deploy and manage LLM observability platforms such as Langfuse and OpenLLMetry.
  • Implement Infrastructure as Code using Terraform, CloudFormation, and Ansible for reproducible deployments.
  • Build automated alerting and incident response systems for AI application performance and reliability.
  • Collaborate with engineering teams to instrument AI applications with telemetry and observability hooks.
  • Optimize data retention, indexing strategies, and query performance for large-scale observability data.

What You Will Bring

You should have deep expertise in observability infrastructure, hands-on experience with OpenTelemetry and the ELK stack, and sound knowledge of AI/ML system monitoring challenges. A passion for scalable, reliable infrastructure and a proactive approach to automation and incident management is essential.

Required Qualifications

  • 4+ years of DevOps/Infrastructure engineering experience focusing on observability and monitoring.
  • Expert-level experience with OpenTelemetry implementation and configuration.
  • Production experience with ELK stack including cluster management and optimization.
  • Strong knowledge of distributed tracing, metrics collection, and log aggregation architectures.
  • Experience with container orchestration (Kubernetes, Docker) and cloud infrastructure (AWS/GCP/Azure).
  • Proficiency with Infrastructure as Code tools (Terraform, Ansible, CloudFormation).
  • Experience building high-throughput data ingestion pipelines and real-time analytics systems.
  • Strong scripting skills (Python, Bash/Sh) for automation and tooling.
  • Knowledge of observability best practices, SLI/SLO definitions, and incident response.
  • Experience with monitoring tools like Prometheus, Grafana, or DataDog.

Bonus Points

  • Experience with LLMOps observability tools such as Langfuse and LiteLLM.
  • Experience with Golang, Rust, or C/C++.
  • Knowledge of AI/ML system monitoring patterns and LLM application telemetry.
  • Experience with OpenSearch and ClickHouse for analytics workloads.
  • Familiarity with conversational AI analytics and trace reconstruction techniques.
  • Experience instrumenting LLM applications, RAG systems, or AI Agent workflows.
  • Background in time-series databases and vector search optimization.
  • Contributions to open-source observability or LLMOps projects.
  • Knowledge of eval-driven development and automated AI system testing.

Employment Details

Employment Type: Full Time

Location: Remote - Canada

Compensation: $200,000 - $220,000 CAD base salary

Start Date: Flexible, but preferred immediate

How to Apply

Apply directly via our portal: Apply Here

Key skills/competency

  • DevOps
  • Observability
  • OpenTelemetry
  • ELK
  • IaC
  • Monitoring
  • Automation
  • Cloud
  • Data Ingestion
  • Tracing

How to Get Hired at Textlayer

🎯 Tips for Getting Hired

  • Research Textlayer's culture: Study their mission, values, and projects online.
  • Customize your resume: Highlight DevOps and observability skills.
  • Showcase relevant projects: Demonstrate AI systems and IaC expertise.
  • Prepare technical answers: Focus on OpenTelemetry and ELK experience.

📝 Interview Preparation Advice

Technical Preparation

Review OpenTelemetry documentation and best practices.
Practice ELK stack deployment and cluster tuning.
Brush up on Terraform and CloudFormation scripts.
Study container orchestration with Kubernetes basics.

Behavioral Questions

Describe a challenging project and your role.
Explain problem-solving in team settings.
Discuss teamwork under tight deadlines.
Share how you manage stress and prioritize tasks.

Frequently Asked Questions