top of page

Site Reliability Engineer (SRE)

Job Type

Full Time

Experience

3+

Location

Remote

Job Description

We are seeking a skilled and passionate Site Reliability Engineer to join our engineering team. In this role, you will bridge the gap between software development and operations, ensuring our systems are scalable, reliable, and performant. You will apply software engineering principles to infrastructure and operations challenges, driving automation, observability, and continuous improvement across our platform.

Key Responsibilities

  • Design, build, and maintain scalable, highly available infrastructure and services

  • Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets

  • Lead incident response efforts, including on-call rotations, root cause analysis, and post-mortem reviews

  • Develop and maintain automation tools and scripts to eliminate toil and improve operational efficiency

  • Partner with development teams to embed reliability practices into the software development lifecycle

  • Build and improve monitoring, alerting, and observability frameworks across the stack

  • Evaluate and implement infrastructure-as-code (IaC) solutions to manage cloud environments

  • Conduct capacity planning and performance tuning to support business growth

  • Identify and drive improvements to deployment pipelines and CI/CD processes

  • Champion a culture of reliability, blameless post-mortems, and continuous learning

Qualifications

  • 3–6 years of experience in site reliability engineering, DevOps, or a related field

  • Strong proficiency in at least one programming or scripting language (Python, Go, Bash, or similar)

  • Hands-on experience with cloud platforms (AWS, Azure, or GCP)

  • Experience with container orchestration tools such as Kubernetes and Docker

  • Proficiency with infrastructure-as-code tools (Terraform, Ansible, or CloudFormation)

  • Solid understanding of networking concepts (DNS, TCP/IP, load balancing, CDNs)

  • Experience with CI/CD pipelines and DevOps tooling (Jenkins, GitHub Actions, ArgoCD, etc.)

  • Strong troubleshooting and problem-solving skills in distributed systems environments

  • Preferred Qualifications

  • Relevant certifications such as AWS Solutions Architect, CKA (Certified Kubernetes Administrator), or Microsoft Azure Administrator.

  • Experience with observability tools such as Datadog, Prometheus, Grafana, or New Relic

  • Familiarity with service mesh technologies (Istio, Linkerd)

  • Experience in a regulated industry (financial services, healthcare, etc.)

  • Background in software development or systems engineering

  • Knowledge of chaos engineering principles and tools (e.g., Gremlin, Chaos Monkey)

bottom of page