Site Reliability Engineer (SRE)
Job Type
Full Time
Experience
3+
Location
Remote
Job Description
We are seeking a skilled and passionate Site Reliability Engineer to join our engineering team. In this role, you will bridge the gap between software development and operations, ensuring our systems are scalable, reliable, and performant. You will apply software engineering principles to infrastructure and operations challenges, driving automation, observability, and continuous improvement across our platform.
Key Responsibilities
Design, build, and maintain scalable, highly available infrastructure and services
Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets
Lead incident response efforts, including on-call rotations, root cause analysis, and post-mortem reviews
Develop and maintain automation tools and scripts to eliminate toil and improve operational efficiency
Partner with development teams to embed reliability practices into the software development lifecycle
Build and improve monitoring, alerting, and observability frameworks across the stack
Evaluate and implement infrastructure-as-code (IaC) solutions to manage cloud environments
Conduct capacity planning and performance tuning to support business growth
Identify and drive improvements to deployment pipelines and CI/CD processes
Champion a culture of reliability, blameless post-mortems, and continuous learning
Qualifications
3–6 years of experience in site reliability engineering, DevOps, or a related field
Strong proficiency in at least one programming or scripting language (Python, Go, Bash, or similar)
Hands-on experience with cloud platforms (AWS, Azure, or GCP)
Experience with container orchestration tools such as Kubernetes and Docker
Proficiency with infrastructure-as-code tools (Terraform, Ansible, or CloudFormation)
Solid understanding of networking concepts (DNS, TCP/IP, load balancing, CDNs)
Experience with CI/CD pipelines and DevOps tooling (Jenkins, GitHub Actions, ArgoCD, etc.)
Strong troubleshooting and problem-solving skills in distributed systems environments
Preferred Qualifications
Relevant certifications such as AWS Solutions Architect, CKA (Certified Kubernetes Administrator), or Microsoft Azure Administrator.
Experience with observability tools such as Datadog, Prometheus, Grafana, or New Relic
Familiarity with service mesh technologies (Istio, Linkerd)
Experience in a regulated industry (financial services, healthcare, etc.)
Background in software development or systems engineering
Knowledge of chaos engineering principles and tools (e.g., Gremlin, Chaos Monkey)