Coupang Logo

Coupang

Staff Reliability Engineer

Reposted 9 Days Ago
Be an Early Applicant
In-Office
Bengaluru, Bengaluru Urban, Karnataka
Expert/Leader
In-Office
Bengaluru, Bengaluru Urban, Karnataka
Expert/Leader
This role focuses on developing and maintaining Grafana dashboards, managing monitoring systems, automating processes, and collaborating across teams to ensure IT service reliability.
The summary above was generated by AI

Company Introduction

We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we are collectively disrupting the multi-billion-dollar commerce industry from the ground up and establishing an unparalleled reputation for being leading and reliable force in South Korean commerce.

We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been at since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.

Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.

Role Overview: 
 
To ensure stable Coupang's IT services, the IT Reliability Engineering team operates monitoring systems and processes for IT infra and applications. The team is responsible for ensuring and improving monitoring visibility. In the case of an event or incident, the team collaborates with the engineering team to resolve it and manage relevant metrics. To ensure the continuity of service, the team regularly conducts DR tests.

Key Responsibilities:

Strategic Vision & Leadership
  • Define and drive the observability strategy and roadmap, aligning with business and technology goals.
  • Establish a mature observability framework covering infrastructure, network, applications, and end-user experience.
  • Advocate for observability best practices across engineering, operations, and product teams.
  • Monitoring & Tool Implementation
  • Lead the design, implementation, and optimization of observability platforms (e.g., Prometheus, Grafana, Datadog, New Relic, Splunk).
  • Evaluate and onboard new tools and technologies to enhance visibility and telemetry across systems.
  • Ensure scalable and resilient monitoring architectures are in place for hybrid and cloud-native environments.
 
Gap Analysis & Continuous Improvement
 
  • Conduct gap assessments in existing monitoring setups and identify areas for improvement.
  • Implement automated solutions to address low-hanging fruits and reduce manual overhead.
  • Continuously refine monitoring configurations to improve signal-to-noise ratio and reduce alert fatigue.
  • End-to-End Observability
  • Build and maintain end-to-end visibility across infrastructure, network, applications, and user journeys.
  • Integrate observability tools with incident management, ticketing, and reporting systems.
  • Develop and enforce tagging strategies, metrics standards, and log enrichment practices.
  • Collaboration & Enablement
  • Partner with DevOps, SRE, and application teams to embed observability into CI/CD pipelines and development workflows.
  • Provide technical guidance and training to teams on observability toolsand practices.
  • Support incident response and post-mortem analysis with automated diagnostics and telemetry insights.
  • Data-Driven Insights
  • Leverage observability data to generate actionable insights for performance tuning, capacity planning, and reliability engineering.
  • Create dashboards and reports that provide meaningful visibility to stakeholders at all levels.
 
Qualifications:

Observability & Monitoring Tools
 
  • Prometheus, Grafana, Zabbix, SolarWinds
  • Datadog, New Relic, Dynatrace, Splunk, Helix
  • Open Telemetry (for standardized telemetry collection)

Infrastructure & Automation

  • Terraform, Ansible, Puppet, Chef (IaC tools)
  • Scripting languages: Python, Bash, PowerShell
  • REST APIs: Experience integrating and automating observability tools via
  • APIs
 
Cloud & Container Platforms
 
  • AWS, Azure, Google Cloud Platform
  • Kubernetes and Docker (monitoring containerized environments)
  • Cloud-native monitoring tools: CloudWatch, Azure Monitor, GCP Operations
  • Suite
 
CI/CD & DevOps Tooling
  • Jenkins, GitLab CI, GitHub Actions
  • Git (version control)
  • Integration of observability into CI/CD pipelines
 
Data Analysis & Visualization
 
  • Experience with metrics, logs, and traces
  • Building dashboards and custom visualizations
  • Familiarity with SQL or time-series databases (e.g., InfluxDB, TimescaleDB)
 
Alerting & Incident Management
 
  • Tools like PagerDuty, xMatters, VictorOps, ServiceNow, Jira, Helixs
  • Knowledge of alert tuning, event correlation, and automated diagnostics
 
Architecture & Design
  • Understanding of distributed systems, microservices, and network protocols
  • Ability to design scalable observability architectures
 
Preferred Qualifications:
 
  • 15+ years of hands-on experience in monitoring, observability, and infrastructure operations.
  • Proven track record of designing and implementing observability platforms in complex, environments.
  • Experience in gap analysis and optimization of monitoring setups across infrastructure, network, applications, and end-user layers.
  • Strong background in SRE.
  • Deep expertise in observability tools (Prometheus, Grafana, Dynatrace, etc.)
  • Strong skills in Infrastructure as Code, automation scripting, and API integrations.
  • Familiarity with cloud-native architectures, microservices.
  • Experience integrating observability into CI/CD pipelines and incident management workflows.
 
Soft Skills
 
  • Strategic thinker with a vision for mature observability practices.
  • Excellent communication and collaboration skills to work across teams.
  • Ability to mentor and guide teams on observability principles and tooling.

Type of work:

· Hybrid - Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week, with the flexibility to work from home 2 days a week, depending on the role requirement. Some businesses may require more time in office due to nature of work.


Details to consider

· Those eligible for employment protection (recipients of veteran’s benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws.

Privacy Notice

·Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below: https://www.coupang.jobs/privacy-policy/




Top Skills

Ansible
AWS
Azure
Ci/Cd Tools
Cloudwatch
Docker
Elasticsearch
GCP
Go
Grafana
Influxdb
Kubernetes
Loki
Prometheus
Python
Shell
Terraform
Zabbix

Similar Jobs

3 Days Ago
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cloud • Security • Software • Cybersecurity
The Staff SRE will deploy and manage AI/ML infrastructure, create CI/CD pipelines, maintain monitoring systems, and optimize training environments across clusters.
Top Skills: AWSAzureBashDockerGitGitGCPGrafanaHugging Face TransformersKubernetesPrometheusPythonPyTorchTensorrtTerraform
16 Days Ago
Easy Apply
Hybrid
Bangalore, Bengaluru, Karnataka, IND
Easy Apply
Senior level
Senior level
Cloud • Information Technology • Security • Software • Cybersecurity
The Senior Staff Site Reliability Engineer will design and manage AWS cloud environments, implement infrastructure-as-code, optimize Kubernetes operations, and build monitoring dashboards, ensuring operational efficiency and security.
Top Skills: AnsibleArgoAWSBashCloudFormationEc2EksElk StackGitlabGrafanaHelmKubernetesLambdaLinuxPrometheusPythonRdsS3Terraform
6 Days Ago
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cybersecurity
The Staff Site Reliability Engineer will manage and support Linux infrastructure, implement observability solutions, and ensure service reliability for a global IT environment through automation and collaboration across teams.
Top Skills: AnsibleAWSBashCircleCIDockerGCPGitJenkinsKubernetesLinuxPuppetPythonShellTerraform

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account