MontyCloud Logo

MontyCloud

Staff Engineer - Site Reliability Engineering

Posted 14 Days Ago
Be an Early Applicant
In-Office
Bengaluru, Karnataka, IND
Senior level
In-Office
Bengaluru, Karnataka, IND
Senior level
Lead reliability and operational excellence for a cloud-native SaaS platform, focusing on automation, AI-driven operations, and system scalability.
The summary above was generated by AI
Role Overview
MontyCloud is seeking a highly experienced Staff Site Reliability Engineer (SRE) to lead reliability, scalability, and operational excellence for our cloud-native, AI-driven SaaS platform.  This role requires a strategic, organization-wide impact, combining deep expertise in distributed systems with modern practices in automation, observability, and AI-driven operations (AIOps). You will define reliability standards, influence system architecture, and build intelligent systems that enable engineering teams to operate efficiently and proactively.
As a Staff SRE, you will champion automation-first and AI-augmented reliability engineering, reducing operational toil, improving system resilience, and driving a culture of ownership and continuous improvement across teams.

Key Responsibilities
  • Define and drive organization-wide reliability strategy, including SLIs, SLOs, SLAs, and error budgets.
  • Influence system architecture to ensure high availability, scalability, fault tolerance, and operability.
  • Design and build scalable automation frameworks and internal platforms to reduce operational toil and enable self-service capabilities.
  • Leverage AI/ML-driven approaches to enhance observability, anomaly detection, and predictive incident prevention.
  • Implement and optimize AI-assisted incident management, including alert triage, root cause analysis, and automated remediation workflows.
  • Lead implementation of centralized observability (metrics, logs, traces) and define effective alerting and monitoring strategies.
  • Drive proactive performance optimization, capacity planning, and system efficiency improvements using data-driven insights.
  • Lead incident management, including critical incident response, resolution, and blameless postmortems with a focus on systemic fixes.
  • Design and improve incident and change management workflows, integrating observability with ITSM tools (e.g., ServiceNow, Jira Service Management, PagerDuty).
  • Automate incident detection, triage, escalation, and remediation workflows to minimize manual intervention.
  • Champion resilience practices such as disaster recovery, chaos engineering, and failure testing.
  • Partner with engineering teams to improve CI/CD reliability, release safety, and deployment strategies (e.g., canary, blue-green).
  • Continuously reduce MTTR, change failure rate, and operational overhead through automation and engineering improvements.
  • Drive cloud cost optimization and resource efficiency, including optimization of AI/ML workloads and inference costs.
  • Collaborate with data and ML teams to ensure reliability, scalability, and observability of AI/ML systems, including monitoring for drift and performance degradation.
  • Mentor engineers and act as a technical leader, influencing best practices and elevating reliability standards across teams.
  • Foster a culture of ownership, automation-first mindset, and AI-augmented operational excellence.

Desired Skills and Requirements

Must Have
  • Problem-solving skills
  • Cloud: AWS
  • Programming/ Scripting: Python, Go
  • Containerization: Kubernetes, containers, microservices architectures
  • Infrastructure as Code (IaC): Terraform, CloudFormation
  • Automation/Configuration Management: Ansible, Puppet, Chef
  • Monitoring/Observability: Datadog, Prometheus, Grafana, Splunk, AWS CloudWatch, AWS X-Ray
  • Reliability Engineering: SLIs, SLOs, SLAs, error budgets
  • Incident Management & Reliability Frameworks
  • CI/CD and Release engineering: experience with Jenkins, GitLab CI, etc.
  • ITSM & Incident Tools: ServiceNow, Jira Service Management, PagerDuty, Opsgenie
  • AI/ML & AIOps for observability, alerting, incident analysis, and automation
  • System Design, Scalability, Performance Engineering, and Reliability Trade-offs
  • Distributed Systems expertise

Good-to-Have
  • General Dev Experience: Internal Developer Platforms (IDP) & Platform Engineering
  • Chaos Engineering Tools: e.g., Gremlin, Chaos Monkey etc.
  • Resilience Testing
  • Security, Compliance, and Governance in Cloud Environments
  • Application Development
  • Agile Methodology
  • FinOps & Cloud Cost Optimization

Experience
  • 8+ years of experience in Site Reliability Engineering / DevOps / Platform Engineering in SaaS platform environments.
  • 3 years of experience specifically in managing and optimizing SaaS platforms.
  • 3 years of expert knowledge and hands-on experience with AWS.
  • 4 years of experience using automation tools like Ansible, Puppet, or Chef.
  • 4 years of experience with scripting in Python or similar languages.
  • 3 years of experience using tools like Splunk, New Relic, Datadog, AWS CloudWatch, or AWS X-Ray.
  • 3 years of experience leading disaster recovery efforts in current and previous roles.
  • 3 years of experience implementing chaos engineering practices in live environments.
  • 4 years of active involvement in on-call rotations and incident management.
  • 4+ years of end-to-end application development experience, showcasing familiarity with the complete software development lifecycle and a strong ability to design, implement, and deploy functional, scalable applications.
  • 3 years of experience leading post-mortem analysis sessions following major incidents.

Education
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • Equivalent practical experience in large-scale SaaS or cloud-native environments is highly valued.

Similar Jobs

16 Days Ago
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
Senior level
Senior level
Cloud • Security • Software • Cybersecurity
The role involves managing DNS infrastructure, implementing DNSSEC, automating operational tasks, and collaborating on DNS-related projects within Netskope's security cloud.
Top Skills: AnsibleBashBindChefDockerPowerdnsPrometheusPythonSaltstack
2 Hours Ago
Easy Apply
Hybrid
Bangalore, Bengaluru Urban, Karnataka, IND
Easy Apply
Senior level
Senior level
Cloud • Healthtech • Professional Services • Software • Pharmaceutical
The Sr Data Engineer I will provide consulting services for clinical systems, manage projects, and design SQL code for data reporting and analytics while ensuring compliance with industry standards.
Top Skills: .NetAWSAzureBusiness ObjectsC#CognosDb2ElluminateHTMLIbm DatastageInformaticaJreviewMicrostrategyMs SqlOraclePl/SqlPythonQlikQliksenseQlikviewRSASSparkSpotfireSQLSQL ServerSsrsT-SqlTableauTeradata
2 Hours Ago
Easy Apply
Hybrid
Bengaluru, Karnataka, IND
Easy Apply
Mid level
Mid level
Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
The Account Executive manages the full sales cycle, generates pipeline, closes new customers, and builds relationships with C-level executives to achieve revenue goals.
Top Skills: OutreachSales NavigatorSalesforceZoominfo

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account