aion

Site Reliability Engineer

Reposted 2 Days Ago

Be an Early Applicant

In-Office

Bengaluru, Bengaluru Urban, Karnataka

Mid level

In-Office

Bengaluru, Bengaluru Urban, Karnataka

Mid level

The Site Reliability Engineer will design monitoring systems, manage infrastructure automation, implement CI/CD pipelines, and ensure high availability across distributed systems using best SRE practices.

The summary above was generated by AI

About AION

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.

Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.

Who you are

You are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.

Technical Skills & Experience

3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered)
A Tier1 college education or previous work experience at FAANG/top startups is preferred but not required
Cloud Platforms: Deep expertise with AWS, GCP, or Azure infrastructure services
Kubernetes: Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
Infrastructure as Code: Strong experience with Terraform, Pulumi, or similar IaC tools
Observability: Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
Service Mesh: Experience with Istio, Linkerd, or similar service mesh technologies
Networking: Understanding of network architectures, DNS, load balancing, and security groups
CI/CD: Knowledge of automated deployment pipelines and GitOps workflows
Scripting: Proficiency in Bash, Python, or Go for automation scripts
Container Technologies: Deep understanding of Docker, containerd, and OCI specifications
Security: Knowledge of infrastructure security best practices and compliance requirements
Incident Management: Experience with incident response, post-mortems, and developing SOP documentation

Key Responsibilities

Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms.
Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
Responsible for implementing service mesh solutions for observability, traffic management, and security.
Design and implement logging systems that provide visibility into complex distributed systems.
Responsible for capacity planning and resource optimization across cloud environments.
Implement CI/CD pipelines for reliable and consistent deployments across all environments.
Design and build self-healing systems that automatically recover from common failure modes.
Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices.
Responsible for designing and implementing disaster recovery strategies and testing procedures.
Create and maintain production, staging, and development environments with appropriate isolation.
Collaborate with security teams to implement infrastructure security best practices and compliance requirements.

Location

Individuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.

Why Join Us

Be part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry.
Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry.
Collaborate with top-tier talent from the tech industry.
Competitive salary and benefits package.
Flexible work environment with opportunities for professional growth and development.

If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.

Top Skills

AWS

Azure

Bash

Docker

Elk Stack

GCP

Grafana

Istio

Kubernetes

Linkerd

Prometheus

Pulumi

Python

Terraform

Bengaluru, Karnataka, India

Similar Jobs

JPMorganChase

Site Reliability Engineer

2 Days Ago

Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

Financial Services

As a SRE III, you will lead reliability improvements, mentor engineers, handle major incidents, and utilize data for service optimization.

Top Skills: .NetDatadogDockerDynatraceEcsGitlabGrafanaJava Spring BootJenkinsKubernetesPrometheusPythonSplunkTerraform

JPMorganChase

Software Engineer

3 Days Ago

Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Mid level

Financial Services

As a Site Reliability Engineer III, you'll drive innovation by implementing and optimizing applications via code and cloud infrastructure, collaborate with teams, and enhance reliability and scalability of systems.

Top Skills: .NetDatadogDockerDynatraceEcsGitlabGrafanaJavaJenkinsKubernetesPrometheusPythonSplunkSpring BootTerraform

JPMorganChase

Site Reliability Engineer

8 Days Ago

Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

Mid level

Financial Services

The Site Reliability Engineer II role involves optimizing applications and infrastructure, collaborating on deployment strategies, implementing SRE best practices, and ensuring application reliability and scalability.

Top Skills: .NetDatadogDockerDynatraceEcsGitlabGrafanaJavaJenkinsKubernetesPrometheusPythonSplunkSpring BootTerraform

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.

aion

Site Reliability Engineer

Top Skills

aion Bengaluru, Karnataka, IND Office

Similar Jobs

Site Reliability Engineer

Software Engineer

Site Reliability Engineer

What you need to know about the Bengaluru Tech Scene