Nexla Jobs

Senior Site Reliability Engineer

Nexla

Senior Site Reliability Engineer

Reposted 20 Hours Ago

Be an Early Applicant

In-Office

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

In-Office

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

As a Senior DevOps Engineer at Nexla, you'll manage AWS EKS infrastructure, implement CI/CD pipelines, and ensure system reliability while collaborating with engineering teams.

The summary above was generated by AI

About Nexla

Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and APIs into a single design pattern. We accelerate the development of solutions for GenAI, Analytics, and Inter-company data. Nexla makes data users and developers up to 10x more productive by delivering a true blend of no-code, low-code, and pro-code interfaces.

Leading companies including DoorDash, LinkedIn, Johnson & Johnson, and LiveRamp trust Nexla for mission-critical data. Named in the 2022, 2023, and 2024 Gartner Magic Quadrant™ for Data Integration Tools and top-rated by customers on Gartner Peer Insights, headquartered in San Mateo, California.

At Nexla, our culture is built around our core values: Have Empathy, Be Curious, Be Intellectually Honest, Achieve Excellence, and Remember to Relax. We put our customers at the heart of everything we do, foster a data-driven mindset, take ownership of our work, and believe in the power of teamwork to achieve ambitious goals.

Role
You will own the reliability of the distributed data systems at the heart of Nexla - the streaming runtime and processing engines that move hundreds of billions of rows per day for top-tier enterprises. This is an SRE role for our big data stack: Kafka, Spark, Flink, Ray, Redis, and data warehouses, all running on Kubernetes.

This is not a cloud-provisioning role. We are looking for someone who has lived inside stateful, high-throughput systems in production who has chased down a broker outage, a checkpoint stall, a crashlooping cache, and a sink that silently stopped writing, and who fixes the architecture rather than the symptom. If keeping a large, busy data platform alive and fast is the kind of problem you find satisfying, you will have a lot of fun working with us. This is a unique opportunity to shape the foundation of a product that is defining the next wave of intelligent, context-aware data movement.

Responsibilities

Streaming & Data Plane Reliability: Own the health of our Kafka-based runtime (managed via Strimzi on Kubernetes) - broker health, topic lifecycle and count management, partition and throughput tuning, certificate/secret rotation, and version upgrades - at a scale of hundreds of thousands of topics and hundreds of billions of rows per day.
Distributed Processing Engines: Operate and tune distributed system workloads in production in collaboration with backend teams, resource allocation, autoscaling, checkpointing, backpressure, and failure recovery for both batch and streaming jobs.
Stateful Services: Run Redis clusters and other stateful systems reliably - failover, persistence, liveness/readiness tuning, and capacity planning under heavy and bursty load.
Kubernetes & Operators: Take end-to-end ownership of Amazon EKS, Google GKE and the operators (Strimzi and others) running our stateful data workloads - cluster lifecycle, scaling, version upgrades, and resource governance.
Observability: Build deep, data-aware monitoring - consumer lag, throughput, partition skew, job latency, error rates - not just host and CPU metrics. Make the data plane's behavior legible before it breaks.
Incident Management: Lead root-cause analysis for distributed-systems failures (broker outages, crashloops, sink decommissions, control-plane race conditions) and drive durable fixes. Mitigate fast, but design out the recurrence.
Infrastructure as Code & Automation: Provision and manage cloud infrastructure with Terraform; build operational runbooks and automation, including for air-gapped / private enterprise installs (pre-staged images, operator-facing procedures).
Collaboration: Partner with platform, runtime, and connector engineering - and with SREs and support - to ship and scale new data-movement features reliably in a large-scale Linux environment.r with SREs, L2/Support, and developers to deploy and scale new product features and improve production monitoring in a large-scale Linux environment.

Qualifications

Experience: 8+ years in infrastructure, SRE, or DevOps, with significant time spent operating production distributed data systems (not just application/cloud infra).
Kafka: Deep, hands-on operational experience running Kafka at scale in production - ideally on Kubernetes via Strimzi - including upgrades, topic/partition management, performance tuning, and TLS/secret rotation.
Distributed Processing (Strong Plus): Production experience operating one or more of Spark, Flink, or Ray - resource tuning, checkpointing, failure recovery.
Stateful Systems (Must Have): Production experience with Redis (clustering, persistence, failover) and a solid understanding of operating stateful workloads on Kubernetes (StatefulSets, PVCs, probes, operators).
Data Warehouses: Familiarity operating against Snowflake, BigQuery, or similar, and an understanding of JDBC connectivity and sink reliability.
Kubernetes & EKS: Strong hands-on EKS - cluster creation, scaling, version upgrades, and operator management.
Infrastructure as Code: Advanced proficiency with Terraform.
Programming: Proficiency in Python (or similar) for automation and tooling. Comfort reading and debugging JVM-based systems is a strong plus.
Reliability Mindset: Demonstrated ownership of incident management, RCA, capacity planning, and performance tuning for high-throughput systems.
CI/CD: Solid understanding of CI/CD methodology (Jenkins, GitHub Actions, or GitLab CI) for containerized and non-containerized apps. Supporting, not the core of the role.
Nice to Have: Configuration management (Ansible preferred); broader AWS services (IAM, VPC, EC2, S3, Lambda); AWS CloudFormation.
Soft Skills: Excellent communication and organizational skills; ability to coordinate effectively within a team and with customers.

Why This Might Be Worth It

You own the hard part. The stateful, distributed systems that move billions of rows are the platform's most demanding reliability problems - and they'd be yours.
Impact at scale from day one. Your work keeps mission-critical data flowing for companies like DoorDash and LinkedIn.
The AI wave is real for us. We're not bolting AI onto a legacy product. Intelligent connectors, context-aware data movement, and agentic workflows are the core of what we're building next - on top of the runtime you'd run.

Small team, big problems. Direct access to the CTO, real influence over product direction, and the autonomy to make significant technical bets.
Recognized platform, startup energy. Enterprise validation with the speed and ownership of an early-stage company.

Location
Pune(preferred) or Bengaluru

Why Build Your Future at Nexla? We are standing at the precipice of the GenAI revolution, but the biggest bottleneck isn't the models, it's the data. By joining Nexla, you aren’t just entering a company; you are stepping into the critical layer of the modern data stack that powers the AI economy. We are the Data Fabric that enables industry titans like LinkedIn, DoorDash, and J&J to turn messy, siloed data into ready-to-use products for RAG and predictive models. This is your opportunity to move beyond simple tooling and build the actual infrastructure that democratizes data access for the next decade of innovation. If you want to solve the hardest problems in data engineering and own a piece of a market projected to hit billions, your career belongs here.

Similar Jobs

JLL Technologies

Senior Site Reliability Engineer

2 Days Ago

In-Office

Bengaluru, Karnataka, IND

Senior level

Information Technology • Software

Lead design and implementation of reliable, scalable infrastructure using Terraform and IaC across AWS. Define SLOs/SLIs, lead incident response and post-mortems, apply AI/AIOps for observability and root-cause analysis, improve CI/CD (GitHub Actions), and collaborate across teams to drive reliability and automation.

Top Skills: Agentic AiAiopsAWSCi/CdCloudFormationCloudwatchEc2Github ActionsIamInfrastructure As CodeKubernetesLogging/Log AnalysisMySQLOraclePostgresRdsS3TerraformVpc

JLL

Senior Site Reliability Engineer

2 Days Ago

In-Office

Bengaluru, Karnataka, IND

Senior level

Real Estate • Financial Services

Lead design and maintenance of scalable, reliable AWS infrastructure using Terraform and IaC. Define SLOs/SLIs, lead incident response and post-mortems, integrate agentic AI/AIOps for monitoring and root-cause analysis, implement CI/CD automation, collaborate with cross-functional teams, and produce runbooks and reusable reliability solutions.

Top Skills: Agentic AiAi-Powered AnalyticsAiopsAmazon Web Services (Aws)Ci/CdCloudFormationCloudwatchEc2Github ActionsIamInfrastructure As Code (Iac)KubernetesMySQLOraclePostgresRdsS3TerraformVpc

73 Strings

Senior Site Reliability Engineer

9 Days Ago

In-Office

Bengaluru, Bengaluru Urban, Karnataka, IND

Senior level

Artificial Intelligence • Fintech • Software • Analytics

Ensure reliability, scalability, and operational excellence of production systems. Lead incident management, define SLOs, troubleshoot production issues, support deployments, maintain observability, document runbooks, and collaborate with engineering and customers to raise reliability.

Top Skills: AngularCloud ServicesCSSDatadogDynatraceETLHTMLJavaJavaScriptKafkaMicroservicesPostgresSnowflake

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.