We are a B2B WealthTech startup based in Abu Dhabi and backed by BNY Mellon (America’s oldest bank and first company to list on NYSE) and Lunate (a new $50B AUM alternative asset management firm based in Abu Dhabi, UAE). The company has raised $300M to build a state of the art wealth technology platform.
Our mission is to power and grow our clients’ Wealth franchises through differentiated experiences, financial solutions, and insights. Our digital wealth management platform- will enable banks and other financial institutions in the Middle East to grow and further penetrate affluent, HNW and UHNW investor segments.
While still leveraging the capabilities and knowledge of large organizations, our fintech is a startup with truly cross-functional and agile teams.
For more information, please visit www.alpheya.com
Role
We are building an SRE team that owns production reliability end-to-end across data ingestion pipelines + backend services + Kubernetes deployments + observability.
This is not a tickets-only ops role. You will debug complex production issues, ship permanent fixes (code/config), and harden the system so issues don’t repeat. You’ll work across ingestion/ETL (Snowflake and other sources), application services (Go/Node.js), and platform operations (Kubernetes/Helm), with strong emphasis on incident response and reliability engineering.
What you’ll do- Own production reliability for ingestion workflows end-to-end including SLAs and incident response.
- Lead and execute incident response for ingestion + application failures: triage, mitigation, stakeholder comms, and coordination across teams.
- Debug and resolve ingestion and data mapping issues (client-specific FinTech files, schema changes, edge cases) and ensure correctness post-fix.
- Operate ingestion services/workers on Kubernetes: troubleshoot rollouts, config/secrets, scaling, resource bottlenecks, node/pod issues, and runtime failures.
- Handle data recovery safely: replays/backfills, idempotency checks, dedupe strategies, reconciliation queries, and data-quality validation.
- Diagnose database issues (PostgreSQL/CNPG): performance bottlenecks, locks, indexing, query tuning, migrations, and operational risks.
- Build ingestion + application observability: dashboards and alerts for freshness, throughput, lag, error rate, retries, DLQ volume, processing latency, and per-tenant success metrics.
- Drive prevention: improve runbooks/service passports, post-deploy validation, regression testing, and operational standards.
- Partner with application/data engineers on schema evolution, data contracts, and reliability patterns (timeouts, retries, backpressure, safe degradation).
Requirements
- Strong knowledge of SQL and PostgreSQL.
- Ability to debug production issues across data + backend services + infrastructure, not just within one layer.
- Working understanding of backend systems in Go (preferred) and Node.js: able to navigate codebases, follow request flows, debug production issues, and contribute small-to-medium fixes (not only scripts).
- Working understanding of distributed backend systems and APIs (GraphQL + gRPC/RPC): able to follow request flows across services, identify contract/schema issues, and troubleshoot latency/error patterns end-to-end.
- Experience with ETL/ELT pipelines and messaging systems.
- Understanding of data formats (CSV, JSON, Parquet).
- Familiarity with MySQL and Snowflake.
- Exposure to Kubernetes, Docker, AKS for running data jobs.
- Ability to debug ingestion errors and runtime failures.
Good to Have
- Experience with Temporal workflows or distributed systems.
- Prior exposure to observability stacks (Prometheus, Grafana, Loki, Tempo).
- Interest in transitioning towards SRE/Platform engineering.



