Site Reliability Engineer (SRE) - Automation & Tooling
Position Overview:
We are looking for a talented and motivated Site Reliability Engineer (SRE) with a strong focus on automation and tooling. As part of our dynamic engineering team, you will play a crucial role in building and maintaining reliable, scalable, and efficient cloud infrastructure. You will work closely with development, operations, and product teams to enhance our systems and services while championing the best practices of SRE.
Key Responsibilities:
- Design, develop, and implement automated systems to improve the reliability, performance, and scalability of our services.
- Create and maintain tooling that facilitates rapid deployment, monitoring, and management of our infrastructure.
- Collaborate with cross-functional teams to integrate automation solutions with existing workflows and pipelines.
- Identify and resolve performance bottlenecks and ensure high availability of critical services.
- Develop and follow SRE best practices to enhance system reliability and operational efficiency.
- Contribute to incident response and postmortem analysis to continuously improve our systems.
- Participate in on-call rotations to support continuous 24/7 operations.
- Foster a culture of continuous improvement through proactive monitoring, performance tuning, and capacity planning.
- Advocate for cloud-agnostic architecture principles and assist in the integration and management of multi-cloud environments.
Qualifications:
- B.S./M.S. in Computer Science, Engineering, or a related field, or equivalent industry experience.
- 5+ years experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
- Strong proficiency in automation tools and frameworks (e.g., Ansible, Terraform, Chef, Puppet).
- Extensive experience with scripting and programming languages (e.g., Python, Go, Bash).
- Solid understanding of cloud platforms (AWS, GCP, Azure) and cloud-agnostic architectural principles.
- 5+ years hands-on experience with container orchestration tools (e.g., Kubernetes, Docker).
- Expertise in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
- Familiarity with CI/CD pipelines and relevant tools (e.g., Jenkins, GitLab).
- Strong understanding of networking concepts, distributed systems, and microservices architecture.
- Demonstrated knowledge of incident management and post-incident analysis processes (e.g., SLIs, SLOs, SLAs).
- Excellent problem-solving skills and the ability to work in a fast-paced, collaborative environment.
- Strong communication skills and the ability to convey complex technical concepts to non-technical stakeholders.
Preferred Qualifications:
- Experience with Scaled Agile Framework (SAfE) methodology.
- Knowledge of security best practices in cloud environments.
- Previous experience building or maintaining cloud-agnostic solutions.
Top Skills
What We Do
Telesign provides continuous trust to leading global enterprises by connecting, protecting, and defending their digital identities. Telesign verifies over five billion unique phone numbers a month, representing half of the world’s mobile users, and provides critical insight into the remaining billions. The company’s powerful AI and extensive data science deliver identity with a unique combination of speed, accuracy, and global reach. Telesign solutions prevent fraud, secure communications, and enable the digital economy by allowing companies and customers to engage with confidence.
Why Work With Us
We exist to make the digital world a more trustworthy place for everyone. At Telesign, experience has taught us the smallest ideas can have the greatest impact on safety and trust. We believe that individuality is your superpower and we invite you to bring your unique talents to help Telesign innovate, get things done, and defend the digital world.