NVIDIA

Compute Cluster SRE Engineer, GPU - HPC

Sorry, this job was removed at 08:03 p.m. (IST) on Wednesday, Oct 09, 2024

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural networks are trained to recognize patterns from massive amounts of data - has shown to be deeply effective at solving some of the most complex problems in everyday life.

Farm GPU compute cluster SRE works to maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline that demands knowledge across different systems, Slurm/LSF, Unix administration, scripting, capacity management, and opensource technologies. Farm GPU SRE is responsible for developing the solution around our large compute cluster to make it work efficiently and improve the user experience for customer as well as engineers supporting the cluster. Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to product quality and interesting and dynamic day-to-day work. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you will be doing:

Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. <Removed this line >
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.
Support services before they go live through activities such as capacity management, providing best possible user support issues. <replaced the above line with this>
Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Understand complex and vast infrastructure and support it during on call weeks.
Work with different SME and help provide quality resolution to the production issues to the customer.

What we need to see:

BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.
4+ years of hands-on industry experience in the above-mentioned areas
Must have experience with Linux system administration(Ubuntu , Centos/Redhat)
Must have HPC cluster scheduler experience in setup and administration like SLURM &/ LSF.
Experience in one or more of the following: Python, Perl, Bash.
Good understanding of open-source IT Automation tools like Ansible.
Interest in crafting, analyzing, and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.

Ways to stand out of the crowd:

Experience of Bright Cluster Manager (BCM)
Understanding on InfiniBand or Ethernet concepts.
Experience with high-speed storage solutions such as Lustre, GPFS.
Experience with MPI , Pytorch

6, Chinappa Layout, Laxmi Sagar Layout, Mahadevapura, Bengaluru, Karnataka, India, 560048

Similar Jobs

Atlassian

Senior Engineering Manager, Search Infrastructure

4 Hours Ago

Remote

Bengaluru, Karnataka, IND

Senior level

Cloud • Information Technology • Productivity • Security • Software • App development • Automation

As a Senior Engineering Manager for Search Infrastructure, you will oversee the Search Platform team, focusing on designing goals, empowering engineers, and delivering outcomes. You will drive technical solutions in search and AI, mentor team members, and ensure a culture of innovation and excellence while managing multiple projects in a fast-paced environment.

Toast

Staff Software Engineer (Tech Lead Manager)

6 Hours Ago

Bengaluru, Karnataka, IND

Expert/Leader

Cloud • Fintech • Food • Information Technology • Software • Hospitality

The Staff Software Engineer (Tech Lead Manager) will oversee a small team, design and maintain scalable features, manage backend services, ensure quality through automated testing, and lead projects aimed at improving customer and partner functionality. The role combines 60% individual contributions and 40% managerial responsibilities, focusing on delivering positive impacts for Toast's customers.

Top Skills: SparkAWSEs6JavaKotlinPostgresRabbitMQReact

Toast

Staff Software Engineer

6 Hours Ago

Bengaluru, Karnataka, IND

Senior level

Cloud • Fintech • Food • Information Technology • Software • Hospitality

The Staff Software Engineer will design, build, deploy, and maintain scalable features, lead complex projects, mentor engineers, and ensure quality through automated testing. This role involves collaborating across teams and managing backend services and APIs while focusing on customer impact.

Top Skills: SparkAWSDropwizardEs6JavaKotlinPostgresRabbitMQReact

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.

NVIDIA

Compute Cluster SRE Engineer, GPU - HPC

NVIDIA Bengaluru, Karnataka, IND Office

Similar Jobs

Senior Engineering Manager, Search Infrastructure

Staff Software Engineer (Tech Lead Manager)

Staff Software Engineer

What you need to know about the Bengaluru Tech Scene