NVIDIA’s Infrastructure, Planning and Processes (IPP) organization is seeking a hard-working and experienced Site Reliability/DevOps Engineer, with strong background in Infrastructure Management, Monitoring, Automation, & System Administration, to join our Sanity Operations Team in Pune. The IPP Org provides Infrastructure, Products & Services for multiple software teams including GPU, Mobile, and Automotive divisions working on NVIDIA's extraordinary products & services.
The team is responsible for hosting, enabling & running the large scale private cloud systems & services, for our in-house Testing CI framework. The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android, etc.), running with NVIDIA GPUs and Tegra Processors.
What you’ll be doing:
-
Create resilient, scalable, and efficient test and deployment pipelines.
-
Design and implement complex automation platforms to identify & resolve operational inefficiencies.
-
Triaging software, hardware and infrastructure issues and maintaining high availability for our infrastructure & services.
-
Deploying & Monitoring critical high performance, large scale services running on Geo-distributed systems.
-
Continuously Strive for efficient utilization & management of the infrastructure.
-
Automate processes for enabling developers to adopt self-service practices, while ensuring compliance with security standards.
-
Work with architects and engineers across the teams to review the designs & solutions during development and deployment phases.
-
Collaborate with our other engineering teams to deliver reliable, robust, and high-performance capability of the underlying infra.
-
Mine & analyze data from multiple sources for identifying scaling & optimization opportunities.
What we need to see:
-
Bachelor’s or Master’s degree in computer science, Software Engineering, or equivalent experience with 8+ years of experience in a DevOps environment.
-
Strong hands-on experience in Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Gitlab, Jira, etc)
-
Working Experience in monitoring & maintaining large-scale infrastructure applications running in a microservice-based architecture.
-
Proficient with Virtualization architecture with strong experience in Kubernetes, VMs, Dockers.
-
Experience with continuous integration and continuous delivery systems such as GitLab, GitOps, Jenkins, Packer, and Terraform.
-
Strong Python scripting skills, with proven background of using/writing JSON/REST APIs.
-
Fluency in using MySQL or equivalent NoSQL databases queries
-
Solid understanding of configuration management tools like, Chef, Puppet, Ansible, etc.
-
Working Experience with Perforce, GIT or any other version control system is necessary.
-
Experience with telemetry and alerting systems such as Kibana, Elastic Search, Grafana, and Prometheus to create rich visualizations of system health over time.
-
Ability to self-manage, show leadership, mentor others and communicate well.
Ways to stand out from the crowd:
-
Understanding of networking concepts like TCP/IP and firewall management.
-
Exposure to web apps/dashboards on frameworks like Django, AngularJS, VueJS, etc.
-
High level understanding of Build and Test systems.
-
Experience in Building regression detection systems by analyzing real-time production data, emphasizing important metrics.
-
Innovating with industry-standard tools and collaborating with the open source community
-
Outstanding interpersonal skills and communication.
Top Skills
NVIDIA Bengaluru, Karnataka, IND Office
6, Chinappa Layout, Laxmi Sagar Layout, Mahadevapura, Bengaluru, Karnataka, India, 560048