While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning and growth.
If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi!
Required Experience: 3 to 6 Years
Roles and Responsibilities:
-
Design, deploy, and maintain distributed systems using Kubernetes and Slurm for optimal resource utilization and workload management.
-
Lead the configuration and optimization of Multi-GPU, Multi-Node Deep Learning job scheduling, ensuring efficient computation and data processing.
-
Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.
-
Experience in working with On-prem NVIDIA GPU servers.
-
Develop and maintain complex shell scripts for various system automation tasks, enhancing efficiency and reducing manual intervention.
-
Monitor system performance, identify bottlenecks, and implement necessary adjustments to ensure high availability and reliability.
-
Troubleshoot and resolve technical issues related to the distributed system, job scheduling, and deep learning processes.
-
Stay updated with industry trends and emerging technologies in distributed systems, deep learning, and automation.
Skill Set Needed:
-
Strong communication and collaboration skills to work effectively within a cross-functional team.
-
Good with Python.
-
Hands-on experience in MLOps - MLFlow, Kubeflow, AutoML etc.
-
Good to have at least one ML framework understanding - PyTorch / TensorFlow.
-
Experience in shell scripting./linux
-
Good understanding of logical networks.
-
Understanding of NLP (preferred) / Computer Vision
-
Cloud native stack.
-
Proven experience in designing, deploying, and managing distributed systems, with a focus on Kubernetes and Slurm.
-
Sufficient understanding of AI Model Training and Deployment and Strong background in Multi-GPU, Multi-Node Deep Learning job scheduling and resource management.
-
Proficiency in Linux systems, particularly Ubuntu, and the ability to navigate and troubleshoot related issues.
-
Extensive experience creating complex shell scripts for automation and system orchestration.
-
Familiarity with continuous integration and deployment (CI/CD) processes.
-
Excellent problem-solving skills and the ability to diagnose and resolve technical issues promptly.
Good to Have:
-
Previously working on NVIDIA Ecosystem or well aware of NVIDIA Ecosystem - Triton Inference Server, CUDA,
-
Good at Slurm, Kubernetes, Linux, and AI Deployment tools.
If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!