Software Architect (Docker/Slurm)
Role description
Software Architect (Docker/Slurm(Simple Linux Utility for Resources Management), Service where a specialized technical profile is required to carry out the installation, configuration, and maintenance of a computing infrastructure with 4 NVIDIA L40s GPUs. This infrastructure must be based on Docker for container management, Slurm as a job and resource queue manager, enabling vGPU technology on GPUs, and the use of monitoring tools such as Prometheus, cAdvisor, DCGM Exporter and Grafana. Additionally, the possible integration and management of MIG(Multi-Instance GPU) should be taken into account if required.
Primary Duties & Responsibilities
- Specialized technical profile is required to carry out the installation, configuration, and maintenance of a computing infrastructure with 4 NVIDIA L40s GPUs.
- This infrastructure must be based on Docker for container management, Slurm as a job and resource queue manager, enabling vGPU technology on GPUs.
- Expertise with the use of monitoring tools such as: Prometheus, cAdvisor, DCGM Exporter and Grafana.
- Additionally, the possible integration and management of MIG(Multi-Instance GPU) should be taken into account if required.
Education & Requirements
- Minimum 5 years of experience in installing and configuring Docker and Slurm based infrastructures.
- Demonstrable experience in vGPUs configuration and, if required, MIG (Multi-Instance GPU) configuration.
- Advanced knowledge in Docker container management and GPU integration using NVIDIA Container Toolkit.
- Ability to configure MIGs and vGPUs on NVIDIA GPUs.
- Experience configuring and customizing Grafana for visualization of resource usage metrics.
Preferred
- Experience with setting up infrastructures for AI model training and testing