HPC Support Engineer

Publication Date:  Oct 22, 2024
Ref. No:  522680
Location: 

Timisoara, RO

Eviden, part of the Atos Group, with an annual revenue of circa € 5 billion is a global leader in data-driven, trusted and sustainable digital transformation. As a next generation digital business with worldwide leading positions in digital, cloud, data, advanced computing and security, it brings deep expertise for all industries in more than 47 countries. By uniting unique high-end technologies across the full digital continuum with 47,000 world-class talents, Eviden expands the possibilities of data and technology, now and for generations to come.

 

HPC Support Engineer:

 

               A High-Performance Computing (HPC) support engineer plays a vital role in maintaining and optimizing computing environments, which are used by research institutions, industries, and organizations for tasks that require significant computational power, such as scientific simulations, large-scale data analysis, machine learning, and engineering computations.

 

Role Expectations:

  • HPC systems are often clusters of interconnected servers. The engineer is responsible for the administration of these clusters, which includes installation, configuration, and maintenance of hardware and software.

  • Linux is the dominant OS in HPC environments. The engineer ensures that the OS is updated, secure, and optimized for high-performance workloads.

  • HPC environments use job schedulers (e.g., SLURM, PBS, or LSF) to allocate resources efficiently. The engineer manages these schedulers to ensure optimal job performance, queue management, and fair distribution of resources among users.

  • Designing and managing backup solutions for large volumes of data, ensuring minimal data loss in case of hardware failures or other disasters.

  • Interactions with SMC (Smart Management Center) which is the foundation for hosting infrastructure and application micro-services dedicated in managing a HPC supercomputer.

  • Support and maintain technology standards, processes, and policies related to on-prem/cloud Infrastructure in scope.

  • Contribute to international projects by providing consultancy regarding HPC infrastructure architectures (on-premises and cloud).

  • Suggest system changes in accordance with documented SOPs.

  • Produce and maintain appropriate documentation and diagrams describing system setups and overall inventory.

 

Capabilities and Expertise:

  • System Administration Red Hat expertise.

  • Networking, expertise in configuring and troubleshooting networking setups within HPC clusters, including understanding low-latency interconnects like InfiniBand or Omni-Path.

  • Scripting Proficiency, use scripting languages such as Bash, Python, or Perl for automating routine tasks like cluster monitoring, user onboarding, or job submissions.

  • Configuration Management, familiarity with tools like Ansible, Puppet, or Chef to automate the deployment and configuration of cluster nodes and services.

  • System Monitoring, Implement and manage monitoring tools (e.g., Prometheus, Grafana) to track system health, detect performance bottlenecks, and identify potential hardware or software failures.

  • Storage Management, familiarity with large-scale storage systems such as GPFS, Lustre, or NFS, and the ability to troubleshoot file system issues.

             

  Nice to have:

  • Supercomputers knowledge, and understanding of advanced supercomputing platforms (e.g., Cray, IBM Blue Gene).

  • Experience with submitting Jobs in Schedulers like LSF, Slurm, GridEngine, etc.

  • Parallel Computing, experience with parallelism (shared and distributed memory architectures), MPI (Message Passing Interface), and OpenMP.

 

Why Join Us?

  • Training and Certifications: Access ongoing training and certifications for current and emerging technologies.

  • Hybrid Schedule: Enjoy the flexibility of a hybrid work schedule.

  • Flexible Benefits: Receive a range of benefits, including private medical insurance, company-supported CSR, sports and leisure activities, lunch vouchers, mobile phones, and laptops.

  • Reimbursement: Get a yearly fixed amount for reimbursement.

  • Performance Bonus: Earn an annual performance bonus based on your achievements. Career Advancement: Explore numerous opportunities for professional growth and career advancement.

  • Extra Vacation Days: Take advantage of additional vacation days to relax and recharge.

 

 

 

Let’s grow together.