HPC AI Benchmark specialist
Bezons, FR
About Eviden
Eviden is the Atos Group brand for hardware and software products with c. € 1 billion in revenue, operating in 36 countries and comprising four business units: advanced computing, cybersecurity products, mission-critical systems and vision AI. As a next-generation technology leader, Eviden offers a unique combination of hardware and software technologies for businesses, public sector and defense organizations and research institutions, helping them to create value out of their data. Bringing together more than 4,500 world-class talents and holding more than 2,100 patents, Eviden provides a strong portfolio of innovative and eco-efficient solutions in AI, computing, security, data and applications.
Bull, currently Eviden's Big Data & Security (BDS) division, delivers some of the world's most powerful High-Performance Computing (HPC) solutions. As a leader in Europe, it advises and supports its clients in solving the most complex scientific problems of today and tomorrow. As part of our development and ambitious future programs, we are recruiting an, we are recruiting a Benchmark ML Engineer.
You will work within the Applications and Performance team (more than 30 Engineers) whose main mission is to respond to HPC & AI calls for tenders. This team plans and guarantees the performance of customers' scientific applications on proposed supercomputers.
The position can be based on several Eviden HPC sites in France: Grenoble (38, referably), Les Clayes-sous-bois (78), Bruyères-le-Châtel (91), Bordeaux (33), Rennes (35), Toulouse (31) or Montpellier (34). However, other sites in Europe are possible.
Your role will be the preparation, execution, and analysis of AI applications, mainly benchmarks. These benchmarks generally consist of trainings or inferences on a reference data set and rules, to evaluate the performance of the system in terms of time or throughput. Usual benchmarks are taken from the MLPerf suite (Training and Inference: Datacenter).
The purpose of the benchmarking activity is to characterize the application on current system to project on target systems: latency vs throughput, accuracy vs performance, scaling efficiency, …
Target systems are computing clusters usually equipped with accelerators (e.g. NVIDIA GPU, AMD GPU, Intel Gaudi GPU, …). Therefore, the benchmarks are run in a multi-node multi-accelerator framework. Part of the work consists of estimating the performance of non-existent systems (new technology, larger size, etc.).
Surrounded by passionate and attentive experts, your missions will be multiple:
· AI benchmark analysis:
o Literature review on the application considered
o Code exploration (if available)
o Match between hardware architecture and hyperparameters
· Benchmark preparation:
o Software environment (usually in a container)
o Training and job scripts
o Dataset preparation
o Hyperparameter search
o Documentation
· Benchmark execution and performance estimation:
o Test executions
o Analysis
o Reports
The position we are offering you is located in an environment working at the highest technological level, in direct contact with the various entities of the division, in relation with our partners (AMD, Intel, NVIDIA, ...) and in collaboration with our customers.
You may be required to make short trips mainly in France and Europe.
Your profile:
· Relevant degree in higher education or university
· Ideally, several years of experience in the field of AI
· Autonomous with team spirit
Ideal skills:
· Machine / Deep Learning
o Good understanding of the fundamentals of deep learning
o Experience in training classic neural network architectures (MLP, CNN, RNN)
o Transformers and LLM
o Large-scale distributed training strategies (parallel data, parallel tensors, parallel pipelines, FSPD, DeepSpeed, ..)
· Execution environment:
o Containers (docker, singularity)
o Job scheduler (SLURM mainly),
o Knowledge of architectures will be a plus (GPU, Network, storage)
· Language and frameworks:
o Python,
o Bash,
o PyTorch, TensorFlow
ML Benchmarking requires a strong resilience and an ability to question things. You should have a strong sense of curiosity, ability to work closely with other team members, but also be able to take your own initiative. Strong-willed and questioning by nature, you know how to technically solve complex optimization problems. Organized and rigorous in your approach, you are able to balance time and priorities, while maintaining an open dialogue to find the best compromise for a given problem.
Let’s grow together.