The AWS ParallelCluster team is looking for a senior System Development Engineer to join our broader HPC Organization.
Our team owns a core set of technologies that allow our customers to plan, schedule, and execute HPC workloads across AWS compute services and capabilities.
As a team member, you'll have the opportunity to operate and engineer systems on a global scale, while touching and influencing large parts of the underlying AWS services. You'll be involved in the design and development of products that provide fully featured HPC infrastructures on demand. An HPC infrastructure is complex in nature including the provisioning of multiple resources as computing, networking, storage and the deployment and configuration of different operating systems and software tools that enable our customers to fulfill their HPC workloads. We don’t expect you to be an expert in, or necessarily even be familiar with, all of these technologies, but we do expect you to be excited to learn about them and use them to delight our customers.
You'll focus on operational excellence by implementing and integrating DevOps key methodologies, such as infrastructure as code and configuration management, and tools into the product architecture as capabilities for building HPC infrastructures. You'll improve and apply such practices and tools to help to innovate faster through automating and streamlining the software development and infrastructure management processes. You will become intimate with the architecture of our systems and will drive prioritization of operational issues.
This position involves on-call responsibilities, typically one week every two months. We work to ensure that our systems are fault tolerant and alarms won't wake us up in the middle of the night. When this happens, we take action to ensure we will not get paged again for the same issue.
The team is dedicated to supporting new team members, you will grow in the team through one-on-one mentoring and thorough, but kind, code reviews.
Over the years, we have developed a strong sense of team trust and we are looking for a new teammate who is enthusiastic, empathetic, motivated, and reliable.
QUALIFICHE DI BASE
· Bachelors degree in Computer Science / Engineering (or related STEM Subjects), or equivalent experience
· Several years of professional experience in systems development or Linux/Unix system engineering
· Several years of programming experience with at least one modern language such as Python, Java, C#, C++
· Experience with DevOps practices and tools for continuous delivery, infrastructure as code, software deployment automation / configuration management
· Experience designing and building highly-available distributed systems, and operating processes that reduce manual efforts and increase overall efficiency
· Technically sound in software development activities and life cycles, including coding standard, source control management, testing and operations
· English as working language is a requirement
· Masters Degree in Computer Science / Engineering, or equivalent experience
· Experience with configuration management systems such as Chef, Ansible, or Puppet
· Experience specifying, designing, and implementing system health, performance monitoring tools, and software management tools
· Experience developing automation to solve problems at scale
· Experience with HPC batch schedulers (LSF, PBS, GridEngine, Slurm, etc.) or other HPC, cluster management technologies
· Familiarity with AWS platform