HPC Systems Administrator
@ Vector Institute

Toronto, ON
CA$110,000
On Site
Full Time
Posted 13 hours ago

Your Application Journey

Personalized Resume
Apply
Email Hiring Manager
Interview

Email Hiring Manager

XXXXXXXXXX XXXXXXXXX XXXXXXXXX******* @vectorinstitute.ai
Recommended after applying

Job Details

Position Summary

The Vector Institute is seeking an HPC Systems Administrator to join our growing team in Toronto as we continue our work of establishing Canada as a centre of expertise for AI. You will be involved in building and maintaining High-Performance Computing environments for world-class research in Machine Learning.

Key Responsibilities

  • Support over 250+ node, 10,000+ core, 1,200+ GPU HPC compute clusters.
  • Support GPU-enabled workstation office environment.
  • Provide guidance and support to the research community.
  • Develop and maintain tools for automatic installation and configuration.
  • Perform hardware/software upgrades and maintenance.
  • Install scientific software, libraries across various OS.
  • Support researchers with all their computing needs.
  • Maintain network infrastructure and system security.
  • Handle enterprise IT operations.

Key Success Measures

  • Ensure smooth system functioning with proactive troubleshooting and maintenance.
  • Deliver strong support for both research and enterprise IT needs.
  • Build and maintain tools for local and cloud infrastructure administration.

Profile of the Ideal Candidate

A degree or diploma in computer science or engineering and more than three years of hands-on Linux/UNIX systems administration experience in a research environment is required. The role demands experience managing HPC grids and job schedulers like Slurm, strong programming and scripting skills, and a problem-solving attitude. Excellent communication skills and an ability to work autonomously in a fast-paced environment are essential.

Qualifications And Assets

  • Experience with HPC workload management systems (Slurm, SGE, Moab/Torque).
  • Experience with large scale-out storage (SAN/NAS) and file systems (ZFS, GPFS).
  • Good understanding of high-speed internetworking (100GE, Infiniband).
  • Experience supporting data management, backups, archives and monitoring.
  • Familiarity with application tools/databases (MySQL, PostgreSQL) and open source infrastructure (openLDAP, NFS, openZFS, 2FA systems).

Equal Opportunity

At the Vector Institute, we support diversity and welcome candidates from all backgrounds including underrepresented groups. If you require accommodations during the recruitment process, please contact hr@vectorinstitute.ai.

Key skills/competency

  • HPC
  • Linux
  • Systems Administration
  • Slurm
  • Networking
  • Automation
  • Scripting
  • Security
  • Storage
  • Research Support

How to Get Hired at Vector Institute

🎯 Tips for Getting Hired

  • Research Vector Institute: Understand their AI and research focus.
  • Customize your resume: Highlight HPC and Linux skills.
  • Showcase relevant projects: Emphasize automation and troubleshooting.
  • Prepare technical insights: Review HPC and scheduler systems.
  • Practice communication: Be clear about problem-solving examples.

📝 Interview Preparation Advice

Technical Preparation

Review HPC cluster management manuals.
Study scheduler software documentation.
Practice Linux system troubleshooting.
Test installation of scientific software.

Behavioral Questions

Describe a challenging troubleshooting incident.
Explain prioritization in fast-paced environments.
Describe handling multiple tasks simultaneously.
Give example of proactive problem resolution.

Frequently Asked Questions