Working at MIT offers opportunities that just aren’t found anywhere else, including generous and unique benefits that help to ensure that MIT employees are healthy, supported, and enjoy a fulfilling work/life balance. Discover more about what it's like to work at MIT.

We welcome people from all walks of life to bring their talent, ideas, and experience to our community. We value diversity and strongly encourage applications from individuals from all identities and backgrounds – like yours. If you want to be part of our exceptional, multicultural, collaborative, and inclusive community, then take a look at this opportunity.

Lead Site Reliability Engineer

Job Number: 24909
Functional Area: Information Technology
Department: Office of Research Computing and Data
School Area: VP Research

Employment Type: Full-time (Hybrid)
Employment Category: Exempt
Visa Sponsorship Available: No
Schedule:

Posting Description

LEAD SITE RELIABILITY ENGINEER, Office of Research Computing and Data (ORCD), to build and advance SRE functions in collaboration with a diverse team of systems engineers; play a pivotal part in the strategic transformation of infrastructure planning, design, delivery, and operations in support of ORCD's continued growth; and build and foster cross-functional collaboration between engineering and operations teams across MIT, ensuring alignment with institutional objectives and long-term strategic initiatives.

Find the full job description here: https://orcd.mit.edu/about-orcd/jobs

Job Requirements

REQUIRED: Bachelor’s degree in engineering, computer science, related field or equivalent industry experience; a minimum of seven years of experience in site reliability engineering or a related field; possess a deep and broad expertise across multiple technical domains, including Linux, networking, and virtualization; ability to drive innovation in system architecture and lead transformative design initiatives from the ground up; robust analytical and structured problem-solving skills, coupled with excellent communication and inter-personal abilities; deep understanding of Linux, LDAP, virtualization & config management in a large Linux-based engineering environment. PREFERRED: 10+ years of experience in site reliability engineering; experience working within an HPC/research computing environment; ability to analyze network traffic to identify technical issues and suspicious activities. Job #24909-11

4/8/2025