Engineering Manager, Production Engineering

FULL TIME
lead_staff

Salary

No salary data

vs. Engineering avg

Ghost Score

Better than ~65% of category

Engineering jobs

Freshness

Posted 1 weeks ago

Job Description

Crusoe is on a mission to accelerate the abundance of energy and intelligence, operating as a vertically integrated AI infrastructure company. The Engineering Manager will lead the Production Engineering team, focusing on reliability improvements and managing the health of services delivered to enterprise customers. Responsibilities: Leading and growing a team of SREs embedded within Crusoe's AI product areas, setting technical direction and fostering a culture of ownership and continuous improvement; Contributing as an IC — reviewing code, building tooling, and driving automation to reduce toil and improve the reliability and scalability of production services; Owning SLA/SLO performance, incident response, and on-call health for service offerings; leading blameless post-mortems and driving systemic remediation; Partnering with embedded product and platform engineering teams to influence infrastructure design, observability strategy, and operational readiness for new and existing services; Defining and tracking reliability, performance, and operational maturity metrics across the team; translating data into prioritized roadmap investments; Serving as a technical escalation point for high-severity production incidents affecting enterprise customers, and collaborating with Cloud Support and Customer Success on resolution and communication Qualifications: 5+ years of software or infrastructure engineering experience, with at least 1–2 years in an engineering management or tech lead role; Strong SRE or production engineering background — hands-on experience with incident management, SLO frameworks, runbooks, and on-call operations; Solid coding ability; comfortable writing production-grade code in Go, Python, or similar languages to build tooling and automation; Experience working with or embedding into cross-functional product teams, and influencing engineering decisions across organizational boundaries; Familiarity with container orchestration and cloud-native infrastructure — Kubernetes, distributed systems, and cloud service architectures; Strong communication skills — able to clearly represent technical risk and operational status to both engineering peers and business stakeholders Required Skills: Site Reliability Engineering (SRE), Production Engineering, Incident Management, SLO Frameworks, Runbooks, On-call Operations, Coding in Go, Coding in Python, Tooling, Automation, Container Orchestration, Kubernetes, Cloud-native Infrastructure, Distributed Systems, Cloud Service Architectures, Cross-functional Team Collaboration

Ghost Score Breakdown

No salary (mandate state violation)
+ pts
No company logo
+ pts
Fresh posting (4-7 days)
+ pts
Known scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 17/100Low Ghost Risk

Application Tips

  • Top skills mentioned: python, go, kubernetes. Make sure your resume highlights these.
  • This listing shows strong signals of being a real opportunity — apply with confidence.

Browse More