Engineering Manager, Production Engineering
FULL TIME
lead_staff
Salary
No salary data
vs. Engineering avg
Ghost Score
Better than ~65% of category
Engineering jobs
Freshness
Posted 1 weeks ago
Required Skills
Job Description
Crusoe is on a mission to accelerate the abundance of energy and intelligence, operating as a vertically integrated AI infrastructure company. The Engineering Manager will lead the Production Engineering team, focusing on reliability improvements and managing the health of services delivered to enterprise customers.
Responsibilities:
Leading and growing a team of SREs embedded within Crusoe's AI product areas, setting technical direction and fostering a culture of ownership and continuous improvement; Contributing as an IC — reviewing code, building tooling, and driving automation to reduce toil and improve the reliability and scalability of production services; Owning SLA/SLO performance, incident response, and on-call health for service offerings; leading blameless post-mortems and driving systemic remediation; Partnering with embedded product and platform engineering teams to influence infrastructure design, observability strategy, and operational readiness for new and existing services; Defining and tracking reliability, performance, and operational maturity metrics across the team; translating data into prioritized roadmap investments; Serving as a technical escalation point for high-severity production incidents affecting enterprise customers, and collaborating with Cloud Support and Customer Success on resolution and communication
Qualifications:
5+ years of software or infrastructure engineering experience, with at least 1–2 years in an engineering management or tech lead role; Strong SRE or production engineering background — hands-on experience with incident management, SLO frameworks, runbooks, and on-call operations; Solid coding ability; comfortable writing production-grade code in Go, Python, or similar languages to build tooling and automation; Experience working with or embedding into cross-functional product teams, and influencing engineering decisions across organizational boundaries; Familiarity with container orchestration and cloud-native infrastructure — Kubernetes, distributed systems, and cloud service architectures; Strong communication skills — able to clearly represent technical risk and operational status to both engineering peers and business stakeholders
Required Skills:
Site Reliability Engineering (SRE), Production Engineering, Incident Management, SLO Frameworks, Runbooks, On-call Operations, Coding in Go, Coding in Python, Tooling, Automation, Container Orchestration, Kubernetes, Cloud-native Infrastructure, Distributed Systems, Cloud Service Architectures, Cross-functional Team Collaboration
Ghost Score Breakdown
No salary (mandate state violation)
+ ptsNo company logo
+ ptsFresh posting (4-7 days)
+ ptsKnown scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 17/100Low Ghost Risk
Application Tips
- Top skills mentioned: python, go, kubernetes. Make sure your resume highlights these.
- This listing shows strong signals of being a real opportunity — apply with confidence.