LLM Evaluation Engineer
FULL TIME
mid
Salary
No salary data
vs. Engineering avg
Ghost Score
Worse than category average
Engineering jobs
Freshness
Posted 5 months ago
Required Skills
Job Description
ThirdLaw is building the control layer for AI in the enterprise, focusing on safety, compliance, and operational risks associated with LLMs and AI agents. The LLM Evaluation Engineer will develop the evaluation layer of the ThirdLaw platform, ensuring that LLM prompts and outputs adhere to enterprise policies through real-time evaluation logic and integration of various AI components.
Responsibilities:
Design and build real-time evaluation logic that determines whether LLM prompts or outputs violate enterprise policies; Implement evaluation strategies using a mix of semantic similarity, foundation model scoring, rule-based systems, and statistical checks; Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking); Prototype, tune, and productize small language models and prompt templates for classification, labeling, or scoring; Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage layers; Build tools to observe, debug, and improve evaluator performance across real-world data distributions; Define abstractions for reusable evaluation components that can scale across use cases
Qualifications:
7+ years of experience in ML systems or AI engineering roles, with at least 1–2 years working directly with LLMs, NLP pipelines, or semantic search; Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and how to work with them via APIs or open source; Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines; Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules; Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow; Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production
Required Skills:
Machine learning systems, AI engineering, Large language models (LLMs), Natural language processing (NLP) pipelines, Semantic search, Foundation models, API integration, Open source foundation models, Vector search, Embeddings pipelines, Real-time evaluation logic, Semantic similarity, Classifier scoring, Rule-based systems, Python programming, Hugging Face Transformers, LangChain, PyTorch, TensorFlow, Model behavior analysis, Prompt configuration testing, Debugging production systems, Clear written communication
Ghost Score Breakdown
Posted 90+ days ago
+ ptsNo salary info
+ ptsNo company logo
+ ptsKnown scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 75/100Likely Ghost
Application Tips
- Top skills mentioned: python, machine_learning, tensorflow. Make sure your resume highlights these.
- This listing shows some ghost job indicators. Proceed with caution and verify the role is actively being filled.