LLM Evaluation Engineer

FULL TIME
mid

Salary

No salary data

vs. Engineering avg

Ghost Score

Worse than category average

Engineering jobs

Freshness

Posted 5 months ago

Job Description

ThirdLaw is building the control layer for AI in the enterprise, focusing on safety, compliance, and operational risks associated with LLMs and AI agents. The LLM Evaluation Engineer will develop the evaluation layer of the ThirdLaw platform, ensuring that LLM prompts and outputs adhere to enterprise policies through real-time evaluation logic and integration of various AI components. Responsibilities: Design and build real-time evaluation logic that determines whether LLM prompts or outputs violate enterprise policies; Implement evaluation strategies using a mix of semantic similarity, foundation model scoring, rule-based systems, and statistical checks; Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking); Prototype, tune, and productize small language models and prompt templates for classification, labeling, or scoring; Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage layers; Build tools to observe, debug, and improve evaluator performance across real-world data distributions; Define abstractions for reusable evaluation components that can scale across use cases Qualifications: 7+ years of experience in ML systems or AI engineering roles, with at least 1–2 years working directly with LLMs, NLP pipelines, or semantic search; Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and how to work with them via APIs or open source; Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines; Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules; Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow; Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production Required Skills: Machine learning systems, AI engineering, Large language models (LLMs), Natural language processing (NLP) pipelines, Semantic search, Foundation models, API integration, Open source foundation models, Vector search, Embeddings pipelines, Real-time evaluation logic, Semantic similarity, Classifier scoring, Rule-based systems, Python programming, Hugging Face Transformers, LangChain, PyTorch, TensorFlow, Model behavior analysis, Prompt configuration testing, Debugging production systems, Clear written communication

Ghost Score Breakdown

Posted 90+ days ago
+ pts
No salary info
+ pts
No company logo
+ pts
Known scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 75/100Likely Ghost

Application Tips

  • Top skills mentioned: python, machine_learning, tensorflow. Make sure your resume highlights these.
  • This listing shows some ghost job indicators. Proceed with caution and verify the role is actively being filled.

Browse More