ThirdLaw is building the control layer for AI in the enterprise, focusing on safety, compliance, and operational risks associated with LLMs and AI agents. The LLM Evaluation Engineer will develop the evaluation layer of the ThirdLaw platform, ensuring that LLM prompts and outputs adhere to enterprise policies through real-time evaluation logic and integration of various AI components. Responsibilities: Design and build real-time evaluation logic that determines whether LLM prompts or outputs violate enterprise policies; Implement evaluation strategies using a mix of semantic similarity, foundation model scoring, rule-based systems, and statistical checks; Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking); Prototype, tune, and productize small language models and prompt templates for classification, labeling, or scoring; Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage layers; Build tools to observe, debug, and improve evaluator performance across real-world data distributions; Define abstractions for reusable evaluation components that can scale across use cases Qualifications: 7+ years of experience in ML systems or AI engineering roles, with at least 1–2 years working directly with LLMs, NLP pipelines, or semantic search; Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and how to work with them via APIs or open source; Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines; Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules; Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow; Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production Required Skills: Machine learning systems, AI engineering, Large language models (LLMs), Natural language processing (NLP) pipelines, Semantic search, Foundation models, API integration, Open source foundation models, Vector search, Embeddings pipelines, Real-time evaluation logic, Semantic similarity, Classifier scoring, Rule-based systems, Python programming, Hugging Face Transformers, LangChain, PyTorch, TensorFlow, Model behavior analysis, Prompt configuration testing, Debugging production systems, Clear written communication

Ghost Score Breakdown

Posted 90+ days ago

+ pts

No salary info

+ pts

No company logo

+ pts

Known scam/ghost company

Reposted listing

Expired deadline

High job-to-employee ratio

Recruiting agency

Overall: 75/100Likely Ghost

Application Tips

Top skills mentioned: python, machine_learning, tensorflow. Make sure your resume highlights these.
This listing shows some ghost job indicators. Proceed with caution and verify the role is actively being filled.

See all engineering jobs See all jobs in United States All ThirdLaw | Runtime AI Safety jobs

Low Ghost Risk

United States2 weeks agoActive

+12 more

Product Manager, Clinic Operations Local Infusion

Low Ghost Risk

United States2 weeks agoActive

engineering

lead_staff

FULL TIME

Senior Product Manager, Marketplace Growth Jerry

Low Ghost Risk

United States2 weeks agoActive

engineering

lead_staff

FULL TIME

Full Stack Developer Steampunk, Inc.

Low Ghost Risk

Lean, VA2 weeks agoActive

+6 more

Founding Full-Stack Engineer Empathy Talent

Low Ghost Risk

Harjumaa, ES2 weeks agoActive

AI Product Engineer Vertex Inc.

Low Ghost Risk

United States2 weeks agoActive

Senior Software Engineer CyberArk

Low Ghost Risk

Clara, CA2 weeks agoActive

+1 more

Software Engineer One10

Low Ghost Risk

United States2 weeks agoActive

+8 more

Browse More

More Engineering jobs Jobs in United States Engineering jobs in United States Remote Engineering jobs

LLM Evaluation Engineer

Required Skills

Job Description

Ghost Score Breakdown

Application Tips

Similar Jobs

Browse More

Similar Jobs