LLM Dataset Engineer

FULL TIME
mid

Salary

No salary data

vs. Engineering avg

Ghost Score

Worse than category average

Engineering jobs

Freshness

Posted 2 months ago

Job Description

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a high-efficiency serving platform. They are seeking a highly technical LLM Dataset Engineer to lead the strategy, creation, and curation of massive datasets that power their foundation models, ensuring world-class performance in reasoning, safety, and multimodal understanding. Responsibilities: Own the end-to-end creation of pre-training datasets for LLMs; Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data; Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO); Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency; Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks; Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the 'diet' of the model is mathematically balanced; Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement Qualifications: 5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models; Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks; Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models; Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data; Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets; Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet) Required Skills: Data Science, Machine Learning, Python, Multiprocessing, Multithreading, Memory Management, Petabyte-Scale Data Handling, Dataset Reconstruction, Web Crawling, Post-Training Dataset Creation, RLHF, DPO, Human Labeling Workflow Management, Data Frameworks - Spark, Data Frameworks - Ray, Data Loading Formats - WebDataset, Data Loading Formats - Parquet

Ghost Score Breakdown

Posted 60-89 days ago
+ pts
No salary (mandate state violation)
+ pts
No company logo
+ pts
Known scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 55/100Suspicious

Application Tips

  • Top skills mentioned: python, machine_learning, spark. Make sure your resume highlights these.
  • This listing shows some ghost job indicators. Proceed with caution and verify the role is actively being filled.

Browse More