Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a high-efficiency serving platform. They are seeking a highly technical LLM Dataset Engineer to lead the strategy, creation, and curation of massive datasets that power their foundation models, ensuring world-class performance in reasoning, safety, and multimodal understanding. Responsibilities: Own the end-to-end creation of pre-training datasets for LLMs; Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data; Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO); Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency; Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks; Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the 'diet' of the model is mathematically balanced; Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement Qualifications: 5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models; Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks; Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models; Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data; Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets; Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet) Required Skills: Data Science, Machine Learning, Python, Multiprocessing, Multithreading, Memory Management, Petabyte-Scale Data Handling, Dataset Reconstruction, Web Crawling, Post-Training Dataset Creation, RLHF, DPO, Human Labeling Workflow Management, Data Frameworks - Spark, Data Frameworks - Ray, Data Loading Formats - WebDataset, Data Loading Formats - Parquet

Ghost Score Breakdown

Posted 60-89 days ago

+ pts

No salary (mandate state violation)

+ pts

No company logo

+ pts

Known scam/ghost company

Reposted listing

Expired deadline

High job-to-employee ratio

Recruiting agency

Overall: 55/100Suspicious

Application Tips

Top skills mentioned: python, machine_learning, spark. Make sure your resume highlights these.
This listing shows some ghost job indicators. Proceed with caution and verify the role is actively being filled.

See all engineering jobs See all jobs in Francisco All Sciforium jobs

Low Ghost Risk

United States1 weeks agoActive

+12 more