Senior LLM Evaluation & Reinforcement Fine-Tuning Engineer

Pune, Maharashtra, India

1 month ago

Applicants: 0

Apply Now

Amazon Design Analysis Experimental design Statistics

Salary Not Disclosed

N/A

Job Description

Company Description Genrise is a leading ecommerce content agent that specializes in identifying content gaps, creating high-performing product copy, and tailoring it for every marketplace. We deliver on-brand content for platforms like Amazon, Walmart, and Target, 10x faster. Our innovative approach ensures top-ranking content, making us a preferred choice for ecommerce businesses. Role Description We?re looking for a hands-on technical expert who has actually written evals for large language models and has direct experience with reinforcement fine-tuning (e.g., RLHF, RLAIF, or RFT variants). You?ll split your time between building/owning our LLM evaluation stack ? leveraging best practices in experimental design, measurement, and trustworthy deployment. If you love turning fuzzy product goals into measurable evaluations, care deeply about scientific rigor, and enjoy building cool tech, this is for you. What you?ll do Design, implement, and maintain robust evaluation suites for LLMs (task- and domain-specific; regression and exploratory). Lead or contribute to reinforcement fine-tuning projects (reward modeling, preference data pipelines, safety/quality constraints, offline/online tuning loops). Define success metrics, sampling strategies, and statistical tests; ensure reproducibility and leakage prevention. Build data generation and curation pipelines for evals (human + synthetic), including rubric design and inter-annotator agreement. Partner with research, product, and infra to ship models with quantifiable improvements and clear trade-off documentation. Teach & mentor: run workshops, code walkthroughs, and evaluations office hours; raise the scientific bar across the org. Write clear experiment reports and decision memos; contribute to internal best-practice guides. What we?re looking for (must-haves) Recent, hands-on experience delivering 1?2+ real projects where you authored LLM evals end-to-end (design ? implementation ? analysis). Demonstrated experience with reinforcement fine-tuning for LLMs (RLHF/RLAIF/RFT)?reward modeling, preference data, or policy optimization. Strong scientific foundation : experimental design, statistics, hypothesis testing, error analysis. Machine learning depth : transformers, tokenization, finetuning, sampling/decoding, data quality, overfitting/leakage controls. Proficiency with Python , PyTorch/JAX , and common LLM tooling (HF, vLLM, Triton, Ray/SLURM, Weights & Biases, etc.). Excellent written and verbal communication; proven ability to teach and mentor engineers/researchers. Nice to have Safety evals, hallucination/robustness/red-teaming experience. Evaluation of tool use/agents, code generation, retrieval-augmented tasks. Knowledge of ranking/recommender systems or bandits. Infra for eval orchestration (sharding, caching, dataset versioning). Contributions to open-source eval frameworks or benchmark leaderboards.