Senior Software Development Engineer – LLM Inference Framework
FULL TIME
senior
Salary
No salary data
vs. Engineering avg
Ghost Score
Better than ~65% of category
Engineering jobs
Freshness
Posted 2 weeks ago
Required Skills
Job Description
AMD is a company dedicated to building great products that accelerate next-generation computing experiences. The Senior Software Development Engineer will be responsible for building and optimizing production-grade inference runtimes for large language models on AMD GPUs, driving performance, scalability, and reliability.
Responsibilities:
Architect and optimize distributed LLM inference runtimes based on in-house LLM engines or open-source stacks such as vLLM, SGLang, and llm-d; Design and improve TP / PP / EP (MoE) hybrid execution, including KV-cache management, attention dispatch, and token scheduling; Implement and optimize multi-node inference pipelines using RCCL, RDMA, and collective-based execution; Drive throughput, latency, and memory efficiency across single-GPU and multi-GPU clusters; Optimize continuous batching, speculative decoding, KV-cache paging, prefix caching, and multi-turn serving; Work with AMD GPU libraries (AITER, HIPBLAS-LT, RCCL, ROCm runtime) to ensure inference frameworks efficiently use FP8 / FP4 GEMM and FlashAttention / MLA; Collaborate with compiler teams (Triton, LLVM, ROCm) to unblock framework-level performance; Upstream features and performance fixes into vLLM, SGLang, and llm-d; Enable customer PoCs and production deployments on AMD platforms; Build and maintain benchmark-grade inference pipelines
Qualifications:
Hands-on understanding of vLLM, SGLang, or similar inference stacks; Experience with distributed inference scaling and a proven track record of contributing to upstream open-source projects; Strong experience integrating optimized GPU performance into machine-learning frameworks (e.g., PyTorch, TensorFlow) for high-throughput and scalable inference; Strong background in NVIDIA, AMD, or similar GPU architectures and kernel development; Expertise in Python and preferably experience in C/C++, including debugging, performance tuning, and test design for large-scale systems; Experience running large-scale workloads on heterogeneous GPU clusters, optimizing for efficiency and scalability; Understanding of compiler and runtime systems, including LLVM, ROCm, and GPU code generation; Master's or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field
Required Skills:
Large language model (LLM) inference frameworks, Distributed inference runtimes, Tensor parallelism, Pipeline parallelism, Expert parallelism (MoE), GPU runtime, Kernel development, Performance optimization, Scalability optimization, RCCL, RDMA, FP8/FP4 GEMM, FlashAttention, MLA, Python, C/C++, Debugging, Performance tuning, Test design, Machine learning frameworks integration, PyTorch, TensorFlow, GPU architectures NVIDIA, GPU architectures AMD, High-performance computing on GPU clusters, Compiler, Runtime systems, LLVM, ROCm, GPU code generation
Ghost Score Breakdown
No salary (mandate state violation)
+ ptsNo company logo
+ ptsFresh posting (4-7 days)
+ ptsKnown scam/ghost company
Reposted listing
Expired deadline
High job-to-employee ratio
Recruiting agency
Overall: 17/100Low Ghost Risk
Application Tips
- Top skills mentioned: python, machine_learning, tensorflow. Make sure your resume highlights these.
- This listing shows strong signals of being a real opportunity — apply with confidence.