AMD is a company dedicated to building great products that accelerate next-generation computing experiences. The Senior Software Development Engineer will be responsible for building and optimizing production-grade inference runtimes for large language models on AMD GPUs, driving performance, scalability, and reliability. Responsibilities: Architect and optimize distributed LLM inference runtimes based on in-house LLM engines or open-source stacks such as vLLM, SGLang, and llm-d; Design and improve TP / PP / EP (MoE) hybrid execution, including KV-cache management, attention dispatch, and token scheduling; Implement and optimize multi-node inference pipelines using RCCL, RDMA, and collective-based execution; Drive throughput, latency, and memory efficiency across single-GPU and multi-GPU clusters; Optimize continuous batching, speculative decoding, KV-cache paging, prefix caching, and multi-turn serving; Work with AMD GPU libraries (AITER, HIPBLAS-LT, RCCL, ROCm runtime) to ensure inference frameworks efficiently use FP8 / FP4 GEMM and FlashAttention / MLA; Collaborate with compiler teams (Triton, LLVM, ROCm) to unblock framework-level performance; Upstream features and performance fixes into vLLM, SGLang, and llm-d; Enable customer PoCs and production deployments on AMD platforms; Build and maintain benchmark-grade inference pipelines Qualifications: Hands-on understanding of vLLM, SGLang, or similar inference stacks; Experience with distributed inference scaling and a proven track record of contributing to upstream open-source projects; Strong experience integrating optimized GPU performance into machine-learning frameworks (e.g., PyTorch, TensorFlow) for high-throughput and scalable inference; Strong background in NVIDIA, AMD, or similar GPU architectures and kernel development; Expertise in Python and preferably experience in C/C++, including debugging, performance tuning, and test design for large-scale systems; Experience running large-scale workloads on heterogeneous GPU clusters, optimizing for efficiency and scalability; Understanding of compiler and runtime systems, including LLVM, ROCm, and GPU code generation; Master's or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field Required Skills: Large language model (LLM) inference frameworks, Distributed inference runtimes, Tensor parallelism, Pipeline parallelism, Expert parallelism (MoE), GPU runtime, Kernel development, Performance optimization, Scalability optimization, RCCL, RDMA, FP8/FP4 GEMM, FlashAttention, MLA, Python, C/C++, Debugging, Performance tuning, Test design, Machine learning frameworks integration, PyTorch, TensorFlow, GPU architectures NVIDIA, GPU architectures AMD, High-performance computing on GPU clusters, Compiler, Runtime systems, LLVM, ROCm, GPU code generation

Ghost Score Breakdown

No salary (mandate state violation)

+ pts

No company logo

+ pts

Fresh posting (4-7 days)

+ pts

Known scam/ghost company

Reposted listing

Expired deadline

High job-to-employee ratio

Recruiting agency

Overall: 17/100Low Ghost Risk

Application Tips

Top skills mentioned: python, machine_learning, tensorflow. Make sure your resume highlights these.
This listing shows strong signals of being a real opportunity — apply with confidence.

See all engineering jobs See all jobs in Clara All AMD jobs

Low Ghost Risk

United States1 weeks agoActive

+12 more