I write GPU kernels and build inference systems, working from memory-bandwidth-bound CUDA ops on H100s up through paged KV-cache allocators and continuous-batching schedulers. I care about making models run fast at every layer of the stack.
Currently: Math & CS at Stanford · GPU kernels at Hazy Research · building an LLM inference engine from scratch · ML systems & inference infra.
About
I'm obsessed with making models run fast. That means writing CUDA kernels and reasoning carefully about memory hierarchies: HBM bandwidth, coalescing, shared-memory layout, bank conflicts. At Hazy Research, I'm writing custom kernels for ThunderKittens, including a speed-of-light RMSNorm kernel for H100. I'm also building AgentServe, an LLM inference engine implemented from scratch: Llama 3.2 from the ground up (RMSNorm, RoPE, grouped-query attention, SwiGLU), a paged KV-cache block allocator, and a continuous-batching scheduler with agent-aware scheduling policies. I want to go deeper on inference: kernel fusion, speculative decoding, disaggregated serving. There's a lot of stack left to understand.
Outside of work, I'm a huge football and basketball fan and a proud Wisconsinite. I love reading and am working on getting better at it; I'll be documenting my reading journey and insights on this website.
Selected Work
GPU Kernels for Modern DL
Writing custom CUDA kernels for ThunderKittens, a research-grade GPU kernel framework. Focused on memory-bandwidth-bound ops for LLM inference: tiled kernels, shared-memory orchestration, warp-level primitives, profiled with Nsight Compute to push kernels toward speed-of-light.
FiberFold: Predicting 3D Chromatin Organization
Worked on improving a deep learning model combining CNNs and transformers to predict 3D genome organization. Focused on model evaluation and refinement using Hi-C maps, developing metrics to assess prediction accuracy across different genomic regions. Also tackled other challenging genomics problems in the lab, working with single-molecule sequencing data to understand chromatin structure and function.
Agentic Systems for Enterprise
Built end-to-end agentic workflows for enterprise systems like Salesforce and Workday. Developed a vendor-agnostic multi-agent platform that provides a single view across IT architecture with enterprise-grade governance and secure execution.
Projects
tk-rmsnorm: Speed-of-Light RMSNorm Kernel
A RMSNorm kernel written in ThunderKittens-style primitives, targeting near-peak HBM bandwidth on H100. RMSNorm is purely memory-bandwidth-bound at decode, which makes it a clean target for speed-of-light engineering: vectorized 128-bit loads, swizzled shared-memory staging for gamma, fp32 accumulators for the reduction, and a fused residual variant. Achieves ~80% of peak HBM bandwidth in the prefill regime.
AgentServe: Agent-Aware LLM Inference
An OpenAI-compatible LLM inference engine built from scratch in Python. Llama 3.2 implemented from the ground up (RMSNorm, RoPE, grouped-query attention, SwiGLU), a paged KV-cache block allocator, and a continuous- batching scheduler with agent-aware policies that reorder requests to unblock tool calls in agent DAGs. Benchmarked against FIFO baselines and vLLM.
Priori AI
Local intelligence for clinical confidence. Simplifies prior authorization by bringing AI to the point of care—runs locally on physicians' devices for maximum privacy while dramatically improving insurance approval workflows.
LiDAR Sim2Real Translation
Range-view diffusion model for translating synthetic LiDAR data to real-world distributions. Physics-aware approach incorporating beam angles, dropout patterns, and intensity falloff for improved sim-to-real transfer.
Paper Notes
Paper notes coming soon...
Writing
More writing coming soon...
Contact
The best way to reach me is by email. I'm always open to talking about GPU programming, LLM inference infrastructure, ML systems, and interesting collaborations.
Email: krishs04@stanford.edu