Krish Sharma – ML & Systems

I write GPU kernels and build inference systems, working from memory-bandwidth-bound CUDA ops on H100s up through paged KV-cache allocators and continuous-batching schedulers. I care about making models run fast at every layer of the stack.

Currently: Math & CS at Stanford · GPU kernels at Hazy Research · building an LLM inference engine from scratch · ML systems & inference infra.

GitHub LinkedIn

About

I'm obsessed with making models run fast. That means writing CUDA kernels and reasoning carefully about memory hierarchies: HBM bandwidth, coalescing, shared-memory layout, bank conflicts. At Hazy Research, I'm writing custom kernels for ThunderKittens, including a speed-of-light RMSNorm kernel for H100. I'm also building AgentServe, an LLM inference engine implemented from scratch: Llama 3.2 from the ground up (RMSNorm, RoPE, grouped-query attention, SwiGLU), a paged KV-cache block allocator, and a continuous-batching scheduler with agent-aware scheduling policies. I want to go deeper on inference: kernel fusion, speculative decoding, disaggregated serving. There's a lot of stack left to understand.

Outside of work, I'm a huge football and basketball fan and a proud Wisconsinite. I love reading and am working on getting better at it; I'll be documenting my reading journey and insights on this website.

Selected Work

GPU Kernels for Modern DL

Stanford · Hazy Research

Writing custom CUDA kernels for ThunderKittens, a research-grade GPU kernel framework. Focused on memory-bandwidth-bound ops for LLM inference: tiled kernels, shared-memory orchestration, warp-level primitives, profiled with Nsight Compute to push kernels toward speed-of-light.

FiberFold: Predicting 3D Chromatin Organization

Stanford Medicine · Altemose Lab

Worked on improving a deep learning model combining CNNs and transformers to predict 3D genome organization. Focused on model evaluation and refinement using Hi-C maps, developing metrics to assess prediction accuracy across different genomic regions. Also tackled other challenging genomics problems in the lab, working with single-molecule sequencing data to understand chromatin structure and function.

Agentic Systems for Enterprise

Tessera Labs · Engineering

Built end-to-end agentic workflows for enterprise systems like Salesforce and Workday. Developed a vendor-agnostic multi-agent platform that provides a single view across IT architecture with enterprise-grade governance and secure execution.

Projects

tk-rmsnorm: Speed-of-Light RMSNorm Kernel

CUDA · ThunderKittens · H100

A RMSNorm kernel written in ThunderKittens-style primitives, targeting near-peak HBM bandwidth on H100. RMSNorm is purely memory-bandwidth-bound at decode, which makes it a clean target for speed-of-light engineering: vectorized 128-bit loads, swizzled shared-memory staging for gamma, fp32 accumulators for the reduction, and a fused residual variant. Achieves ~80% of peak HBM bandwidth in the prefill regime.

AgentServe: Agent-Aware LLM Inference

LLM Inference · Systems · Scheduling

An OpenAI-compatible LLM inference engine built from scratch in Python. Llama 3.2 implemented from the ground up (RMSNorm, RoPE, grouped-query attention, SwiGLU), a paged KV-cache block allocator, and a continuous- batching scheduler with agent-aware policies that reorder requests to unblock tool calls in agent DAGs. Benchmarked against FIFO baselines and vLLM.

Priori AI

Healthcare · AI · EHR Integration

Local intelligence for clinical confidence. Simplifies prior authorization by bringing AI to the point of care—runs locally on physicians' devices for maximum privacy while dramatically improving insurance approval workflows.

LiDAR Sim2Real Translation

Computer Vision · Diffusion Models · LiDAR

Range-view diffusion model for translating synthetic LiDAR data to real-world distributions. Physics-aware approach incorporating beam angles, dropout patterns, and intensity falloff for improved sim-to-real transfer.

Paper Notes

Paper notes coming soon...

Writing

More writing coming soon...

Contact

The best way to reach me is by email. I'm always open to talking about GPU programming, LLM inference infrastructure, ML systems, and interesting collaborations.

Email: krishs04@stanford.edu