Skip to content

Instantly share code, notes, and snippets.

@bbrowning
bbrowning / proxy_guidelines.md
Last active May 7, 2026 13:00
Guidelines for proxies between clients and vLLM

vLLM Proxy Implementation Guidelines

1. Introduction

Problem Statement

Many projects and products implement network proxies that sit between clients and vLLM, including:

  • Content filtering and guardrail systems - Safety and compliance enforcement
  • Traffic management and routing layers - Load balancing and service mesh integration
@bbrowning
bbrowning / chat_template_gemma_large_fixed.jinja
Created April 17, 2026 00:24
gemma 4 chat template that works with opencode - download the .jinja file and tell vllm to use it via `--chat-template chat_template_gemma_large_fixed.jinja`
{%- macro format_parameters(properties, required) -%}
{%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%}
{%- set ns = namespace(found_first=false) -%}
{%- for key, value in properties | dictsort -%}
{%- set add_comma = false -%}
{%- if key not in standard_keys -%}
{%- if ns.found_first %},{% endif -%}
{%- set ns.found_first = true -%}
{{ key }}:{
{%- if value['description'] -%}
@bbrowning
bbrowning / 0_instructions.md
Last active April 14, 2026 01:47
A simple proxy to convert a non-streaming Chat Completions request into a streaming one

Simple non-streaming to streaming Chat Completions Proxy

This is a simple proxy I use to run non-streaming evals (like BFCL multi_turn) against vLLM server's streaming request/response path. Run the vLLM server as usual, run the proxy (via python proxy.py), and point BFCL to http://localhost:8001/v1 instead of http://localhost:8000/v1 to test the streaming path.

This means you can start vLLM once and run BFCL twice, in both streaming and non-streaming by just changing the OPENAI_BASE_URL, to verify basic correctness of the streaming reasoning and tool call parsers.

The entire script was written by Gemini in one-shot, but it seems to work so far in basic testing.

@bbrowning
bbrowning / instructions.md
Last active April 19, 2026 14:33
Compile recent vLLM builds from source on DGX Spark

Compiling vLLM main from source on DGX Spark

I do all this SSH'd into the DGX Spark from another machine, so everything is terminal commands.

Install python dev dependencies and uv

sudo apt install python3-dev
curl -LsSf https://astral.sh/uv/install.sh | sh
@bbrowning
bbrowning / sm120_nvfp4_moe.diff
Created November 21, 2025 20:56
Changes required to get latest main of vLLM running Qwen3 MoE NVFP4 on DGX Spark
diff --git a/csrc/ops.h b/csrc/ops.h
index f8bdc61aa..933c64db0 100644
--- a/csrc/ops.h
+++ b/csrc/ops.h
@@ -218,6 +218,7 @@ bool cutlass_scaled_mm_supports_fp4(int64_t cuda_device_capability);
bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability);
bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability);
bool cutlass_group_gemm_supported(int64_t cuda_device_capability);
+bool cutlass_moe_mm_supports_fp4(int64_t cuda_device_capability);
@bbrowning
bbrowning / Dockerfile.dgx_spark
Created November 21, 2025 00:14
Dockerfile to create vLLM v0.11.2 containers for DGX Spark
# A crude copy of vLLM's normal Dockerfile that installs
# a released version on DGX Spark
ARG CUDA_VERSION=13.0.2
ARG PYTHON_VERSION=3.12
ARG VLLM_VERSION=0.11.2
ARG BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
ARG PYTORCH_CUDA_INDEX_BASE_URL=https://download.pytorch.org/whl
@bbrowning
bbrowning / test_grammar.py
Created September 11, 2025 17:27
Llguidance, vllm guided_grammar, and Hermes models
import json
from openai import OpenAI
def hermes_grammar_from_tools(tools: list[dict]) -> str:
tool_funcs = ""
for tool in tools:
tool_funcs += " | " if tool_funcs else ""
tool_funcs += f"fun_{tool['function']['name']}"
@bbrowning
bbrowning / 0_instructions.md
Last active September 3, 2025 17:36
Running BFCL against models deployed to vLLM

Running BFCL tests against vLLM

Clone the gorilla repo and install BFCL dependencies:

git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
python -m venv
source venv/bin/activate
pip install -e .
@bbrowning
bbrowning / pydantic_agent_test.py
Created July 25, 2025 00:01
An example of how to use Pydantic AI with Llama Stack and the Responses API
# Dependencies:
# pip install openai pydantic-ai
# This example uses the web_search builtin tool, so it assumes you
# have a valid TAVILY_API_KEY environment variable set before starting
# your Llama Stack server.
# Usage:
#
# ollama run llama3.2:3b
@bbrowning
bbrowning / llama4_pythonic.ebnf
Created May 22, 2025 12:38
EBNF grammar (for use with Tatsu) for Llama 4 Pythonic tool calling parsing
@@grammar::Llama4
start
=
expression $
;
expression
=