Skip to content

Instantly share code, notes, and snippets.

View abhishekmishragithub's full-sized avatar
:shipit:
Shipping

Tofu abhishekmishragithub

:shipit:
Shipping
View GitHub Profile
@abhishekmishragithub
abhishekmishragithub / README.md
Last active February 23, 2026 14:26
Smallest AI Pulse STT - Streaming Debug Example (fixes latency & full_transcript issues)

Pulse STT Streaming Debug Example

Debug example for Smallest AI Pulse STT WebSocket streaming with proper latency control.

Issues This Solves

  1. High latency (10+ seconds) — Fixed by throttling audio to real-time pace
  2. No restored/full transcript — Fixed by enabling full_transcript=true
  3. Poor Hindi transcription — Fixed by using correct language code (hi or multi)

I don't know what's wrong with me today. I woke up feeling so heavy, like there's this weight on my chest that won't go away. Work has been so stressful lately, and I feel like I'm falling behind on everything. I just want to crawl back into bed and disappear for a while. I don't even know who to talk to about this.

@abhishekmishragithub
abhishekmishragithub / kimi-k2-thinking-eval-benchamark.md
Last active November 14, 2025 12:16
kimi-k2-thinking-eval-benchamark

Kimi-K2-Thinking – Local Evaluation (vLLM + LM Evaluation Harness)

1. Setup

  • Model: kimi-k2-thinking (Moonshot Kimi-K2 Thinking)
  • Serving backend: vLLM
  • Serve command (summary):
    • tensor-parallel-size=8
    • distributed-executor-backend=mp
@abhishekmishragithub
abhishekmishragithub / kv_cache_transformers.md
Created October 24, 2025 11:03
KV cache in transformers models - llms

🔑 What is K and V in Transformers?

Every decoder transformer layer (like in Llama) has self-attention.

For each token being processed, the model computes 3 vectors:

Name Meaning Role
Q = Query "What am I looking for?" Used to match past context
K = Key "How should I be looked up?" Used to be compared against the Q
@abhishekmishragithub
abhishekmishragithub / llm_benchmarking_glossary.md
Last active September 25, 2025 05:16
llm benchmarking terms / glossary

RPS (Requests per Second): how many requests the system completed per second. RPS = successful_requests / benchmark_duration_sec

TPS (Tokens per Second): ambiguous term, can be more explicit:

  • Output Token Throughput (tok/s) = total_output_tokens / duration

  • Total Token Throughput (tok/s) = (input_tokens + output_tokens) / duration

  • Per-request TPS (your sequential runs) = tokens_returned / request_latency

Python + UV + Neovim Setup Guide

What You Get

Your Neovim setup now includes:

🐍 Python Development Features

  • LSP Support: Pyright for type checking + Ruff for linting/formatting
  • Auto-formatting: Automatic code formatting with Ruff on save
  • Debugging: Full debugging support with nvim-dap

Ghostty Keyboard Shortcuts

Default keyboard shortcuts for Ghostty terminal emulator. Platform-specific differences are noted where applicable.

Window Management

Action Windows/Linux macOS
New window Ctrl+Shift+N Cmd+N
Close window Alt+F4 Cmd+Shift+W

✅ Step 1: Install the plugins (if not already installed)

🔹 A. zsh-autosuggestions

If you haven’t already:

git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

mental model (what to remember)

Think in layers of memory, each with different lifetime & size:

  1. Working buffer (short-term)
    The last K messages (e.g., 10–30) from the current chat. Fast, no processing.

  2. Running summary (compressed short-term)

Here’s a step-by-step guide to deploy a Python backend to Google App Engine (GAE) using the Standard Environment — suitable for most web apps (Flask, FastAPI, etc.).

✅ Prerequisites • Python 3.7 – 3.10 (GAE Standard supports specific versions) • GCP project created • gcloud CLI installed and authenticated (gcloud init)

📁 1. Project Structure (example for Flask)

my-agent-app/