Skip to content

Instantly share code, notes, and snippets.

@ashokvarmamatta
Last active April 9, 2026 09:25
Show Gist options
  • Select an option

  • Save ashokvarmamatta/2305e6d9a2c7b5ac7da8fc5c1c95e489 to your computer and use it in GitHub Desktop.

Select an option

Save ashokvarmamatta/2305e6d9a2c7b5ac7da8fc5c1c95e489 to your computer and use it in GitHub Desktop.
Run Gemma 4 On Your Android Phone — Complete Beginner's Guide (LiteRT LM SDK, Kotlin, on-device AI)

Run Gemma 4 On Your Android Phone — Complete Beginner's Guide (LiteRT LM SDK, Kotlin, on-device AI)

🧠 Run Gemma 4 On Your Android Phone

The Complete Beginner's Guide — From Zero to On-Device AI

Typing SVG

LiteRT LM Gemma 4 Android Kotlin Privacy


📖 What This Guide Covers

You want to run an AI model ON YOUR PHONE.
No cloud. No API key. No internet (after download).
This guide shows you EXACTLY how — from scratch.

         ┌──────────────────────────────┐
         │  1. Pick a model             │  ← Which Gemma to download?
         │  2. Add the SDK             │  ← One Gradle line
         │  3. Download the model       │  ← 2.6 GB, one-time
         │  4. Load it                  │  ← Engine + Conversation
         │  5. Chat with it             │  ← Send text, get response
         │  6. Advanced stuff           │  ← Images, audio, thinking
         └──────────────────────────────┘

🤔 Wait, What Does "On-Device AI" Mean?

NORMAL AI (Cloud)                    ON-DEVICE AI (This Guide)
─────────────────                    ────────────────────────
Your phone                           Your phone
    │                                     │
    │  "What is gravity?"                 │  "What is gravity?"
    │                                     │
    ▼                                     ▼
Internet ──► OpenAI/Google server    THE MODEL RUNS HERE
    │            │                   ON YOUR PHONE'S CPU/GPU
    │            │ (processes)            │
    │            ▼                        │ (processes locally)
    │        Response                     ▼
    ◄────────────┘                   Response
    │                                     │
    ▼                                     ▼
You see the reply                    You see the reply

❌ Needs internet                    ✅ Works in airplane mode
❌ Data goes to cloud                ✅ Data never leaves phone
❌ Costs money (API fees)            ✅ 100% free after download
❌ Server can be down                ✅ Always available

📦 Step 1 — Pick Your Model

Model Size RAM Needed What It Can Do Best For
🥇 Gemma 4 E2B 2.6 GB 8 GB Text + Vision + Audio + Thinking Most phones. Start here.
🥈 Gemma 4 E4B 3.7 GB 12 GB Same but smarter Flagship phones (Pixel 9, S25 Ultra)
🥉 Gemma 3n E2B 3.7 GB 8 GB Text + Vision + Audio (no thinking) Previous gen, still solid
Gemma 3 1B 584 MB 6 GB Text only Low-end phones, fast responses
🧪 DeepSeek R1 1.5B 1.8 GB 6 GB Text only (reasoning) Logical/math tasks

💡 Don't know which to pick?Gemma 4 E2B. It's the newest, smallest for its power, and works on most modern phones.

Where to download

Every model is on HuggingFace. Direct links:

Model Download Link
Gemma 4 E2B https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm
Gemma 4 E4B https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm
Gemma 3n E2B https://huggingface.co/google/gemma-3n-E2B-it-litert-lm
Gemma 3 1B https://huggingface.co/litert-community/Gemma3-1B-IT
DeepSeek R1 https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B

🔧 Step 2 — Add the SDK to Your Android Project

You need one library: LiteRT LM (Google's on-device LLM engine).

2a. Version catalog (gradle/libs.versions.toml)

[versions]
kotlin = "2.2.0"          # Must be 2.2.0+ (LiteRT LM requires it)
litertlm = "0.10.0"
ksp = "2.2.0-2.0.2"       # If you use Room, switch from kapt to KSP

[libraries]
litertlm = { group = "com.google.ai.edge.litertlm", name = "litertlm-android", version.ref = "litertlm" }

[plugins]
kotlin-compose = { id = "org.jetbrains.kotlin.plugin.compose", version.ref = "kotlin" }

2b. App build file (app/build.gradle.kts)

plugins {
    alias(libs.plugins.android.application)
    alias(libs.plugins.kotlin.android)
    alias(libs.plugins.kotlin.compose)   // ← Required for Kotlin 2.0+
}

android {
    compileSdk = 35
    defaultConfig {
        minSdk = 31    // Android 12+ required
    }
    // ⚠️ Remove composeOptions { kotlinCompilerExtensionVersion = "..." }
    //    The kotlin-compose plugin handles this now
}

dependencies {
    implementation(libs.litertlm)   // ← This is the only new dependency
}

2c. Sync Gradle

Click "Sync Now" in Android Studio. If you see errors:

Error Fix
Metadata version 2.3.0, expected 1.9.0 Upgrade Kotlin to 2.2.0
kapt fails with Room Switch Room from kapt to ksp
composeOptions error Remove composeOptions block, add kotlin-compose plugin

⚠️ Big gotcha: LiteRT LM uses Kotlin 2.3 metadata. Your project MUST use Kotlin 2.2.0+. This cascades: Room needs 2.7+, kapt→KSP, old Compose compiler plugin replaced by kotlin-compose. See BUG-35 in ZeroClaw for the full story.


📥 Step 3 — Download the Model to the Phone

Two approaches: in-app download or manual push via ADB.

Option A: Download in your app (recommended)

// Use WorkManager for reliable background download with resume support
// Full example: ModelDownloadWorker.kt in ZeroClawAndroid

val url = "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm?download=true"
val destDir = File(context.filesDir, "models")
destDir.mkdirs()
val destFile = File(destDir, "gemma-4-E2B-it.litertlm")

// Simple download (for testing — use WorkManager for production)
withContext(Dispatchers.IO) {
    val conn = URL(url).openConnection() as HttpURLConnection
    conn.connect()
    conn.inputStream.use { input ->
        FileOutputStream(destFile).use { output ->
            input.copyTo(output)
        }
    }
}
// destFile.absolutePath is your model path

Option B: Push via ADB (for development)

# Download the model file to your computer first, then:
adb push gemma-4-E2B-it.litertlm /data/local/tmp/

# Or push to app's files directory:
adb push gemma-4-E2B-it.litertlm /storage/emulated/0/Android/data/YOUR.PACKAGE.NAME/files/models/

Resume support (for large downloads)

// If download fails mid-way, resume from where it stopped:
val startByte = if (tmpFile.exists()) tmpFile.length() else 0L
val conn = URL(url).openConnection() as HttpURLConnection
if (startByte > 0) {
    conn.setRequestProperty("Range", "bytes=$startByte-")
}
// HTTP 206 = resumed, 200 = started over

🚀 Step 4 — Load the Model (Engine + Conversation)

This is where the magic happens. The LiteRT LM SDK has two main objects:

Engine          = the brain (loads model weights into memory)
Conversation    = the chat session (sends messages, gets replies)

You create ONE Engine, then create Conversations from it.
Engine is heavy (5-30 sec to load). Conversation is light (instant).

Complete loading code

import com.google.ai.edge.litertlm.Backend
import com.google.ai.edge.litertlm.Content
import com.google.ai.edge.litertlm.Contents
import com.google.ai.edge.litertlm.Conversation
import com.google.ai.edge.litertlm.ConversationConfig
import com.google.ai.edge.litertlm.Engine
import com.google.ai.edge.litertlm.EngineConfig
import com.google.ai.edge.litertlm.SamplerConfig

// ── Step 1: Configure the engine ────────────────────────

val modelPath = "/path/to/gemma-4-E2B-it.litertlm"

val engineConfig = EngineConfig(
    modelPath = modelPath,
    backend = Backend.CPU(),          // ← text inference on CPU (safe default)
    visionBackend = Backend.GPU(),    // ← REQUIRED for image input! (null = no vision)
    audioBackend = Backend.CPU(),     // ← for audio input (null = no audio)
    maxNumTokens = 4096               // ← max tokens for input + output combined
)

// ── Step 2: Create and initialize the engine ────────────
//    This loads the model into memory. Takes 5-30 seconds.
//    Do this on a background thread!

val engine = Engine(engineConfig)
engine.initialize()    // ← blocking call, run on Dispatchers.IO

// ── Step 3: Create a conversation ───────────────────────

val samplerConfig = SamplerConfig(
    topK = 64,            // consider top 64 tokens at each step
    topP = 0.95,          // nucleus sampling: keep tokens until 95% probability
    temperature = 1.0     // 1.0 = balanced, 0.0 = deterministic, 2.0 = creative
)

val conversation = engine.createConversation(
    ConversationConfig(
        samplerConfig = samplerConfig,
        // Optional: set a system prompt
        systemInstruction = Contents.of(listOf(
            Content.Text("You are a helpful assistant. Be concise.")
        ))
    )
)

println("✅ Model loaded and ready!")

What each parameter does

┌─────────────────────────────────────────────────────────────┐
│                    EngineConfig                               │
│                                                               │
│  modelPath     = where the .litertlm file is on disk         │
│  backend       = CPU or GPU (see section below)              │
│  maxNumTokens  = total budget for input + output tokens      │
│                  4096 = good default                          │
│                  32768 = max for Gemma 4 (uses more RAM)     │
│                                                               │
├─────────────────────────────────────────────────────────────┤
│                    SamplerConfig                              │
│                                                               │
│  temperature   = randomness of output                        │
│                  0.0 = always picks most likely word          │
│                  1.0 = balanced (default, good for chat)     │
│                  2.0 = very creative/random                  │
│                                                               │
│  topK          = only consider the top K most likely tokens  │
│                  64 = good default                            │
│                  1 = greedy (always pick the best)            │
│                                                               │
│  topP          = nucleus sampling threshold                  │
│                  0.95 = consider tokens until 95% cumulative │
│                  1.0 = consider all tokens                   │
└─────────────────────────────────────────────────────────────┘

💬 Step 5 — Chat With the Model

5a. Simple blocking call

// Send a message and get the complete response
val input = Contents.of(listOf(Content.Text("What is photosynthesis?")))

// This blocks until the full response is generated
val response = conversation.generateResponse(input)
println(response)
// Output: "Photosynthesis is the process by which green plants..."

5b. Streaming (token-by-token) ⭐ Recommended

import com.google.ai.edge.litertlm.Message
import com.google.ai.edge.litertlm.MessageCallback

val input = Contents.of(listOf(Content.Text("Explain gravity in simple terms")))

conversation.sendMessageAsync(
    input,
    object : MessageCallback {
        override fun onMessage(message: Message) {
            // Called for EACH token as it's generated
            val token = message.toString()
            print(token)  // prints word-by-word: "Gravity" "is" "a" "force" ...
            
            // Check for thinking content (Gemma 4 only)
            val thinking = message.channels["thought"]?.toString()
            if (!thinking.isNullOrEmpty()) {
                println("[THINKING] $thinking")
            }
        }

        override fun onDone() {
            println("\n✅ Generation complete!")
        }

        override fun onError(throwable: Throwable) {
            println("❌ Error: ${throwable.message}")
        }
    },
    emptyMap()  // extra context (pass mapOf("enable_thinking" to "true") for thinking mode)
)

5c. Multi-turn conversation (the Conversation remembers!)

// Turn 1
conversation.sendMessageAsync(
    Contents.of(listOf(Content.Text("My name is Alex"))),
    callback, emptyMap()
)
// AI: "Nice to meet you, Alex!"

// Turn 2 — the AI remembers turn 1!
conversation.sendMessageAsync(
    Contents.of(listOf(Content.Text("What's my name?"))),
    callback, emptyMap()
)
// AI: "Your name is Alex!"

// No manual history management needed — the Conversation object handles it.

5d. Reset conversation (clear history)

// Close old conversation, create new one on same engine
conversation.close()
val newConversation = engine.createConversation(
    ConversationConfig(samplerConfig = samplerConfig)
)
// New conversation has no memory of previous messages

🖼️ Step 6 — Send Images (Vision)

Gemma 4 can understand images! Send a photo and ask about it.

// Load image as PNG byte array
val bitmap: Bitmap = // ... load from camera, gallery, etc.
val stream = ByteArrayOutputStream()
bitmap.compress(Bitmap.CompressFormat.PNG, 100, stream)
val imageBytes = stream.toByteArray()

// Build contents: image first, then text
val contents = Contents.of(listOf(
    Content.ImageBytes(imageBytes),                    // ← the image
    Content.Text("What do you see in this image?")     // ← the question
))

conversation.sendMessageAsync(contents, callback, emptyMap())
// AI: "I see a golden retriever playing in a park with a red frisbee..."

⚠️ Important: Add the image BEFORE the text in the Contents list. The SDK processes them in order.

🚨 CRITICAL — visionBackend is REQUIRED! If your EngineConfig does not include visionBackend = Backend.GPU(), sending Content.ImageBytes will cause a native SIGSEGV crash (null pointer in liblitertlm_jni.so). This crash cannot be caught by try/catch — it kills the entire app. Make sure your engine is configured like this:

val engineConfig = EngineConfig(
    modelPath = modelPath,
    backend = Backend.CPU(),
    visionBackend = Backend.GPU(),  // ← WITHOUT THIS, IMAGE INPUT CRASHES!
    maxNumTokens = 4096
)

Without visionBackend, no vision executor is created, so the image bytes hit a null pointer in the native layer.

📝 Supported models: Only Gemma 4 E2B/E4B and Gemma 3n support vision. Gemma 3 1B and DeepSeek are text-only.


🎤 Step 7 — Send Audio

// Audio must be raw PCM bytes (not MP3/AAC)
// Sample rate: 16000 Hz, mono, 16-bit

val audioBytes: ByteArray = // ... record from microphone or load WAV

val contents = Contents.of(listOf(
    Content.AudioBytes(audioBytes),
    Content.Text("Transcribe this audio and summarize it")
))

conversation.sendMessageAsync(contents, callback, emptyMap())

📝 Supported models: Gemma 4 E2B/E4B and Gemma 3n support audio. Others are text-only.


💭 Step 8 — Thinking Mode (Chain-of-Thought)

Gemma 4 can show its reasoning process before answering. Like watching it think.

// Enable thinking via extra context
val extraContext = mapOf("enable_thinking" to "true")

conversation.sendMessageAsync(
    Contents.of(listOf(Content.Text("If I have 3 boxes with 5 apples each, and I give away 7, how many remain?"))),
    object : MessageCallback {
        override fun onMessage(message: Message) {
            val text = message.toString()
            val thinking = message.channels["thought"]?.toString()

            if (!thinking.isNullOrEmpty()) {
                // This is the AI's internal reasoning
                println("🧠 Thinking: $thinking")
                // "Let me calculate: 3 boxes × 5 apples = 15 apples total.
                //  If I give away 7: 15 - 7 = 8 apples remain."
            }
            if (text.isNotEmpty()) {
                // This is the final answer
                println("💬 Answer: $text")
                // "You have 8 apples remaining."
            }
        }

        override fun onDone() { println("✅ Done") }
        override fun onError(t: Throwable) { println("${t.message}") }
    },
    extraContext   // ← this enables thinking mode
)
What you see:

🧠 Thinking: Let me break this down step by step.
🧠 Thinking: 3 boxes × 5 apples = 15 total apples.
🧠 Thinking: 15 - 7 = 8 apples remaining.
💬 Answer: You have 8 apples remaining.
✅ Done

📝 Only Gemma 4 supports thinking mode. Other models ignore the enable_thinking context.


⚡ Step 9 — CPU vs GPU

CPU (Default — Use This)

val engineConfig = EngineConfig(
    modelPath = modelPath,
    backend = Backend.CPU(),
    maxNumTokens = 4096
)

✅ Works on all phones ✅ Stable, no crashes ✅ Uses ~2-3 GB RAM ❌ Slower generation (5-15 tok/s depending on phone)

GPU (Advanced — High-End Phones Only)

val engineConfig = EngineConfig(
    modelPath = modelPath,
    backend = Backend.GPU(),
    maxNumTokens = 4096
)

✅ 2-5x faster generation ❌ Loads entire model into GPU VRAMWILL CRASH (SIGSEGV) on phones with < 12 GB RAM ❌ Competes with Android's RenderThread for GPU → can freeze UI

⚠️  WARNING: GPU MODE CRASH EXPLAINED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Gemma 4 E2B = 2.6 GB model file
GPU loading  = ~3 GB VRAM needed
Android UI   = also uses GPU for drawing

Phone has 8 GB RAM total:
  - Android OS:     ~2 GB
  - Your app:       ~1 GB
  - Model on GPU:   ~3 GB
  - RenderThread:   needs GPU too → SIGSEGV (Fatal signal 11)
                    ═══════════════════════════════════
                    App crashes. Not catchable in Java.

FIX: Use CPU. Or only use GPU on 12GB+ RAM phones.

How to safely try GPU with fallback

fun createEngine(modelPath: String): Engine {
    // Try GPU first on high-end devices, fall back to CPU
    val activityManager = context.getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager
    val memInfo = ActivityManager.MemoryInfo()
    activityManager.getMemoryInfo(memInfo)
    val totalRamGb = memInfo.totalMem / (1024L * 1024 * 1024)

    val backend = if (totalRamGb >= 12) {
        Log.d("LLM", "Device has ${totalRamGb}GB RAM — using GPU")
        Backend.GPU()
    } else {
        Log.d("LLM", "Device has ${totalRamGb}GB RAM — using CPU (GPU needs 12GB+)")
        Backend.CPU()
    }

    val config = EngineConfig(
        modelPath = modelPath,
        backend = backend,
        maxNumTokens = 4096
    )
    return Engine(config).also { it.initialize() }
}

🧹 Step 10 — Cleanup (Don't Leak Memory!)

// When you're done with the model (app closing, switching models, etc.)

conversation.close()    // ← close conversation FIRST
engine.close()          // ← then close engine

// If you want to cancel generation mid-way:
conversation.cancelProcess()

⚠️ Always close in order: conversation first, then engine. Closing engine without closing conversation can leak native memory.


📋 Complete Copy-Paste Example

Drop this into any Activity or ViewModel and it works:

import android.os.Bundle
import android.util.Log
import androidx.activity.ComponentActivity
import androidx.lifecycle.lifecycleScope
import com.google.ai.edge.litertlm.*
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch

class ChatActivity : ComponentActivity() {

    private var engine: Engine? = null
    private var conversation: Conversation? = null

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)

        val modelPath = "${filesDir}/models/gemma-4-E2B-it.litertlm"

        lifecycleScope.launch(Dispatchers.IO) {
            // Load model
            Log.d("LLM", "Loading model...")
            val config = EngineConfig(
                modelPath = modelPath,
                backend = Backend.CPU(),
                maxNumTokens = 4096
            )
            engine = Engine(config).also { it.initialize() }

            conversation = engine!!.createConversation(
                ConversationConfig(
                    samplerConfig = SamplerConfig(topK = 64, topP = 0.95, temperature = 1.0),
                    systemInstruction = Contents.of(listOf(
                        Content.Text("You are a helpful, concise assistant.")
                    ))
                )
            )
            Log.d("LLM", "✅ Model ready!")

            // Chat
            chat("Hello! What can you do?")
            chat("What is the capital of Japan?")
            chat("What did I just ask you?")  // Tests memory
        }
    }

    private fun chat(userMessage: String) {
        Log.d("LLM", "👤 You: $userMessage")
        val sb = StringBuilder()

        conversation?.sendMessageAsync(
            Contents.of(listOf(Content.Text(userMessage))),
            object : MessageCallback {
                override fun onMessage(message: Message) {
                    sb.append(message.toString())
                }
                override fun onDone() {
                    Log.d("LLM", "🤖 AI: $sb")
                }
                override fun onError(throwable: Throwable) {
                    Log.e("LLM", "❌ Error: ${throwable.message}")
                }
            },
            emptyMap()
        )
    }

    override fun onDestroy() {
        conversation?.close()
        engine?.close()
        super.onDestroy()
    }
}

🔧 Troubleshooting

Problem Cause Fix
Metadata version 2.3.0, expected 1.9.0 Kotlin too old Upgrade to Kotlin 2.2.0
kapt build failure with Room Room 2.6 incompatible with Kotlin 2.2 Upgrade Room to 2.7+, switch kapt→KSP
SIGSEGV (Fatal signal 11) on model load GPU out of memory Switch to Backend.CPU()
SIGSEGV when sending Content.ImageBytes Missing visionBackend in EngineConfig Add visionBackend = Backend.GPU() to EngineConfig — without it, no vision executor is created and image bytes hit a null pointer
SIGSEGV on image with GPU backend too Model + vision both on GPU = OOM Keep backend = CPU(), only visionBackend = GPU()
Model takes 30+ seconds to load Normal for first load Load on background thread, show progress
Model file not found Wrong path Check context.filesDir path, verify file exists
Response is garbage/random Temperature too high Lower temperature to 0.7-1.0
App killed by Android Model uses too much RAM Use smaller model (Gemma 3 1B = 584 MB)
composeOptions error Old Compose compiler setup Remove composeOptions, add kotlin-compose plugin
CancellationException on response User cancelled or timeout Handle gracefully, not a real error

📊 Performance Benchmarks

Tested on mid-range Android phone (8 GB RAM, Snapdragon 7 Gen 2):

Model Load Time Speed (CPU) RAM Usage
Gemma 4 E2B ~15 sec 8-12 tok/s ~3.5 GB
Gemma 3 1B ~3 sec 15-25 tok/s ~1.2 GB
DeepSeek R1 1.5B ~5 sec 10-15 tok/s ~2.0 GB

Performance varies by device. Flagship phones (Pixel 9, S25 Ultra) are 2-3x faster.


🔗 Resources

What Link
LiteRT LM SDK https://ai.google.dev/edge/litert
Gemma 4 Models https://huggingface.co/litert-community
Google AI Edge Gallery (reference app) https://github.com/google-ai-edge/gallery
ZeroClaw Android (production example) https://github.com/ashokvarmamatta/ZeroClawAndroid
Kotlin 2.2 Migration Guide https://kotlinlang.org/docs/whatsnew22.html

Built with LiteRT LM by Google AI Edge

Guide by @ashokvarmamatta

Learned by building ZeroClaw Android — 180 phases, 37 tools, 10 channels, Gemma 4 on-device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment