apnea/spontaneous-language-switching.md

Created May 2, 2026 06:58

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/apnea/c54edb68b033161c1b7a5f31875f5bc6.js"></script>
Save apnea/c54edb68b033161c1b7a5f31875f5bc6 to your computer and use it in GitHub Desktop.

Download ZIP

Spontaneous Language Switching in LLMs — Empirical Observations

Raw

spontaneous-language-switching.md

Spontaneous Language Switching in LLMs

LLMs may spontaneously switch to Chinese mid-reasoning regardless of prompt language — observed in both OpenAI's o1 and Chinese models (DeepSeek, Qwen, GLM)

The papers listed below list 3 possible reasons for this: internal circuit competition, strategic reasoning advantages gained during training, and the influence of distributed training data.

1. Competition Between Internal Circuits

Mechanistic interpretability research suggests that multilingual LLMs possess two distinct internal subsystems that govern generation:

Language Sub-circuits: These act as "lingual keys" that detect and maintain language patterns.
Semantic Sub-circuits: These function as "contextual values" that retrieve language-agnostic meaning and concepts.

In normal generation, these circuits converge smoothly. However, unintended code-switching occurs when semantic circuits dominate, overriding the language-specific pathways. When the model focuses intensely on the "semantics" of a complex reasoning problem, the linguistic control circuit may weaken, allowing the model to lapse into a language that might be more strongly associated with the underlying concept or reasoning step.

2. Strategic Reasoning Behavior (RLVR)

Evidence from reasoning-focused models like DeepSeek-R1 suggests that language mixing is often a strategic behavior rather than a flaw.

Training Origins: Research identifies Reinforcement Learning with Verifiable Rewards (RLVR) as the critical stage where this behavior emerges.
Reasoning Efficiency: Enforcing a single language (monolingual decoding) can actually degrade performance. For instance, in some tests, preventing language mixing reduced reasoning accuracy on the MATH500 benchmark by 5.6 percentage points.
Language Choice: Models may switch to Chinese mid-reasoning because they "perceive" (via trained weights) that a potential language switch will benefit the reasoning process for that specific step.

3. Training Data and Vocabulary "Pollution"

The statistical distribution of the training corpus also plays a significant role in spontaneous language triggers.

Token Frequency: Analysis of the vocabularies in OpenAI’s models (including o1) and Chinese models (like Qwen and GLM) reveals a high prevalence of specific Chinese tokens.
Polluted Chinese (PoC) Tokens: Some of these models have vocabularies "polluted" by tokens from adult content or gambling sites because those terms appeared frequently in the pre-training data. For example, in GPT-4o's vocabulary, the name of a specific Japanese adult film star appears 2.6 times more frequently than the common greeting "Hello" (您好).
Triggering Switches: Because these tokens are concentrated in the pre-training corpus, they form strong associations. If a reasoning chain inadvertently touches on a semantic concept linked to these high-frequency Chinese clusters, the model may spontaneously output related Chinese tokens, even if they appear nonsensical in context.

4. Comparison Across Models

The sources note that while this is observed across several architectures, the causes vary slightly:

OpenAI Models (o1, GPT-4o): These show significant "pollution" in their Chinese vocabularies—over 23% of long Chinese tokens in some GPT versions are related to porn or gambling—which can trigger weird Chinese outputs.
Chinese Models (DeepSeek, Qwen, GLM): These generally have much cleaner vocabularies with fewer "polluted" tokens (e.g., DeepSeek-V3 has only 0.17% such tokens compared to 46.6% in some GPT sets). In these models, switching is more likely attributed to the strategic reasoning benefits or the high density of Chinese-language reasoning data used during their RLVR stages.

Sources Breakdown

1. DeepSeek-R1: language mixing as an emergent reasoning behaviour

DeepSeek-R1-Zero (pure RL, no supervised fine-tuning) spontaneously developed language mixing in its chain of thought — switching between English and Chinese mid-reasoning.
DeepSeek team tried to suppress this with language consistency rewards. It reduced mixing but also degraded reasoning accuracy.
Li et al. (2025, EMNLP): enforcing monolingual decoding on DeepSeek-R1 reduced accuracy by 5.6 percentage points on MATH500. Language mixing is a reasoning strategy the model discovers during RL training, not a bug.
Source: Li et al. (2025) "The Impact of Language Mixing on Bilingual LLM Reasoning." EMNLP 2025. https://arxiv.org/abs/2507.15849

2. Mechanistic cause: semantic regime overrides language regime

Xiao et al. (2026, submitted ICLR) identified two distinct neural circuits in multilingual LLMs:
- Language regime: "lingual key" — detects and maintains the current output language.
- Semantic regime: "contextual value" — retrieves language-agnostic semantics.
Normally these converge. During code-switching, the semantic circuit dominates, overriding the language pathway and destabilising output language.
Multilingual neurons localised to ~0.019% of all neurons. Fine-tuning just this subset reduced code-switching rate by 20.8%.
Source: Xiao et al. (2026) "How Do Language Models Speak Languages? A Case Study on Unintended Code-Switching." ICLR 2026 submission. https://openreview.net/forum?id=HIXPyQ1aMq

3. Multilingual LLMs think in English

Schut et al. (2025) applied a logit lens to LLMs processing French, German, Dutch, and Mandarin inputs. The model first emits representations closest to English for semantically-loaded words, before translating to the target language.
Activation steering is more effective when steering vectors are computed in English rather than the input/output language.
English-centric training data creates an English-dominant internal representation space — the model reasons in English, then translates.
Source: Schut, Gal & Farquhar (2025) "Do Multilingual LLMs Think In English?" https://arxiv.org/abs/2502.15603

4. Tokeniser pollution from Chinese training data

Zhang et al. (2025, EMNLP): GPT tokenisers contain "Polluted Chinese" (PoC) tokens — Chinese subword fragments that don't correspond to meaningful characters or words, created because BPE operates on UTF-8 bytes without understanding Chinese text structure.
PoC tokens cause garbled Chinese output. The model has learned statistical patterns for tokens that are semantically meaningless.
Source: Zhang et al. (2025) "Speculating LLMs' Chinese Training Data Pollution from Their Tokens." EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1327.pdf

5. Chinese prompting does NOT save tokens (debunking a myth)

Claim: Chinese prompts are more token-efficient for coding tasks. Ren et al. (2026) tested this on SWE-bench Lite.
Result: no consistent efficiency advantage. Token cost is model-dependent (MiniMax-2.7 costs 1.28x more in Chinese; GLM-5 costs slightly less). Success rate when prompting in Chinese is generally lower across all models tested.
Source: Ren et al. (2026) "Mythbuster: Chinese Language Is Not More Efficient Than English in Vibe Coding." https://arxiv.org/abs/2604.14210

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment