ЛЛМ потрясающе умные и потрясающе тупые. От диалога с ЛЛМ меня бросает то в эйфорию, то в отчаяние. Кажется, что можно получить x10 производительности труда, но на практике на часах 2 часа ночи, а я ору матом на Клода. Так жить нельзя. Настало время остановиться, выдохнуть и подумать, как использовать ЛЛМ как левередж.
Разработка при помощи ЛЛМ делится на две части:
- Микро - когда у нас маленькая задача, занимающая минимум контекста, а ЛЛМ нужно следовать до 100 инструкций (для claude 3.7 ~95% точность Writing a good CLAUDE.md). В этом режиме мы не сталкиваемся с "dumb zone", и ЛЛМ может выполнить все инструкции без деградации
- Макро - когда у нас куча уже написанного кода, куча инструкций и все это нужно нарезать на "Микро" контексты
Программирование включает в себя множество подзадач: проектирование, тестирование, безопасность, производительность, мониторинг, поддержка и т.д. Каждая ветвь имеет свои особенности, но я предполагаю, что принцип работы ЛЛМ будет примерно одинаков в каждой из них. В этой статье мы будем сфокусированы на проектировании/тестировании. Если добавить в стек производительность, безопасность и мониторинг, то на входе будет код повышенного качества, однако я просто на это забью.
В этой статье мы рассмотрим именно микро уровень, как основополагающий. Если мы сможем получить x10 левередж на микро уровне, то далее, почти тривиально, его можно скейлить на макро и жечь тысячи долларов/токенов в день.
Главный Вопрос, на который мы попытаемся ответить:
Как на микро уровне получить от ЛЛМ программу,
которая делает то, что ты хочешь без багов,
с минимальным затратом времени программиста?
Для ответа на Главный Вопрос в этой статье мы возьмем один простой алгоритм "чанкинг текста для эмбеддингов", один JS модуль, и реализуем его разными способами при помощи Клода. Алгоритм нетривиальный – то есть его не ваншотал Клод в марте 2026 Opus 4.6 с простого вайбкод промта. Этот алгоритм легко вмещается в контекст, а Клод не перегружен инструкциями.
Ресурсы, в которых мы ограничены, и которые мы будем замерять:
- Время программиста. Нет смысла использовать ЛЛМ, если можно самому написать все это за то же время. Хочется x10, чтобы оправдать чек
- Размер контекста и количество инструкций. Очень лимитированы. Чем меньше текста в системе, тем все лучше работает
При работе с ЛЛМ есть несколько методологий, которые мы рассмотрим:
- Вайбкод
- Планнинг мод
- Spec Driven Development
- Всякие фреймворки вида spec-kit или OpenSpec
- Вызывать его в цикле с промтом
Какие-то корпоративные тулзы, которые запускают оравы агентов за бешеные бабки (от Dex Horthy такое видел, Research Plan Implement), мы рассматривать не будем, ибо у нас нет бабок. Но они больше сфокусированы на макро, разрезаны на специализированных микро агентов и детерминистическую обвязку. Микро агенты в итоге все равно пишут/читают md/код.
Методологии можно сгруппировать по способу ввода:
- Программист пишет в чат > ЛЛМ пишет код
- Программист пишет в чат > ЛЛМ пишет md > ЛЛМ читает md и пишет код
- Программист пишет в md > ЛЛМ читает md и пишет код
А также по способу проверки результата:
- Программист читает код
- Программист читает md
Заваривайте чай, посмотрим, какие методология и подход лучше всего. Будет весело.
Я реализовал алгоритм чанкинг текста для эмбеддингов различными способами десятки раз.
В чистый провал можно записать вайбкодинг и любые попытки любым способом написать спеки без участия программиста. Остальные способы так или иначе достигли цели.
Лучшими оказались самописные спеки: быстрее всех, компактнее всех, понятнее всех. На втором месте OpenSpec, он быстрый, но объемный. Потом планнинг мод – одновременно удобный и неудобный. Затем Spec-Kit — супер жирный и медленный.
Как на микро уровне получить от ЛЛМ программу,
которая делает то, что ты хочешь без багов,
с минимальным затратом времени программиста?
- .md файлы – новый исходный код программы, их нужно хранить в репозитории, мы их компилируем в код через ЛЛМ
- Обязательно писать .md нужно только руками и не пускать в них ЛЛМ. В репозитории должно быть строгое разграничение, какие файлы пишет и читает человек, какие пишет ЛЛМ, не смешиваем их
- ЛЛМ не способна проектировать софт, но может очень круто писать код по спек файлам. Программа пишется в 6 раз быстрее, чем руками, качество продукта на уровне
- Тест кейсы нужно самостоятельно описывать в .md файлах. ЛЛМ читерит и не может сама себя тестировать
- Программист обязан валидировать скомпилированный ЛЛМ код. Валидация тест кейсов – это основной роут валидации кода. Это многократно быстрее чтения исходного кода программы
- Держать .md файлы максимально мелкими и с максимально плотной информацией, загружать только те, которые реально нужны в контексте
Основных находок три:
Она может погуглить, спросить у оператора, но самостоятельно этого выполнить не может. "ЛЛМ не может думать".
Я пробовал очень простую программу "разбить текст на чанки для последующего эмбеддинга". ЛЛМ верно поняла, что нужны сепараторы, макс длина чанка и оверлап. Но не поняла, зачем они нужны. ЛЛМ разбивала слова посередине, либо выдавала чанки без оверлапа, хотя он был сконфигурирован. Результат отвратного качества. Принципиально неверно просить ЛЛМ проектировать софт, это грести против течения
Хорошие примеры плохой архитектуры:
Как преобразование Фурье, только для текста. Преобразование Фурье не интеллектуальное, но крайне полезное: если в одном домене задача запарна, то в другом может быть тривиальна.
Можно из английского перегнать в JS; из JS в найденную CVE уязвимость; из CVE сделать программу-эксплоит.
Если дать краткое описание алгоритма, чуть типов, чуть тестов – то ЛЛМ очень четко выдает код, который удовлетворяет всем требованиям и НЕ СОДЕРЖИТ багов.
Исходный код в .md файлах в 3 раза меньше, чем в .ts файлах, сжатие потрясающее, не говоря о скорости написания программистом: в общем случае скорость разработки с ЛЛМ была в 6 раз выше чисто за счет того, что можно не указывать все детали алгоритма
Хороший пример идеальной трансляции:
3. Валидировать код ЛЛМ нужно так же, как и любой другой код – через тесты. Но сама себе тесты ЛЛМ писать не может
ЛЛМ тесты пишет так, чтобы их ей было легче пройти, что и смешно, и грустно.
Несколько составленных вручную кейсов и PBT дают результат на голову качественней, чем 1000 строк написанных ЛЛМ слоп тестов.
Тесты нужно составить таким образом, чтобы их было легко читать, но при этом они проверяли кудрявую логику. А вам нужно их читать, это основной пруф того, что работа выполнена верно
Примеры плохих тестов:
- Цикл тесты код содержит баги, но тесты подобраны так, чтобы быть зелеными
- Spec-kit тесты очень много текста, супер запарно валидировать
- OpenSpec тесты несмотря на внешнюю тщательность не тестируют текст оверлапа
Без всяких сомнений ЛЛМ это Китайская комната.
Этот контраст между "думать" и "транслировать между доменами" просто потрясающий. Именно из-за него иногда кажется, что Клод тупой, а иногда что он – Гений. Именно из-за него нельзя давать Клоду писать .md файлы.
Визуально и "думать" и "транслировать" – это просто текст, какой-то промт для ЛЛМ. Но в одном случае ты не знаешь, что нужно делать, а в другом знаешь, но тебе запарно это вводить.
Если ЛЛМ не знает правильного ответа, то она его не придумает. Может случайно угадать, но у нас такие по размеру домены, где случай нереален.
Конечно, можно направить ЛЛМ погуглить или спросить кого-то другого (другие ЛЛМ лол), но сама она не сложит 2 + 2 верно, если не видела эти 2 + 2 = 4 миллионы раз.
Настоящий левередж появляется, если описать задачу неточно, с большим количеством пропусков и дыр, но с очевидными формами, которые в эти дыры подходят.
Например, предоставить ЛЛМ пару типов, один подобранный руками тест кейс, какое-нибудь свойство для PBT. Чтобы все эти идеи пересекались в многомерном пространстве и описывали четкое для ЛЛМ решение.
И вот тут наш друг Клод сияет. Как бы я ни тренировал vim и слепую печать, я никогда не смогу выдавать в x10 раз быстрее код.
За счет этих дыр, пропусков, ассампшенов создается левередж. Больше дыр – больше левередж.
В JS или языке Си нужно описать каждый шаг алгоритма, каждую скобочку – очень много деталей. Но если описать проблему в домене мышления .md, через хитрое пересечение различных идей описать форму решения, то детали нафиг не нужны.
ЛЛМ транслирует .md в домен кода и восстанавливает все скобочки и точки. Я в 6 раз пишу программу быстрее в .md, чем в TS, и это я еще не тренировался составлять короткие .md. Поверхностный осмотр показал, что можно удалить значительную часть файла, и клод произведет такое же правильное решение.
Маленький и абсолютно безграмотно написанный на корявом английском человеком .md, в котором живут интересно пересекающиеся идеи, может однозначно задать здоровенный работающий алгоритм в коде + тесты + мониторинг + перфоманс + безопасность.
И в то же время огромная красивая-вычурная простыня ЛЛМ текста приведет к безумно тупому багу. И именно поэтому никогда не стоит давать ЛЛМ писать спеки и .md.
ЛЛМ НЕ напишет туда того, чего не знает, ибо не умеет думать. Но она напишет туда то, что и так всем известно, и засрет свой контекст и контекст программиста бессмысленной информацией. ОГРОМНЫМ количеством бессмысленной, красиво оформленной информации.
У меня ЛЛМ выдавала спеки в 4 раза больше того, что я сам написал, и это я еще не тренировался писать малые спеки, цифра может легко уйти в x10.
Также можно натравить ЛЛМ делать "рисерч": гуглить, скачивать сорс код, читать его и выдавать какой-то сжатый отчет. Нужно помнить, что это не настоящий человеческий рисерч: новизна идей в этом отчете не за счет мыслительного процесса, а за счет поиска и индексации.
ЛЛМ прочитает кучу текста и сможет найти иголку в огромном стоге информации (чувак из Антропика хвастался, что нашел в линукс ядре уязвимость). А вот выводы ЛЛМ сделать не может.
Поэтому нет смысла пихать сам отчет в .md. Его нужно прочитать, сделать выводы самому и написать на корявом английском "chunks never break words if can!".
Скиллом разработчика становится: как написать маленький ленивый md, который легко верифицировать? Это целое искусство доставания эссенции из алгоритма. Что у меня хорошо работало:
- Описать часть формы входных и выходных данных. Нужно думать о том, чтобы алгоритм было легко тестировать; выходы/входы должны способствовать этому
- Интересный ручной тест кейс
- Интересные pbt свойства
- Тесты должны получаться верифицируемыми визуально под пивас: их мало, визуально компактные, емкие, составленные вами
- Можно и нужно компилить md несколько раз с чистого контекста и смотреть, какие возникают проблемы у ЛЛМ. Было бы прикольно поставить watch скрипт и перекомпилировать на каждое изменение файла
Вы удивитесь, насколько можно мало описать и много получить.
Большинство популярных ЛЛМ утилит (планнинг мод, spec-kit, OpenSpec, от Dex Horthy Research Plan Implement) придерживается флоу:
- Программист пишет в чат
- ЛЛМ пишет md
- Программист верифицирует md
- ЛЛМ читает md и пишет код и тесты
- Программист верифицирует код
Что по мне дико неэффективно.
Хотя нужно держать в голове, что это еще и Макро утилиты, направленные на разработку, которая не вмещается в один контекст. Но в самом своем ядре у них такой же точно флоу и на Микро уровне.
ЛЛМ пишет здоровенные md, которые по факту ничего не делают, но кажутся полезными. Это было четко подтверждено тестами с Spec-Kit и OpenSpec. Но при этом программисту и ЛЛМ приходится этими md засирать себе контекст.
То же самое происходит и с ЛЛМ-тестами, они многочисленны, однообразны и малоэффективны.
Я предлагаю флоу:
- Программист пишет md в них же тесты
- ЛЛМ пишет код
- Программист верифицирует тесты
Просто убираем лишний шаг: вместо чата загоняем спеки прямо в md. Разделение труда: человек думает, ЛЛМ пишет. Всем все удобней: человеку меньше читать, меньше писать, легче верифицировать. ЛЛМ не надо думать, ЛЛМ меньше читать.
Верификация самого же кода программы затруднительна, ибо алгоритмы, в отличие от тестов, хитрые и большие. Но можно чисто проскандировать на предмет какого-то поверхностного инсайда или какой-то совсем лютой дичи.
На mac mini через ssh/tmux
- Создаем отдельных юзеров, которые не могут смотреть файлы друг друга
- Каждому юзеру ставим отдельного Клода
- Клоды привязаны к одному аккаунту, ибо у меня нет денег на 10 клодов. Тут считаем, что через инет клоды не общаются и не подсказывают друг другу
- Запускаем Клода как
claude --dangerously-skip-permissions - Реализуем один и тот же алгоритм
- Сравниваем реализации между собой
Замеряем:
- Время, потраченное программистом
- Размер контекста
- Ошибки
Особенности:
- Мы не будем думать, как менеджить контекст. Алгоритм маленький, легко влезает
- Каждый из кейсов я для надежности прогнал несколько раз и убедился, что он стабильно себя одинаково ведёт. Статья и так получилась большая, я не буду все прикладывать
Мне для RAG потребовались эмбеддинги для чат логов. Кучу чатов нужно разрезать на примерно одинаковые куски и скормить в ЛЛМ, чтобы получить вектора для поиска. Но если резать втупую по длине строки, то фраза
Почему упал вчера прод сервер
Может быть нарезана на
чему упа, л вчера - что будет тупо воспринято ЛЛМ.
Поэтому нужно резать не только по длине, но и по сепараторам: переносам строки, пробелам, знакам препинания
Но если так сделать, то чат лог:
- настало то время
- перезапустишь прод сервер?
- ок, хорошо
Разрежется на - перезапустишь прод сервер? и - ок, хорошо
"ок, хорошо" - несет мало информации без контекста, поэтому нельзя резать чанки впритык. Нужен небольшой оверлап. То есть в идеале должны получаться чанки вида:
Чанк 1
- настало то время
- перезапустишь прод сервер?
Чанк 2
- перезапустишь прод сервер?
- ок, хорошо
Теперь у чанков будет контекст и все будет искаться лучше.
- Левая граница оверлапа берется как самый левый сепаратор, но никогда не может резать слова. Не включаем сам сепаратор в оверлап
- Правая граница текста берется как самый приоритетный самый правый сепаратор, может резать слова (если нет сепараторов). Текст включает в себя сепаратор
- Можно задать макс длину чанка
- Можно задать макс длину оверлапа
- По возможности всегда пытаемся добавить в чанк оверлап, оставшийся бюджет тратим на продвижение вперед
- Можно задать с какой позиции в тексте начинать нарезку, чтобы у первого чанка мог быть оверлап
Если уже понятно, как на микро уровне производить с помощью ЛЛМ код, то еще не ясно, как скейлить это на макро уровень. А также как добавить SDD в готовый взрослый проект.
Интересно также понять, как сделать ЛЛМ агента и дать ему RAG, чтобы непрограммисты могли слать сложные запросы в данные компании. Тут основная проблема в том, чтобы не давать ЛЛМ врать, ведь авто тестов в ручных запросах не будет. Быть может, к каждому запросу нужно прикреплять способ его проверки, но это усложняет запросы.
Продвинутый программист разочарован ЛЛМ – ЛЛМ решила проблему, но сделала это через дупликацию кода. Трагедия века. Си компилятор инлайнит и дублирует функции только в путь – это не баг, а фича.
Дупликация вредна в домене мышления, ибо мутации становятся хрупкими, а также сложнее понять систему. Но в домене сырого кода – она может даже положительно влиять на перфоманс. У продвинутого программиста батхерт, и я его понимаю.
Когда я пишу на языке Си, то я не знаю, как оно под капотом будет транслироваться в машинный код. Я примерно представляю алгоритмы, и также знаю, что все покрыто лютыми тестами, верифицированными людьми. Си можно доверять.
Также можно относиться и к ЛЛМ. Я пишу алгоритмы и архитектуру в .md, они конвертируются в код. Код покрыт лютыми тестами. Тесты также я сам пишу в .md. В этом случае коду из ЛЛМ можно доверять. В этом особом случае можно забить на понимание каждой строчки кода.
Тут на передний план выходят PBT и другие адвансед методы тестирования. Может 100% кавередж репорт. Точно верификация протоколов как в TS. Может fuzzy тестирование с гайдом. Перфоманс тесты. Можно следить за утечками памяти. Можно скармливать продакшен логи и строить сложные пайплайны самопочинки.
Всегда есть огромный соблазн попросить ЛЛМ писать md файлы и сделать x100, но это просто нукнет систему. Нужно четко понимать обязанности ЛЛМ и программиста.
При разделении труда и понимании своей роли ЛЛМ дает левередж x10. Программисты нужны как никогда.
- Время 1:30
- Успех, ошибок 0
- Размера контекста нет, нет ЛЛМ
Пьем тонизирующий китайский чай и пишем код руками. Как всегда, задача казалась на 10 минут, но из-за реальности растянулась на 1:30. Оценка времени косячит в разработке.
Были написаны и юнит, и Property-Based тесты. Я вполне уверен, что в 99.9% кейсов программа работает как ожидается. Если баги есть, то они совсем левые, например оверлап никогда не может достичь максимума из-за отброса сепаратора.
Качество кода премиальное, но разработка сверхзапарная. Нужно указать все детали, ввести весь код, что долго-муторно.
Написанные тут тесты будут использованы для проверки других алгоритмов. Их мы никогда не покажем Клоду, а будем использовать чисто для валидации.
Details
import assert from 'assert'
export type Chunk = {
start: number
mid: number
end: number
text: string
}
export const splitStringOnChunks = ({
text,
start: textStart = 0,
separators,
chunkLen,
overlap = 0,
}: {
text: string
start?: number
separators: string[]
chunkLen: number
overlap?: number
}): Chunk[] => {
assert(chunkLen > 0)
assert(textStart >= 0)
assert(separators.length > 0)
assert(separators.every(Boolean))
assert(overlap >= 0)
assert(overlap < chunkLen)
const chunks: Chunk[] = []
if (!text) return chunks
let mid = textStart
while (mid < text.length) {
let start = 0
const overlapFrom = mid - overlap
if (overlapFrom > 0) {
start = mid
const potOverlap = text.slice(overlapFrom, mid)
for (const sep of separators) {
let idx = potOverlap.indexOf(sep)
if (idx < 0) continue
idx += overlapFrom + sep.length
start = idx < start ? idx : start
}
}
let end = Math.min(mid + chunkLen - (mid - start), text.length)
if (end < text.length) {
const potText = text.slice(mid, end)
for (const sep of separators) {
const pos = potText.lastIndexOf(sep)
if (pos >= 0) {
end = mid + pos + sep.length
break
}
}
}
chunks.push({
start,
mid,
end,
text: text.slice(start, end),
})
assert(end > mid)
mid = end
}
return chunks
}Details
import fc from 'fast-check'
import _ from 'lodash'
import { it, expect } from 'vitest'
import { splitStringOnChunks } from './splitStringOnChunks'
it('splitStringOnChunks: base cases', () => {
expect(
splitStringOnChunks({
text: 'skip some. Hello world. Some long text',
start: 11,
chunkLen: 30,
overlap: 9,
separators: ['. ', '.', ' '],
}).map((x) => x.text),
).toStrictEqual(['some. Hello world. ', 'world. Some long text'])
expect(
splitStringOnChunks({
text: 'skip some. Hello world. Some long text',
start: 11,
chunkLen: 30,
overlap: 9,
separators: [' ', '.', '. '],
}).map((x) => x.text),
).toStrictEqual(['some. Hello world. Some long ', 'long text'])
})
const paramsGen = fc.record({
text: fc.string({ minLength: 0, maxLength: 10 }),
start: fc.integer({ min: 0 }),
chunkLen: fc.integer({ min: 0 }),
overlap: fc.integer({ min: 0 }),
separators: fc.array(fc.string({ minLength: 1, maxLength: 2 }), { minLength: 1, maxLength: 2 }),
})
it('splitStringOnChunks: bpt', () => {
fc.assert(
fc.property(fc.context(), paramsGen, (ctx, params) => {
fc.pre(params.overlap <= params.chunkLen - 1)
const chunks = splitStringOnChunks(params)
ctx.log(JSON.stringify(chunks))
// deterministic
expect(chunks).toStrictEqual(splitStringOnChunks(params))
// non existant separator has no effect
if (!params.text.includes('*'))
expect(chunks).toStrictEqual(
splitStringOnChunks({ ...params, separators: ['*', ...params.separators] }),
)
const effectiveText = params.text.slice(params.start)
if (effectiveText) {
// if there is text there is output
expect(chunks.length).toBeGreaterThan(0)
} else {
// not output if nothing to chunk
expect(chunks.length).toStrictEqual(0)
return
}
// can recover the original text
expect(chunks.map((x) => params.text.slice(x.mid, x.end)).join('')).toStrictEqual(
effectiveText,
)
// chunks are sane
chunks.forEach((chunk) => {
expect(chunk.start).toBeGreaterThanOrEqual(0)
expect(chunk.start).toBeLessThanOrEqual(chunk.mid)
expect(chunk.mid).toBeLessThan(chunk.end)
expect(params.text.slice(chunk.start, chunk.end)).toStrictEqual(chunk.text)
expect(chunk.text.length).toBeLessThanOrEqual(params.chunkLen)
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(params.overlap)
// chunk should ends with a separator
if (chunk.end < params.text.length && chunk.text.length < params.chunkLen)
expect(params.separators.some((x) => chunk.text.endsWith(x))).toBeTruthy()
})
}),
{
numRuns: 10_000,
},
)
})- Провал. Ошибки есть неисправляемые
Проверяем самый тупой вариант. В режиме вайбкодинга описываем клоду алгоритм и двигаемся итеративно, руками проверяя, как работает программа.
Ключевые особенности вайбкодинга:
- Мы мало думаем
- Мы не даем Клоду писать тесты/автоматизированные проверки и запускать их
- Проверяем работу программы самостоятельно
Есть две ключевые проблемы с вайбкодом:
Из-за того, что нет нормальных тестов, клод не может диагностировать и исправить ошибки. Клод пытается сам что-то запускать, но он придумывает тривиальные кейсы, которые легко проходят, но по факту не ловят проблему. Программа фейлится на секретных тест кейсах из 'Пишем код самостоятельно'.
Также Клод изначально не выяснил у меня все требования. Я, глядя на работу программы, не мог понять, что требования не выполнены. Тоже большая проблема.
На вайбкоде далеко не уедешь.
Details
Изначальный запрос на алгоритм
I want to create a function `splitStringOnChunks()` in `./src/splitStringOnChunks.ts`
It split incoming text on chunks with overlap support for the future vector embeddings
It should support overlap, should allow to specify start position in text, shuld support chunkLen and overlap length
Should support separators. They should be ordered in desc of importance
Left boundary should just take the most left separator, if none, then no overlap
Right boundary should take the most important right separator, if none, then allowed to split words
Написал код
run it with interesting parameters and output to ./out.json
Тяжело понять, работает ли
too big strings. make small strings and without new lines, this will be easy for me to check
Правим косметику
The left overlap boundary should not include separators
Нашли первый баг
I think there is a bug in this chunk and we should break on more important separator ". "
{
"index": 2,
"start": 26,
"end": 43,
"length": 17,
"text": "ggg hhh iii. jjj "
},
Поправили, но что-то я не уверен, примеры слишком простые
seems good. but check for bugs. and give some other examples of execution, I'm not sure
Я внешне не вижу багов, читаю кейсы - они хорошие
are we done?
Хотя внешне все было ок, но когда мы прогнали на 'Пишем код самостоятельно', то выяснились ошибки
I wounder how the algo works regard budget for the right side. Do we initially try to match overlap and then use unspent budget to
match right side? Or does it work other way around?
Да, мы просто друг друга недопоняли. Этого требования не было оговорено, но я его подразумевал. Но при этом в приведенных примерах не было этого кейса, и мы его пропустили. Косяк.
Yes, I want other way around, I want initially try to match overlap, and then try to match right boudnary
Вроде как говорит, что исправил, но по факту все та же ошибка остается.
are you sure it works?
Нет, не уверен, что-то поправил в коде. Начинается эта занудная дрочильня с Клодом, когда пытаешься его уговорить решить одну явную проблему, а он тупит. Не исправляет.
show me the example when we first try to match left overlap and the right part and it clearly seen that we used some budget first on the left overlap and the right one didn't consume as much as it could
И ведь в реальности, в его приведенном примере все ок работает, а на моем секретном примере фейлится. То есть баг есть, но клод его в упор не видит, ибо он подобрал такой пример, который его не показывает. То есть как бы он решил проблему но не решил.
Это конец вайбкодинга, дальше мне нужно думать и показывать Клоду свои придуманные примеры. Финализируем. Хочу также показать ему секретный пример, может поможет.
Run code on this example and tell me what you think
splitStringOnChunks({
text: 'skip some. Hello world. Some long text',
start: 11,
chunkLen: 30,
overlap: 9,
separators: ['. ', '.', ' '],
}).map((x) => x.text),
Говорит, посмотрел – все ок. Я говорю - посмотри блин, почему у первого чанка нет оверлапа???
Good catch. The first chunk skips overlap because of... – А, ну конечно, мой пример ему помог, все теперь работает. Но это уже не вайбкодинг.
Details
export interface SplitStringOnChunksOptions {
text: string;
chunkLen: number;
overlap?: number;
start?: number;
separators?: string[];
}
export interface Chunk {
text: string;
start: number;
mid: number;
end: number;
}
export function splitStringOnChunks(options: SplitStringOnChunksOptions): Chunk[] {
const {
text,
chunkLen,
overlap = 0,
start = 0,
separators = ["\n\n", "\n", ". ", " "],
} = options;
if (chunkLen <= 0) throw new Error("chunkLen must be positive");
if (overlap < 0) throw new Error("overlap must be non-negative");
if (overlap >= chunkLen) throw new Error("overlap must be less than chunkLen");
if (start < 0 || start >= text.length) return [];
const chunks: Chunk[] = [];
let mid = start;
while (mid < text.length) {
// Step 1: Spend overlap budget — look back from mid to find left boundary
let chunkStart = mid;
if (overlap > 0 && chunks.length > 0) {
const overlapStart = Math.max(0, mid - overlap);
const overlapRegion = text.substring(overlapStart, mid);
// Find the leftmost separator in the overlap region
let leftmost = -1;
for (const sep of separators) {
const idx = overlapRegion.indexOf(sep);
if (idx !== -1) {
const candidate = overlapStart + idx;
if (leftmost === -1 || candidate < leftmost) {
leftmost = candidate;
}
}
}
if (leftmost !== -1) {
chunkStart = leftmost + separators.find(s => text.startsWith(s, leftmost))!.length;
}
}
const overlapUsed = mid - chunkStart;
const forwardBudget = chunkLen - overlapUsed;
// Step 2: Spend remaining budget forward — find right boundary
const rawEnd = Math.min(mid + forwardBudget, text.length);
let end = rawEnd;
if (rawEnd < text.length) {
// Look for the most important separator in the back half of the forward region
let found = false;
for (const sep of separators) {
const searchStart = Math.max(mid, rawEnd - Math.floor(chunkLen / 2));
const region = text.substring(searchStart, rawEnd);
const lastIdx = region.lastIndexOf(sep);
if (lastIdx !== -1) {
end = searchStart + lastIdx + sep.length;
found = true;
break;
}
}
if (!found) {
end = rawEnd;
}
}
chunks.push({ text: text.substring(chunkStart, end), start: chunkStart, mid, end });
if (end >= text.length) break;
mid = end;
}
return chunks;
}
- Провал. Ошибки есть неисправляемые
Пробуем работать над программой циклом из /clear + инструкция. Применяем одну мутацию множество раз с чистого контекста и смотрим, что получится.
Попросим ЛЛМ взять на себя всю интеллектуальную работу, все высокоуровневые архитектурные решения. Пусть Клод напишет нашу функцию с нуля без спеков.
Такой подход дает результат лучше вайбкодинга. Однако все еще есть неисправляемые тонкие баги. Но, что более важно, алгоритм выбран отвратный. Нельзя Клоду давать архитектурные решения, он их запарывает.
Details
Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests.
Клод верно понял, что для эмбеддингов нужны и оверлап, и сепараторы с приоритетом. Накидал функцию и тесты. Визуально все выглядит максимально цивильно. На практике полный треш:
"hello world lol internet" maxChunkSize=6 overlap=3
превращается в чанки
"hello"
"lo"
"o"
"world"
"ld"
"d lol"
"ol"
"l"
"internet"
Что-то не так с оверлапом. Для надежности повторим цикл пару раз.
/clear
There is a function here, improve it
Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests
Повторяем еще 3 раза. Программа растет с каждой итерацией. После первой итерации у нас было 90 строк программы и 100 строк теста. После четвертой: 420 строк программы и 690 строк теста.
И токены не зря горят. Прикол в том, что итеративный подход ИСПРАВИЛ баги. От итераций программа улучшается, а я думал, будет деградировать. Однако есть класс багов посложнее, которые все-таки не исправляются, как бы я ни итерировался:
Тонок, очень запарно даже объяснять этот баг.
- Если у чанка есть место под оверлап, то мы пытаемся найти символов слева от чанка и добавить их в чанк
- Если там есть 2 точки с сепараторами, и самая левая точка превышает размеры чанка
- Но при этом вторая точка не превышает, и может быть использована под оверлап
- То мы скипаем оверлап, а должны были просто брать оверлап поменьше
Пример: в первом случае оверлап БОЛЬШЕ второго, и это БЛОКИРУЕТ оверлап:
maxChunkSize=12 overlap=6 text=hello a a worldasdf
hello a aworldasdf
maxChunkSize=12 overlap=3 text=hello a a worldasdf
hello a aa worldasdf
Пробуем обжать программу и добавить PBT. Несколько итераций
There is a function here, improve it
Write a TypeScript function that will split text on chunks for embeddings
Make the function and tests more compact. There is so much code, I think lots of stuff is excessive and redundant
Remove redundant stuff
После обжимания: 200 строк программы и 117 строк теста. Однако программа не стала менее тупой:
- Баги сохраняются
- Сама фундаментальная идея оверлапа неверна
При выбранном алгоритме (даже с волшебным исправлением багов) огромная часть чанков будет тупо скипать оверлап. Фактически алгоритм работает на треть мощности. То есть нельзя сделать оверлап 20% и все ваши чанки будут с ним, это получилось совсем не так:
- Если при создании чанка у него осталась часть бюджета от maxChunkSize
- То эта часть может быть потрачена на оверлап в пределах "overlap" бюджета.
Это безумный дизайн. Очень тупое решение выбрано Клодом, и от него итерациями не уйти, он в нем застрял. Полный провал. Фундаментально Клод не умеет мыслить, есть опасение, что как ни промптить, Клод не сможет придумывать себе верные спеки.
Details
export interface ChunkOptions {
maxChunkSize?: number;
minChunkSize?: number;
overlap?: number;
separators?: string[];
sizeOf?: (text: string) => number;
}
export interface Chunk {
text: string;
start: number;
end: number;
}
interface Span {
start: number;
end: number;
}
const DEFAULTS: Required<ChunkOptions> = {
maxChunkSize: 1000,
minChunkSize: 100,
overlap: 200,
separators: ["\n# ", "\n## ", "\n### ", "\n\n", "\n", ". ", "? ", "! ", " "],
sizeOf: (s: string) => s.length,
};
function measure(text: string, start: number, end: number, sizeOf: (s: string) => number): number {
return sizeOf(text.slice(start, end));
}
export function chunkTextWithMeta(text: string, options?: ChunkOptions): Chunk[] {
const { maxChunkSize, minChunkSize, overlap, separators, sizeOf } = { ...DEFAULTS, ...options };
if (maxChunkSize <= 0) throw new Error("maxChunkSize must be positive");
if (minChunkSize < 0) throw new Error("minChunkSize must be non-negative");
if (overlap < 0) throw new Error("overlap must be non-negative");
if (overlap >= maxChunkSize) throw new Error("overlap must be less than maxChunkSize");
const trimmedFull = trimSpan(text, 0, text.length);
if (!trimmedFull.text) return [];
if (measure(text, trimmedFull.start, trimmedFull.end, sizeOf) <= maxChunkSize) {
return [trimmedFull];
}
const segments = recursiveSplit(text, 0, text.length, maxChunkSize, separators, sizeOf);
const trimmed = segments
.map((s) => trimSpan(text, s.start, s.end))
.filter((s) => s.text.length > 0);
const chunks = mergeSmall(text, trimmed, minChunkSize, maxChunkSize, sizeOf);
if (overlap === 0 || chunks.length <= 1) return chunks;
return applyOverlap(text, chunks, overlap, maxChunkSize, separators, sizeOf);
}
export function chunkText(text: string, options?: ChunkOptions): string[] {
return chunkTextWithMeta(text, options).map((c) => c.text);
}
function recursiveSplit(
text: string, start: number, end: number,
maxChunkSize: number, separators: string[], sizeOf: (s: string) => number,
): Span[] {
if (measure(text, start, end, sizeOf) <= maxChunkSize) return [{ start, end }];
for (let sepIdx = 0; sepIdx < separators.length; sepIdx++) {
const parts = splitOn(text, start, end, separators[sepIdx]);
if (parts.length <= 1) continue;
// Greedily merge parts into segments that fit
const segments: Span[] = [];
let segStart = parts[0].start, segEnd = parts[0].end;
for (let i = 1; i < parts.length; i++) {
if (measure(text, segStart, parts[i].end, sizeOf) <= maxChunkSize) {
segEnd = parts[i].end;
} else {
segments.push({ start: segStart, end: segEnd });
segStart = parts[i].start;
segEnd = parts[i].end;
}
}
segments.push({ start: segStart, end: segEnd });
const finer = separators.slice(sepIdx + 1);
const result: Span[] = [];
for (const seg of segments) {
if (measure(text, seg.start, seg.end, sizeOf) > maxChunkSize) {
result.push(...(finer.length > 0
? recursiveSplit(text, seg.start, seg.end, maxChunkSize, finer, sizeOf)
: hardSplit(text, seg.start, seg.end, maxChunkSize, sizeOf)));
} else {
result.push(seg);
}
}
return result;
}
return hardSplit(text, start, end, maxChunkSize, sizeOf);
}
function splitOn(text: string, start: number, end: number, sep: string): Span[] {
const parts: Span[] = [];
let pos = start;
while (pos < end) {
const idx = text.indexOf(sep, pos);
if (idx === -1 || idx >= end) { parts.push({ start: pos, end }); break; }
parts.push({ start: pos, end: idx + sep.length });
pos = idx + sep.length;
}
return parts;
}
function hardSplit(
text: string, start: number, end: number,
maxChunkSize: number, sizeOf: (s: string) => number,
): Span[] {
const result: Span[] = [];
let pos = start;
while (pos < end) {
if (measure(text, pos, end, sizeOf) <= maxChunkSize) { result.push({ start: pos, end }); break; }
let lo = pos + 1, hi = end;
while (lo < hi) {
const mid = lo + Math.floor((hi - lo + 1) / 2);
if (measure(text, pos, mid, sizeOf) <= maxChunkSize) lo = mid; else hi = mid - 1;
}
result.push({ start: pos, end: lo });
pos = lo;
}
return result;
}
function trimSpan(text: string, start: number, end: number): { text: string; start: number; end: number } {
const raw = text.slice(start, end);
const trimmed = raw.trim();
if (!trimmed) return { text: "", start, end: start };
const lead = raw.length - raw.trimStart().length;
const trail = raw.length - raw.trimEnd().length;
return { text: trimmed, start: start + lead, end: end - trail };
}
function mergeSmall(
text: string,
segments: { text: string; start: number; end: number }[],
minChunkSize: number, maxChunkSize: number, sizeOf: (s: string) => number,
): { text: string; start: number; end: number }[] {
if (segments.length <= 1 || minChunkSize === 0) return segments;
const result: { text: string; start: number; end: number }[] = [];
for (const seg of segments) {
if (sizeOf(seg.text) >= minChunkSize || result.length === 0) {
result.push({ ...seg });
continue;
}
const prev = result[result.length - 1];
const merged = trimSpan(text, prev.start, seg.end);
if (measure(text, merged.start, merged.end, sizeOf) <= maxChunkSize) {
result[result.length - 1] = merged;
} else {
result.push({ ...seg });
}
}
return result;
}
function applyOverlap(
text: string, chunks: Chunk[], overlap: number,
maxChunkSize: number, separators: string[], sizeOf: (s: string) => number,
): Chunk[] {
const result: Chunk[] = [chunks[0]];
for (let i = 1; i < chunks.length; i++) {
const chunk = chunks[i];
const rawStart = Math.max(chunk.start - overlap, chunks[i - 1].start);
const snapped = snapToSep(text, rawStart, chunk.start, separators);
const trimmed = trimSpan(text, snapped, chunk.end);
if (!trimmed.text) continue;
result.push(measure(text, trimmed.start, trimmed.end, sizeOf) > maxChunkSize ? chunk : trimmed);
}
return result;
}
function snapToSep(text: string, pos: number, bound: number, separators: string[]): number {
const region = text.slice(pos, bound);
for (const sep of separators) {
const idx = region.indexOf(sep);
if (idx !== -1) return pos + idx + sep.length;
}
return pos;
}Details
import { describe, it, expect } from "vitest";
import { chunkText, chunkTextWithMeta } from "./chunk-text.js";
describe("chunkText", () => {
it("returns empty for empty/whitespace input", () => {
expect(chunkText("")).toEqual([]);
expect(chunkText(" \n\n ")).toEqual([]);
});
it("returns single chunk when text fits", () => {
expect(chunkText("Short text.", { maxChunkSize: 500, overlap: 0 })).toEqual(["Short text."]);
});
it("splits on sentence boundaries when paragraphs are too large", () => {
const text = "First sentence. Second sentence. Third sentence.";
expect(chunkText(text, { maxChunkSize: 25, overlap: 0, minChunkSize: 0 })).toEqual([
"First sentence.",
"Second sentence.",
"Third sentence.",
]);
});
it("falls through to finer separators for oversized chunks", () => {
const text =
"Short paragraph.\n\n" +
"This is a much longer paragraph with multiple sentences. " +
"It needs to be split on sentence boundaries. " +
"Because paragraph splitting alone is not enough.\n\n" +
"Final paragraph.";
const result = chunkText(text, { maxChunkSize: 60, overlap: 0, minChunkSize: 0 });
result.forEach((c) => expect(c.length).toBeLessThanOrEqual(60));
expect(result.some((c) => c.includes("Short paragraph"))).toBe(true);
expect(result.some((c) => c.includes("Final paragraph"))).toBe(true);
});
it("hard-cuts when no separators match", () => {
expect(
chunkText("abcdefghijklmnopqrstuvwxyz", { maxChunkSize: 10, overlap: 0, minChunkSize: 0, separators: [] }),
).toEqual(["abcdefghij", "klmnopqrst", "uvwxyz"]);
});
it("respects custom separators", () => {
expect(
chunkText("part1|part2|part3", { maxChunkSize: 10, overlap: 0, minChunkSize: 0, separators: ["|"] }),
).toEqual(["part1|", "part2|", "part3"]);
});
it("merges small fragments into neighbors", () => {
expect(chunkText("Hello world. Bye", { maxChunkSize: 18, overlap: 0, minChunkSize: 5 })).toEqual([
"Hello world. Bye",
]);
});
it("splits on markdown headers", () => {
const text = "# Intro\n\nIntro text.\n\n## Section One\n\nContent one.\n\n## Section Two\n\nContent two.";
const result = chunkText(text, { maxChunkSize: 40, overlap: 0, minChunkSize: 0 });
expect(result.some((c) => c.includes("Intro"))).toBe(true);
expect(result.some((c) => c.includes("Section One"))).toBe(true);
expect(result.some((c) => c.includes("Section Two"))).toBe(true);
});
it("throws on invalid options", () => {
expect(() => chunkText("hello", { maxChunkSize: 0 })).toThrow("maxChunkSize must be positive");
expect(() => chunkText("hello", { overlap: -1 })).toThrow("overlap must be non-negative");
expect(() => chunkText("hello", { minChunkSize: -1 })).toThrow("minChunkSize must be non-negative");
expect(() => chunkText("hello", { maxChunkSize: 10, overlap: 10 })).toThrow("overlap must be less than");
});
});
describe("overlap", () => {
it("consecutive chunks share overlapping content", () => {
const text =
"Alpha bravo charlie. Delta echo foxtrot. Golf hotel india. " +
"Juliet kilo lima. Mike november oscar.";
const result = chunkTextWithMeta(text, { maxChunkSize: 50, overlap: 20, minChunkSize: 0 });
expect(result.some((c, i) => i > 0 && c.start < result[i - 1].end)).toBe(true);
});
it("snaps overlap to separator boundaries", () => {
const text = "Sentence one here. Sentence two here. Sentence three here.";
const result = chunkTextWithMeta(text, { maxChunkSize: 25, overlap: 15, minChunkSize: 0 });
for (const chunk of result) {
expect(chunk.text).toMatch(/^[A-Z]/);
expect(text.slice(chunk.start, chunk.end)).toBe(chunk.text);
}
});
});
describe("custom sizeOf", () => {
const wordCount = (s: string): number => {
const t = s.trim();
return t.length === 0 ? 0 : t.split(/\s+/).length;
};
it("chunks by word count", () => {
const text = "one two three four five six seven eight nine ten";
const result = chunkText(text, { maxChunkSize: 4, overlap: 0, minChunkSize: 0, sizeOf: wordCount });
result.forEach((c) => expect(wordCount(c)).toBeLessThanOrEqual(4));
expect(result.join(" ")).toContain("ten");
});
it("hard-splits with custom sizeOf", () => {
const result = chunkText("aaa bbb ccc ddd eee fff", {
maxChunkSize: 2, overlap: 0, minChunkSize: 0, sizeOf: wordCount, separators: [],
});
result.forEach((c) => expect(wordCount(c)).toBeLessThanOrEqual(2));
});
});
import { describe, it, expect } from "vitest";
import fc from "fast-check";
import { chunkText, chunkTextWithMeta } from "./chunk-text.js";
const validOptions = () =>
fc
.record({
maxChunkSize: fc.integer({ min: 1, max: 500 }),
overlapFraction: fc.double({ min: 0, max: 0.9, noNaN: true }),
minChunkFraction: fc.double({ min: 0, max: 0.9, noNaN: true }),
})
.map(({ maxChunkSize, overlapFraction, minChunkFraction }) => ({
maxChunkSize,
overlap: Math.floor(maxChunkSize * overlapFraction),
minChunkSize: Math.floor(maxChunkSize * minChunkFraction),
}));
describe("property-based tests", () => {
it("never produces chunks exceeding maxChunkSize", () => {
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 2000 }), validOptions(), (text, opts) => {
for (const chunk of chunkText(text, { ...opts, minChunkSize: 0 })) {
expect(chunk.length).toBeLessThanOrEqual(opts.maxChunkSize);
}
}),
{ numRuns: 300 },
);
});
it("preserves all non-whitespace content", () => {
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 1000 }), (text) => {
const chunks = chunkText(text, { maxChunkSize: 50, overlap: 0, minChunkSize: 0 });
expect(chunks.join("").replace(/\s+/g, "")).toBe(text.replace(/\s+/g, ""));
}),
{ numRuns: 300 },
);
});
it("chunk metadata offsets match chunk text", () => {
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 1000 }), validOptions(), (text, opts) => {
for (const chunk of chunkTextWithMeta(text, opts)) {
expect(text.slice(chunk.start, chunk.end)).toBe(chunk.text);
}
}),
{ numRuns: 300 },
);
});
it("overlap=0 produces non-overlapping spans", () => {
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 1000 }), (text) => {
const chunks = chunkTextWithMeta(text, { maxChunkSize: 50, overlap: 0, minChunkSize: 0 });
for (let i = 1; i < chunks.length; i++) {
expect(chunks[i].start).toBeGreaterThanOrEqual(chunks[i - 1].end);
}
}),
{ numRuns: 200 },
);
});
it("all chunks are trimmed", () => {
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 1000 }), validOptions(), (text, opts) => {
for (const chunk of chunkTextWithMeta(text, opts)) {
expect(chunk.text).toBe(chunk.text.trim());
}
}),
{ numRuns: 200 },
);
});
it("custom sizeOf is respected", () => {
const wordCount = (s: string): number => {
const trimmed = s.trim();
return trimmed.length === 0 ? 0 : trimmed.split(/\s+/).length;
};
fc.assert(
fc.property(
fc.array(fc.constantFrom("alpha", "bravo", "charlie", "delta", "echo", "fox"), {
minLength: 5,
maxLength: 100,
}),
fc.integer({ min: 2, max: 10 }),
(words, maxTokens) => {
for (const chunk of chunkText(words.join(" "), { maxChunkSize: maxTokens, overlap: 0, minChunkSize: 0, sizeOf: wordCount })) {
expect(wordCount(chunk)).toBeLessThanOrEqual(maxTokens);
}
},
),
{ numRuns: 200 },
);
});
});- Время 0:40
- Не ваншот. Ошибки были, но с моей помощью исправили
- Суммарно тестов было выдано на 255 строк
В вайбкодинге у нас было две проблемы:
- Клод изначально не выяснил у меня все требования. И внешне я никак не мог это диагностировать, на примерах все было ок
- В программе были баги. Абсолютно точно нужны автоматизированные тесты
Планнинг мод атакует первую проблему и позволяет клоду задать нам вопросы и выяснить неочевидные для него детали. А вот по поводу второй просто попробуем, чтобы он сам предложил решение и придумал такой подход, чтобы код был без багов. В итоге он предложил тесты и сам же их и написал.
Клод честно спросил у меня детали и выяснил требования до начала разработки, спасибо ему. Однако в коде был допущен баг. И Клод в упор не воспринимал его как баг, пока я вот прямо в лицо его не тыкнул. Это общая проблема, такая же, как и в вайбкодинге.
Если Клод самостоятельно придумывает тест кейсы, то он халявит. Он придумывает такие кейсы, чтобы ему было легко их пройти, а не чтобы программа работала верно. И мне приходится пробираться через эту гору муторы, читать бессмысленные проверки одного и того же, чтобы в итоге найти тупую ошибку.
С планнингом есть еще огромная проблема, что при итеративном обсуждении требований Клод может рандомно переписать куски требований в неожиданных местах. По факту нужно после каждого вопроса от Клода полностью перечитывать требования, что нереально бесит.
Details
Изначальный запрос. Не забыли включить планнинг мод. Пробуем попросить лучшие практики.
Write ./src/splitStringOnChunks/splitStringOnChunks.ts
It should export splitStringOnChunks() that splits text on chunks for embeddings
splitStringOnChunks accepts
chunkLen - max chunk length
overlap - overlap len
start - position from where start chunking
separators - list of separators in desc of importance order
Use the best practice to produce reliable code that I will be able to confirm as valid
Далее мы говорим с Клодом некоторое время туда сюда, обсуждаем подход. Клод предлагает не тот алгоритм, который мне нравится. Мы обсуждаем. Предлагает тесты. Вроде выглядит все ок. Соглашаюсь с планом.
Тестов на 180 строк, читаемые. Визуально все работает супер, пробуем проверить на наших секретных 'Пишем код самостоятельно' тестах.
На наших тестах фейлится, беда. Я не хочу ему давать наши тест кейсы, это было бы читерно. Я хочу чтобы он сам ошибку нашел.
Can you check the original requirements and double check that they hold
I found a bug. On line 47, overlapWindowStart = currentPos - overlap can go negative (or before start) when currentPos is small relative to overlap. For example, if the first chunk is short and overlap is large:
Ок, значит мы можем просто в цикле поспрашивать его это несколько раз и подождать, пока его решение устаканится. Проблема тут, что я не могу ресетнуть контекст и начать все с начала, я потеряю наши requirements ибо они не записаны никуда. Я могу только спрашивать его в цикле.
Говорит, что все ок. Запускаю наши тесты - не проходят.
Попробую прямо спросить.
I think the overlap logic is wrong in cases, where we should first give chunkLen budget to overlap and then use the rest for chunk body. Can you check it with a test?
Чет много написал сложных тестов, я не могу их понять. Прошу один конкретный кейс придумать
no, I mean make a concrete test where you combine start/overlap and concrete strings in an interesting 1 case. I want to check an easy to observe case
Ошибку я вижу, а Клод в притык не видит. Если задан start+overlap, то первый чанк не будет иметь оверлапа
I think something is wrong with this case
Are you sure it is ok?
const text = "SKIP|aa|bb|cc|dd|ee|ff";
const result = splitStringOnChunks({
text,
chunkLen: 9,
overlap: 6,
start: 5,
separators: ["|"],
});
assert.deepEqual(result, [
{ start: 5, mid: 5, end: 14, text: "aa|bb|cc|" },
{ start: 11, mid: 14, end: 20, text: "cc|dd|ee|" },
{ start: 17, mid: 20, end: 22, text: "ee|ff" },
]);
Он не понял. Проблема похожа на вайбкодинг проблему - у первого чанка нет оверлапа.
why the first chunk has no overlap?
You're right — that's the bug. When start > 0, the first chunk should still attempt overlap into the text before start. There's text there ("SKIP|") that could provide context.
The fix is simple: remove the isFirstChunk guard. When start=0, the overlap window would naturally be empty (text[0..0]), producing no overlap anyway.
На этом можно завершать. Но давайте еще разок прогоним через тесты наши. А 'Пишем код самостоятельно' тесты то прошли!
Ну что-то совсем здоровая. Вангую, ее можно сжать за итерацию рефакторинга или оптимизации перфоманса, я такое уже видел.
Details
export type Chunk = {
start: number;
mid: number;
end: number;
text: string;
};
export interface SplitStringOnChunksOptions {
text: string;
chunkLen: number;
overlap: number;
start?: number;
separators?: string[];
}
const DEFAULT_SEPARATORS = ["\n\n", "\n", ". ", " "];
export function splitStringOnChunks(options: SplitStringOnChunksOptions): Chunk[] {
const {
text,
chunkLen,
overlap,
start = 0,
separators = DEFAULT_SEPARATORS,
} = options;
if (chunkLen <= 0) throw new Error("chunkLen must be greater than 0");
if (overlap < 0) throw new Error("overlap must be non-negative");
if (overlap >= chunkLen) throw new Error("overlap must be less than chunkLen");
if (start < 0) throw new Error("start must be non-negative");
if (start >= text.length) return [];
const chunks: Chunk[] = [];
let currentPos = start;
while (currentPos < text.length) {
let chunkStart: number;
let overlapUsed: number;
if (overlap === 0 || currentPos === 0) {
chunkStart = currentPos;
overlapUsed = 0;
} else {
// Step 1: resolve overlap — find leftmost separator in overlap window
const overlapWindowStart = Math.max(0, currentPos - overlap);
const overlapWindow = text.slice(overlapWindowStart, currentPos);
let leftmostPos = -1;
let leftmostSepLen = 0;
for (const sep of separators) {
const idx = overlapWindow.indexOf(sep);
if (idx !== -1 && (leftmostPos === -1 || idx < leftmostPos)) {
leftmostPos = idx;
leftmostSepLen = sep.length;
}
}
if (leftmostPos !== -1) {
// Start after the separator
chunkStart = overlapWindowStart + leftmostPos + leftmostSepLen;
overlapUsed = currentPos - chunkStart;
} else {
// No separator found — no overlap
chunkStart = currentPos;
overlapUsed = 0;
}
}
const mid = currentPos;
// Step 2: resolve right boundary
const remainingBudget = chunkLen - overlapUsed;
const rightLimit = chunkStart + overlapUsed + remainingBudget; // = chunkStart + chunkLen
if (rightLimit >= text.length) {
// Remaining text fits in budget
const chunkText = text.slice(chunkStart);
if (chunkText.trim().length > 0) {
chunks.push({ start: chunkStart, mid, end: text.length, text: chunkText });
}
break;
}
// Search backward from rightLimit for a separator (priority order)
const searchWindow = text.slice(currentPos, rightLimit);
let splitOffset = -1;
let splitSepLen = 0;
for (const sep of separators) {
const idx = searchWindow.lastIndexOf(sep);
if (idx !== -1) {
splitOffset = idx;
splitSepLen = sep.length;
break; // first separator in list = highest priority
}
}
if (splitOffset !== -1) {
const absoluteSplitEnd = currentPos + splitOffset + splitSepLen;
const chunkText = text.slice(chunkStart, absoluteSplitEnd);
if (chunkText.trim().length > 0) {
chunks.push({ start: chunkStart, mid, end: absoluteSplitEnd, text: chunkText });
}
currentPos = absoluteSplitEnd;
} else {
// Hard cut
const chunkText = text.slice(chunkStart, rightLimit);
if (chunkText.trim().length > 0) {
chunks.push({ start: chunkStart, mid, end: rightLimit, text: chunkText });
}
currentPos = rightLimit;
}
}
return chunks;
}
Details
import { describe, it } from "node:test";
import assert from "node:assert/strict";
import { splitStringOnChunks, type Chunk } from "./splitStringOnChunks.js";
function texts(chunks: Chunk[]): string[] {
return chunks.map((c) => c.text);
}
describe("splitStringOnChunks", () => {
describe("input validation", () => {
it("throws if chunkLen <= 0", () => {
assert.throws(
() => splitStringOnChunks({ text: "hello", chunkLen: 0, overlap: 0 }),
/chunkLen must be greater than 0/
);
});
it("throws if overlap < 0", () => {
assert.throws(
() => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: -1 }),
/overlap must be non-negative/
);
});
it("throws if overlap >= chunkLen", () => {
assert.throws(
() => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 10 }),
/overlap must be less than chunkLen/
);
});
it("throws if start < 0", () => {
assert.throws(
() => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 0, start: -1 }),
/start must be non-negative/
);
});
});
describe("edge cases", () => {
it("returns empty array for empty string", () => {
assert.deepEqual(
splitStringOnChunks({ text: "", chunkLen: 10, overlap: 0 }),
[]
);
});
it("returns empty array if start >= text length", () => {
assert.deepEqual(
splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 0, start: 100 }),
[]
);
});
it("returns single chunk when text fits in chunkLen", () => {
const result = splitStringOnChunks({ text: "hello world", chunkLen: 20, overlap: 0 });
assert.deepEqual(result, [
{ start: 0, mid: 0, end: 11, text: "hello world" },
]);
});
});
describe("basic splitting (no overlap)", () => {
it("splits on space separator", () => {
const result = splitStringOnChunks({
text: "aaa bbb ccc ddd",
chunkLen: 8,
overlap: 0,
separators: [" "],
});
assert.deepEqual(texts(result), ["aaa bbb ", "ccc ddd"]);
assert.deepEqual(result[0], { start: 0, mid: 0, end: 8, text: "aaa bbb " });
assert.deepEqual(result[1], { start: 8, mid: 8, end: 15, text: "ccc ddd" });
});
it("hard-cuts when no separator found", () => {
const result = splitStringOnChunks({
text: "abcdefghijklmnop",
chunkLen: 5,
overlap: 0,
separators: [" "],
});
assert.deepEqual(texts(result), ["abcde", "fghij", "klmno", "p"]);
});
});
describe("separator priority on right boundary", () => {
it("prefers higher-priority separator", () => {
const text = "hello world\nfoo bar baz";
const result = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 0,
separators: ["\n", " "],
});
assert.deepEqual(texts(result), ["hello world\n", "foo bar baz"]);
});
it("falls back to lower-priority separator when higher not found", () => {
const text = "hello world foo bar";
const result = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 0,
separators: ["\n", " "],
});
assert.deepEqual(texts(result), ["hello ", "world foo ", "bar"]);
});
});
describe("overlap", () => {
it("snaps overlap to leftmost separator in overlap window", () => {
const result = splitStringOnChunks({
text: "aaa bbb ccc ddd eee",
chunkLen: 8,
overlap: 5,
separators: [" "],
});
assert.deepEqual(texts(result), ["aaa bbb ", "bbb ccc ", "ccc ddd ", "ddd eee"]);
// Second chunk: start=4 (overlap), mid=8 (new content), end=12
assert.deepEqual(result[1], { start: 4, mid: 8, end: 12, text: "bbb ccc " });
});
it("handles overlap larger than currentPos without going negative", () => {
// First chunk ends at pos 3 ("ab "), then overlap=4 > currentPos=3
const result = splitStringOnChunks({
text: "ab cd ef gh",
chunkLen: 5,
overlap: 4,
separators: [" "],
});
assert.deepEqual(texts(result), ["ab ", "cd ", "cd ef", "ef gh"]);
// Verify text integrity
for (const chunk of result) {
assert.equal(chunk.text, "ab cd ef gh".slice(chunk.start, chunk.end));
}
});
it("skips overlap when no separator in overlap window", () => {
const result = splitStringOnChunks({
text: "abcdefghijklmnop",
chunkLen: 8,
overlap: 3,
separators: [" "],
});
assert.deepEqual(texts(result), ["abcdefgh", "ijklmnop"]);
// No overlap → mid equals start
assert.equal(result[1].start, 8);
assert.equal(result[1].mid, 8);
});
it("chunk.text matches text.slice(start, end)", () => {
const text = "aaa bbb ccc ddd eee";
const result = splitStringOnChunks({
text,
chunkLen: 8,
overlap: 5,
separators: [" "],
});
for (const chunk of result) {
assert.equal(chunk.text, text.slice(chunk.start, chunk.end));
}
});
});
describe("overlap + start combined", () => {
it("overlap consumes from chunkLen budget, rest goes to body", () => {
//
// text = "SKIP|aa|bb|cc|dd|ee|ff"
// 0123456789012345678901
// ^start=5
//
// chunkLen=9, overlap=6, separators=["|"]
//
// Chunk 1 (first, no overlap):
// search text[5..14] = "aa|bb|cc|"
// last "|" at idx 8 → end=14
// → "aa|bb|cc|" start=5 mid=5 end=14
//
// Chunk 2 (currentPos=14):
// overlap window text[8..14] = "bb|cc|"
// leftmost "|" at idx 2 → chunkStart=8+2+1=11, overlapUsed=3
// remaining=9-3=6, search text[14..20] = "dd|ee|"
// last "|" at idx 5 → end=20
// → "cc|dd|ee|" start=11 mid=14 end=20
// overlap="cc|" (11..14), body="dd|ee|" (14..20)
//
// Chunk 3 (currentPos=20):
// overlap window text[14..20] = "dd|ee|"
// leftmost "|" at idx 2 → chunkStart=17, overlapUsed=3
// remaining=6, rightLimit=17+9=26 >= 22 → last chunk
// → "ee|ff" start=17 mid=20 end=22
// overlap="ee|" (17..20), body="ff" (20..22)
//
const text = "SKIP|aa|bb|cc|dd|ee|ff";
const result = splitStringOnChunks({
text,
chunkLen: 9,
overlap: 6,
start: 5,
separators: ["|"],
});
// First chunk: overlap window text[max(0,5-6)..5] = "SKIP|"
// leftmost "|" at idx 4 → chunkStart=0+4+1=5, overlapUsed=0
// (separator is right at boundary, so no actual overlap text)
// remaining=9, search text[5..14] = "aa|bb|cc|" → end=14
assert.deepEqual(result, [
{ start: 5, mid: 5, end: 14, text: "aa|bb|cc|" },
{ start: 11, mid: 14, end: 20, text: "cc|dd|ee|" },
{ start: 17, mid: 20, end: 22, text: "ee|ff" },
]);
});
});
describe("start offset", () => {
it("begins chunking from the start position", () => {
const result = splitStringOnChunks({
text: "skip this aaa bbb ccc",
chunkLen: 8,
overlap: 0,
start: 10,
separators: [" "],
});
assert.deepEqual(texts(result), ["aaa bbb ", "ccc"]);
assert.equal(result[0].start, 10);
assert.equal(result[0].mid, 10);
});
});
describe("multi-char separators", () => {
it("splits on paragraph boundaries", () => {
const text = "para one\n\npara two\n\npara three";
const result = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 0,
separators: ["\n\n", " "],
});
assert.deepEqual(texts(result), ["para one\n\n", "para two\n\n", "para three"]);
});
});
describe("default separators", () => {
it("uses default separators when none provided", () => {
const text = "Hello world. This is a test.\n\nNew paragraph here.";
const result = splitStringOnChunks({
text,
chunkLen: 30,
overlap: 0,
});
assert.deepEqual(texts(result), ["Hello world. This is a test.\n\n", "New paragraph here."]);
});
});
});
- Время 0:20
- Ошибок 0, ваншот
- Размер md 73 строки, тестов на 133
Планнинг мод работает, и быстрее ручного ввода, однако страдает из-за проблем:
- Нельзя ресетнуть контекст и начать заново. Обсуждения спеков хранятся вне гита. Однако их можно туда перетащить.
- После каждой итерации требования нужно перечитывать. Клод может рандомно их поменять. Если бы они были в гите, то хотя бы был диф. Но сюрприз - диф может быть, а требования по факту не поменялись, просто переформулировались. Короче, геморрой.
- Если Клод пишет себе тесты, то он смертельно халявит. Производит множество одинаковых тест кейсов, которые все не покрывают, но при этом их приходится читать
Атакуем все эти проблемы:
- Пишем спеки сами и в гите, не даем их менять Клоду. Таким образом спеки предсказуемые. Просто компилируем спеки в программу. Если результат не нравится – дописываем спеки, ресетим контекст. Также просим Клода указать на ошибки или противоречия, на которые он наткнулся во время реализации
- Пишем сами в спеках примеры, которые станут юнит и pbt тестами. Не даем Клоду халявить, наших кейсов мало, но они серьезные
Результат потрясный. Мне потребовалось 20 минут, чтобы написать src/splitStringOnChunks/spec.md, но Клод с него ваншотнул алгоритм. Я также попробовал поудалять часть спеков – все равно ваншотнул, похоже можно очень неточно описывать проблему, и он додумает куски. Главное, чтобы тест кейсы и спеки были написаны человеком.
Details
compile src/splitStringOnChunks/spec.md into code
run the code
output any errors or contradictions you found
for testing use vitest
for pbt use fast-check
try to put all invaritans into single domain run
./src/splitStringOnChunks/spec.md
create './splitStringOnChunks.ts' that will export splitStringOnChunks() function
this function splits input text on chunks for embeddings, support overlap and separators
inputs
text - the original text
start - from what position start chunking. this allows for the first chunk having overlap
chunkLen - the soft boundary of max length of a chunk
overlap - the soft boundary of lenfth of left overlap
separators - ordered array of non empty string separators. ordered in desc of importance
outputs an array of Chunk[] each Chunk consists of
start - index in the original text from where the chunk starting
mid - from where new data starts and overlap ends. So overlap is [start, mid) and new data is [mid, end)
end - where chunk ends
text - the full chunk text eg [start, end)
[start, end) should be no more than "chunkLen"
[start, mid) should be no more than "overlap"
the right boundary should be the most important the most right separator, so separator priority is important
if there are no separators then we allow hard cut
the left boundary should be the most left any separator, so we ignore separator priority
if there are no separator then we make start === mid, eg, skipping overlap for this case
beginning and the text and the end of the text are the most important separators
first we try to match left boundary with overlap, then we use the rest of chunkLen budget to match the right boundary
eg, we try to add overlap to all chunks
don't include separators for the left boundary
but include separators for the right boundary
## Testing
### Case 1
text "some hello. String. that we want"
start 12
chunkLen 22
overlap 10
separators [". ", " "]
result chunks
"hello. String. "
"String. that we want"
### Case 2
text "some hello. String. that we want"
start 12
chunkLen 22
overlap 10
separators [" ", ". "]
result chunks
"hello. String. that "
"that we want"
### Property-Based Testing
run pbt on the domain
text: length [0, 10]
start: [0, 11]
chunkLen: [1, 11]
overlap: [0, 11]
separators: [1, 2] separators of strings of length [1, 2]
calling splitStringOnChunks() with the same arguments produces the same resutl
if you join all [mid, end) you will get the original string respecting the string "start" param
empty input text produces no chunks
non emty text produces at least 1 chunk
for each chunk
start <= mid
mid < end
text is [start, end)
if there is no overlap start === midНа что получаем ответ:
Errors/Contradictions Found in the Spec
1. start >= text.length with non-empty text: The PBT domain allows start ∈ [0, 11] and text.length ∈ [0, 10], so start can exceed text.length. The invariant "non-empty text produces at least 1 chunk" conflicts with this — when start >= text.length, there's nothing left to chunk. I resolved this by only asserting "at least 1 chunk" when start < text.length.
2. "soft boundary" vs hard constraints: The spec calls chunkLen and overlap "soft boundaries", but the PBT invariants assert [start, end) ≤ chunkLen and [start, mid) ≤ overlap as hard constraints. These can conflict when overlap >= chunkLen (no room for new data). I handle this by shrinking the overlap to ensure forward progress, which may violate the "soft" overlap boundary while preserving the chunkLen constraint.
3. "if there is no overlap start === mid": The spec doesn't clarify whether "no overlap" means the overlap parameter is 0, or that the chunk happened to get no overlap (e.g., no separator found in the overlap region). Both cases result in start === mid in my implementation, but the invariant's intent is ambiguous. I tested the overlap === 0 case explicitly.
Details
export interface Chunk {
start: number
mid: number
end: number
text: string
}
export function splitStringOnChunks({
text,
start,
overlap,
chunkLen,
separators,
}: {
text: string
start: number
chunkLen: number
overlap: number
separators: string[]
}): Chunk[] {
if (text.length === 0 || start >= text.length) return []
const chunks: Chunk[] = []
let currentMid = start
while (currentMid < text.length) {
// === Left boundary (overlap) ===
const searchStart = Math.max(0, currentMid - overlap)
let chunkStart = currentMid // default: no overlap
if (searchStart < currentMid) {
if (searchStart === 0) {
// Beginning of text is the most important separator
chunkStart = 0
} else {
// Find leftmost any separator in [searchStart, currentMid)
let leftmostPos = -1
let leftmostEnd = -1
for (const sep of separators) {
for (let p = searchStart; p + sep.length <= currentMid; p++) {
if (text.substring(p, p + sep.length) === sep) {
if (leftmostPos === -1 || p < leftmostPos) {
leftmostPos = p
leftmostEnd = p + sep.length
}
break // only need leftmost occurrence for this separator
}
}
}
if (leftmostPos >= 0) {
chunkStart = leftmostEnd // don't include separator
}
// else: no separator found, chunkStart stays at currentMid (skip overlap)
}
}
// Ensure there's room for at least 1 char of new data
if (chunkStart + chunkLen <= currentMid) {
chunkStart = currentMid + 1 - chunkLen
if (chunkStart < 0) chunkStart = 0
}
// === Right boundary ===
const maxEnd = Math.min(chunkStart + chunkLen, text.length)
let chunkEnd: number
// End of text is the most important separator
if (text.length <= chunkStart + chunkLen) {
chunkEnd = text.length
} else {
chunkEnd = maxEnd // default: hard cut
// Try each separator in priority order
for (const sep of separators) {
let found = false
// Find rightmost occurrence starting at or after currentMid
for (let p = maxEnd - sep.length; p >= currentMid; p--) {
if (text.substring(p, p + sep.length) === sep) {
chunkEnd = p + sep.length
found = true
break
}
}
if (found) break
}
}
// Ensure forward progress
if (chunkEnd <= currentMid) {
chunkEnd = Math.min(currentMid + 1, text.length)
}
chunks.push({
start: chunkStart,
mid: currentMid,
end: chunkEnd,
text: text.substring(chunkStart, chunkEnd),
})
currentMid = chunkEnd
}
return chunks
}Details
import { describe, it, expect } from 'vitest';
import fc from 'fast-check';
import { splitStringOnChunks } from './splitStringOnChunks';
describe('splitStringOnChunks', () => {
describe('Case 1', () => {
it('splits with separators [". ", " "]', () => {
const chunks = splitStringOnChunks({
text: 'some hello. String. that we want',
start: 12,
chunkLen: 22,
overlap: 10,
separators: ['. ', ' '],
});
expect(chunks.map((c) => c.text)).toEqual([
'hello. String. ',
'String. that we want',
]);
// Verify chunk structure
expect(chunks[0]).toEqual({
start: 5,
mid: 12,
end: 20,
text: 'hello. String. ',
});
expect(chunks[1]).toEqual({
start: 12,
mid: 20,
end: 32,
text: 'String. that we want',
});
});
});
describe('Case 2', () => {
it('splits with separators [" ", ". "]', () => {
const chunks = splitStringOnChunks({
text: 'some hello. String. that we want',
start: 12,
chunkLen: 22,
overlap: 10,
separators: [' ', '. '],
});
expect(chunks.map((c) => c.text)).toEqual([
'hello. String. that ',
'that we want',
]);
expect(chunks[0]).toEqual({
start: 5,
mid: 12,
end: 25,
text: 'hello. String. that ',
});
expect(chunks[1]).toEqual({
start: 20,
mid: 25,
end: 32,
text: 'that we want',
});
});
});
describe('Property-Based Testing', () => {
const textArb = fc.string({ minLength: 0, maxLength: 10 });
const startArb = fc.integer({ min: 0, max: 11 });
const chunkLenArb = fc.integer({ min: 1, max: 11 });
const overlapArb = fc.integer({ min: 0, max: 11 });
const separatorArb = fc.array(
fc.string({ minLength: 1, maxLength: 2 }),
{ minLength: 1, maxLength: 2 },
);
const arbs = fc.tuple(textArb, startArb, chunkLenArb, overlapArb, separatorArb);
it('all invariants hold', () => {
fc.assert(
fc.property(arbs, ([text, start, chunkLen, overlap, separators]) => {
const params = { text, start, chunkLen, overlap, separators };
const chunks = splitStringOnChunks(params);
// Deterministic: same args produce same result
const chunks2 = splitStringOnChunks(params);
expect(chunks).toEqual(chunks2);
// Empty text produces no chunks
if (text.length === 0) {
expect(chunks).toHaveLength(0);
return;
}
// Non-empty text with valid start produces at least 1 chunk
if (text.length > 0 && start < text.length) {
expect(chunks.length).toBeGreaterThanOrEqual(1);
}
// Joining [mid, end) reproduces the original text from start
if (start < text.length) {
const joined = chunks.map((c) => text.substring(c.mid, c.end)).join('');
expect(joined).toBe(text.substring(start));
}
// Per-chunk invariants
for (const chunk of chunks) {
// start <= mid
expect(chunk.start).toBeLessThanOrEqual(chunk.mid);
// mid < end
expect(chunk.mid).toBeLessThan(chunk.end);
// text is [start, end)
expect(chunk.text).toBe(text.substring(chunk.start, chunk.end));
// chunk length constraint: [start, end) <= chunkLen
expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);
// overlap constraint: [start, mid) <= overlap
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(overlap);
// if overlap param is 0, start === mid
if (overlap === 0) {
expect(chunk.start).toBe(chunk.mid);
}
}
}),
{ numRuns: 10000 },
);
});
});
});
- Время 0:30
- Ваншот, ошибок нет
- Размер md 292 строки, тестов на 643
Очевидно, что SDD тащит. Дальше его нужно скейлить на большие проекты, которые не помещаются в контекст и состоят из кучи эволюционирующих фичей. Для этого, например, использую фреймворки вида spec-kit. Для маленькой задачи это оверкилл, но все равно интересно посмотреть, как это работает.
Speckit ваншотнул алгоритм и произвел самый читаемый код, респект ему. Однако немудрено, в него столько промтов запихнули. 'Способ №3' мы запускали на сыром claude code и конечно качество кода было меньше.
Прикольная особенность, что spec-kit сам написал себе тесты. Их конечно дохера и через них нужно пробираться. Но прикол в том, что он сам их написал, и багов не было.
Основная проблема speckit – огромное количество текста, которое нужно читать. Если в нашем кастомном SDD 'Способ №3' спеки занимали 70 строк, то в speckit это 300. И я не включил в эти 300 еще 200 из нарезанных тасков, я подразумеваю, что это чисто для ЛЛМ инструкции и читать их не надо. Вторая проблема – это медленно работает и жрет токены. Таски выполнялись 20 минут.
Типа можно сказать: о, это просто оверкил для такого маленького функционала. Если бы функционал был большой, то количество оверхеда было бы меньше относительно полезного аутпута. Я вот в это не верю, мне кажется, количество оверхеда будет расти пропорционально размеру задачи. Нужно будет читать много и ждать долго. Что сильно ухудшит качество работы.
Ну и длительность одной итерации: 30 минут моей работы + 30 минут работы агента = час на одну итерацию. Хотелось бы быстрых итераций, но фреймворк тяжелый.
Details
Делаю по официальному туториалу
Fill the constitution with the bare minimum requirements for a nodejs typescript backend
/speckit.specify I need you to build a splitStringOnChunks() function that will split incoming text on chunks with
overlap support for my embeddings. Input should be a single object with:
text - original text
chunkLen - max len of a chunk, it is recomended len, but chunks could be smaller
overlap - max len of an overlap, it is recomended len, but overlap could be smaller
separators - a list of non empty string separators that are sorted in desc of priority
Output is an array of chunks, each of them has
start - index in the original text where the chunk starts
mid - index where the new text data starts and overlap ends. [start, mid) - is the overlap
end - where the chunk ends
text - the chunk text [start, end)
Это произвело огромный файл спецификации, который долго читать. Но файл в гите. Но во время чтения я нашел места, где ЛЛМ-ка затупила, а именно наш многострадальный кейс с оверлапом для ненулевого start. Нужно вносить изменения в спеки.
Feature Branch: 001-string-chunk-splitter
Created: 2026-03-26
Status: Draft
Input: User description: "Build a splitStringOnChunks() function that splits text into chunks with overlap support for embeddings"
A developer passes a long text string to splitStringOnChunks() with a desired chunk length and a list of separators. The function splits the text into chunks that respect separator boundaries, each chunk no longer than chunkLen. The first chunk has no overlap (start === mid), and subsequent chunks include overlap from the previous chunk.
Why this priority: This is the core functionality — without basic chunking, no other behavior matters.
Independent Test: Can be tested by passing a multi-paragraph text with chunkLen=100, overlap=20, separators=["\n\n", "\n", ". ", " "] and verifying the returned chunks cover the entire text with correct start, mid, end, and text fields.
Acceptance Scenarios:
- Given a text of 500 characters,
chunkLen=100,overlap=20,separators=["\n\n", "\n", " "], WhensplitStringOnChunks()is called, Then it returns an array of chunks where each chunk'stext.length <= chunkLen, overlap regions are<= overlapcharacters, and every character in the original text is covered. - Given any output chunks, When inspecting indices, Then
chunk.text === text.slice(chunk.start, chunk.end)for every chunk,chunk.mid >= chunk.start, and the first chunk hasstart === mid === 0. - Given consecutive chunks, When examining their boundaries, Then
chunks[i+1].start < chunks[i].end(overlap exists) andchunks[i+1].mid === chunks[i].end(new content starts where previous chunk ended).
A developer provides an ordered list of separators (e.g., ["\n\n", "\n", ". ", " "]). The function prefers splitting at higher-priority separators first and only falls back to lower-priority ones when the higher-priority ones do not produce a valid split within the chunk length.
Why this priority: Separator-aware splitting is critical for embedding quality — splitting mid-sentence degrades retrieval accuracy.
Independent Test: Can be tested by providing text with multiple separator types and verifying that chunks split at paragraph boundaries before sentence boundaries before word boundaries.
Acceptance Scenarios:
- Given text with paragraph breaks and
chunkLenlarge enough to fit a paragraph, When splitting, Then chunks split at"\n\n"boundaries rather than at"\n"or" ". - Given a paragraph longer than
chunkLen, When no paragraph separator fits, Then the function falls back to the next separator in the list (e.g.,"\n", then". ", then" ").
A developer passes edge-case inputs: empty text, text shorter than chunkLen, text with no matching separators, or very small chunkLen/overlap values. The function handles these gracefully without errors.
Why this priority: Robustness ensures the function is reliable in production pipelines without requiring callers to pre-validate input.
Independent Test: Can be tested by passing empty strings, single-character texts, and texts with no separator matches, verifying correct output or empty arrays.
Acceptance Scenarios:
- Given an empty string as
text, WhensplitStringOnChunks()is called, Then it returns an empty array. - Given text shorter than
chunkLen, When splitting, Then a single chunk is returned withstart=0,mid=0,end=text.length,text=originalText. - Given text with no matching separators and
chunkLensmaller thantext.length, When splitting, Then the function splits at hardchunkLenboundaries as a last resort.
- What happens when
overlap >= chunkLen? The overlap MUST be clamped to be strictly less thanchunkLento guarantee forward progress. - What happens when a separator is longer than the remaining text? It is skipped.
- What happens when
chunkLenis 0 or negative? Return an empty array. - What happens when
separatorsis an empty array? Fall back to hard splitting atchunkLenboundaries.
- FR-001: The function MUST accept a single object parameter with fields:
text(string),chunkLen(number),overlap(number), andseparators(string array). - FR-002: The function MUST return an array of chunk objects, each with
start(number),mid(number),end(number), andtext(string). - FR-003: For every chunk,
textMUST equaloriginalText.slice(start, end). - FR-004: The first chunk MUST have
start === mid === 0. - FR-005: For consecutive chunks
[i]and[i+1]:chunks[i+1].midMUST equalchunks[i].end(no gaps in new content). - FR-006: Each chunk's length (
end - start) MUST NOT exceedchunkLen. - FR-007: Each chunk's overlap length (
mid - start) MUST NOT exceedoverlap. - FR-008: The function MUST prefer splitting at higher-priority separators (earlier in the array) over lower-priority ones.
- FR-009: When no separator produces a valid split within
chunkLen, the function MUST hard-split at thechunkLenboundary. - FR-010: The union of all chunks' new content ranges
[mid, end)MUST cover the entire original text with no gaps. - FR-011: Overlap for chunk
[i+1]MUST start from a separator boundary found within the overlap window of chunk[i], preferring higher-priority separators.
- ChunkInput: The input configuration object containing
text,chunkLen,overlap, andseparators. - Chunk: The output object representing a single chunk with positional indices (
start,mid,end) and the extractedtext.
- SC-001: Every character in the original text appears in at least one chunk's new content range
[mid, end). - SC-002: No chunk exceeds the specified
chunkLenin total length. - SC-003: Overlap regions respect the
overlapmaximum length. - SC-004: When separators are available within the chunk window, splits occur at separator boundaries at least 90% of the time (hard splits are rare).
- SC-005: The function processes a 1MB text input in under 1 second.
- The function is a pure, synchronous utility — no I/O, no async operations.
- Separators are non-empty strings as stated in the input contract.
chunkLenandoverlapare positive integers; invalid values result in an empty array.- The function is deterministic — same input always produces the same output.
- Multi-byte characters (UTF-16 surrogate pairs) are not specially handled; indices refer to JavaScript string indices (UTF-16 code units).
/speckit.specify "FR-004" is a wrong spec. For non zero "start" if "overlap" configured there might be overlap for the first chunk
/speckit.specify I want left boundary to ignore separators order. Just try to pick the most left one. If no found - make chunk start
= mid. I also do not want to include separators into the overlap. For the right boundary include separators and for them try find the
high priority one, if no found - allowed to hard split
/speckit.specify I want my chunks to have overlaps. So first try to match left boundary, only then use unspent chunkLen budget on the
right boundary
На этом моменте мне нравятся спеки. Прошу его позадавать мне вопросов еще, может что забыли.
/speckit.clarify
Следующим шагом сообщаем нашу архитектуру. В speckit сильное разграничение между деталями имплементации и спеками фичей.
/speckit.plan use TypeScript, nodejs, vitest, fast-check
Команда произвела файл с планом. Теперь у нас насколько .md файлов, читаем все что новое.
specs/001-string-chunk-splitter/plan.md
Branch: 001-string-chunk-splitter | Date: 2026-03-26 | Spec: spec.md
Input: Feature specification from /specs/001-string-chunk-splitter/spec.md
Implement a splitStringOnChunks() utility function that splits text into overlapping chunks for embedding pipelines. The function uses asymmetric boundary rules: left boundary (overlap) is resolved first by picking the leftmost separator regardless of priority, then the right boundary uses the remaining chunkLen budget with priority-ordered separator search. Pure synchronous TypeScript function with no dependencies beyond the runtime.
Language/Version: TypeScript 6.x with strict: true (Node.js LTS)
Primary Dependencies: None for production; vitest + fast-check for testing
Storage: N/A
Testing: vitest (unit/property tests), fast-check (property-based testing)
Target Platform: Node.js (LTS)
Project Type: Library (single exported utility function)
Performance Goals: Process 1MB text input in under 1 second
Constraints: Pure synchronous function, no external dependencies, UTF-16 code unit indices
Scale/Scope: Single function with types — minimal surface area
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
| Principle | Status | Notes |
|---|---|---|
| I. Strict TypeScript | PASS | strict: true already configured in tsconfig.json. All types will be explicit. No any usage. |
| II. Simplicity First | PASS | Single exported function + 2 type definitions. No abstractions, no classes, no patterns beyond a loop. |
| III. Explicit Error Handling | PASS | Invalid inputs return empty array (documented in spec). No thrown exceptions for expected edge cases. |
| Technology Stack | PASS | TypeScript, tsc build, src/ → dist/, no new production dependencies. |
| Development Workflow | PASS | Code in src/, build via tsc, no console.log. |
Gate result: ALL PASS — no violations, no complexity tracking needed.
specs/001-string-chunk-splitter/
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
│ └── splitStringOnChunks.md
└── tasks.md
src/
├── index.ts # Re-exports public API
└── splitStringOnChunks.ts # Function implementation + types
tests/
├── splitStringOnChunks.test.ts # Unit tests (vitest)
└── splitStringOnChunks.prop.ts # Property-based tests (fast-check)
Structure Decision: Single project layout. The feature is one function file plus types. Tests live in a top-level tests/ directory. No models/services/cli subdirectories needed — that would violate Simplicity First.
specs/001-string-chunk-splitter/data-model.md
The input configuration object for splitStringOnChunks().
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| text | string | Yes | — | The original text to split |
| chunkLen | number | Yes | — | Maximum total length of a chunk (overlap + new content) |
| overlap | number | Yes | — | Maximum length of overlap region |
| separators | string[] | Yes | — | Non-empty strings, sorted by descending priority |
| start | number | No | 0 | Offset where new content begins in the text |
Validation rules:
chunkLenmust be > 0, otherwise return empty arrayoverlapmust be >= 0; clamped tochunkLen - 1if >=chunkLenseparatorsentries must be non-empty stringsstartmust be >= 0 and <text.length(if text is non-empty)
A single chunk in the output array.
| Field | Type | Description |
|---|---|---|
| start | number | Index in original text where the chunk begins (overlap start) |
| mid | number | Index where new content begins (overlap ends). [start, mid) is overlap |
| end | number | Index where the chunk ends (exclusive) |
| text | string | The chunk text: originalText.slice(start, end) |
Invariants:
0 <= start <= mid <= end <= text.lengthend - start <= chunkLenmid - start <= overlapchunk.text === originalText.slice(chunk.start, chunk.end)
ChunkInput→ produces →Chunk[]- For consecutive chunks
[i]and[i+1]:chunks[i+1].mid === chunks[i].end - First chunk:
mid === input.start - Last chunk:
end === text.length(or end of processable text)
N/A — this is a pure stateless function. No lifecycle or state management.
specs/001-string-chunk-splitter/research.md
Decision: Use vitest as test runner with fast-check for property-based testing.
Rationale: vitest is fast, TypeScript-native, and has built-in assertion support. fast-check provides property-based testing which is ideal for this function — the invariants (no gaps, chunk length limits, overlap constraints) are natural properties to verify over randomized inputs.
Alternatives considered:
- Jest: Heavier, slower TypeScript support, no advantage for this use case.
- Node test runner: Built-in but less ergonomic for property-based testing integration.
Decision: Single-pass linear scan from start (or 0) to end of text, building one chunk at a time.
Rationale: The function processes text left-to-right. For each chunk:
- Resolve left boundary by scanning the overlap window
[mid - overlap, mid)for the leftmost separator occurrence. - Compute remaining budget:
chunkLen - (mid - start). - Resolve right boundary by searching
[mid, mid + remainingBudget)for rightmost occurrence of highest-priority separator. - Emit chunk, advance
midtoendof the new content.
This is O(n * s * k) worst case where n = text length, s = number of separators, k = max separator length. For typical use (short separators, reasonable chunk sizes), this is effectively O(n).
Alternatives considered:
- Pre-indexing all separator positions: Higher memory, more complex code, marginal benefit for typical inputs.
- Regex-based splitting: Less control over boundary selection strategy, harder to implement asymmetric rules.
Decision: For right boundary, iterate separators by priority and use lastIndexOf semantics within the window. For left boundary, scan all separators in the overlap window and pick the one at the earliest position (longest match on tie).
Rationale: lastIndexOf within a substring gives the rightmost occurrence naturally. For the left boundary, a single scan collecting all separator positions is straightforward.
Alternatives considered:
- Using a single combined regex: Would lose per-separator priority control and complicate the longest-match-on-tie rule.
Decision: Zero production dependencies. Only dev dependencies: vitest, fast-check, existing typescript and @types/node.
Rationale: Constitution mandates minimal dependencies. This is a pure string manipulation function — no external libraries needed.
specs/001-string-chunk-splitter/spec.md
Feature Branch: 001-string-chunk-splitter
Created: 2026-03-26
Status: Draft
Input: User description: "Build a splitStringOnChunks() function that splits text into chunks with overlap support for embeddings"
A developer passes a long text string to splitStringOnChunks() with a desired chunk length and a list of separators. The function splits the text into chunks that respect separator boundaries, each chunk no longer than chunkLen. Subsequent chunks include overlap from the previous chunk. When a non-zero start offset is provided with overlap > 0, even the first chunk can include overlap text from before the starting position.
Why this priority: This is the core functionality — without basic chunking, no other behavior matters.
Independent Test: Can be tested by passing a multi-paragraph text with chunkLen=100, overlap=20, separators=["\n\n", "\n", ". ", " "] and verifying the returned chunks cover the entire text with correct start, mid, end, and text fields.
Acceptance Scenarios:
- Given a text of 500 characters,
chunkLen=100,overlap=20,separators=["\n\n", "\n", " "], WhensplitStringOnChunks()is called, Then it returns an array of chunks where each chunk'stext.length <= chunkLen, overlap regions are<= overlapcharacters, and every character in the original text is covered. - Given any output chunks, When inspecting indices, Then
chunk.text === text.slice(chunk.start, chunk.end)for every chunk, andchunk.mid >= chunk.start. Whenstart=0, the first chunk hasstart === mid === 0. Whenstart > 0andoverlap > 0, the first chunk hasmid === startandstart < mid(overlap from text before the starting offset). - Given consecutive chunks, When examining their boundaries, Then
chunks[i+1].mid === chunks[i].end(new content starts where previous chunk ended). If a separator was found in the overlap window,chunks[i+1].start < chunks[i+1].mid(overlap exists). If no separator was found,chunks[i+1].start === chunks[i+1].mid(no overlap).
A developer provides an ordered list of separators. The right boundary (chunk end) respects separator priority — preferring higher-priority separators and falling back to lower-priority ones, with hard split as last resort. The left boundary (overlap start) ignores priority and picks the leftmost separator in the overlap window to maximize overlap. Separators are excluded from overlap text but included in chunk text at the right boundary.
Why this priority: Asymmetric boundary handling maximizes embedding context (wide overlaps) while preserving semantic coherence at chunk boundaries (priority-aware right splits).
Independent Test: Can be tested by providing text with multiple separator types and verifying that right boundaries prefer high-priority separators, left boundaries pick the leftmost separator regardless of type, and separator text is not part of the overlap.
Acceptance Scenarios:
- Given text with paragraph breaks and
chunkLenlarge enough to fit a paragraph, When splitting at the right boundary, Then chunks split at"\n\n"boundaries rather than at"\n"or" ". - Given a paragraph longer than
chunkLen, When no paragraph separator fits at the right boundary, Then the function falls back to the next separator in the list (e.g.,"\n", then". ", then" "). - Given an overlap window containing both
"\n"and" "separators, When determining the left boundary, Then the function picks whichever separator occurrence is leftmost (earliest position), regardless of priority order. - Given an overlap window with no separator occurrences, When determining the left boundary, Then the chunk has
start === mid(no overlap for this chunk). - Given a separator found at the left boundary, When the overlap is constructed, Then the overlap text starts immediately after the separator — the separator itself is NOT included in
[start, mid). - Given
chunkLen=100and an overlap of 30 characters is resolved for a chunk, When determining the right boundary, Then the right boundary search window is 70 characters (100 - 30), not 100. The overlap budget is spent first, and the right boundary gets the remainder.
A developer passes edge-case inputs: empty text, text shorter than chunkLen, text with no matching separators, or very small chunkLen/overlap values. The function handles these gracefully without errors.
Why this priority: Robustness ensures the function is reliable in production pipelines without requiring callers to pre-validate input.
Independent Test: Can be tested by passing empty strings, single-character texts, and texts with no separator matches, verifying correct output or empty arrays.
Acceptance Scenarios:
- Given an empty string as
text, WhensplitStringOnChunks()is called, Then it returns an empty array. - Given text shorter than
chunkLen, When splitting, Then a single chunk is returned withstart=0,mid=0,end=text.length,text=originalText. - Given text with no matching separators and
chunkLensmaller thantext.length, When splitting, Then the function splits at hardchunkLenboundaries as a last resort.
- What happens when
overlap >= chunkLen? The overlap MUST be clamped to be strictly less thanchunkLento guarantee forward progress. - What happens when a separator is longer than the remaining text? It is skipped.
- What happens when
chunkLenis 0 or negative? Return an empty array. - What happens when
separatorsis an empty array? Fall back to hard splitting atchunkLenboundaries.
- FR-001: The function MUST accept a single object parameter with fields:
text(string),chunkLen(number),overlap(number),separators(string array), and optionallystart(number, defaults to 0) indicating the offset in the text where new content begins. - FR-002: The function MUST return an array of chunk objects, each with
start(number),mid(number),end(number), andtext(string). - FR-003: For every chunk,
textMUST equaloriginalText.slice(start, end). - FR-004: The first chunk's
midMUST equal the inputstartoffset. Whenstart === 0, the first chunk hasstart === mid === 0(no overlap). Whenstart > 0andoverlap > 0, the first chunk'sstartMAY be less thanmid, providing overlap context from text before the starting offset. The overlap length (mid - chunk.start) MUST NOT exceedoverlap. - FR-005: For consecutive chunks
[i]and[i+1]:chunks[i+1].midMUST equalchunks[i].end(no gaps in new content). - FR-006: Each chunk's length (
end - start) MUST NOT exceedchunkLen. - FR-007: Each chunk's overlap length (
mid - start) MUST NOT exceedoverlap. - FR-008: Boundary resolution order: The left boundary (overlap) MUST be resolved first. The right boundary MUST then use only the unspent
chunkLenbudget:remainingBudget = chunkLen - (mid - start). This prioritizes overlap presence over maximizing new content per chunk. - FR-012: Right boundary (chunk end): The function MUST iterate separators from highest to lowest priority. For each separator, find the rightmost occurrence within the remaining budget window
[mid, mid + remainingBudget). If found, split there (separator text included in the chunk, before the split point). If not found, try the next separator. When no separator produces a valid split within the remaining budget, the function MUST hard-split atmid + remainingBudget. - FR-009: The union of all chunks' new content ranges
[mid, end)MUST cover the entire original text with no gaps. - FR-010: Left boundary (overlap start): When determining where overlap begins for a chunk, the function MUST ignore separator priority order and instead pick the leftmost separator occurrence found within the overlap window. When multiple separators start at the same position, the longest matching separator MUST be chosen (overlap begins after the full separator, avoiding partial separator artifacts). If no separator is found in the overlap window, the chunk MUST have
start === mid(no overlap). - FR-011: Separator text MUST NOT be included in the overlap. The overlap
[start, mid)MUST begin immediately after the separator occurrence used to determine the left boundary.
- ChunkInput: The input configuration object containing
text,chunkLen,overlap,separators, and optionallystart(offset where new content begins, defaults to 0). - Chunk: The output object representing a single chunk with positional indices (
start,mid,end) and the extractedtext.
- SC-001: Every character in the original text appears in at least one chunk's new content range
[mid, end). - SC-002: No chunk exceeds the specified
chunkLenin total length. - SC-003: Overlap regions respect the
overlapmaximum length. - SC-004: When separators are available within the chunk window, splits occur at separator boundaries at least 90% of the time (hard splits are rare).
- SC-005: The function processes a 1MB text input in under 1 second.
- Q: Right boundary separator search strategy — rightmost or leftmost occurrence per priority level? → A: Rightmost occurrence per priority level (maximizes chunk size while respecting separator priority).
- Q: Left boundary — when multiple separators start at the same position, pick longest or shortest? → A: Longest match (avoids partial separator artifacts in overlap text).
- The function is a pure, synchronous utility — no I/O, no async operations.
- Separators are non-empty strings as stated in the input contract.
chunkLenandoverlapare positive integers; invalid values result in an empty array.- The function is deterministic — same input always produces the same output.
- Multi-byte characters (UTF-16 surrogate pairs) are not specially handled; indices refer to JavaScript string indices (UTF-16 code units).
Теперь нарезаем все спеки на таски
/speckit.tasks break this down
specs/001-string-chunk-splitter/tasks.md
Input: Design documents from /specs/001-string-chunk-splitter/
Prerequisites: plan.md (required), spec.md (required for user stories), research.md, data-model.md, contracts/
Tests: Included — vitest for unit tests, fast-check for property-based tests (per plan.md tech stack).
Organization: Tasks are grouped by user story to enable independent implementation and testing of each story.
- [P]: Can run in parallel (different files, no dependencies)
- [Story]: Which user story this task belongs to (e.g., US1, US2, US3)
- Include exact file paths in descriptions
- Single project:
src/,tests/at repository root
Purpose: Project initialization, dev dependencies, and type definitions
- T001 Install vitest and fast-check as dev dependencies via
npm install -D vitest fast-check - T002 Add vitest configuration to package.json: add
"test": "vitest run"and"test:watch": "vitest"scripts - T003 Define
ChunkInputandChunkinterfaces and exportsplitStringOnChunksstub (returns[]) in src/splitStringOnChunks.ts per contracts/splitStringOnChunks.md - T004 Re-export public API from src/index.ts: export
splitStringOnChunks,ChunkInput, andChunkfrom./splitStringOnChunks.js - T005 Create empty test files: tests/splitStringOnChunks.test.ts (unit) and tests/splitStringOnChunks.prop.ts (property-based)
Checkpoint: npm run build succeeds. npm test runs (tests may be empty/skipped). Types are importable.
Purpose: Input validation logic that all user stories depend on
- T006 Implement input validation in src/splitStringOnChunks.ts: return
[]for empty text,chunkLen <= 0, orstart >= text.length. ClampoverlaptochunkLen - 1ifoverlap >= chunkLen. Defaultstartto0if not provided.
Checkpoint: Foundation ready — stub function validates inputs and returns [] for invalid cases. User story implementation can now begin.
Goal: Core chunking loop that splits text into chunks with overlap, producing correct start, mid, end, text fields. Uses hard-split only (no separator logic yet).
Independent Test: Pass a multi-line text with chunkLen=100, overlap=20, separators=[] and verify chunks cover the entire text with correct indices and no gaps.
NOTE: Write these tests FIRST, ensure they FAIL before implementation
- T007 [P] [US1] Write unit tests in tests/splitStringOnChunks.test.ts for basic chunking: text coverage (FR-009), index correctness (FR-003), first chunk
mid === start(FR-004), continuitychunks[i+1].mid === chunks[i].end(FR-005), chunk size limit (FR-006), overlap size limit (FR-007). Useseparators: [](hard-split only). - T008 [P] [US1] Write unit tests in tests/splitStringOnChunks.test.ts for
startoffset: whenstart > 0withoverlap > 0, verify first chunk hasmid === startand overlap from text before start. Whenstart === 0, verifystart === mid === 0.
- T009 [US1] Implement core chunking loop in src/splitStringOnChunks.ts: iterate from
startto end of text, for each chunk resolve left boundary (hard:mid - overlapclamped to 0), setend = start + chunkLenclamped totext.length, emit{ start, mid, end, text: text.slice(start, end) }, advancemid = end. No separator logic yet — hard splits only. - T010 [US1] Verify all T007 and T008 tests pass via
npm test
Checkpoint: Function splits text into chunks with correct indices and overlap via hard boundaries. All US1 tests pass.
Goal: Add separator-aware splitting with asymmetric rules: left boundary picks leftmost separator (priority-agnostic, separator excluded from overlap), right boundary picks rightmost occurrence of highest-priority separator within remaining budget.
Independent Test: Pass text with mixed separators (["\n\n", "\n", ". ", " "]) and verify right boundaries prefer high-priority separators, left boundaries pick leftmost separator regardless of type, separator text excluded from overlap, and budget is shared (left first, right gets remainder).
NOTE: Write these tests FIRST, ensure they FAIL before implementation
- T011 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for right boundary: highest-priority separator preferred (FR-012), rightmost occurrence within budget, fallback to lower-priority separators, hard-split when no separator found.
- T012 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for left boundary: leftmost separator picked regardless of priority (FR-010), longest match on tie, separator text excluded from overlap (FR-011),
start === midwhen no separator in overlap window. - T013 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for budget sharing: left boundary resolved first, right boundary uses
remainingBudget = chunkLen - (mid - start)(FR-008). Test withchunkLen=100, overlap of 30 chars → right window is 70 chars.
- T014 [US2] Implement left boundary resolver in src/splitStringOnChunks.ts: scan overlap window
[mid - overlap, mid)for all separator occurrences, pick leftmost position, on tie pick longest separator, setstartto position after separator end. If no separator found,start = mid. - T015 [US2] Implement right boundary resolver in src/splitStringOnChunks.ts: compute
remainingBudget = chunkLen - (mid - start), iterate separators by priority, for each find rightmost occurrence in[mid, mid + remainingBudget)via lastIndexOf semantics, split at first separator found. Hard-split atmid + remainingBudgetif none found. - T016 [US2] Integrate boundary resolvers into the core loop: replace hard-split logic with left-then-right resolution per FR-008. Ensure
endis set after the separator text (included in chunk). - T017 [US2] Verify all T011, T012, T013 tests pass via
npm test
Checkpoint: Function uses asymmetric boundary rules. All US1 and US2 tests pass.
Goal: Handle all edge cases gracefully: empty text, short text, no matching separators, overlap >= chunkLen, chunkLen <= 0, empty separators array, separators longer than remaining text.
Independent Test: Pass empty strings, single-character texts, texts with no matching separators, and extreme parameter values. Verify correct output or empty arrays without errors.
NOTE: Write these tests FIRST, ensure they FAIL before implementation
- T018 [P] [US3] Write unit tests in tests/splitStringOnChunks.test.ts for edge cases: empty text →
[],chunkLen <= 0→[],overlap >= chunkLen→ clamped, text shorter thanchunkLen→ single chunk, empty separators array → hard split, separator longer than remaining text → skipped.
- T019 [US3] Review and harden edge case handling in src/splitStringOnChunks.ts: ensure separator search skips separators longer than the search window, empty separators array falls back to hard split, and all validation from T006 is complete. Fix any failing edge case tests.
- T020 [US3] Verify all T018 tests pass via
npm test
Checkpoint: All edge cases handled. All US1, US2, and US3 tests pass.
Purpose: Property-based tests, performance validation, build verification
- T021 [P] Write property-based tests in tests/splitStringOnChunks.prop.ts using fast-check: generate random text, chunkLen, overlap, separators. Assert invariants: coverage (FR-009), continuity (FR-005), size limit (FR-006), overlap limit (FR-007), text consistency (FR-003), forward progress (ordering).
- T022 [P] Write performance test in tests/splitStringOnChunks.test.ts: generate 1MB text string, verify
splitStringOnChunkscompletes in under 1 second (SC-005). - T023 Verify
npm run buildproduces dist/splitStringOnChunks.js and dist/index.js with zero errors - T024 Run quickstart.md validation: execute the usage example from specs/001-string-chunk-splitter/quickstart.md and verify output is correct
- Setup (Phase 1): No dependencies — can start immediately
- Foundational (Phase 2): Depends on Setup completion — BLOCKS all user stories
- User Story 1 (Phase 3): Depends on Foundational. Core loop — no separator logic.
- User Story 2 (Phase 4): Depends on US1 completion (refines the same function with separator logic)
- User Story 3 (Phase 5): Depends on US2 completion (hardens existing implementation)
- Polish (Phase 6): Depends on all user stories being complete
- Tests MUST be written and FAIL before implementation
- Implementation tasks are sequential (they modify the same file)
- Story complete before moving to next priority
- T007, T008 can run in parallel (different test sections, same file but independent describe blocks)
- T011, T012, T013 can run in parallel (different test sections)
- T021, T022 can run in parallel (different test files/sections)
- T003, T005 can run in parallel (different files)
# Launch all test tasks for User Story 2 together:
Task: "T011 - Right boundary tests in tests/splitStringOnChunks.test.ts"
Task: "T012 - Left boundary tests in tests/splitStringOnChunks.test.ts"
Task: "T013 - Budget sharing tests in tests/splitStringOnChunks.test.ts"
# Then implement sequentially (same file):
Task: "T014 - Left boundary resolver"
Task: "T015 - Right boundary resolver"
Task: "T016 - Integrate into core loop"- Complete Phase 1: Setup
- Complete Phase 2: Foundational (input validation)
- Complete Phase 3: User Story 1 (hard-split core loop)
- STOP and VALIDATE: All US1 tests pass, function works with
separators: [] - Demo: function correctly chunks text with overlap via hard boundaries
- Complete Setup + Foundational → Foundation ready
- Add User Story 1 → Test → MVP: basic chunking works
- Add User Story 2 → Test → Separator-aware splitting works
- Add User Story 3 → Test → All edge cases handled
- Polish → Property tests + performance validated
- [P] tasks = different files or independent test sections, no dependencies
- [Story] label maps task to specific user story for traceability
- All user stories modify the same source file (src/splitStringOnChunks.ts), so stories are sequential (US1 → US2 → US3)
- Tests for each story can be written in parallel within the story
- Commit after each completed phase
- Stop at any checkpoint to validate independently
И вот только теперь приступаем к кодингу
/speckit.implement
Ждем 20 минут
Details
export interface ChunkInput {
text: string;
chunkLen: number;
overlap: number;
separators: string[];
start?: number;
}
export interface Chunk {
start: number;
mid: number;
end: number;
text: string;
}
export function splitStringOnChunks(input: ChunkInput): Chunk[] {
const { text, chunkLen, separators } = input;
const start = input.start ?? 0;
if (text.length === 0 || chunkLen <= 0 || start >= text.length) {
return [];
}
const overlap = Math.min(input.overlap, chunkLen - 1);
const chunks: Chunk[] = [];
let mid = start;
while (mid < text.length) {
// Resolve left boundary (overlap start)
const overlapStart = Math.max(0, mid - overlap);
const chunkStart = resolveLeftBoundary(text, overlapStart, mid, separators);
// Resolve right boundary using remaining budget
const remainingBudget = chunkLen - (mid - chunkStart);
const chunkEnd = resolveRightBoundary(text, mid, remainingBudget, separators);
chunks.push({
start: chunkStart,
mid,
end: chunkEnd,
text: text.slice(chunkStart, chunkEnd),
});
mid = chunkEnd;
}
return chunks;
}
function resolveLeftBoundary(
text: string,
windowStart: number,
mid: number,
separators: string[],
): number {
if (separators.length === 0 || windowStart >= mid) {
return mid;
}
// Find the leftmost separator occurrence in [windowStart, mid)
// On tie (same position), pick the longest separator
let bestPos = -1;
let bestLen = 0;
for (const sep of separators) {
if (sep.length === 0) continue;
// Scan for all occurrences of this separator in the window
let searchFrom = windowStart;
while (searchFrom < mid) {
const idx = text.indexOf(sep, searchFrom);
if (idx === -1 || idx >= mid) break;
if (bestPos === -1 || idx < bestPos || (idx === bestPos && sep.length > bestLen)) {
bestPos = idx;
bestLen = sep.length;
}
// Only need the first (leftmost) occurrence of this separator
break;
}
}
if (bestPos === -1) {
// No separator found — no overlap
return mid;
}
// Overlap starts after the separator (separator excluded from overlap)
return Math.min(bestPos + bestLen, mid);
}
function resolveRightBoundary(
text: string,
mid: number,
budget: number,
separators: string[],
): number {
const windowEnd = Math.min(mid + budget, text.length);
if (windowEnd >= text.length) {
return text.length;
}
if (separators.length === 0) {
return windowEnd;
}
// Iterate separators by priority (highest first)
// For each, find the rightmost occurrence in [mid, windowEnd)
for (const sep of separators) {
if (sep.length === 0) continue;
// Find rightmost occurrence: search backwards from windowEnd
const searchWindow = text.slice(mid, windowEnd);
const lastIdx = searchWindow.lastIndexOf(sep);
if (lastIdx !== -1) {
// Split after the separator (separator included in chunk)
return mid + lastIdx + sep.length;
}
}
// No separator found — hard split
return windowEnd;
}
Details
import { describe, it, expect } from "vitest";
import * as fc from "fast-check";
import { splitStringOnChunks } from "../src/splitStringOnChunks.js";
describe("splitStringOnChunks - property-based tests", () => {
const separatorsArb = fc.array(
fc.constantFrom(" ", "\n", "\n\n", ". ", ", ", "; ", " "),
{ minLength: 0, maxLength: 4 },
);
it("coverage: union of [mid, end) covers entire text from start", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
if (chunks.length === 0) return;
expect(chunks[0]!.mid).toBe(0);
expect(chunks[chunks.length - 1]!.end).toBe(text.length);
},
),
{ numRuns: 200 },
);
});
it("continuity: chunks[i+1].mid === chunks[i].end", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
for (let i = 0; i < chunks.length - 1; i++) {
expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
}
},
),
{ numRuns: 200 },
);
});
it("size limit: chunk.end - chunk.start <= chunkLen", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
for (const chunk of chunks) {
expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);
}
},
),
{ numRuns: 200 },
);
});
it("overlap limit: chunk.mid - chunk.start <= clamped overlap", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const clampedOverlap = Math.min(overlap, chunkLen - 1);
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
for (const chunk of chunks) {
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(clampedOverlap);
}
},
),
{ numRuns: 200 },
);
});
it("text consistency: chunk.text === text.slice(chunk.start, chunk.end)", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
for (const chunk of chunks) {
expect(chunk.text).toBe(text.slice(chunk.start, chunk.end));
}
},
),
{ numRuns: 200 },
);
});
it("forward progress: chunks[i].mid < chunks[i+1].mid", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 500 }),
fc.integer({ min: 5, max: 200 }),
fc.integer({ min: 0, max: 100 }),
separatorsArb,
(text, chunkLen, overlap, separators) => {
const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
for (let i = 0; i < chunks.length - 1; i++) {
expect(chunks[i]!.mid).toBeLessThan(chunks[i + 1]!.mid);
}
},
),
{ numRuns: 200 },
);
});
});
import { describe, it, expect } from "vitest";
import { splitStringOnChunks, type Chunk } from "../src/splitStringOnChunks.js";
// Helper to verify invariants on any chunk result
function assertChunkInvariants(
chunks: Chunk[],
text: string,
chunkLen: number,
overlap: number,
inputStart: number,
): void {
if (chunks.length === 0) return;
// FR-004: First chunk mid === input start
expect(chunks[0]!.mid).toBe(inputStart);
for (let i = 0; i < chunks.length; i++) {
const chunk = chunks[i]!;
// FR-003: text consistency
expect(chunk.text).toBe(text.slice(chunk.start, chunk.end));
// FR-006: chunk size limit
expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);
// FR-007: overlap size limit
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(overlap);
// Basic ordering
expect(chunk.start).toBeLessThanOrEqual(chunk.mid);
expect(chunk.mid).toBeLessThanOrEqual(chunk.end);
}
// FR-005: continuity
for (let i = 0; i < chunks.length - 1; i++) {
expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
}
// FR-009: coverage — union of [mid, end) covers [inputStart, text.length)
expect(chunks[0]!.mid).toBe(inputStart);
expect(chunks[chunks.length - 1]!.end).toBe(text.length);
}
describe("splitStringOnChunks", () => {
// =====================
// US1: Basic Text Chunking
// =====================
describe("US1: basic chunking (hard-split, no separators)", () => {
it("splits text into chunks covering the entire input", () => {
const text = "a".repeat(500);
const chunks = splitStringOnChunks({
text,
chunkLen: 100,
overlap: 20,
separators: [],
});
assertChunkInvariants(chunks, text, 100, 20, 0);
expect(chunks.length).toBeGreaterThan(1);
});
it("returns single chunk when text is shorter than chunkLen", () => {
const text = "hello world";
const chunks = splitStringOnChunks({
text,
chunkLen: 100,
overlap: 10,
separators: [],
});
expect(chunks).toHaveLength(1);
expect(chunks[0]).toEqual({
start: 0,
mid: 0,
end: text.length,
text,
});
});
it("first chunk has start === mid === 0 when start is 0", () => {
const text = "a".repeat(200);
const chunks = splitStringOnChunks({
text,
chunkLen: 50,
overlap: 10,
separators: [],
});
expect(chunks[0]!.start).toBe(0);
expect(chunks[0]!.mid).toBe(0);
});
it("maintains continuity: chunks[i+1].mid === chunks[i].end", () => {
const text = "a".repeat(300);
const chunks = splitStringOnChunks({
text,
chunkLen: 80,
overlap: 15,
separators: [],
});
for (let i = 0; i < chunks.length - 1; i++) {
expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
}
});
it("no chunk exceeds chunkLen", () => {
const text = "a".repeat(500);
const chunks = splitStringOnChunks({
text,
chunkLen: 60,
overlap: 10,
separators: [],
});
for (const chunk of chunks) {
expect(chunk.end - chunk.start).toBeLessThanOrEqual(60);
}
});
it("overlap does not exceed overlap param", () => {
const text = "a".repeat(500);
const chunks = splitStringOnChunks({
text,
chunkLen: 60,
overlap: 10,
separators: [],
});
for (const chunk of chunks) {
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(10);
}
});
});
describe("US1: start offset", () => {
it("first chunk has mid === start when start > 0", () => {
const text = "a".repeat(200);
const chunks = splitStringOnChunks({
text,
chunkLen: 50,
overlap: 10,
separators: [],
start: 50,
});
expect(chunks[0]!.mid).toBe(50);
});
it("first chunk has overlap when start > 0 and overlap > 0 with separators", () => {
// Use text with separators so overlap can be resolved
const text = "aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk lll mmm nnn";
const chunks = splitStringOnChunks({
text,
chunkLen: 20,
overlap: 10,
separators: [" "],
start: 20,
});
expect(chunks[0]!.mid).toBe(20);
expect(chunks[0]!.start).toBeLessThan(chunks[0]!.mid);
});
it("first chunk has no overlap when start > 0 but no separators in window", () => {
const text = "a".repeat(200);
const chunks = splitStringOnChunks({
text,
chunkLen: 50,
overlap: 10,
separators: [],
start: 50,
});
// No separators → no overlap per FR-010
expect(chunks[0]!.start).toBe(chunks[0]!.mid);
});
it("covers text from start to end", () => {
const text = "a".repeat(200);
const startOffset = 50;
const chunks = splitStringOnChunks({
text,
chunkLen: 50,
overlap: 10,
separators: [],
start: startOffset,
});
assertChunkInvariants(chunks, text, 50, 10, startOffset);
});
});
// =====================
// US2: Asymmetric Boundary Rules
// =====================
describe("US2: right boundary — separator priority", () => {
it("prefers highest-priority separator at right boundary", () => {
// Text with both \n\n and spaces. chunkLen big enough to reach \n\n
const text = "Hello world\n\nSecond paragraph here";
const chunks = splitStringOnChunks({
text,
chunkLen: 30,
overlap: 5,
separators: ["\n\n", " "],
});
// First chunk should end at or after \n\n (position 13) since it's highest priority
expect(chunks[0]!.end).toBe(13); // "Hello world\n\n"
});
it("picks rightmost occurrence of highest-priority separator", () => {
const text = "aaa bbb ccc ddd eee";
const chunks = splitStringOnChunks({
text,
chunkLen: 19,
overlap: 0,
separators: [" "],
});
// With chunkLen=19 and full text=19, should be one chunk
expect(chunks).toHaveLength(1);
expect(chunks[0]!.end).toBe(19);
});
it("falls back to lower-priority separator when higher not found", () => {
const text = "Hello world this is text";
const chunks = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 0,
separators: ["\n\n", " "],
});
// No \n\n in text, should fall back to " "
// First chunk: budget=15, rightmost " " in [0,15) is at pos 11 ("Hello world ")
expect(chunks[0]!.end).toBe(12); // "Hello world " — end after the space
});
it("hard-splits when no separator found in budget window", () => {
const text = "abcdefghijklmnopqrstuvwxyz";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 0,
separators: ["\n"],
});
// No \n in text, should hard-split at 10
expect(chunks[0]!.end).toBe(10);
});
});
describe("US2: left boundary — leftmost separator", () => {
it("picks leftmost separator regardless of priority", () => {
// Overlap window should contain both " " and "\n"
// " " at earlier position should win even though "\n" is higher priority
const text = "aaa bbb\nccc ddd eee fff ggg";
const chunks = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 10,
separators: ["\n", " "],
});
// Second chunk: mid should be at chunks[0].end
// Overlap window looks back from mid. The leftmost separator in the window wins.
if (chunks.length > 1) {
const chunk1 = chunks[1]!;
// The overlap should start after the leftmost separator in the window
expect(chunk1.start).toBeLessThan(chunk1.mid);
}
});
it("picks longest match when separators start at same position", () => {
// Construct text where \n\n and \n overlap at the same position
// and only one occurrence exists in the overlap window
const text = "aaaa\n\nbbbbbbbbbb cccccccccc";
// 01234 56789...
const chunks = splitStringOnChunks({
text,
chunkLen: 20,
overlap: 10,
separators: ["\n\n", "\n"],
});
// Second chunk: mid = first chunk's end
// Overlap window looks back from mid. The \n\n at index 4 should be found.
// Since \n\n (len 2) and \n (len 1) both start at index 4,
// \n\n wins → overlap starts at index 6 (after \n\n), not index 5 (after \n)
if (chunks.length > 1) {
const chunk1 = chunks[1]!;
// Overlap starts after the \n\n separator
expect(chunk1.start).toBe(6); // position after \n\n
}
});
it("separator excluded from overlap text (FR-011)", () => {
const text = "hello world foo bar baz qux";
const chunks = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 10,
separators: [" "],
});
for (let i = 1; i < chunks.length; i++) {
const chunk = chunks[i]!;
if (chunk.start < chunk.mid) {
// Overlap text should not start with the separator
const overlapText = text.slice(chunk.start, chunk.mid);
expect(overlapText[0]).not.toBe(" ");
}
}
});
it("no overlap when no separator in overlap window", () => {
const text = "abcdefghijklmnopqrstuvwxyz";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 5,
separators: ["\n"],
});
// No \n anywhere, so no overlap for any chunk
for (const chunk of chunks) {
expect(chunk.start).toBe(chunk.mid);
}
});
});
describe("US2: budget sharing", () => {
it("right boundary uses remaining budget after overlap", () => {
// Build text where overlap will consume some of the chunkLen budget
const text = "aaa bbb ccc ddd eee fff ggg hhh iii jjj";
const chunks = splitStringOnChunks({
text,
chunkLen: 20,
overlap: 10,
separators: [" "],
});
for (const chunk of chunks) {
// Total chunk length must not exceed chunkLen
expect(chunk.end - chunk.start).toBeLessThanOrEqual(20);
}
// Verify all invariants hold with separators
assertChunkInvariants(chunks, text, 20, 10, 0);
});
it("chunkLen=100 with 30-char overlap leaves 70 for right boundary", () => {
// Create text with a separator at position 30 so overlap resolves to ~30 chars
const prefix = "a".repeat(29) + " "; // 30 chars, separator at 29
const body = "b".repeat(100); // plenty of text
const text = prefix + body;
const chunks = splitStringOnChunks({
text,
chunkLen: 100,
overlap: 35,
separators: [" "],
start: 30,
});
// First chunk: mid=30, overlap window [0,30), leftmost " " at 29
// start = 30 (after the space at 29), so overlap = 0 chars
// Actually the space is at index 29, overlap starts after it = 30 = mid
// So start === mid (separator right at boundary edge)
// Let's just verify invariants
assertChunkInvariants(chunks, text, 100, 35, 30);
});
});
// =====================
// US3: Edge Case Handling
// =====================
describe("US3: edge cases", () => {
it("returns empty array for empty text", () => {
expect(splitStringOnChunks({
text: "",
chunkLen: 10,
overlap: 5,
separators: [" "],
})).toEqual([]);
});
it("returns empty array for chunkLen <= 0", () => {
expect(splitStringOnChunks({
text: "hello",
chunkLen: 0,
overlap: 0,
separators: [],
})).toEqual([]);
expect(splitStringOnChunks({
text: "hello",
chunkLen: -5,
overlap: 0,
separators: [],
})).toEqual([]);
});
it("clamps overlap when overlap >= chunkLen", () => {
const text = "hello world foo bar";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 100, // way larger than chunkLen
separators: [" "],
});
// Should not hang or crash — overlap clamped to chunkLen - 1
expect(chunks.length).toBeGreaterThan(0);
for (const chunk of chunks) {
expect(chunk.end - chunk.start).toBeLessThanOrEqual(10);
expect(chunk.mid - chunk.start).toBeLessThanOrEqual(9); // clamped to chunkLen - 1
}
});
it("returns single chunk when text is shorter than chunkLen", () => {
const text = "hi";
const chunks = splitStringOnChunks({
text,
chunkLen: 100,
overlap: 10,
separators: [" "],
});
expect(chunks).toHaveLength(1);
expect(chunks[0]).toEqual({
start: 0,
mid: 0,
end: 2,
text: "hi",
});
});
it("falls back to hard split with empty separators array", () => {
const text = "abcdefghij" + "klmnopqrst";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 5,
separators: [],
});
// With no separators, should hard-split and have no overlaps
expect(chunks.length).toBeGreaterThan(1);
for (const chunk of chunks) {
expect(chunk.start).toBe(chunk.mid); // no overlap without separators
}
});
it("skips separator longer than remaining text in search window", () => {
const text = "abcde";
const chunks = splitStringOnChunks({
text,
chunkLen: 3,
overlap: 0,
separators: ["abcdef"], // longer than text
});
// Should not crash, should hard-split
expect(chunks.length).toBeGreaterThan(0);
assertChunkInvariants(chunks, text, 3, 0, 0);
});
it("returns empty array when start >= text.length", () => {
expect(splitStringOnChunks({
text: "hello",
chunkLen: 10,
overlap: 0,
separators: [],
start: 5,
})).toEqual([]);
expect(splitStringOnChunks({
text: "hello",
chunkLen: 10,
overlap: 0,
separators: [],
start: 100,
})).toEqual([]);
});
});
// =====================
// Performance
// =====================
describe("performance", () => {
it("processes 1MB text in under 1 second (SC-005)", () => {
// Build 1MB text with spaces every ~6 chars
const segment = "abcde ";
const repeats = Math.ceil(1_000_000 / segment.length);
const text = segment.repeat(repeats);
const startTime = performance.now();
const chunks = splitStringOnChunks({
text,
chunkLen: 1000,
overlap: 200,
separators: ["\n\n", "\n", ". ", " "],
});
const elapsed = performance.now() - startTime;
expect(elapsed).toBeLessThan(1000);
expect(chunks.length).toBeGreaterThan(0);
});
});
});
- Время 0:20
- Ваншот, ошибок нет
- Размер md 197 строки, тестов на 261
OpenSpec очень быстро ваншотнул алгоритм, код на уровне, намного быстрее, чем speckit. То ли это я за десятки экспериментов поднаторел объяснять ЛЛМ спеки, то ли реально хороший фреймворк.
OpenSpec страдает от таких же проблем фреймворков – большой оверхед. Фича занимает 197 линий md файлов против 70 в самописных спеках. И все эти 197 строк придется прочитать и поддерживать. И ладно бы каждая линия была полезна, но плотность информации низкая, много повторов, просто неудобный экспириенс.
Тесты, хотя есть, страдают от такой же низкой информационной плотности. Проверяют одно и то же по много раз одним и тем же способом. Если бы у меня не было мастер тестов из 'Пишем код самостоятельно', то я бы не мог с уверенностью сказать, что все работает как надо.
Details
Изначальный запрос
/opsx:propose we are building splitStringOnChunks() TypeScript nodejs function that will cut incoming text on chunks for embeddings
encoding. Supports left overlap and separators.
As an input accepts a single object with:
text - original text
chunkLen - max len of a chunk
overlap - max len of an overlap
start - from what index start chunking text. this way the first chunk can have an overlap
separators - an array of non empty string separators in desc of importance order
As an output there should be an array of chunks, each chunk has:
start - from what index in the original text the chunk starts
mid - from where the new chunk text starts and overlap ends, [start, mid) - is the overlap
end - where the chunk ends
text - the full chunk text [start, end)
first we try to find left boundary and match the most left separator without any priority, just try to find the biggest overlap
if no separator found we skip overlap for this chunk mid = start
then we use unspent chunkLen budget and try to match the right boundary
try to find the most priority the most right separator
if no separator found we may split the text midword
for the left boundary I don't want include separatos in the chunk
for the right boundary I want to include separtors in the chunk
Это произвело несколько файлов
openspec/changes/split-string-on-chunks/proposal.md
We need a text chunking utility for preparing text for embeddings encoding. The function must split text into chunks of bounded size with configurable overlap to preserve context across chunk boundaries, and use separator-aware splitting to produce cleaner chunk boundaries.
- Add a new
splitStringOnChunks()TypeScript function that:- Accepts a configuration object with
text,chunkLen,overlap,start, andseparators - Returns an array of chunk descriptors with
start,mid,end, andtextfields - Supports left overlap: finds the largest overlap by matching the most-left separator (no priority), excludes separator from chunk text
- Supports right boundary: uses remaining
chunkLenbudget, matches the most-right occurrence of the highest-priority separator, includes separator in chunk text - Falls back to mid-word splitting when no separator matches
- Accepts a configuration object with
text-chunking: Core text chunking logic with overlap and separator-aware boundary detection
- New source file under
src/for the chunking function - New test file for the chunking function
- No external dependencies required — pure string manipulation
openspec/changes/split-string-on-chunks/design.md
We need a utility function to split text into chunks suitable for embeddings encoding. The function operates on raw strings with index-based positioning, enabling callers to control chunking from any starting point and carry overlap from previous chunks. No external dependencies are needed — this is pure string manipulation.
Goals:
- Provide a single, stateless function
splitStringOnChunks()that chunks text with overlap - Support separator-aware boundary detection with configurable separator priority
- Left boundary: find largest possible overlap by matching the most-left separator (no priority ordering), exclude separator from chunk text
- Right boundary: use remaining
chunkLenbudget, prefer the most-right occurrence of the highest-priority separator, include separator in chunk text - Gracefully handle edge cases: no separators found, text shorter than chunkLen, empty text
Non-Goals:
- Streaming or async chunking
- Token-based length (we operate on character length only)
- Recursive/hierarchical chunking strategies
- Encoding-aware splitting (e.g., UTF-16 surrogate pairs)
The function takes one config object and returns an array of chunk descriptors. This keeps the API surface minimal and easy to extend without breaking changes.
For the overlap region [start - overlap, start), we search for any separator occurrence and pick the one closest to the beginning of the overlap window. This maximizes overlap size. The separator itself is excluded from the chunk — the chunk starts after the separator. If no separator is found, mid = start (no overlap for this chunk).
Alternative considered: Priority-based left boundary — rejected because maximizing overlap is more valuable than separator importance on the left side.
For the right boundary, we consume the remaining chunkLen budget after the overlap. Within that window, we find the most-right occurrence of the highest-priority separator. The separator is included in the chunk text. If no separator is found, we split mid-word at the budget limit.
Alternative considered: Most-right of any separator — rejected because right boundary quality matters more for semantic coherence, so separator priority should be respected.
Each chunk returns start, mid, end, and text. The mid field separates overlap from new content: [start, mid) is overlap, [mid, end) is new. This lets callers distinguish overlap from fresh content without re-parsing.
- Multi-character separators: Searching for all separators at every position has O(text × separators × maxSepLen) cost → acceptable for typical chunk sizes (a few thousand chars) and small separator lists.
- Mid-word splits: When no separator matches the right boundary, we split mid-word. This is a deliberate trade-off: we guarantee bounded chunk size over clean boundaries.
- Overlap skip: If no separator is found in the overlap window, the chunk has no overlap (
mid = start). Callers should be aware that overlap is best-effort.
openspec/changes/split-string-on-chunks/tasks.md
- 1.1 Create
src/splitStringOnChunks.tswith the function signature, input type, and output type - 1.2 Implement left boundary overlap detection: search overlap window for most-left separator, exclude separator from chunk, fall back to mid=start
- 1.3 Implement right boundary detection: use remaining chunkLen budget, find most-right occurrence of highest-priority separator, include separator, fall back to mid-word split
- 1.4 Implement the main chunking loop: iterate from
startto end of text, producing chunk descriptors and advancing position
- 2.1 Add tests for basic chunking: no overlap, single separator, full text coverage
- 2.2 Add tests for left overlap: separator found (most-left), separator excluded from chunk, no separator falls back to mid=start
- 2.3 Add tests for right boundary: priority-based separator selection, fallback to lower-priority, mid-word split when no separator
- 2.4 Add tests for edge cases: empty text, start at end, text shorter than chunkLen, multi-character separators
- 2.5 Add tests for budget accounting: overlap consumes budget, chunk length never exceeds chunkLen
openspec/changes/split-string-on-chunks/specs/text-chunking/spec.md
The splitStringOnChunks() function SHALL accept a single object with the following properties:
text(string): the original text to chunkchunkLen(number): maximum length of a chunkoverlap(number): maximum length of overlapstart(number): index in the original text from which to begin chunkingseparators(string[]): array of non-empty string separators in descending order of importance
The function SHALL return an array of chunk objects.
- WHEN called with
{ text: "hello world foo bar", chunkLen: 10, overlap: 0, start: 0, separators: [" "] } - THEN the function returns an array of chunk objects covering the entire text
- WHEN called with
{ text: "", chunkLen: 10, overlap: 0, start: 0, separators: [" "] } - THEN the function returns an empty array
- WHEN called with
{ text: "hello", chunkLen: 10, overlap: 0, start: 5, separators: [" "] } - THEN the function returns an empty array
Each chunk object SHALL have the following properties:
start(number): index in the original text where the chunk startsmid(number): index where the new content starts (overlap is[start, mid))end(number): index where the chunk ends (exclusive)text(string): the full chunk text equal totext.slice(start, end)
- WHEN a chunk is returned with
start: 2, end: 8 - THEN
chunk.textSHALL equaloriginalText.slice(2, 8)
- WHEN a chunk has no overlap (first chunk with
start: 0andoverlap: 0) - THEN
chunk.midSHALL equalchunk.start
When overlap > 0 and start > 0, the function SHALL search for the left boundary in the region [start - overlap, start). It SHALL find the most-left occurrence of any separator (regardless of priority) to maximize overlap size. The separator itself SHALL NOT be included in the chunk — the chunk starts after the separator.
- WHEN text is
"aaa.bbb.ccc.ddd", start is12, overlap is8, separators are["."] - THEN the chunk's
startSHALL be4(after the first.found in the overlap window[4, 12)), giving maximum overlap
- WHEN text is
"hello world test", start is11, overlap is6, separators are[" "] - THEN the chunk starts after the space separator (the space is not part of the chunk text)
- WHEN no separator is found in the overlap region
- THEN
midSHALL equalstartand the chunk has no overlap (the chunk starts atstart)
After determining the left boundary, the function SHALL use the remaining chunkLen budget to find the right boundary. It SHALL search for the most-right occurrence of the highest-priority separator within the budget window. The separator SHALL be included in the chunk text.
- WHEN text is
"hello world. foo bar", start is0, chunkLen is18, separators are[". ", " "] - THEN the chunk ends after
". "(the higher-priority separator), even though" "appears further right
- WHEN the highest-priority separator is not found in the right boundary window
- THEN the function SHALL try the next separator in priority order
- WHEN a separator is matched at the right boundary
- THEN the chunk text SHALL include the separator characters
- WHEN no separator is found within the chunkLen budget
- THEN the chunk SHALL end at exactly
start + chunkLen(or end of text), splitting mid-word if necessary
The function SHALL produce chunks that together cover the entire text from start to the end. Each chunk's mid value SHALL equal the previous chunk's end value (or start for the first chunk when no overlap is found), ensuring no text is skipped.
- WHEN text is
"the quick brown fox jumps over the lazy dog"with chunkLen15, overlap5, start0, separators[" "] - THEN concatenating the non-overlap portions
text.slice(chunk.mid, chunk.end)of all chunks SHALL produce the original text fromstartonward
- WHEN the remaining text is shorter than
chunkLen - THEN the last chunk SHALL contain all remaining text without padding
The total length of each chunk (end - start) SHALL NOT exceed chunkLen. The overlap consumes part of the budget, and the right boundary uses the remainder.
- WHEN overlap is found spanning 4 characters
- THEN the right boundary search window is limited to
chunkLen - 4characters frommid
- WHEN any chunk is produced
- THEN
chunk.end - chunk.startSHALL be less than or equal tochunkLen
Прочитал - вроде все ок, компилируем код
/opsx:apply
Работает шустрее speckit, закодил меньше чем за 5 минут. Проверяем имплементацию на наших секретных тестах. Все зеленое - ваншот.
Details
interface SplitStringOnChunksInput {
text: string;
chunkLen: number;
overlap: number;
start: number;
separators: string[];
}
interface Chunk {
start: number;
mid: number;
end: number;
text: string;
}
/**
* Find the most-left occurrence of any separator in the window [windowStart, windowEnd).
* Returns the index right after the separator, or -1 if none found.
*/
function findLeftBoundary(
text: string,
windowStart: number,
windowEnd: number,
separators: string[],
): number {
let bestPos = -1; // position after separator
for (const sep of separators) {
const idx = text.indexOf(sep, windowStart);
if (idx >= 0 && idx + sep.length <= windowEnd) {
const afterSep = idx + sep.length;
if (bestPos === -1 || afterSep < bestPos) {
bestPos = afterSep;
}
}
}
return bestPos;
}
/**
* Find the most-right occurrence of the highest-priority separator in [windowStart, windowEnd).
* Returns the index after the separator (end of chunk), or -1 if none found.
*/
function findRightBoundary(
text: string,
windowStart: number,
windowEnd: number,
separators: string[],
): number {
for (const sep of separators) {
// Find the most-right occurrence of this separator
let bestIdx = -1;
let searchFrom = windowStart;
while (searchFrom < windowEnd) {
const idx = text.indexOf(sep, searchFrom);
if (idx < 0 || idx + sep.length > windowEnd) break;
bestIdx = idx;
searchFrom = idx + 1;
}
if (bestIdx >= 0) {
return bestIdx + sep.length;
}
}
return -1;
}
function splitStringOnChunks(input: SplitStringOnChunksInput): Chunk[] {
const { text, chunkLen, overlap, separators } = input;
const chunks: Chunk[] = [];
let pos = input.start;
while (pos < text.length) {
let chunkStart: number;
let mid: number;
// Left boundary: find overlap
if (overlap > 0 && pos > 0) {
const windowStart = Math.max(0, pos - overlap);
const leftBound = findLeftBoundary(text, windowStart, pos, separators);
if (leftBound >= 0) {
chunkStart = leftBound;
mid = pos;
} else {
chunkStart = pos;
mid = pos;
}
} else {
chunkStart = pos;
mid = pos;
}
// Right boundary: use remaining budget
const overlapLen = mid - chunkStart;
const budget = chunkLen - overlapLen;
const maxEnd = Math.min(text.length, mid + budget);
let chunkEnd: number;
if (maxEnd >= text.length) {
// Remaining text fits in budget
chunkEnd = text.length;
} else {
// Try to find a separator for clean break
const rightBound = findRightBoundary(text, mid, maxEnd, separators);
if (rightBound >= 0) {
chunkEnd = rightBound;
} else {
// Split mid-word
chunkEnd = maxEnd;
}
}
chunks.push({
start: chunkStart,
mid,
end: chunkEnd,
text: text.slice(chunkStart, chunkEnd),
});
// Advance position to end of new content
pos = chunkEnd;
}
return chunks;
}
export = splitStringOnChunks;
Details
import splitStringOnChunks = require("./splitStringOnChunks");
import test = require("node:test");
import assert = require("node:assert/strict");
const { describe, it } = test;
// 2.1 Basic chunking: no overlap, single separator, full text coverage
describe("basic chunking", () => {
it("splits text on spaces with no overlap", () => {
const chunks = splitStringOnChunks({
text: "hello world foo bar",
chunkLen: 10,
overlap: 0,
start: 0,
separators: [" "],
});
assert.ok(chunks.length > 0);
for (const c of chunks) {
assert.equal(c.text, c.text.length > 0 ? c.text : "");
assert.equal(c.mid, c.start); // no overlap
}
});
it("covers full text with no gaps", () => {
const text = "the quick brown fox jumps over the lazy dog";
const chunks = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 0,
start: 0,
separators: [" "],
});
const reconstructed = chunks.map((c) => text.slice(c.mid, c.end)).join("");
assert.equal(reconstructed, text);
});
it("chunk.text matches text.slice(start, end)", () => {
const text = "hello world foo bar baz";
const chunks = splitStringOnChunks({
text,
chunkLen: 12,
overlap: 0,
start: 0,
separators: [" "],
});
for (const c of chunks) {
assert.equal(c.text, text.slice(c.start, c.end));
}
});
});
// 2.2 Left overlap tests
describe("left overlap", () => {
it("finds most-left separator to maximize overlap", () => {
// text: "aaa.bbb.ccc.ddd", start=12, overlap=8
// overlap window: [4, 12) = "bbb.ccc."
// most-left separator "." is at index 3 — but that's before window start (4)
// In window [4,12): "." at index 7, "." at index 11
// most-left in window after sep: index 7+1=8
const text = "aaa.bbb.ccc.ddd";
const chunks = splitStringOnChunks({
text,
chunkLen: 20,
overlap: 8,
start: 12,
separators: ["."],
});
assert.equal(chunks.length, 1);
assert.equal(chunks[0]!.start, 8); // after "." at index 7
assert.equal(chunks[0]!.mid, 12);
});
it("excludes separator from chunk text on left boundary", () => {
const text = "hello world test";
// start=11 ("test"), overlap=6, window=[5,11)=" world"
// " " at index 5, after sep = 6
const chunks = splitStringOnChunks({
text,
chunkLen: 20,
overlap: 6,
start: 11,
separators: [" "],
});
assert.equal(chunks[0]!.start, 6); // after space at 5
assert.ok(!chunks[0]!.text.startsWith(" ")); // separator excluded
});
it("falls back to mid=start when no separator in overlap window", () => {
const text = "abcdefghijklmnop";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 5,
start: 8,
separators: [" "],
});
assert.equal(chunks[0]!.start, 8);
assert.equal(chunks[0]!.mid, 8);
});
});
// 2.3 Right boundary tests
describe("right boundary", () => {
it("uses highest-priority separator", () => {
// text: "hello world. foo bar", chunkLen=18, separators=[". ", " "]
// window [0, 18) = "hello world. foo b"
// ". " at index 11 → end at 13
// " " at index 16 ("b") — more right but lower priority
// Should pick ". " because it's higher priority
const text = "hello world. foo bar";
const chunks = splitStringOnChunks({
text,
chunkLen: 18,
overlap: 0,
start: 0,
separators: [". ", " "],
});
assert.equal(chunks[0]!.end, 13); // after ". "
assert.equal(chunks[0]!.text, "hello world. ");
});
it("falls back to lower-priority separator when higher not found", () => {
const text = "hello world foo";
const chunks = splitStringOnChunks({
text,
chunkLen: 13,
overlap: 0,
start: 0,
separators: [". ", " "],
});
// no ". " found, falls back to " "
// " " most-right in [0,13) = "hello world f" → " " at 11
assert.equal(chunks[0]!.end, 12); // after " " at 11
});
it("includes separator in chunk text", () => {
const text = "hello world foo";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 0,
start: 0,
separators: [" "],
});
// First chunk should end after a space, including the space
assert.ok(chunks[0]!.text.endsWith(" "));
});
it("splits mid-word when no separator found", () => {
const text = "abcdefghijklmnop";
const chunks = splitStringOnChunks({
text,
chunkLen: 6,
overlap: 0,
start: 0,
separators: [" "],
});
assert.equal(chunks[0]!.end, 6);
assert.equal(chunks[0]!.text, "abcdef");
});
});
// 2.4 Edge cases
describe("edge cases", () => {
it("empty text returns empty array", () => {
const chunks = splitStringOnChunks({
text: "",
chunkLen: 10,
overlap: 0,
start: 0,
separators: [" "],
});
assert.deepEqual(chunks, []);
});
it("start at end of text returns empty array", () => {
const chunks = splitStringOnChunks({
text: "hello",
chunkLen: 10,
overlap: 0,
start: 5,
separators: [" "],
});
assert.deepEqual(chunks, []);
});
it("text shorter than chunkLen returns single chunk", () => {
const chunks = splitStringOnChunks({
text: "hi",
chunkLen: 100,
overlap: 0,
start: 0,
separators: [" "],
});
assert.equal(chunks.length, 1);
assert.equal(chunks[0]!.text, "hi");
});
it("handles multi-character separators", () => {
const text = "part1---part2---part3";
const chunks = splitStringOnChunks({
text,
chunkLen: 10,
overlap: 0,
start: 0,
separators: ["---"],
});
assert.equal(chunks[0]!.text, "part1---");
assert.equal(chunks[0]!.end, 8);
});
});
// 2.5 Budget accounting
describe("budget accounting", () => {
it("overlap consumes part of chunkLen budget", () => {
const text = "aaa bbb ccc ddd eee";
// start=8 ("ccc ddd eee"), overlap=5, chunkLen=12
// overlap window [3,8) = " bbb ", " " at 3 → afterSep=4
// overlapLen = 8-4 = 4, budget = 12-4 = 8
const chunks = splitStringOnChunks({
text,
chunkLen: 12,
overlap: 5,
start: 8,
separators: [" "],
});
assert.ok(chunks[0]!.end - chunks[0]!.start <= 12);
});
it("chunk length never exceeds chunkLen", () => {
const text = "the quick brown fox jumps over the lazy dog";
const chunkLen = 15;
const chunks = splitStringOnChunks({
text,
chunkLen,
overlap: 5,
start: 0,
separators: [" "],
});
for (const c of chunks) {
assert.ok(
c.end - c.start <= chunkLen,
`Chunk [${c.start},${c.end}) length ${c.end - c.start} exceeds chunkLen ${chunkLen}`,
);
}
});
it("full coverage with overlap: non-overlap portions reconstruct text", () => {
const text = "the quick brown fox jumps over the lazy dog";
const chunks = splitStringOnChunks({
text,
chunkLen: 15,
overlap: 5,
start: 0,
separators: [" "],
});
const reconstructed = chunks.map((c) => text.slice(c.mid, c.end)).join("");
assert.equal(reconstructed, text);
});
});
- Провал. Ошибки есть неисправляемые
Пробуем передать все решения OpenSpec: только ставим задачу, не вмешиваемся в процесс ее проектирования.
Клод верно понял, что для эмбеддингов нужны и оверлап, и сепараторы с приоритетом. Хотя, например, сепараторы нельзя конфигурировать.
Однако Клод всё так же принял тупые архитектурные решения: оверлап режет слова пополам, ломая тем самым контекст для ЛЛМ. Также есть мажорные баги. При определенном оверлапе что-то все расхерачивает. Клод просто ленится и не делает конкретные тесты для оверлапа, только проверяет его наличие.
Я запустил еще пару раз цикл:
/opsx:propose fix bugs and make thing better
/opsx:apply
Часть багов пропала. Но и левая, и правая границы чанка могут легко резать слова. А это означает, что базовый функционал не решен.
Details
/opsx:propose Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests and property-based testing.
/opsx:apply
Оверлап может резать слово пополам. Также есть мажорные баги:
text="Hello world. something long" chunkSize=16 overlap=6
Результат просто взрыв
"Hello world. "
"llo world. "
"lo world. "
"o world. "
" world. "
"world. "
"orld. "
"rld. "
"ld. "
"d. "
". something long"
Делаем еще пару раз
/opsx:propose fix bugs and make thing better
/opsx:apply
/opsx:propose fix bugs and make thing better
/opsx:apply
Изначальный баг исправлен и теперь:
text="Hello world. something long" chunkSize=16 overlap=6
"Hello world. "
"orld. something "
"thing long"
Оверлап режет слова, но это по дизайну. Вообще и правая граница режет слова, если есть \n\n сепараторы. Скорее всего есть какие-то баги, на этом моменте лень искать.
Details
export interface ChunkOptions {
/** Maximum chunk size in characters */
chunkSize: number;
/** Number of overlapping characters between consecutive chunks (default: 0) */
overlap?: number;
}
/**
* Find the last paragraph boundary (\n\n) in text up to maxPos.
* Returns the index after the boundary, or -1 if none found.
*/
function findParagraphBoundary(text: string, maxPos: number): number {
const idx = text.lastIndexOf("\n\n", maxPos - 1);
if (idx <= 0) return -1;
return idx + 2; // position after the double newline
}
/**
* Find the last sentence boundary (.!? followed by whitespace) in text up to maxPos.
* Returns the index after the whitespace, or -1 if none found.
*/
function findSentenceBoundary(text: string, maxPos: number): number {
for (let i = maxPos - 1; i >= 1; i--) {
const ch = text[i];
if (ch === " " || ch === "\n" || ch === "\t") {
const prev = text[i - 1];
if (prev === "." || prev === "!" || prev === "?") {
return i + 1; // position after the whitespace
}
}
}
return -1;
}
/**
* Find the last word boundary (space) in text up to maxPos.
* Returns the index after the space, or -1 if none found.
*/
function findWordBoundary(text: string, maxPos: number): number {
for (let i = maxPos; i >= 1; i--) {
if (text[i - 1] === " ") {
return i;
}
}
return -1;
}
/**
* Split text into chunks suitable for embedding models.
*
* Chunks are boundary-aware: the algorithm prefers splitting at paragraph
* or sentence boundaries. Falls back to a hard split at chunkSize when
* no boundary is found.
*
* @param text - The input text to split
* @param options - Chunking configuration
* @returns Array of text chunks
*/
export function splitTextForEmbeddings(
text: string,
options: ChunkOptions,
): string[] {
const { chunkSize, overlap = 0 } = options;
if (chunkSize < 1) {
throw new Error("chunkSize must be at least 1");
}
if (overlap < 0) {
throw new Error("overlap must not be negative");
}
if (overlap >= chunkSize) {
throw new Error("overlap must be less than chunkSize");
}
if (text.length === 0) {
return [];
}
if (text.length <= chunkSize) {
return [text];
}
const chunks: string[] = [];
let pos = 0;
while (pos < text.length) {
const remaining = text.length - pos;
// If remaining text fits in one chunk, take it all
if (remaining <= chunkSize) {
chunks.push(text.slice(pos));
break;
}
// Try to find the best boundary within the chunk window
const window = text.slice(pos, pos + chunkSize);
let splitAt: number;
// 1. Try paragraph boundary (only if chunk would be large enough for overlap)
const paraBoundary = findParagraphBoundary(window, chunkSize);
if (paraBoundary > overlap) {
splitAt = paraBoundary;
} else {
// 2. Try sentence boundary (only if chunk would be large enough for overlap)
const sentBoundary = findSentenceBoundary(window, chunkSize);
if (sentBoundary > overlap) {
splitAt = sentBoundary;
} else {
// 3. Try word boundary (only if chunk would be large enough for overlap)
const wordBoundary = findWordBoundary(window, chunkSize);
if (wordBoundary > overlap) {
splitAt = wordBoundary;
} else {
// 4. Hard split
splitAt = chunkSize;
}
}
}
chunks.push(window.slice(0, splitAt));
// Advance position, accounting for overlap
pos += splitAt - overlap;
}
return chunks;
}Details
import { describe, it, expect } from "vitest";
import * as fc from "fast-check";
import { splitTextForEmbeddings } from "./chunk-text.js";
// ---------------------------------------------------------------------------
// Unit Tests
// ---------------------------------------------------------------------------
describe("splitTextForEmbeddings", () => {
describe("basic behavior", () => {
it("returns empty array for empty text", () => {
expect(splitTextForEmbeddings("", { chunkSize: 100 })).toEqual([]);
});
it("splits long text into chunks not exceeding chunkSize", () => {
const text = "a".repeat(1000);
const result = splitTextForEmbeddings(text, { chunkSize: 200 });
for (const chunk of result) {
expect(chunk.length).toBeLessThanOrEqual(200);
}
expect(result.length).toBeGreaterThan(1);
});
});
describe("boundary-aware splitting", () => {
it("splits at paragraph boundary", () => {
const text = "First paragraph content.\n\nSecond paragraph that is long enough to matter for splitting purposes.";
const result = splitTextForEmbeddings(text, { chunkSize: 40 });
expect(result[0]).toBe("First paragraph content.\n\n");
});
it("splits at sentence boundary when no paragraph boundary", () => {
const text = "First sentence here. Second sentence that continues for a while and is quite long.";
const result = splitTextForEmbeddings(text, { chunkSize: 30 });
expect(result[0]).toBe("First sentence here. ");
});
it("hard splits when no boundary found", () => {
const text = "abcdefghijklmnopqrstuvwxyz";
const result = splitTextForEmbeddings(text, { chunkSize: 10 });
expect(result[0]).toBe("abcdefghij");
expect(result[0].length).toBe(10);
});
it("splits at word boundary when no sentence boundary", () => {
const text = "hello world foobar baz qux";
const result = splitTextForEmbeddings(text, { chunkSize: 12 });
expect(result[0]).toBe("hello world ");
});
it("reconstructs text with word-boundary splitting and overlap", () => {
const text = "alpha beta gamma delta epsilon zeta eta theta";
const overlap = 5;
const result = splitTextForEmbeddings(text, { chunkSize: 15, overlap });
for (const chunk of result) {
expect(chunk.length).toBeLessThanOrEqual(15);
}
for (let i = 0; i < result.length - 1; i++) {
const tail = result[i].slice(-overlap);
const head = result[i + 1].slice(0, overlap);
expect(tail).toBe(head);
}
let reconstructed = result[0];
for (let i = 1; i < result.length; i++) {
reconstructed += result[i].slice(overlap);
}
expect(reconstructed).toBe(text);
});
});
describe("overlap", () => {
it("overlapping chunks share characters at boundaries", () => {
const text = "abcdefghijklmnopqrstuvwxyz";
const result = splitTextForEmbeddings(text, {
chunkSize: 10,
overlap: 3,
});
for (let i = 0; i < result.length - 1; i++) {
const tail = result[i].slice(-3);
const head = result[i + 1].slice(0, 3);
expect(tail).toBe(head);
}
});
it("reconstructs original text with overlap removal", () => {
const text = "abcdefghijklmnopqrstuvwxyz";
const overlap = 3;
const result = splitTextForEmbeddings(text, {
chunkSize: 10,
overlap,
});
let reconstructed = result[0];
for (let i = 1; i < result.length; i++) {
reconstructed += result[i].slice(overlap);
}
expect(reconstructed).toBe(text);
});
it("throws when overlap >= chunkSize", () => {
expect(() =>
splitTextForEmbeddings("hello", { chunkSize: 5, overlap: 5 }),
).toThrow("overlap must be less than chunkSize");
expect(() =>
splitTextForEmbeddings("hello", { chunkSize: 5, overlap: 10 }),
).toThrow("overlap must be less than chunkSize");
});
});
describe("regression: overlap + boundary interaction", () => {
it("handles overlap larger than early sentence boundary", () => {
const text = "Done. " + "x".repeat(100);
const result = splitTextForEmbeddings(text, { chunkSize: 20, overlap: 10 });
// All chunks must respect size limit
for (const chunk of result) {
expect(chunk.length).toBeLessThanOrEqual(20);
}
// Overlap must match between consecutive chunks
for (let i = 0; i < result.length - 1; i++) {
const tail = result[i].slice(-10);
const head = result[i + 1].slice(0, 10);
expect(tail).toBe(head);
}
// Reconstruction must produce original text
let reconstructed = result[0];
for (let i = 1; i < result.length; i++) {
reconstructed += result[i].slice(10);
}
expect(reconstructed).toBe(text);
});
it("reconstructs multi-sentence text with overlap", () => {
const text = "Hello world. This is a test. Another sentence here. Final part.";
const overlap = 10;
const result = splitTextForEmbeddings(text, { chunkSize: 30, overlap });
let reconstructed = result[0];
for (let i = 1; i < result.length; i++) {
reconstructed += result[i].slice(overlap);
}
expect(reconstructed).toBe(text);
});
});
describe("validation", () => {
it("throws when chunkSize is 0", () => {
expect(() =>
splitTextForEmbeddings("hello", { chunkSize: 0 }),
).toThrow("chunkSize must be at least 1");
});
it("throws when chunkSize is negative", () => {
expect(() =>
splitTextForEmbeddings("hello", { chunkSize: -1 }),
).toThrow("chunkSize must be at least 1");
});
it("throws when overlap is negative", () => {
expect(() =>
splitTextForEmbeddings("hello", { chunkSize: 10, overlap: -1 }),
).toThrow("overlap must not be negative");
});
});
describe("roundtrip integrity", () => {
it("joining chunks without overlap reproduces original text", () => {
const text =
"The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs.";
const result = splitTextForEmbeddings(text, {
chunkSize: 30,
overlap: 0,
});
expect(result.join("")).toBe(text);
});
});
});
// ---------------------------------------------------------------------------
// Property-Based Tests
// ---------------------------------------------------------------------------
describe("splitTextForEmbeddings (property-based)", () => {
// Arbitrary for valid chunk options: chunkSize >= 1, 0 <= overlap < chunkSize
const chunkOptionsArb = fc
.record({
chunkSize: fc.integer({ min: 1, max: 500 }),
overlapRatio: fc.double({ min: 0, max: 0.9, noNaN: true }),
})
.map(({ chunkSize, overlapRatio }) => ({
chunkSize,
overlap: Math.floor(chunkSize * overlapRatio),
}));
it("no chunk exceeds chunkSize", () => {
fc.assert(
fc.property(fc.string({ minLength: 0, maxLength: 2000 }), chunkOptionsArb, (text, opts) => {
const chunks = splitTextForEmbeddings(text, opts);
for (const chunk of chunks) {
expect(chunk.length).toBeLessThanOrEqual(opts.chunkSize);
}
}),
{ numRuns: 200 },
);
});
it("no text is lost when overlap is 0", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 2000 }),
fc.integer({ min: 1, max: 500 }),
(text, chunkSize) => {
const chunks = splitTextForEmbeddings(text, {
chunkSize,
overlap: 0,
});
expect(chunks.join("")).toBe(text);
},
),
{ numRuns: 200 },
);
});
it("overlap correctness: tail of chunk N matches head of chunk N+1", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 2000 }),
chunkOptionsArb.filter((o) => o.overlap > 0),
(text, opts) => {
const chunks = splitTextForEmbeddings(text, opts);
for (let i = 0; i < chunks.length - 1; i++) {
const tail = chunks[i].slice(-opts.overlap);
const head = chunks[i + 1].slice(0, opts.overlap);
expect(tail).toBe(head);
}
},
),
{ numRuns: 200 },
);
});
it("chunk count is within expected bounds", () => {
fc.assert(
fc.property(
fc.string({ minLength: 1, maxLength: 2000 }),
chunkOptionsArb,
(text, opts) => {
const chunks = splitTextForEmbeddings(text, opts);
expect(chunks.length).toBeGreaterThanOrEqual(1);
const step = opts.chunkSize - opts.overlap;
const maxChunks = Math.ceil(text.length / Math.max(step, 1)) + 1;
expect(chunks.length).toBeLessThanOrEqual(maxChunks);
},
),
{ numRuns: 200 },
);
});
});