aleksejkozin/how-to-llm-x10.md

Как писать программы при помощи ЛЛМ в 10 раз быстрее, чтобы они б*ь работали как надо?

ЛЛМ потрясающе умные и потрясающе тупые. От диалога с ЛЛМ меня бросает то в эйфорию, то в отчаяние. Кажется, что можно получить x10 производительности труда, но на практике на часах 2 часа ночи, а я ору матом на Клода. Так жить нельзя. Настало время остановиться, выдохнуть и подумать, как использовать ЛЛМ как левередж.

Разработка при помощи ЛЛМ делится на две части:

Микро - когда у нас маленькая задача, занимающая минимум контекста, а ЛЛМ нужно следовать до 100 инструкций (для claude 3.7 ~95% точность Writing a good CLAUDE.md). В этом режиме мы не сталкиваемся с "dumb zone", и ЛЛМ может выполнить все инструкции без деградации
Макро - когда у нас куча уже написанного кода, куча инструкций и все это нужно нарезать на "Микро" контексты

Программирование включает в себя множество подзадач: проектирование, тестирование, безопасность, производительность, мониторинг, поддержка и т.д. Каждая ветвь имеет свои особенности, но я предполагаю, что принцип работы ЛЛМ будет примерно одинаков в каждой из них. В этой статье мы будем сфокусированы на проектировании/тестировании. Если добавить в стек производительность, безопасность и мониторинг, то на входе будет код повышенного качества, однако я просто на это забью.

В этой статье мы рассмотрим именно микро уровень, как основополагающий. Если мы сможем получить x10 левередж на микро уровне, то далее, почти тривиально, его можно скейлить на макро и жечь тысячи долларов/токенов в день.

Главный Вопрос, на который мы попытаемся ответить:

Как на микро уровне получить от ЛЛМ программу, 
которая делает то, что ты хочешь без багов, 
с минимальным затратом времени программиста?

Для ответа на Главный Вопрос в этой статье мы возьмем один простой алгоритм "чанкинг текста для эмбеддингов", один JS модуль, и реализуем его разными способами при помощи Клода. Алгоритм нетривиальный – то есть его не ваншотал Клод в марте 2026 Opus 4.6 с простого вайбкод промта. Этот алгоритм легко вмещается в контекст, а Клод не перегружен инструкциями.

Ресурсы, в которых мы ограничены, и которые мы будем замерять:

Время программиста. Нет смысла использовать ЛЛМ, если можно самому написать все это за то же время. Хочется x10, чтобы оправдать чек
Размер контекста и количество инструкций. Очень лимитированы. Чем меньше текста в системе, тем все лучше работает

При работе с ЛЛМ есть несколько методологий, которые мы рассмотрим:

Вайбкод
Планнинг мод
Spec Driven Development
Всякие фреймворки вида spec-kit или OpenSpec
Вызывать его в цикле с промтом

Какие-то корпоративные тулзы, которые запускают оравы агентов за бешеные бабки (от Dex Horthy такое видел, Research Plan Implement), мы рассматривать не будем, ибо у нас нет бабок. Но они больше сфокусированы на макро, разрезаны на специализированных микро агентов и детерминистическую обвязку. Микро агенты в итоге все равно пишут/читают md/код.

Методологии можно сгруппировать по способу ввода:

Программист пишет в чат > ЛЛМ пишет код
Программист пишет в чат > ЛЛМ пишет md > ЛЛМ читает md и пишет код
Программист пишет в md > ЛЛМ читает md и пишет код

А также по способу проверки результата:

Программист читает код
Программист читает md

Заваривайте чай, посмотрим, какие методология и подход лучше всего. Будет весело.

Результаты

Я реализовал алгоритм чанкинг текста для эмбеддингов различными способами десятки раз.

В чистый провал можно записать вайбкодинг и любые попытки любым способом написать спеки без участия программиста. Остальные способы так или иначе достигли цели.

Лучшими оказались самописные спеки: быстрее всех, компактнее всех, понятнее всех. На втором месте OpenSpec, он быстрый, но объемный. Потом планнинг мод – одновременно удобный и неудобный. Затем Spec-Kit — супер жирный и медленный.

Как на микро уровне получить от ЛЛМ программу, 
которая делает то, что ты хочешь без багов, 
с минимальным затратом времени программиста?

.md файлы – новый исходный код программы, их нужно хранить в репозитории, мы их компилируем в код через ЛЛМ
Обязательно писать .md нужно только руками и не пускать в них ЛЛМ. В репозитории должно быть строгое разграничение, какие файлы пишет и читает человек, какие пишет ЛЛМ, не смешиваем их
ЛЛМ не способна проектировать софт, но может очень круто писать код по спек файлам. Программа пишется в 6 раз быстрее, чем руками, качество продукта на уровне
Тест кейсы нужно самостоятельно описывать в .md файлах. ЛЛМ читерит и не может сама себя тестировать
Программист обязан валидировать скомпилированный ЛЛМ код. Валидация тест кейсов – это основной роут валидации кода. Это многократно быстрее чтения исходного кода программы
Держать .md файлы максимально мелкими и с максимально плотной информацией, загружать только те, которые реально нужны в контексте

Основных находок три:

1. ЛЛМ без человека не может спроектировать программу уровня чуть выше тривиального

Она может погуглить, спросить у оператора, но самостоятельно этого выполнить не может. "ЛЛМ не может думать".

Я пробовал очень простую программу "разбить текст на чанки для последующего эмбеддинга". ЛЛМ верно поняла, что нужны сепараторы, макс длина чанка и оверлап. Но не поняла, зачем они нужны. ЛЛМ разбивала слова посередине, либо выдавала чанки без оверлапа, хотя он был сконфигурирован. Результат отвратного качества. Принципиально неверно просить ЛЛМ проектировать софт, это грести против течения

Хорошие примеры плохой архитектуры:

Цикл чат
OpenSpec без человека

2. ЛЛМ очень хороши в транслировании текста из одного домена в другой

Как преобразование Фурье, только для текста. Преобразование Фурье не интеллектуальное, но крайне полезное: если в одном домене задача запарна, то в другом может быть тривиальна.

Можно из английского перегнать в JS; из JS в найденную CVE уязвимость; из CVE сделать программу-эксплоит.

Если дать краткое описание алгоритма, чуть типов, чуть тестов – то ЛЛМ очень четко выдает код, который удовлетворяет всем требованиям и НЕ СОДЕРЖИТ багов.

Исходный код в .md файлах в 3 раза меньше, чем в .ts файлах, сжатие потрясающее, не говоря о скорости написания программистом: в общем случае скорость разработки с ЛЛМ была в 6 раз выше чисто за счет того, что можно не указывать все детали алгоритма

Хороший пример идеальной трансляции:

SDD чат / SDD программа

3. Валидировать код ЛЛМ нужно так же, как и любой другой код – через тесты. Но сама себе тесты ЛЛМ писать не может

ЛЛМ тесты пишет так, чтобы их ей было легче пройти, что и смешно, и грустно.

Несколько составленных вручную кейсов и PBT дают результат на голову качественней, чем 1000 строк написанных ЛЛМ слоп тестов.

Тесты нужно составить таким образом, чтобы их было легко читать, но при этом они проверяли кудрявую логику. А вам нужно их читать, это основной пруф того, что работа выполнена верно

Примеры плохих тестов:

Цикл тесты код содержит баги, но тесты подобраны так, чтобы быть зелеными
Spec-kit тесты очень много текста, супер запарно валидировать
OpenSpec тесты несмотря на внешнюю тщательность не тестируют текст оверлапа

Разница между Думать и Транслировать

Без всяких сомнений ЛЛМ это Китайская комната.

Этот контраст между "думать" и "транслировать между доменами" просто потрясающий. Именно из-за него иногда кажется, что Клод тупой, а иногда что он – Гений. Именно из-за него нельзя давать Клоду писать .md файлы.

Визуально и "думать" и "транслировать" – это просто текст, какой-то промт для ЛЛМ. Но в одном случае ты не знаешь, что нужно делать, а в другом знаешь, но тебе запарно это вводить.

Если ЛЛМ не знает правильного ответа, то она его не придумает. Может случайно угадать, но у нас такие по размеру домены, где случай нереален.

Конечно, можно направить ЛЛМ погуглить или спросить кого-то другого (другие ЛЛМ лол), но сама она не сложит 2 + 2 верно, если не видела эти 2 + 2 = 4 миллионы раз.

Настоящий левередж появляется, если описать задачу неточно, с большим количеством пропусков и дыр, но с очевидными формами, которые в эти дыры подходят.

Например, предоставить ЛЛМ пару типов, один подобранный руками тест кейс, какое-нибудь свойство для PBT. Чтобы все эти идеи пересекались в многомерном пространстве и описывали четкое для ЛЛМ решение.

И вот тут наш друг Клод сияет. Как бы я ни тренировал vim и слепую печать, я никогда не смогу выдавать в x10 раз быстрее код.

За счет этих дыр, пропусков, ассампшенов создается левередж. Больше дыр – больше левередж.

В JS или языке Си нужно описать каждый шаг алгоритма, каждую скобочку – очень много деталей. Но если описать проблему в домене мышления .md, через хитрое пересечение различных идей описать форму решения, то детали нафиг не нужны.

ЛЛМ транслирует .md в домен кода и восстанавливает все скобочки и точки. Я в 6 раз пишу программу быстрее в .md, чем в TS, и это я еще не тренировался составлять короткие .md. Поверхностный осмотр показал, что можно удалить значительную часть файла, и клод произведет такое же правильное решение.

Маленький и абсолютно безграмотно написанный на корявом английском человеком .md, в котором живут интересно пересекающиеся идеи, может однозначно задать здоровенный работающий алгоритм в коде + тесты + мониторинг + перфоманс + безопасность.

И в то же время огромная красивая-вычурная простыня ЛЛМ текста приведет к безумно тупому багу. И именно поэтому никогда не стоит давать ЛЛМ писать спеки и .md.

ЛЛМ НЕ напишет туда того, чего не знает, ибо не умеет думать. Но она напишет туда то, что и так всем известно, и засрет свой контекст и контекст программиста бессмысленной информацией. ОГРОМНЫМ количеством бессмысленной, красиво оформленной информации.

У меня ЛЛМ выдавала спеки в 4 раза больше того, что я сам написал, и это я еще не тренировался писать малые спеки, цифра может легко уйти в x10.

Также можно натравить ЛЛМ делать "рисерч": гуглить, скачивать сорс код, читать его и выдавать какой-то сжатый отчет. Нужно помнить, что это не настоящий человеческий рисерч: новизна идей в этом отчете не за счет мыслительного процесса, а за счет поиска и индексации.

ЛЛМ прочитает кучу текста и сможет найти иголку в огромном стоге информации (чувак из Антропика хвастался, что нашел в линукс ядре уязвимость). А вот выводы ЛЛМ сделать не может.

Поэтому нет смысла пихать сам отчет в .md. Его нужно прочитать, сделать выводы самому и написать на корявом английском "chunks never break words if can!".

Скиллом разработчика становится: как написать маленький ленивый md, который легко верифицировать? Это целое искусство доставания эссенции из алгоритма. Что у меня хорошо работало:

Описать часть формы входных и выходных данных. Нужно думать о том, чтобы алгоритм было легко тестировать; выходы/входы должны способствовать этому
Интересный ручной тест кейс
Интересные pbt свойства
Тесты должны получаться верифицируемыми визуально под пивас: их мало, визуально компактные, емкие, составленные вами
Можно и нужно компилить md несколько раз с чистого контекста и смотреть, какие возникают проблемы у ЛЛМ. Было бы прикольно поставить watch скрипт и перекомпилировать на каждое изменение файла

Вы удивитесь, насколько можно мало описать и много получить.

Флоу разработки

Большинство популярных ЛЛМ утилит (планнинг мод, spec-kit, OpenSpec, от Dex Horthy Research Plan Implement) придерживается флоу:

Программист пишет в чат
ЛЛМ пишет md
Программист верифицирует md
ЛЛМ читает md и пишет код и тесты
Программист верифицирует код

Что по мне дико неэффективно.

Хотя нужно держать в голове, что это еще и Макро утилиты, направленные на разработку, которая не вмещается в один контекст. Но в самом своем ядре у них такой же точно флоу и на Микро уровне.

ЛЛМ пишет здоровенные md, которые по факту ничего не делают, но кажутся полезными. Это было четко подтверждено тестами с Spec-Kit и OpenSpec. Но при этом программисту и ЛЛМ приходится этими md засирать себе контекст.

То же самое происходит и с ЛЛМ-тестами, они многочисленны, однообразны и малоэффективны.

Я предлагаю флоу:

Программист пишет md в них же тесты
ЛЛМ пишет код
Программист верифицирует тесты

Просто убираем лишний шаг: вместо чата загоняем спеки прямо в md. Разделение труда: человек думает, ЛЛМ пишет. Всем все удобней: человеку меньше читать, меньше писать, легче верифицировать. ЛЛМ не надо думать, ЛЛМ меньше читать.

Верификация самого же кода программы затруднительна, ибо алгоритмы, в отличие от тестов, хитрые и большие. Но можно чисто проскандировать на предмет какого-то поверхностного инсайда или какой-то совсем лютой дичи.

Методология

На mac mini через ssh/tmux

Создаем отдельных юзеров, которые не могут смотреть файлы друг друга
Каждому юзеру ставим отдельного Клода
Клоды привязаны к одному аккаунту, ибо у меня нет денег на 10 клодов. Тут считаем, что через инет клоды не общаются и не подсказывают друг другу
Запускаем Клода как claude --dangerously-skip-permissions
Реализуем один и тот же алгоритм
Сравниваем реализации между собой

Замеряем:

Время, потраченное программистом
Размер контекста
Ошибки

Особенности:

Мы не будем думать, как менеджить контекст. Алгоритм маленький, легко влезает
Каждый из кейсов я для надежности прогнал несколько раз и убедился, что он стабильно себя одинаково ведёт. Статья и так получилась большая, я не буду все прикладывать

Алгоритм

Мне для RAG потребовались эмбеддинги для чат логов. Кучу чатов нужно разрезать на примерно одинаковые куски и скормить в ЛЛМ, чтобы получить вектора для поиска. Но если резать втупую по длине строки, то фраза

Почему упал вчера прод сервер

Может быть нарезана на чему упа, л вчера - что будет тупо воспринято ЛЛМ. Поэтому нужно резать не только по длине, но и по сепараторам: переносам строки, пробелам, знакам препинания

Но если так сделать, то чат лог:

- настало то время
- перезапустишь прод сервер?
- ок, хорошо

Разрежется на - перезапустишь прод сервер? и - ок, хорошо

"ок, хорошо" - несет мало информации без контекста, поэтому нельзя резать чанки впритык. Нужен небольшой оверлап. То есть в идеале должны получаться чанки вида:

Чанк 1

- настало то время
- перезапустишь прод сервер?

Чанк 2

- перезапустишь прод сервер?
- ок, хорошо

Теперь у чанков будет контекст и все будет искаться лучше.

Детали алгоритма

Левая граница оверлапа берется как самый левый сепаратор, но никогда не может резать слова. Не включаем сам сепаратор в оверлап
Правая граница текста берется как самый приоритетный самый правый сепаратор, может резать слова (если нет сепараторов). Текст включает в себя сепаратор
Можно задать макс длину чанка
Можно задать макс длину оверлапа
По возможности всегда пытаемся добавить в чанк оверлап, оставшийся бюджет тратим на продвижение вперед
Можно задать с какой позиции в тексте начинать нарезку, чтобы у первого чанка мог быть оверлап

Дальнейшие исследования

Если уже понятно, как на микро уровне производить с помощью ЛЛМ код, то еще не ясно, как скейлить это на макро уровень. А также как добавить SDD в готовый взрослый проект.

Интересно также понять, как сделать ЛЛМ агента и дать ему RAG, чтобы непрограммисты могли слать сложные запросы в данные компании. Тут основная проблема в том, чтобы не давать ЛЛМ врать, ведь авто тестов в ручных запросах не будет. Быть может, к каждому запросу нужно прикреплять способ его проверки, но это усложняет запросы.

Выводы

Продвинутый программист разочарован ЛЛМ – ЛЛМ решила проблему, но сделала это через дупликацию кода. Трагедия века. Си компилятор инлайнит и дублирует функции только в путь – это не баг, а фича.

Дупликация вредна в домене мышления, ибо мутации становятся хрупкими, а также сложнее понять систему. Но в домене сырого кода – она может даже положительно влиять на перфоманс. У продвинутого программиста батхерт, и я его понимаю.

Когда я пишу на языке Си, то я не знаю, как оно под капотом будет транслироваться в машинный код. Я примерно представляю алгоритмы, и также знаю, что все покрыто лютыми тестами, верифицированными людьми. Си можно доверять.

Также можно относиться и к ЛЛМ. Я пишу алгоритмы и архитектуру в .md, они конвертируются в код. Код покрыт лютыми тестами. Тесты также я сам пишу в .md. В этом случае коду из ЛЛМ можно доверять. В этом особом случае можно забить на понимание каждой строчки кода.

Тут на передний план выходят PBT и другие адвансед методы тестирования. Может 100% кавередж репорт. Точно верификация протоколов как в TS. Может fuzzy тестирование с гайдом. Перфоманс тесты. Можно следить за утечками памяти. Можно скармливать продакшен логи и строить сложные пайплайны самопочинки.

Всегда есть огромный соблазн попросить ЛЛМ писать md файлы и сделать x100, но это просто нукнет систему. Нужно четко понимать обязанности ЛЛМ и программиста.

При разделении труда и понимании своей роли ЛЛМ дает левередж x10. Программисты нужны как никогда.

Приложение: эксперименты

Способ №0 – Пишем код самостоятельно

Время 1:30
Успех, ошибок 0
Размера контекста нет, нет ЛЛМ

Пьем тонизирующий китайский чай и пишем код руками. Как всегда, задача казалась на 10 минут, но из-за реальности растянулась на 1:30. Оценка времени косячит в разработке.

Были написаны и юнит, и Property-Based тесты. Я вполне уверен, что в 99.9% кейсов программа работает как ожидается. Если баги есть, то они совсем левые, например оверлап никогда не может достичь максимума из-за отброса сепаратора.

Качество кода премиальное, но разработка сверхзапарная. Нужно указать все детали, ввести весь код, что долго-муторно.

Важно: как мы будем валидировать другие эксперименты

Написанные тут тесты будут использованы для проверки других алгоритмов. Их мы никогда не покажем Клоду, а будем использовать чисто для валидации.

Ручная программа

Details

import assert from 'assert'

export type Chunk = {
  start: number
  mid: number
  end: number
  text: string
}

export const splitStringOnChunks = ({
  text,
  start: textStart = 0,
  separators,
  chunkLen,
  overlap = 0,
}: {
  text: string
  start?: number
  separators: string[]
  chunkLen: number
  overlap?: number
}): Chunk[] => {
  assert(chunkLen > 0)
  assert(textStart >= 0)
  assert(separators.length > 0)
  assert(separators.every(Boolean))
  assert(overlap >= 0)
  assert(overlap < chunkLen)

  const chunks: Chunk[] = []
  if (!text) return chunks

  let mid = textStart

  while (mid < text.length) {
    let start = 0
    const overlapFrom = mid - overlap
    if (overlapFrom > 0) {
      start = mid
      const potOverlap = text.slice(overlapFrom, mid)
      for (const sep of separators) {
        let idx = potOverlap.indexOf(sep)
        if (idx < 0) continue
        idx += overlapFrom + sep.length
        start = idx < start ? idx : start
      }
    }

    let end = Math.min(mid + chunkLen - (mid - start), text.length)
    if (end < text.length) {
      const potText = text.slice(mid, end)
      for (const sep of separators) {
        const pos = potText.lastIndexOf(sep)
        if (pos >= 0) {
          end = mid + pos + sep.length
          break
        }
      }
    }

    chunks.push({
      start,
      mid,
      end,
      text: text.slice(start, end),
    })

    assert(end > mid)

    mid = end
  }

  return chunks
}

Ручные тесты

Details

import fc from 'fast-check'
import _ from 'lodash'
import { it, expect } from 'vitest'
import { splitStringOnChunks } from './splitStringOnChunks'

it('splitStringOnChunks: base cases', () => {
  expect(
    splitStringOnChunks({
      text: 'skip some. Hello world. Some long text',
      start: 11,
      chunkLen: 30,
      overlap: 9,
      separators: ['. ', '.', ' '],
    }).map((x) => x.text),
  ).toStrictEqual(['some. Hello world. ', 'world. Some long text'])

  expect(
    splitStringOnChunks({
      text: 'skip some. Hello world. Some long text',
      start: 11,
      chunkLen: 30,
      overlap: 9,
      separators: [' ', '.', '. '],
    }).map((x) => x.text),
  ).toStrictEqual(['some. Hello world. Some long ', 'long text'])
})

const paramsGen = fc.record({
  text: fc.string({ minLength: 0, maxLength: 10 }),
  start: fc.integer({ min: 0 }),
  chunkLen: fc.integer({ min: 0 }),
  overlap: fc.integer({ min: 0 }),
  separators: fc.array(fc.string({ minLength: 1, maxLength: 2 }), { minLength: 1, maxLength: 2 }),
})

it('splitStringOnChunks: bpt', () => {
  fc.assert(
    fc.property(fc.context(), paramsGen, (ctx, params) => {
      fc.pre(params.overlap <= params.chunkLen - 1)

      const chunks = splitStringOnChunks(params)
      ctx.log(JSON.stringify(chunks))

      // deterministic
      expect(chunks).toStrictEqual(splitStringOnChunks(params))

      // non existant separator has no effect
      if (!params.text.includes('*'))
        expect(chunks).toStrictEqual(
          splitStringOnChunks({ ...params, separators: ['*', ...params.separators] }),
        )

      const effectiveText = params.text.slice(params.start)
      if (effectiveText) {
        // if there is text there is output
        expect(chunks.length).toBeGreaterThan(0)
      } else {
        // not output if nothing to chunk
        expect(chunks.length).toStrictEqual(0)
        return
      }

      // can recover the original text
      expect(chunks.map((x) => params.text.slice(x.mid, x.end)).join('')).toStrictEqual(
        effectiveText,
      )

      // chunks are sane
      chunks.forEach((chunk) => {
        expect(chunk.start).toBeGreaterThanOrEqual(0)
        expect(chunk.start).toBeLessThanOrEqual(chunk.mid)
        expect(chunk.mid).toBeLessThan(chunk.end)

        expect(params.text.slice(chunk.start, chunk.end)).toStrictEqual(chunk.text)
        expect(chunk.text.length).toBeLessThanOrEqual(params.chunkLen)
        expect(chunk.mid - chunk.start).toBeLessThanOrEqual(params.overlap)

        // chunk should ends with a separator
        if (chunk.end < params.text.length && chunk.text.length < params.chunkLen)
          expect(params.separators.some((x) => chunk.text.endsWith(x))).toBeTruthy()
      })
    }),
    {
      numRuns: 10_000,
    },
  )
})

Способ №1 – Вайбкод: пробуем ваншотнуть проблему

Провал. Ошибки есть неисправляемые

Проверяем самый тупой вариант. В режиме вайбкодинга описываем клоду алгоритм и двигаемся итеративно, руками проверяя, как работает программа.

Ключевые особенности вайбкодинга:

Мы мало думаем
Мы не даем Клоду писать тесты/автоматизированные проверки и запускать их
Проверяем работу программы самостоятельно

Есть две ключевые проблемы с вайбкодом:

Из-за того, что нет нормальных тестов, клод не может диагностировать и исправить ошибки. Клод пытается сам что-то запускать, но он придумывает тривиальные кейсы, которые легко проходят, но по факту не ловят проблему. Программа фейлится на секретных тест кейсах из 'Пишем код самостоятельно'.

Также Клод изначально не выяснил у меня все требования. Я, глядя на работу программы, не мог понять, что требования не выполнены. Тоже большая проблема.

На вайбкоде далеко не уедешь.

Вайбкод чат

Details

Изначальный запрос на алгоритм

I want to create a function `splitStringOnChunks()` in `./src/splitStringOnChunks.ts`
It split incoming text on chunks with overlap support for the future vector embeddings
It should support overlap, should allow to specify start position in text, shuld support chunkLen and overlap length
Should support separators. They should be ordered in desc of importance
Left boundary should just take the most left separator, if none, then no overlap
Right boundary should take the most important right separator, if none, then allowed to split words

Написал код

run it with interesting parameters and output to ./out.json

Тяжело понять, работает ли

too big strings. make small strings and without new lines, this will be easy for me to check

Правим косметику

The left overlap boundary should not include separators

Нашли первый баг

I think there is a bug in this chunk and we should break on more important separator ". "
      {
      "index": 2,
      "start": 26,
      "end": 43,
      "length": 17,
      "text": "ggg hhh iii. jjj "
    },

Поправили, но что-то я не уверен, примеры слишком простые

seems good. but check for bugs. and give some other examples of execution, I'm not sure

Я внешне не вижу багов, читаю кейсы - они хорошие

are we done?

Хотя внешне все было ок, но когда мы прогнали на 'Пишем код самостоятельно', то выяснились ошибки

I wounder how the algo works regard budget for the right side. Do we initially try to match overlap and then use unspent budget to
match right side? Or does it work other way around?

Да, мы просто друг друга недопоняли. Этого требования не было оговорено, но я его подразумевал. Но при этом в приведенных примерах не было этого кейса, и мы его пропустили. Косяк.

Yes, I want other way around, I want initially try to match overlap, and then try to match right boudnary

Вроде как говорит, что исправил, но по факту все та же ошибка остается.

are you sure it works?

Нет, не уверен, что-то поправил в коде. Начинается эта занудная дрочильня с Клодом, когда пытаешься его уговорить решить одну явную проблему, а он тупит. Не исправляет.

show me the example when we first try to match left overlap and the right part and it clearly seen that we used some budget first on the left overlap and the right one didn't consume as much as it could

И ведь в реальности, в его приведенном примере все ок работает, а на моем секретном примере фейлится. То есть баг есть, но клод его в упор не видит, ибо он подобрал такой пример, который его не показывает. То есть как бы он решил проблему но не решил.

Это конец вайбкодинга, дальше мне нужно думать и показывать Клоду свои придуманные примеры. Финализируем. Хочу также показать ему секретный пример, может поможет.

Run code on this example and tell me what you think
  
  splitStringOnChunks({
      text: 'skip some. Hello world. Some long text',
      start: 11,
      chunkLen: 30,
      overlap: 9,
      separators: ['. ', '.', ' '],
    }).map((x) => x.text),

Говорит, посмотрел – все ок. Я говорю - посмотри блин, почему у первого чанка нет оверлапа???

Good catch. The first chunk skips overlap because of... – А, ну конечно, мой пример ему помог, все теперь работает. Но это уже не вайбкодинг.

Вайбкод программа

Details

  
export interface SplitStringOnChunksOptions {
  text: string;
  chunkLen: number;
  overlap?: number;
  start?: number;
  separators?: string[];
}

export interface Chunk {
  text: string;
  start: number;
  mid: number;
  end: number;
}

export function splitStringOnChunks(options: SplitStringOnChunksOptions): Chunk[] {
  const {
    text,
    chunkLen,
    overlap = 0,
    start = 0,
    separators = ["\n\n", "\n", ". ", " "],
  } = options;

  if (chunkLen <= 0) throw new Error("chunkLen must be positive");
  if (overlap < 0) throw new Error("overlap must be non-negative");
  if (overlap >= chunkLen) throw new Error("overlap must be less than chunkLen");
  if (start < 0 || start >= text.length) return [];

  const chunks: Chunk[] = [];
  let mid = start;

  while (mid < text.length) {
    // Step 1: Spend overlap budget — look back from mid to find left boundary
    let chunkStart = mid;
    if (overlap > 0 && chunks.length > 0) {
      const overlapStart = Math.max(0, mid - overlap);
      const overlapRegion = text.substring(overlapStart, mid);

      // Find the leftmost separator in the overlap region
      let leftmost = -1;
      for (const sep of separators) {
        const idx = overlapRegion.indexOf(sep);
        if (idx !== -1) {
          const candidate = overlapStart + idx;
          if (leftmost === -1 || candidate < leftmost) {
            leftmost = candidate;
          }
        }
      }

      if (leftmost !== -1) {
        chunkStart = leftmost + separators.find(s => text.startsWith(s, leftmost))!.length;
      }
    }

    const overlapUsed = mid - chunkStart;
    const forwardBudget = chunkLen - overlapUsed;

    // Step 2: Spend remaining budget forward — find right boundary
    const rawEnd = Math.min(mid + forwardBudget, text.length);
    let end = rawEnd;
    if (rawEnd < text.length) {
      // Look for the most important separator in the back half of the forward region
      let found = false;
      for (const sep of separators) {
        const searchStart = Math.max(mid, rawEnd - Math.floor(chunkLen / 2));
        const region = text.substring(searchStart, rawEnd);
        const lastIdx = region.lastIndexOf(sep);
        if (lastIdx !== -1) {
          end = searchStart + lastIdx + sep.length;
          found = true;
          break;
        }
      }
      if (!found) {
        end = rawEnd;
      }
    }

    chunks.push({ text: text.substring(chunkStart, end), start: chunkStart, mid, end });

    if (end >= text.length) break;
    mid = end;
  }

  return chunks;
}

Способ №1.1 – Цикл: сделай программу лучше, сделай программу лучше, сделай программу лучше

Провал. Ошибки есть неисправляемые

Пробуем работать над программой циклом из /clear + инструкция. Применяем одну мутацию множество раз с чистого контекста и смотрим, что получится.

Попросим ЛЛМ взять на себя всю интеллектуальную работу, все высокоуровневые архитектурные решения. Пусть Клод напишет нашу функцию с нуля без спеков.

Такой подход дает результат лучше вайбкодинга. Однако все еще есть неисправляемые тонкие баги. Но, что более важно, алгоритм выбран отвратный. Нельзя Клоду давать архитектурные решения, он их запарывает.

Цикл чат

Details

Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests.

Клод верно понял, что для эмбеддингов нужны и оверлап, и сепараторы с приоритетом. Накидал функцию и тесты. Визуально все выглядит максимально цивильно. На практике полный треш:

"hello world lol internet" maxChunkSize=6 overlap=3 
превращается в чанки

"hello"
"lo"
"o"
"world"
"ld"
"d lol"
"ol"
"l"
"internet"

Что-то не так с оверлапом. Для надежности повторим цикл пару раз.

/clear

There is a function here, improve it
Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests

Повторяем еще 3 раза. Программа растет с каждой итерацией. После первой итерации у нас было 90 строк программы и 100 строк теста. После четвертой: 420 строк программы и 690 строк теста.

И токены не зря горят. Прикол в том, что итеративный подход ИСПРАВИЛ баги. От итераций программа улучшается, а я думал, будет деградировать. Однако есть класс багов посложнее, которые все-таки не исправляются, как бы я ни итерировался:

Тонок, очень запарно даже объяснять этот баг.

Если у чанка есть место под оверлап, то мы пытаемся найти символов слева от чанка и добавить их в чанк
Если там есть 2 точки с сепараторами, и самая левая точка превышает размеры чанка
Но при этом вторая точка не превышает, и может быть использована под оверлап
То мы скипаем оверлап, а должны были просто брать оверлап поменьше

Пример: в первом случае оверлап БОЛЬШЕ второго, и это БЛОКИРУЕТ оверлап:

maxChunkSize=12 overlap=6 text=hello a a worldasdf

hello a a
worldasdf

maxChunkSize=12 overlap=3 text=hello a a worldasdf

hello a a
a worldasdf

Пробуем обжать программу и добавить PBT. Несколько итераций

There is a function here, improve it
Write a TypeScript function that will split text on chunks for embeddings
Make the function and tests more compact. There is so much code, I think lots of stuff is excessive and redundant
Remove redundant stuff

После обжимания: 200 строк программы и 117 строк теста. Однако программа не стала менее тупой:

Баги сохраняются
Сама фундаментальная идея оверлапа неверна

При выбранном алгоритме (даже с волшебным исправлением багов) огромная часть чанков будет тупо скипать оверлап. Фактически алгоритм работает на треть мощности. То есть нельзя сделать оверлап 20% и все ваши чанки будут с ним, это получилось совсем не так:

Если при создании чанка у него осталась часть бюджета от maxChunkSize
То эта часть может быть потрачена на оверлап в пределах "overlap" бюджета.

Это безумный дизайн. Очень тупое решение выбрано Клодом, и от него итерациями не уйти, он в нем застрял. Полный провал. Фундаментально Клод не умеет мыслить, есть опасение, что как ни промптить, Клод не сможет придумывать себе верные спеки.

Цикл программа

Details

export interface ChunkOptions {
  maxChunkSize?: number;
  minChunkSize?: number;
  overlap?: number;
  separators?: string[];
  sizeOf?: (text: string) => number;
}

export interface Chunk {
  text: string;
  start: number;
  end: number;
}

interface Span {
  start: number;
  end: number;
}

const DEFAULTS: Required<ChunkOptions> = {
  maxChunkSize: 1000,
  minChunkSize: 100,
  overlap: 200,
  separators: ["\n# ", "\n## ", "\n### ", "\n\n", "\n", ". ", "? ", "! ", " "],
  sizeOf: (s: string) => s.length,
};

function measure(text: string, start: number, end: number, sizeOf: (s: string) => number): number {
  return sizeOf(text.slice(start, end));
}

export function chunkTextWithMeta(text: string, options?: ChunkOptions): Chunk[] {
  const { maxChunkSize, minChunkSize, overlap, separators, sizeOf } = { ...DEFAULTS, ...options };

  if (maxChunkSize <= 0) throw new Error("maxChunkSize must be positive");
  if (minChunkSize < 0) throw new Error("minChunkSize must be non-negative");
  if (overlap < 0) throw new Error("overlap must be non-negative");
  if (overlap >= maxChunkSize) throw new Error("overlap must be less than maxChunkSize");

  const trimmedFull = trimSpan(text, 0, text.length);
  if (!trimmedFull.text) return [];

  if (measure(text, trimmedFull.start, trimmedFull.end, sizeOf) <= maxChunkSize) {
    return [trimmedFull];
  }

  const segments = recursiveSplit(text, 0, text.length, maxChunkSize, separators, sizeOf);

  const trimmed = segments
    .map((s) => trimSpan(text, s.start, s.end))
    .filter((s) => s.text.length > 0);

  const chunks = mergeSmall(text, trimmed, minChunkSize, maxChunkSize, sizeOf);

  if (overlap === 0 || chunks.length <= 1) return chunks;

  return applyOverlap(text, chunks, overlap, maxChunkSize, separators, sizeOf);
}

export function chunkText(text: string, options?: ChunkOptions): string[] {
  return chunkTextWithMeta(text, options).map((c) => c.text);
}

function recursiveSplit(
  text: string, start: number, end: number,
  maxChunkSize: number, separators: string[], sizeOf: (s: string) => number,
): Span[] {
  if (measure(text, start, end, sizeOf) <= maxChunkSize) return [{ start, end }];

  for (let sepIdx = 0; sepIdx < separators.length; sepIdx++) {
    const parts = splitOn(text, start, end, separators[sepIdx]);
    if (parts.length <= 1) continue;

    // Greedily merge parts into segments that fit
    const segments: Span[] = [];
    let segStart = parts[0].start, segEnd = parts[0].end;
    for (let i = 1; i < parts.length; i++) {
      if (measure(text, segStart, parts[i].end, sizeOf) <= maxChunkSize) {
        segEnd = parts[i].end;
      } else {
        segments.push({ start: segStart, end: segEnd });
        segStart = parts[i].start;
        segEnd = parts[i].end;
      }
    }
    segments.push({ start: segStart, end: segEnd });

    const finer = separators.slice(sepIdx + 1);
    const result: Span[] = [];
    for (const seg of segments) {
      if (measure(text, seg.start, seg.end, sizeOf) > maxChunkSize) {
        result.push(...(finer.length > 0
          ? recursiveSplit(text, seg.start, seg.end, maxChunkSize, finer, sizeOf)
          : hardSplit(text, seg.start, seg.end, maxChunkSize, sizeOf)));
      } else {
        result.push(seg);
      }
    }
    return result;
  }

  return hardSplit(text, start, end, maxChunkSize, sizeOf);
}

function splitOn(text: string, start: number, end: number, sep: string): Span[] {
  const parts: Span[] = [];
  let pos = start;
  while (pos < end) {
    const idx = text.indexOf(sep, pos);
    if (idx === -1 || idx >= end) { parts.push({ start: pos, end }); break; }
    parts.push({ start: pos, end: idx + sep.length });
    pos = idx + sep.length;
  }
  return parts;
}

function hardSplit(
  text: string, start: number, end: number,
  maxChunkSize: number, sizeOf: (s: string) => number,
): Span[] {
  const result: Span[] = [];
  let pos = start;

  while (pos < end) {
    if (measure(text, pos, end, sizeOf) <= maxChunkSize) { result.push({ start: pos, end }); break; }
    let lo = pos + 1, hi = end;
    while (lo < hi) {
      const mid = lo + Math.floor((hi - lo + 1) / 2);
      if (measure(text, pos, mid, sizeOf) <= maxChunkSize) lo = mid; else hi = mid - 1;
    }
    result.push({ start: pos, end: lo });
    pos = lo;
  }
  return result;
}

function trimSpan(text: string, start: number, end: number): { text: string; start: number; end: number } {
  const raw = text.slice(start, end);
  const trimmed = raw.trim();
  if (!trimmed) return { text: "", start, end: start };
  const lead = raw.length - raw.trimStart().length;
  const trail = raw.length - raw.trimEnd().length;
  return { text: trimmed, start: start + lead, end: end - trail };
}

function mergeSmall(
  text: string,
  segments: { text: string; start: number; end: number }[],
  minChunkSize: number, maxChunkSize: number, sizeOf: (s: string) => number,
): { text: string; start: number; end: number }[] {
  if (segments.length <= 1 || minChunkSize === 0) return segments;

  const result: { text: string; start: number; end: number }[] = [];
  for (const seg of segments) {
    if (sizeOf(seg.text) >= minChunkSize || result.length === 0) {
      result.push({ ...seg });
      continue;
    }
    const prev = result[result.length - 1];
    const merged = trimSpan(text, prev.start, seg.end);
    if (measure(text, merged.start, merged.end, sizeOf) <= maxChunkSize) {
      result[result.length - 1] = merged;
    } else {
      result.push({ ...seg });
    }
  }
  return result;
}

function applyOverlap(
  text: string, chunks: Chunk[], overlap: number,
  maxChunkSize: number, separators: string[], sizeOf: (s: string) => number,
): Chunk[] {
  const result: Chunk[] = [chunks[0]];

  for (let i = 1; i < chunks.length; i++) {
    const chunk = chunks[i];
    const rawStart = Math.max(chunk.start - overlap, chunks[i - 1].start);

    const snapped = snapToSep(text, rawStart, chunk.start, separators);
    const trimmed = trimSpan(text, snapped, chunk.end);
    if (!trimmed.text) continue;

    result.push(measure(text, trimmed.start, trimmed.end, sizeOf) > maxChunkSize ? chunk : trimmed);
  }
  return result;
}

function snapToSep(text: string, pos: number, bound: number, separators: string[]): number {
  const region = text.slice(pos, bound);
  for (const sep of separators) {
    const idx = region.indexOf(sep);
    if (idx !== -1) return pos + idx + sep.length;
  }
  return pos;
}

Цикл тесты

Details

import { describe, it, expect } from "vitest";
import { chunkText, chunkTextWithMeta } from "./chunk-text.js";

describe("chunkText", () => {
  it("returns empty for empty/whitespace input", () => {
    expect(chunkText("")).toEqual([]);
    expect(chunkText("   \n\n   ")).toEqual([]);
  });

  it("returns single chunk when text fits", () => {
    expect(chunkText("Short text.", { maxChunkSize: 500, overlap: 0 })).toEqual(["Short text."]);
  });


  it("splits on sentence boundaries when paragraphs are too large", () => {
    const text = "First sentence. Second sentence. Third sentence.";
    expect(chunkText(text, { maxChunkSize: 25, overlap: 0, minChunkSize: 0 })).toEqual([
      "First sentence.",
      "Second sentence.",
      "Third sentence.",
    ]);
  });

  it("falls through to finer separators for oversized chunks", () => {
    const text =
      "Short paragraph.\n\n" +
      "This is a much longer paragraph with multiple sentences. " +
      "It needs to be split on sentence boundaries. " +
      "Because paragraph splitting alone is not enough.\n\n" +
      "Final paragraph.";
    const result = chunkText(text, { maxChunkSize: 60, overlap: 0, minChunkSize: 0 });
    result.forEach((c) => expect(c.length).toBeLessThanOrEqual(60));
    expect(result.some((c) => c.includes("Short paragraph"))).toBe(true);
    expect(result.some((c) => c.includes("Final paragraph"))).toBe(true);
  });

  it("hard-cuts when no separators match", () => {
    expect(
      chunkText("abcdefghijklmnopqrstuvwxyz", { maxChunkSize: 10, overlap: 0, minChunkSize: 0, separators: [] }),
    ).toEqual(["abcdefghij", "klmnopqrst", "uvwxyz"]);
  });

  it("respects custom separators", () => {
    expect(
      chunkText("part1|part2|part3", { maxChunkSize: 10, overlap: 0, minChunkSize: 0, separators: ["|"] }),
    ).toEqual(["part1|", "part2|", "part3"]);
  });

  it("merges small fragments into neighbors", () => {
    expect(chunkText("Hello world. Bye", { maxChunkSize: 18, overlap: 0, minChunkSize: 5 })).toEqual([
      "Hello world. Bye",
    ]);
  });

  it("splits on markdown headers", () => {
    const text = "# Intro\n\nIntro text.\n\n## Section One\n\nContent one.\n\n## Section Two\n\nContent two.";
    const result = chunkText(text, { maxChunkSize: 40, overlap: 0, minChunkSize: 0 });
    expect(result.some((c) => c.includes("Intro"))).toBe(true);
    expect(result.some((c) => c.includes("Section One"))).toBe(true);
    expect(result.some((c) => c.includes("Section Two"))).toBe(true);
  });

  it("throws on invalid options", () => {
    expect(() => chunkText("hello", { maxChunkSize: 0 })).toThrow("maxChunkSize must be positive");
    expect(() => chunkText("hello", { overlap: -1 })).toThrow("overlap must be non-negative");
    expect(() => chunkText("hello", { minChunkSize: -1 })).toThrow("minChunkSize must be non-negative");
    expect(() => chunkText("hello", { maxChunkSize: 10, overlap: 10 })).toThrow("overlap must be less than");
  });
});

describe("overlap", () => {
  it("consecutive chunks share overlapping content", () => {
    const text =
      "Alpha bravo charlie. Delta echo foxtrot. Golf hotel india. " +
      "Juliet kilo lima. Mike november oscar.";
    const result = chunkTextWithMeta(text, { maxChunkSize: 50, overlap: 20, minChunkSize: 0 });
    expect(result.some((c, i) => i > 0 && c.start < result[i - 1].end)).toBe(true);
  });

  it("snaps overlap to separator boundaries", () => {
    const text = "Sentence one here. Sentence two here. Sentence three here.";
    const result = chunkTextWithMeta(text, { maxChunkSize: 25, overlap: 15, minChunkSize: 0 });
    for (const chunk of result) {
      expect(chunk.text).toMatch(/^[A-Z]/);
      expect(text.slice(chunk.start, chunk.end)).toBe(chunk.text);
    }
  });
});

describe("custom sizeOf", () => {
  const wordCount = (s: string): number => {
    const t = s.trim();
    return t.length === 0 ? 0 : t.split(/\s+/).length;
  };

  it("chunks by word count", () => {
    const text = "one two three four five six seven eight nine ten";
    const result = chunkText(text, { maxChunkSize: 4, overlap: 0, minChunkSize: 0, sizeOf: wordCount });
    result.forEach((c) => expect(wordCount(c)).toBeLessThanOrEqual(4));
    expect(result.join(" ")).toContain("ten");
  });

  it("hard-splits with custom sizeOf", () => {
    const result = chunkText("aaa bbb ccc ddd eee fff", {
      maxChunkSize: 2, overlap: 0, minChunkSize: 0, sizeOf: wordCount, separators: [],
    });
    result.forEach((c) => expect(wordCount(c)).toBeLessThanOrEqual(2));
  });
});

    
import { describe, it, expect } from "vitest";
import fc from "fast-check";
import { chunkText, chunkTextWithMeta } from "./chunk-text.js";

const validOptions = () =>
  fc
    .record({
      maxChunkSize: fc.integer({ min: 1, max: 500 }),
      overlapFraction: fc.double({ min: 0, max: 0.9, noNaN: true }),
      minChunkFraction: fc.double({ min: 0, max: 0.9, noNaN: true }),
    })
    .map(({ maxChunkSize, overlapFraction, minChunkFraction }) => ({
      maxChunkSize,
      overlap: Math.floor(maxChunkSize * overlapFraction),
      minChunkSize: Math.floor(maxChunkSize * minChunkFraction),
    }));

describe("property-based tests", () => {
  it("never produces chunks exceeding maxChunkSize", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 1, maxLength: 2000 }), validOptions(), (text, opts) => {
        for (const chunk of chunkText(text, { ...opts, minChunkSize: 0 })) {
          expect(chunk.length).toBeLessThanOrEqual(opts.maxChunkSize);
        }
      }),
      { numRuns: 300 },
    );
  });

  it("preserves all non-whitespace content", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 1, maxLength: 1000 }), (text) => {
        const chunks = chunkText(text, { maxChunkSize: 50, overlap: 0, minChunkSize: 0 });
        expect(chunks.join("").replace(/\s+/g, "")).toBe(text.replace(/\s+/g, ""));
      }),
      { numRuns: 300 },
    );
  });

  it("chunk metadata offsets match chunk text", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 1, maxLength: 1000 }), validOptions(), (text, opts) => {
        for (const chunk of chunkTextWithMeta(text, opts)) {
          expect(text.slice(chunk.start, chunk.end)).toBe(chunk.text);
        }
      }),
      { numRuns: 300 },
    );
  });

  it("overlap=0 produces non-overlapping spans", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 1, maxLength: 1000 }), (text) => {
        const chunks = chunkTextWithMeta(text, { maxChunkSize: 50, overlap: 0, minChunkSize: 0 });
        for (let i = 1; i < chunks.length; i++) {
          expect(chunks[i].start).toBeGreaterThanOrEqual(chunks[i - 1].end);
        }
      }),
      { numRuns: 200 },
    );
  });

  it("all chunks are trimmed", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 1, maxLength: 1000 }), validOptions(), (text, opts) => {
        for (const chunk of chunkTextWithMeta(text, opts)) {
          expect(chunk.text).toBe(chunk.text.trim());
        }
      }),
      { numRuns: 200 },
    );
  });

  it("custom sizeOf is respected", () => {
    const wordCount = (s: string): number => {
      const trimmed = s.trim();
      return trimmed.length === 0 ? 0 : trimmed.split(/\s+/).length;
    };
    fc.assert(
      fc.property(
        fc.array(fc.constantFrom("alpha", "bravo", "charlie", "delta", "echo", "fox"), {
          minLength: 5,
          maxLength: 100,
        }),
        fc.integer({ min: 2, max: 10 }),
        (words, maxTokens) => {
          for (const chunk of chunkText(words.join(" "), { maxChunkSize: maxTokens, overlap: 0, minChunkSize: 0, sizeOf: wordCount })) {
            expect(wordCount(chunk)).toBeLessThanOrEqual(maxTokens);
          }
        },
      ),
      { numRuns: 200 },
    );
  });

});

Способ №2 – Планнинг мод: пусть Клод нас пораспрашивает

Время 0:40
Не ваншот. Ошибки были, но с моей помощью исправили
Суммарно тестов было выдано на 255 строк

В вайбкодинге у нас было две проблемы:

Клод изначально не выяснил у меня все требования. И внешне я никак не мог это диагностировать, на примерах все было ок
В программе были баги. Абсолютно точно нужны автоматизированные тесты

Планнинг мод атакует первую проблему и позволяет клоду задать нам вопросы и выяснить неочевидные для него детали. А вот по поводу второй просто попробуем, чтобы он сам предложил решение и придумал такой подход, чтобы код был без багов. В итоге он предложил тесты и сам же их и написал.

Клод честно спросил у меня детали и выяснил требования до начала разработки, спасибо ему. Однако в коде был допущен баг. И Клод в упор не воспринимал его как баг, пока я вот прямо в лицо его не тыкнул. Это общая проблема, такая же, как и в вайбкодинге.

Если Клод самостоятельно придумывает тест кейсы, то он халявит. Он придумывает такие кейсы, чтобы ему было легко их пройти, а не чтобы программа работала верно. И мне приходится пробираться через эту гору муторы, читать бессмысленные проверки одного и того же, чтобы в итоге найти тупую ошибку.

С планнингом есть еще огромная проблема, что при итеративном обсуждении требований Клод может рандомно переписать куски требований в неожиданных местах. По факту нужно после каждого вопроса от Клода полностью перечитывать требования, что нереально бесит.

Планнинг мод чат

Details

Изначальный запрос. Не забыли включить планнинг мод. Пробуем попросить лучшие практики.

Write ./src/splitStringOnChunks/splitStringOnChunks.ts
It should export splitStringOnChunks() that splits text on chunks for embeddings
splitStringOnChunks accepts
    chunkLen - max chunk length
    overlap - overlap len
    start - position from where start chunking 
    separators - list of separators in desc of importance order

Use the best practice to produce reliable code that I will be able to confirm as valid

Далее мы говорим с Клодом некоторое время туда сюда, обсуждаем подход. Клод предлагает не тот алгоритм, который мне нравится. Мы обсуждаем. Предлагает тесты. Вроде выглядит все ок. Соглашаюсь с планом.

Тестов на 180 строк, читаемые. Визуально все работает супер, пробуем проверить на наших секретных 'Пишем код самостоятельно' тестах.

На наших тестах фейлится, беда. Я не хочу ему давать наши тест кейсы, это было бы читерно. Я хочу чтобы он сам ошибку нашел.

Can you check the original requirements and double check that they hold

I found a bug. On line 47, overlapWindowStart = currentPos - overlap can go negative (or before start) when currentPos is small relative to overlap. For example, if the first chunk is short and overlap is large:

Ок, значит мы можем просто в цикле поспрашивать его это несколько раз и подождать, пока его решение устаканится. Проблема тут, что я не могу ресетнуть контекст и начать все с начала, я потеряю наши requirements ибо они не записаны никуда. Я могу только спрашивать его в цикле.

Говорит, что все ок. Запускаю наши тесты - не проходят.

Попробую прямо спросить.

I think the overlap logic is wrong in cases, where we should first give chunkLen budget to overlap and then use the rest for chunk body. Can you check it with a test?

Чет много написал сложных тестов, я не могу их понять. Прошу один конкретный кейс придумать

no, I mean make a concrete test where you combine start/overlap and concrete strings in an interesting 1 case. I want to check an easy to observe case

Ошибку я вижу, а Клод в притык не видит. Если задан start+overlap, то первый чанк не будет иметь оверлапа

I think something is wrong with this case
Are you sure it is ok?

      const text = "SKIP|aa|bb|cc|dd|ee|ff";
      const result = splitStringOnChunks({
        text,
        chunkLen: 9,
        overlap: 6,
        start: 5,
        separators: ["|"],
      });

      assert.deepEqual(result, [
        { start: 5,  mid: 5,  end: 14, text: "aa|bb|cc|" },
        { start: 11, mid: 14, end: 20, text: "cc|dd|ee|" },
        { start: 17, mid: 20, end: 22, text: "ee|ff" },
      ]);

Он не понял. Проблема похожа на вайбкодинг проблему - у первого чанка нет оверлапа.

why the first chunk has no overlap?

You're right — that's the bug. When start > 0, the first chunk should still attempt overlap into the text before start. There's text there ("SKIP|") that could provide context.

The fix is simple: remove the isFirstChunk guard. When start=0, the overlap window would naturally be empty (text[0..0]), producing no overlap anyway.

На этом можно завершать. Но давайте еще разок прогоним через тесты наши. А 'Пишем код самостоятельно' тесты то прошли!

Планнинг мод программа

Ну что-то совсем здоровая. Вангую, ее можно сжать за итерацию рефакторинга или оптимизации перфоманса, я такое уже видел.

Details

export type Chunk = {
  start: number;
  mid: number;
  end: number;
  text: string;
};

export interface SplitStringOnChunksOptions {
  text: string;
  chunkLen: number;
  overlap: number;
  start?: number;
  separators?: string[];
}

const DEFAULT_SEPARATORS = ["\n\n", "\n", ". ", " "];

export function splitStringOnChunks(options: SplitStringOnChunksOptions): Chunk[] {
  const {
    text,
    chunkLen,
    overlap,
    start = 0,
    separators = DEFAULT_SEPARATORS,
  } = options;

  if (chunkLen <= 0) throw new Error("chunkLen must be greater than 0");
  if (overlap < 0) throw new Error("overlap must be non-negative");
  if (overlap >= chunkLen) throw new Error("overlap must be less than chunkLen");
  if (start < 0) throw new Error("start must be non-negative");

  if (start >= text.length) return [];

  const chunks: Chunk[] = [];
  let currentPos = start;

  while (currentPos < text.length) {
    let chunkStart: number;
    let overlapUsed: number;

    if (overlap === 0 || currentPos === 0) {
      chunkStart = currentPos;
      overlapUsed = 0;
    } else {
      // Step 1: resolve overlap — find leftmost separator in overlap window
      const overlapWindowStart = Math.max(0, currentPos - overlap);
      const overlapWindow = text.slice(overlapWindowStart, currentPos);

      let leftmostPos = -1;
      let leftmostSepLen = 0;

      for (const sep of separators) {
        const idx = overlapWindow.indexOf(sep);
        if (idx !== -1 && (leftmostPos === -1 || idx < leftmostPos)) {
          leftmostPos = idx;
          leftmostSepLen = sep.length;
        }
      }

      if (leftmostPos !== -1) {
        // Start after the separator
        chunkStart = overlapWindowStart + leftmostPos + leftmostSepLen;
        overlapUsed = currentPos - chunkStart;
      } else {
        // No separator found — no overlap
        chunkStart = currentPos;
        overlapUsed = 0;
      }
    }

    const mid = currentPos;

    // Step 2: resolve right boundary
    const remainingBudget = chunkLen - overlapUsed;
    const rightLimit = chunkStart + overlapUsed + remainingBudget; // = chunkStart + chunkLen

    if (rightLimit >= text.length) {
      // Remaining text fits in budget
      const chunkText = text.slice(chunkStart);
      if (chunkText.trim().length > 0) {
        chunks.push({ start: chunkStart, mid, end: text.length, text: chunkText });
      }
      break;
    }

    // Search backward from rightLimit for a separator (priority order)
    const searchWindow = text.slice(currentPos, rightLimit);
    let splitOffset = -1;
    let splitSepLen = 0;

    for (const sep of separators) {
      const idx = searchWindow.lastIndexOf(sep);
      if (idx !== -1) {
        splitOffset = idx;
        splitSepLen = sep.length;
        break; // first separator in list = highest priority
      }
    }

    if (splitOffset !== -1) {
      const absoluteSplitEnd = currentPos + splitOffset + splitSepLen;
      const chunkText = text.slice(chunkStart, absoluteSplitEnd);
      if (chunkText.trim().length > 0) {
        chunks.push({ start: chunkStart, mid, end: absoluteSplitEnd, text: chunkText });
      }
      currentPos = absoluteSplitEnd;
    } else {
      // Hard cut
      const chunkText = text.slice(chunkStart, rightLimit);
      if (chunkText.trim().length > 0) {
        chunks.push({ start: chunkStart, mid, end: rightLimit, text: chunkText });
      }
      currentPos = rightLimit;
    }

  }

  return chunks;
}

Планнинг мод тесты

Details

  
import { describe, it } from "node:test";
import assert from "node:assert/strict";
import { splitStringOnChunks, type Chunk } from "./splitStringOnChunks.js";

function texts(chunks: Chunk[]): string[] {
  return chunks.map((c) => c.text);
}

describe("splitStringOnChunks", () => {
  describe("input validation", () => {
    it("throws if chunkLen <= 0", () => {
      assert.throws(
        () => splitStringOnChunks({ text: "hello", chunkLen: 0, overlap: 0 }),
        /chunkLen must be greater than 0/
      );
    });

    it("throws if overlap < 0", () => {
      assert.throws(
        () => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: -1 }),
        /overlap must be non-negative/
      );
    });

    it("throws if overlap >= chunkLen", () => {
      assert.throws(
        () => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 10 }),
        /overlap must be less than chunkLen/
      );
    });

    it("throws if start < 0", () => {
      assert.throws(
        () => splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 0, start: -1 }),
        /start must be non-negative/
      );
    });
  });

  describe("edge cases", () => {
    it("returns empty array for empty string", () => {
      assert.deepEqual(
        splitStringOnChunks({ text: "", chunkLen: 10, overlap: 0 }),
        []
      );
    });

    it("returns empty array if start >= text length", () => {
      assert.deepEqual(
        splitStringOnChunks({ text: "hello", chunkLen: 10, overlap: 0, start: 100 }),
        []
      );
    });

    it("returns single chunk when text fits in chunkLen", () => {
      const result = splitStringOnChunks({ text: "hello world", chunkLen: 20, overlap: 0 });
      assert.deepEqual(result, [
        { start: 0, mid: 0, end: 11, text: "hello world" },
      ]);
    });
  });

  describe("basic splitting (no overlap)", () => {
    it("splits on space separator", () => {
      const result = splitStringOnChunks({
        text: "aaa bbb ccc ddd",
        chunkLen: 8,
        overlap: 0,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["aaa bbb ", "ccc ddd"]);
      assert.deepEqual(result[0], { start: 0, mid: 0, end: 8, text: "aaa bbb " });
      assert.deepEqual(result[1], { start: 8, mid: 8, end: 15, text: "ccc ddd" });
    });

    it("hard-cuts when no separator found", () => {
      const result = splitStringOnChunks({
        text: "abcdefghijklmnop",
        chunkLen: 5,
        overlap: 0,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["abcde", "fghij", "klmno", "p"]);
    });
  });

  describe("separator priority on right boundary", () => {
    it("prefers higher-priority separator", () => {
      const text = "hello world\nfoo bar baz";
      const result = splitStringOnChunks({
        text,
        chunkLen: 15,
        overlap: 0,
        separators: ["\n", " "],
      });
      assert.deepEqual(texts(result), ["hello world\n", "foo bar baz"]);
    });

    it("falls back to lower-priority separator when higher not found", () => {
      const text = "hello world foo bar";
      const result = splitStringOnChunks({
        text,
        chunkLen: 10,
        overlap: 0,
        separators: ["\n", " "],
      });
      assert.deepEqual(texts(result), ["hello ", "world foo ", "bar"]);
    });
  });

  describe("overlap", () => {
    it("snaps overlap to leftmost separator in overlap window", () => {
      const result = splitStringOnChunks({
        text: "aaa bbb ccc ddd eee",
        chunkLen: 8,
        overlap: 5,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["aaa bbb ", "bbb ccc ", "ccc ddd ", "ddd eee"]);
      // Second chunk: start=4 (overlap), mid=8 (new content), end=12
      assert.deepEqual(result[1], { start: 4, mid: 8, end: 12, text: "bbb ccc " });
    });

    it("handles overlap larger than currentPos without going negative", () => {
      // First chunk ends at pos 3 ("ab "), then overlap=4 > currentPos=3
      const result = splitStringOnChunks({
        text: "ab cd ef gh",
        chunkLen: 5,
        overlap: 4,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["ab ", "cd ", "cd ef", "ef gh"]);
      // Verify text integrity
      for (const chunk of result) {
        assert.equal(chunk.text, "ab cd ef gh".slice(chunk.start, chunk.end));
      }
    });

    it("skips overlap when no separator in overlap window", () => {
      const result = splitStringOnChunks({
        text: "abcdefghijklmnop",
        chunkLen: 8,
        overlap: 3,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["abcdefgh", "ijklmnop"]);
      // No overlap → mid equals start
      assert.equal(result[1].start, 8);
      assert.equal(result[1].mid, 8);
    });

    it("chunk.text matches text.slice(start, end)", () => {
      const text = "aaa bbb ccc ddd eee";
      const result = splitStringOnChunks({
        text,
        chunkLen: 8,
        overlap: 5,
        separators: [" "],
      });
      for (const chunk of result) {
        assert.equal(chunk.text, text.slice(chunk.start, chunk.end));
      }
    });
  });

  describe("overlap + start combined", () => {
    it("overlap consumes from chunkLen budget, rest goes to body", () => {
      //
      // text = "SKIP|aa|bb|cc|dd|ee|ff"
      //         0123456789012345678901
      //              ^start=5
      //
      // chunkLen=9, overlap=6, separators=["|"]
      //
      // Chunk 1 (first, no overlap):
      //   search text[5..14] = "aa|bb|cc|"
      //   last "|" at idx 8 → end=14
      //   → "aa|bb|cc|"  start=5 mid=5 end=14
      //
      // Chunk 2 (currentPos=14):
      //   overlap window text[8..14] = "bb|cc|"
      //   leftmost "|" at idx 2 → chunkStart=8+2+1=11, overlapUsed=3
      //   remaining=9-3=6, search text[14..20] = "dd|ee|"
      //   last "|" at idx 5 → end=20
      //   → "cc|dd|ee|"  start=11 mid=14 end=20
      //     overlap="cc|" (11..14), body="dd|ee|" (14..20)
      //
      // Chunk 3 (currentPos=20):
      //   overlap window text[14..20] = "dd|ee|"
      //   leftmost "|" at idx 2 → chunkStart=17, overlapUsed=3
      //   remaining=6, rightLimit=17+9=26 >= 22 → last chunk
      //   → "ee|ff"  start=17 mid=20 end=22
      //     overlap="ee|" (17..20), body="ff" (20..22)
      //
      const text = "SKIP|aa|bb|cc|dd|ee|ff";
      const result = splitStringOnChunks({
        text,
        chunkLen: 9,
        overlap: 6,
        start: 5,
        separators: ["|"],
      });

      // First chunk: overlap window text[max(0,5-6)..5] = "SKIP|"
      //   leftmost "|" at idx 4 → chunkStart=0+4+1=5, overlapUsed=0
      //   (separator is right at boundary, so no actual overlap text)
      //   remaining=9, search text[5..14] = "aa|bb|cc|" → end=14
      assert.deepEqual(result, [
        { start: 5,  mid: 5,  end: 14, text: "aa|bb|cc|" },
        { start: 11, mid: 14, end: 20, text: "cc|dd|ee|" },
        { start: 17, mid: 20, end: 22, text: "ee|ff" },
      ]);
    });
  });

  describe("start offset", () => {
    it("begins chunking from the start position", () => {
      const result = splitStringOnChunks({
        text: "skip this aaa bbb ccc",
        chunkLen: 8,
        overlap: 0,
        start: 10,
        separators: [" "],
      });
      assert.deepEqual(texts(result), ["aaa bbb ", "ccc"]);
      assert.equal(result[0].start, 10);
      assert.equal(result[0].mid, 10);
    });
  });

  describe("multi-char separators", () => {
    it("splits on paragraph boundaries", () => {
      const text = "para one\n\npara two\n\npara three";
      const result = splitStringOnChunks({
        text,
        chunkLen: 15,
        overlap: 0,
        separators: ["\n\n", " "],
      });
      assert.deepEqual(texts(result), ["para one\n\n", "para two\n\n", "para three"]);
    });
  });

  describe("default separators", () => {
    it("uses default separators when none provided", () => {
      const text = "Hello world. This is a test.\n\nNew paragraph here.";
      const result = splitStringOnChunks({
        text,
        chunkLen: 30,
        overlap: 0,
      });
      assert.deepEqual(texts(result), ["Hello world. This is a test.\n\n", "New paragraph here."]);
    });
  });
});

Способ №3 – Spec Driven Development: сами пишем спеки, клод только кодит

Время 0:20
Ошибок 0, ваншот
Размер md 73 строки, тестов на 133

Планнинг мод работает, и быстрее ручного ввода, однако страдает из-за проблем:

Нельзя ресетнуть контекст и начать заново. Обсуждения спеков хранятся вне гита. Однако их можно туда перетащить.
После каждой итерации требования нужно перечитывать. Клод может рандомно их поменять. Если бы они были в гите, то хотя бы был диф. Но сюрприз - диф может быть, а требования по факту не поменялись, просто переформулировались. Короче, геморрой.
Если Клод пишет себе тесты, то он смертельно халявит. Производит множество одинаковых тест кейсов, которые все не покрывают, но при этом их приходится читать

Атакуем все эти проблемы:

Пишем спеки сами и в гите, не даем их менять Клоду. Таким образом спеки предсказуемые. Просто компилируем спеки в программу. Если результат не нравится – дописываем спеки, ресетим контекст. Также просим Клода указать на ошибки или противоречия, на которые он наткнулся во время реализации
Пишем сами в спеках примеры, которые станут юнит и pbt тестами. Не даем Клоду халявить, наших кейсов мало, но они серьезные

Результат потрясный. Мне потребовалось 20 минут, чтобы написать src/splitStringOnChunks/spec.md, но Клод с него ваншотнул алгоритм. Я также попробовал поудалять часть спеков – все равно ваншотнул, похоже можно очень неточно описывать проблему, и он додумает куски. Главное, чтобы тест кейсы и спеки были написаны человеком.

SDD чат

Details

compile src/splitStringOnChunks/spec.md into code
run the code
output any errors or contradictions you found

for testing use vitest

for pbt use fast-check
try to put all invaritans into single domain run

./src/splitStringOnChunks/spec.md

create './splitStringOnChunks.ts' that will export splitStringOnChunks() function
this function splits input text on chunks for embeddings, support overlap and separators
inputs
    text - the original text
    start - from what position start chunking. this allows for the first chunk having overlap
    chunkLen - the soft boundary of max length of a chunk
    overlap - the soft boundary of lenfth of left overlap
    separators - ordered array of non empty string separators. ordered in desc of importance

outputs an array of Chunk[] each Chunk consists of
    start - index in the original text from where the chunk starting
    mid - from where new data starts and overlap ends. So overlap is [start, mid) and new data is [mid, end)
    end - where chunk ends
    text - the full chunk text eg [start, end)

[start, end) should be no more than "chunkLen"
[start, mid) should be no more than "overlap"

the right boundary should be the most important the most right separator, so separator priority is important
if there are no separators then we allow hard cut

the left boundary should be the most left any separator, so we ignore separator priority
if there are no separator then we make start === mid, eg, skipping overlap for this case

beginning and the text and the end of the text are the most important separators

first we try to match left boundary with overlap, then we use the rest of chunkLen budget to match the right boundary
eg, we try to add overlap to all chunks

don't include separators for the left boundary
but include separators for the right boundary

## Testing

### Case 1
text "some hello. String. that we want"
start 12
chunkLen 22
overlap 10
separators [". ", " "]

result chunks
"hello. String. "
"String. that we want"

### Case 2
text "some hello. String. that we want"
start 12
chunkLen 22
overlap 10
separators [" ", ". "]

result chunks
"hello. String. that "
"that we want"

### Property-Based Testing
run pbt on the domain
    text: length [0, 10]
    start: [0, 11]
    chunkLen: [1, 11]
    overlap: [0, 11]
    separators: [1, 2] separators of strings of length [1, 2]

calling splitStringOnChunks() with the same arguments produces the same resutl
if you join all [mid, end) you will get the original string respecting the string "start" param
empty input text produces no chunks
non emty text produces at least 1 chunk
for each chunk
    start <= mid
    mid < end
    text is [start, end)
    if there is no overlap start === mid

На что получаем ответ:

Errors/Contradictions Found in the Spec

  1. start >= text.length with non-empty text: The PBT domain allows start ∈ [0, 11] and text.length ∈ [0, 10], so start can exceed text.length. The invariant "non-empty text produces at least 1 chunk" conflicts with this — when start >= text.length, there's nothing left to chunk. I resolved this by only asserting "at least 1 chunk" when start < text.length.
  
  2. "soft boundary" vs hard constraints: The spec calls chunkLen and overlap "soft boundaries", but the PBT invariants assert [start, end) ≤ chunkLen and [start, mid) ≤ overlap as hard constraints. These can conflict when overlap >= chunkLen (no room for new data). I handle this by shrinking the overlap to ensure forward progress, which may violate the "soft" overlap boundary while preserving the chunkLen constraint.
  
  3. "if there is no overlap start === mid": The spec doesn't clarify whether "no overlap" means the overlap parameter is 0, or that the chunk happened to get no overlap (e.g., no separator found in the overlap region). Both cases result in start === mid in my implementation, but the invariant's intent is ambiguous. I tested the overlap === 0 case explicitly.

SDD программа

Details

 export interface Chunk {
  start: number
  mid: number
  end: number
  text: string
}

export function splitStringOnChunks({
  text,
  start,
  overlap,
  chunkLen,
  separators,
}: {
  text: string
  start: number
  chunkLen: number
  overlap: number
  separators: string[]
}): Chunk[] {
  if (text.length === 0 || start >= text.length) return []

  const chunks: Chunk[] = []
  let currentMid = start

  while (currentMid < text.length) {
    // === Left boundary (overlap) ===
    const searchStart = Math.max(0, currentMid - overlap)
    let chunkStart = currentMid // default: no overlap

    if (searchStart < currentMid) {
      if (searchStart === 0) {
        // Beginning of text is the most important separator
        chunkStart = 0
      } else {
        // Find leftmost any separator in [searchStart, currentMid)
        let leftmostPos = -1
        let leftmostEnd = -1

        for (const sep of separators) {
          for (let p = searchStart; p + sep.length <= currentMid; p++) {
            if (text.substring(p, p + sep.length) === sep) {
              if (leftmostPos === -1 || p < leftmostPos) {
                leftmostPos = p
                leftmostEnd = p + sep.length
              }
              break // only need leftmost occurrence for this separator
            }
          }
        }

        if (leftmostPos >= 0) {
          chunkStart = leftmostEnd // don't include separator
        }
        // else: no separator found, chunkStart stays at currentMid (skip overlap)
      }
    }

    // Ensure there's room for at least 1 char of new data
    if (chunkStart + chunkLen <= currentMid) {
      chunkStart = currentMid + 1 - chunkLen
      if (chunkStart < 0) chunkStart = 0
    }

    // === Right boundary ===
    const maxEnd = Math.min(chunkStart + chunkLen, text.length)
    let chunkEnd: number

    // End of text is the most important separator
    if (text.length <= chunkStart + chunkLen) {
      chunkEnd = text.length
    } else {
      chunkEnd = maxEnd // default: hard cut

      // Try each separator in priority order
      for (const sep of separators) {
        let found = false
        // Find rightmost occurrence starting at or after currentMid
        for (let p = maxEnd - sep.length; p >= currentMid; p--) {
          if (text.substring(p, p + sep.length) === sep) {
            chunkEnd = p + sep.length
            found = true
            break
          }
        }
        if (found) break
      }
    }

    // Ensure forward progress
    if (chunkEnd <= currentMid) {
      chunkEnd = Math.min(currentMid + 1, text.length)
    }

    chunks.push({
      start: chunkStart,
      mid: currentMid,
      end: chunkEnd,
      text: text.substring(chunkStart, chunkEnd),
    })

    currentMid = chunkEnd
  }

  return chunks
}

SDD тесты

Details

import { describe, it, expect } from 'vitest';
import fc from 'fast-check';
import { splitStringOnChunks } from './splitStringOnChunks';

describe('splitStringOnChunks', () => {
  describe('Case 1', () => {
    it('splits with separators [". ", " "]', () => {
      const chunks = splitStringOnChunks({
        text: 'some hello. String. that we want',
        start: 12,
        chunkLen: 22,
        overlap: 10,
        separators: ['. ', ' '],
      });

      expect(chunks.map((c) => c.text)).toEqual([
        'hello. String. ',
        'String. that we want',
      ]);

      // Verify chunk structure
      expect(chunks[0]).toEqual({
        start: 5,
        mid: 12,
        end: 20,
        text: 'hello. String. ',
      });
      expect(chunks[1]).toEqual({
        start: 12,
        mid: 20,
        end: 32,
        text: 'String. that we want',
      });
    });
  });

  describe('Case 2', () => {
    it('splits with separators [" ", ". "]', () => {
      const chunks = splitStringOnChunks({
        text: 'some hello. String. that we want',
        start: 12,
        chunkLen: 22,
        overlap: 10,
        separators: [' ', '. '],
      });

      expect(chunks.map((c) => c.text)).toEqual([
        'hello. String. that ',
        'that we want',
      ]);

      expect(chunks[0]).toEqual({
        start: 5,
        mid: 12,
        end: 25,
        text: 'hello. String. that ',
      });
      expect(chunks[1]).toEqual({
        start: 20,
        mid: 25,
        end: 32,
        text: 'that we want',
      });
    });
  });

  describe('Property-Based Testing', () => {
    const textArb = fc.string({ minLength: 0, maxLength: 10 });
    const startArb = fc.integer({ min: 0, max: 11 });
    const chunkLenArb = fc.integer({ min: 1, max: 11 });
    const overlapArb = fc.integer({ min: 0, max: 11 });
    const separatorArb = fc.array(
      fc.string({ minLength: 1, maxLength: 2 }),
      { minLength: 1, maxLength: 2 },
    );

    const arbs = fc.tuple(textArb, startArb, chunkLenArb, overlapArb, separatorArb);

    it('all invariants hold', () => {
      fc.assert(
        fc.property(arbs, ([text, start, chunkLen, overlap, separators]) => {
          const params = { text, start, chunkLen, overlap, separators };
          const chunks = splitStringOnChunks(params);

          // Deterministic: same args produce same result
          const chunks2 = splitStringOnChunks(params);
          expect(chunks).toEqual(chunks2);

          // Empty text produces no chunks
          if (text.length === 0) {
            expect(chunks).toHaveLength(0);
            return;
          }

          // Non-empty text with valid start produces at least 1 chunk
          if (text.length > 0 && start < text.length) {
            expect(chunks.length).toBeGreaterThanOrEqual(1);
          }

          // Joining [mid, end) reproduces the original text from start
          if (start < text.length) {
            const joined = chunks.map((c) => text.substring(c.mid, c.end)).join('');
            expect(joined).toBe(text.substring(start));
          }

          // Per-chunk invariants
          for (const chunk of chunks) {
            // start <= mid
            expect(chunk.start).toBeLessThanOrEqual(chunk.mid);

            // mid < end
            expect(chunk.mid).toBeLessThan(chunk.end);

            // text is [start, end)
            expect(chunk.text).toBe(text.substring(chunk.start, chunk.end));

            // chunk length constraint: [start, end) <= chunkLen
            expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);

            // overlap constraint: [start, mid) <= overlap
            expect(chunk.mid - chunk.start).toBeLessThanOrEqual(overlap);

            // if overlap param is 0, start === mid
            if (overlap === 0) {
              expect(chunk.start).toBe(chunk.mid);
            }
          }
        }),
        { numRuns: 10000 },
      );
    });
  });
});

Способ №4 – Spec-Kit: огромный фреймворк

Время 0:30
Ваншот, ошибок нет
Размер md 292 строки, тестов на 643

Очевидно, что SDD тащит. Дальше его нужно скейлить на большие проекты, которые не помещаются в контекст и состоят из кучи эволюционирующих фичей. Для этого, например, использую фреймворки вида spec-kit. Для маленькой задачи это оверкилл, но все равно интересно посмотреть, как это работает.

Speckit ваншотнул алгоритм и произвел самый читаемый код, респект ему. Однако немудрено, в него столько промтов запихнули. 'Способ №3' мы запускали на сыром claude code и конечно качество кода было меньше.

Прикольная особенность, что spec-kit сам написал себе тесты. Их конечно дохера и через них нужно пробираться. Но прикол в том, что он сам их написал, и багов не было.

Основная проблема speckit – огромное количество текста, которое нужно читать. Если в нашем кастомном SDD 'Способ №3' спеки занимали 70 строк, то в speckit это 300. И я не включил в эти 300 еще 200 из нарезанных тасков, я подразумеваю, что это чисто для ЛЛМ инструкции и читать их не надо. Вторая проблема – это медленно работает и жрет токены. Таски выполнялись 20 минут.

Типа можно сказать: о, это просто оверкил для такого маленького функционала. Если бы функционал был большой, то количество оверхеда было бы меньше относительно полезного аутпута. Я вот в это не верю, мне кажется, количество оверхеда будет расти пропорционально размеру задачи. Нужно будет читать много и ждать долго. Что сильно ухудшит качество работы.

Ну и длительность одной итерации: 30 минут моей работы + 30 минут работы агента = час на одну итерацию. Хотелось бы быстрых итераций, но фреймворк тяжелый.

Spec-Kit чат

Details

Делаю по официальному туториалу

Fill the constitution with the bare minimum requirements for a nodejs typescript backend

/speckit.specify I need you to build a splitStringOnChunks() function that will split incoming text on chunks with
  overlap support for my embeddings. Input should be a single object with:
        text - original text
        chunkLen - max len of a chunk, it is recomended len, but chunks could be smaller
        overlap - max len of an overlap, it is recomended len, but overlap could be smaller
        separators - a list of non empty string separators that are sorted in desc of priority
  Output is an array of chunks, each of them has
        start - index in the original text where the chunk starts
        mid - index where the new text data starts and overlap ends. [start, mid) - is the overlap
        end - where the chunk ends
        text - the chunk text [start, end)

Это произвело огромный файл спецификации, который долго читать. Но файл в гите. Но во время чтения я нашел места, где ЛЛМ-ка затупила, а именно наш многострадальный кейс с оверлапом для ненулевого start. Нужно вносить изменения в спеки.

Feature Specification: String Chunk Splitter with Overlap

Feature Branch: 001-string-chunk-splitter Created: 2026-03-26 Status: Draft Input: User description: "Build a splitStringOnChunks() function that splits text into chunks with overlap support for embeddings"

User Scenarios & Testing (mandatory)

User Story 1 - Basic Text Chunking (Priority: P1)

A developer passes a long text string to splitStringOnChunks() with a desired chunk length and a list of separators. The function splits the text into chunks that respect separator boundaries, each chunk no longer than chunkLen. The first chunk has no overlap (start === mid), and subsequent chunks include overlap from the previous chunk.

Why this priority: This is the core functionality — without basic chunking, no other behavior matters.

Independent Test: Can be tested by passing a multi-paragraph text with chunkLen=100, overlap=20, separators=["\n\n", "\n", ". ", " "] and verifying the returned chunks cover the entire text with correct start, mid, end, and text fields.

Acceptance Scenarios:

Given a text of 500 characters, chunkLen=100, overlap=20, separators=["\n\n", "\n", " "], When splitStringOnChunks() is called, Then it returns an array of chunks where each chunk's text.length <= chunkLen, overlap regions are <= overlap characters, and every character in the original text is covered.
Given any output chunks, When inspecting indices, Then chunk.text === text.slice(chunk.start, chunk.end) for every chunk, chunk.mid >= chunk.start, and the first chunk has start === mid === 0.
Given consecutive chunks, When examining their boundaries, Then chunks[i+1].start < chunks[i].end (overlap exists) and chunks[i+1].mid === chunks[i].end (new content starts where previous chunk ended).

User Story 2 - Separator Priority Splitting (Priority: P2)

A developer provides an ordered list of separators (e.g., ["\n\n", "\n", ". ", " "]). The function prefers splitting at higher-priority separators first and only falls back to lower-priority ones when the higher-priority ones do not produce a valid split within the chunk length.

Why this priority: Separator-aware splitting is critical for embedding quality — splitting mid-sentence degrades retrieval accuracy.

Independent Test: Can be tested by providing text with multiple separator types and verifying that chunks split at paragraph boundaries before sentence boundaries before word boundaries.

Acceptance Scenarios:

Given text with paragraph breaks and chunkLen large enough to fit a paragraph, When splitting, Then chunks split at "\n\n" boundaries rather than at "\n" or " ".
Given a paragraph longer than chunkLen, When no paragraph separator fits, Then the function falls back to the next separator in the list (e.g., "\n", then ". ", then " ").

User Story 3 - Edge Case Handling (Priority: P3)

A developer passes edge-case inputs: empty text, text shorter than chunkLen, text with no matching separators, or very small chunkLen/overlap values. The function handles these gracefully without errors.

Why this priority: Robustness ensures the function is reliable in production pipelines without requiring callers to pre-validate input.

Independent Test: Can be tested by passing empty strings, single-character texts, and texts with no separator matches, verifying correct output or empty arrays.

Acceptance Scenarios:

Given an empty string as text, When splitStringOnChunks() is called, Then it returns an empty array.
Given text shorter than chunkLen, When splitting, Then a single chunk is returned with start=0, mid=0, end=text.length, text=originalText.
Given text with no matching separators and chunkLen smaller than text.length, When splitting, Then the function splits at hard chunkLen boundaries as a last resort.

Edge Cases

What happens when overlap >= chunkLen? The overlap MUST be clamped to be strictly less than chunkLen to guarantee forward progress.
What happens when a separator is longer than the remaining text? It is skipped.
What happens when chunkLen is 0 or negative? Return an empty array.
What happens when separators is an empty array? Fall back to hard splitting at chunkLen boundaries.

Requirements (mandatory)

Functional Requirements

FR-001: The function MUST accept a single object parameter with fields: text (string), chunkLen (number), overlap (number), and separators (string array).
FR-002: The function MUST return an array of chunk objects, each with start (number), mid (number), end (number), and text (string).
FR-003: For every chunk, text MUST equal originalText.slice(start, end).
FR-004: The first chunk MUST have start === mid === 0.
FR-005: For consecutive chunks [i] and [i+1]: chunks[i+1].mid MUST equal chunks[i].end (no gaps in new content).
FR-006: Each chunk's length (end - start) MUST NOT exceed chunkLen.
FR-007: Each chunk's overlap length (mid - start) MUST NOT exceed overlap.
FR-008: The function MUST prefer splitting at higher-priority separators (earlier in the array) over lower-priority ones.
FR-009: When no separator produces a valid split within chunkLen, the function MUST hard-split at the chunkLen boundary.
FR-010: The union of all chunks' new content ranges [mid, end) MUST cover the entire original text with no gaps.
FR-011: Overlap for chunk [i+1] MUST start from a separator boundary found within the overlap window of chunk [i], preferring higher-priority separators.

Key Entities

ChunkInput: The input configuration object containing text, chunkLen, overlap, and separators.
Chunk: The output object representing a single chunk with positional indices (start, mid, end) and the extracted text.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Every character in the original text appears in at least one chunk's new content range [mid, end).
SC-002: No chunk exceeds the specified chunkLen in total length.
SC-003: Overlap regions respect the overlap maximum length.
SC-004: When separators are available within the chunk window, splits occur at separator boundaries at least 90% of the time (hard splits are rare).
SC-005: The function processes a 1MB text input in under 1 second.

Assumptions

The function is a pure, synchronous utility — no I/O, no async operations.
Separators are non-empty strings as stated in the input contract.
chunkLen and overlap are positive integers; invalid values result in an empty array.
The function is deterministic — same input always produces the same output.
Multi-byte characters (UTF-16 surrogate pairs) are not specially handled; indices refer to JavaScript string indices (UTF-16 code units).

/speckit.specify "FR-004" is a wrong spec. For non zero "start" if "overlap" configured there might be overlap for the first chunk

/speckit.specify I want left boundary to ignore separators order. Just try to pick the most left one. If no found - make chunk start
= mid. I also do not want to include separators into the overlap. For the right boundary include separators and for them try find the
high priority one, if no found - allowed to hard split

/speckit.specify I want my chunks to have overlaps. So first try to match left boundary, only then use unspent chunkLen budget on the
 right boundary

На этом моменте мне нравятся спеки. Прошу его позадавать мне вопросов еще, может что забыли.

/speckit.clarify

Следующим шагом сообщаем нашу архитектуру. В speckit сильное разграничение между деталями имплементации и спеками фичей.

/speckit.plan use TypeScript, nodejs, vitest, fast-check

Команда произвела файл с планом. Теперь у нас насколько .md файлов, читаем все что новое.

specs/001-string-chunk-splitter/plan.md

Implementation Plan: String Chunk Splitter with Overlap

Branch: 001-string-chunk-splitter | Date: 2026-03-26 | Spec: spec.md Input: Feature specification from /specs/001-string-chunk-splitter/spec.md

Summary

Implement a splitStringOnChunks() utility function that splits text into overlapping chunks for embedding pipelines. The function uses asymmetric boundary rules: left boundary (overlap) is resolved first by picking the leftmost separator regardless of priority, then the right boundary uses the remaining chunkLen budget with priority-ordered separator search. Pure synchronous TypeScript function with no dependencies beyond the runtime.

Technical Context

Language/Version: TypeScript 6.x with strict: true (Node.js LTS) Primary Dependencies: None for production; vitest + fast-check for testing Storage: N/A Testing: vitest (unit/property tests), fast-check (property-based testing) Target Platform: Node.js (LTS) Project Type: Library (single exported utility function) Performance Goals: Process 1MB text input in under 1 second Constraints: Pure synchronous function, no external dependencies, UTF-16 code unit indices Scale/Scope: Single function with types — minimal surface area

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Principle	Status	Notes
I. Strict TypeScript	PASS	`strict: true` already configured in tsconfig.json. All types will be explicit. No `any` usage.
II. Simplicity First	PASS	Single exported function + 2 type definitions. No abstractions, no classes, no patterns beyond a loop.
III. Explicit Error Handling	PASS	Invalid inputs return empty array (documented in spec). No thrown exceptions for expected edge cases.
Technology Stack	PASS	TypeScript, tsc build, `src/` → `dist/`, no new production dependencies.
Development Workflow	PASS	Code in `src/`, build via `tsc`, no `console.log`.

Gate result: ALL PASS — no violations, no complexity tracking needed.

Project Structure

Documentation (this feature)

specs/001-string-chunk-splitter/
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
│   └── splitStringOnChunks.md
└── tasks.md

Source Code (repository root)

src/
├── index.ts                    # Re-exports public API
└── splitStringOnChunks.ts      # Function implementation + types

tests/
├── splitStringOnChunks.test.ts # Unit tests (vitest)
└── splitStringOnChunks.prop.ts # Property-based tests (fast-check)

Structure Decision: Single project layout. The feature is one function file plus types. Tests live in a top-level tests/ directory. No models/services/cli subdirectories needed — that would violate Simplicity First.

specs/001-string-chunk-splitter/data-model.md

Data Model: String Chunk Splitter

Entities

ChunkInput

The input configuration object for splitStringOnChunks().

Field	Type	Required	Default	Description
text	string	Yes	—	The original text to split
chunkLen	number	Yes	—	Maximum total length of a chunk (overlap + new content)
overlap	number	Yes	—	Maximum length of overlap region
separators	string[]	Yes	—	Non-empty strings, sorted by descending priority
start	number	No	0	Offset where new content begins in the text

Validation rules:

chunkLen must be > 0, otherwise return empty array
overlap must be >= 0; clamped to chunkLen - 1 if >= chunkLen
separators entries must be non-empty strings
start must be >= 0 and < text.length (if text is non-empty)

Chunk

A single chunk in the output array.

Field	Type	Description
start	number	Index in original text where the chunk begins (overlap start)
mid	number	Index where new content begins (overlap ends). `[start, mid)` is overlap
end	number	Index where the chunk ends (exclusive)
text	string	The chunk text: `originalText.slice(start, end)`

Invariants:

0 <= start <= mid <= end <= text.length
end - start <= chunkLen
mid - start <= overlap
chunk.text === originalText.slice(chunk.start, chunk.end)

Relationships

ChunkInput → produces → Chunk[]
For consecutive chunks [i] and [i+1]: chunks[i+1].mid === chunks[i].end
First chunk: mid === input.start
Last chunk: end === text.length (or end of processable text)

State Transitions

N/A — this is a pure stateless function. No lifecycle or state management.

specs/001-string-chunk-splitter/research.md

Research: String Chunk Splitter with Overlap

Testing Framework: vitest + fast-check

Decision: Use vitest as test runner with fast-check for property-based testing.

Rationale: vitest is fast, TypeScript-native, and has built-in assertion support. fast-check provides property-based testing which is ideal for this function — the invariants (no gaps, chunk length limits, overlap constraints) are natural properties to verify over randomized inputs.

Alternatives considered:

Jest: Heavier, slower TypeScript support, no advantage for this use case.
Node test runner: Built-in but less ergonomic for property-based testing integration.

Algorithm Approach: Linear Scan

Decision: Single-pass linear scan from start (or 0) to end of text, building one chunk at a time.

Rationale: The function processes text left-to-right. For each chunk:

Resolve left boundary by scanning the overlap window [mid - overlap, mid) for the leftmost separator occurrence.
Compute remaining budget: chunkLen - (mid - start).
Resolve right boundary by searching [mid, mid + remainingBudget) for rightmost occurrence of highest-priority separator.
Emit chunk, advance mid to end of the new content.

This is O(n * s * k) worst case where n = text length, s = number of separators, k = max separator length. For typical use (short separators, reasonable chunk sizes), this is effectively O(n).

Alternatives considered:

Pre-indexing all separator positions: Higher memory, more complex code, marginal benefit for typical inputs.
Regex-based splitting: Less control over boundary selection strategy, harder to implement asymmetric rules.

Separator Search Strategy

Decision: For right boundary, iterate separators by priority and use lastIndexOf semantics within the window. For left boundary, scan all separators in the overlap window and pick the one at the earliest position (longest match on tie).

Rationale: lastIndexOf within a substring gives the rightmost occurrence naturally. For the left boundary, a single scan collecting all separator positions is straightforward.

Alternatives considered:

Using a single combined regex: Would lose per-separator priority control and complicate the longest-match-on-tie rule.

No Production Dependencies

Decision: Zero production dependencies. Only dev dependencies: vitest, fast-check, existing typescript and @types/node.

Rationale: Constitution mandates minimal dependencies. This is a pure string manipulation function — no external libraries needed.

specs/001-string-chunk-splitter/spec.md

Feature Specification: String Chunk Splitter with Overlap

Feature Branch: 001-string-chunk-splitter Created: 2026-03-26 Status: Draft Input: User description: "Build a splitStringOnChunks() function that splits text into chunks with overlap support for embeddings"

User Scenarios & Testing (mandatory)

User Story 1 - Basic Text Chunking (Priority: P1)

A developer passes a long text string to splitStringOnChunks() with a desired chunk length and a list of separators. The function splits the text into chunks that respect separator boundaries, each chunk no longer than chunkLen. Subsequent chunks include overlap from the previous chunk. When a non-zero start offset is provided with overlap > 0, even the first chunk can include overlap text from before the starting position.

Why this priority: This is the core functionality — without basic chunking, no other behavior matters.

Independent Test: Can be tested by passing a multi-paragraph text with chunkLen=100, overlap=20, separators=["\n\n", "\n", ". ", " "] and verifying the returned chunks cover the entire text with correct start, mid, end, and text fields.

Acceptance Scenarios:

Given a text of 500 characters, chunkLen=100, overlap=20, separators=["\n\n", "\n", " "], When splitStringOnChunks() is called, Then it returns an array of chunks where each chunk's text.length <= chunkLen, overlap regions are <= overlap characters, and every character in the original text is covered.
Given any output chunks, When inspecting indices, Then chunk.text === text.slice(chunk.start, chunk.end) for every chunk, and chunk.mid >= chunk.start. When start=0, the first chunk has start === mid === 0. When start > 0 and overlap > 0, the first chunk has mid === start and start < mid (overlap from text before the starting offset).
Given consecutive chunks, When examining their boundaries, Then chunks[i+1].mid === chunks[i].end (new content starts where previous chunk ended). If a separator was found in the overlap window, chunks[i+1].start < chunks[i+1].mid (overlap exists). If no separator was found, chunks[i+1].start === chunks[i+1].mid (no overlap).

User Story 2 - Asymmetric Boundary Rules (Priority: P2)

A developer provides an ordered list of separators. The right boundary (chunk end) respects separator priority — preferring higher-priority separators and falling back to lower-priority ones, with hard split as last resort. The left boundary (overlap start) ignores priority and picks the leftmost separator in the overlap window to maximize overlap. Separators are excluded from overlap text but included in chunk text at the right boundary.

Why this priority: Asymmetric boundary handling maximizes embedding context (wide overlaps) while preserving semantic coherence at chunk boundaries (priority-aware right splits).

Independent Test: Can be tested by providing text with multiple separator types and verifying that right boundaries prefer high-priority separators, left boundaries pick the leftmost separator regardless of type, and separator text is not part of the overlap.

Acceptance Scenarios:

Given text with paragraph breaks and chunkLen large enough to fit a paragraph, When splitting at the right boundary, Then chunks split at "\n\n" boundaries rather than at "\n" or " ".
Given a paragraph longer than chunkLen, When no paragraph separator fits at the right boundary, Then the function falls back to the next separator in the list (e.g., "\n", then ". ", then " ").
Given an overlap window containing both "\n" and " " separators, When determining the left boundary, Then the function picks whichever separator occurrence is leftmost (earliest position), regardless of priority order.
Given an overlap window with no separator occurrences, When determining the left boundary, Then the chunk has start === mid (no overlap for this chunk).
Given a separator found at the left boundary, When the overlap is constructed, Then the overlap text starts immediately after the separator — the separator itself is NOT included in [start, mid).
Given chunkLen=100 and an overlap of 30 characters is resolved for a chunk, When determining the right boundary, Then the right boundary search window is 70 characters (100 - 30), not 100. The overlap budget is spent first, and the right boundary gets the remainder.

User Story 3 - Edge Case Handling (Priority: P3)

A developer passes edge-case inputs: empty text, text shorter than chunkLen, text with no matching separators, or very small chunkLen/overlap values. The function handles these gracefully without errors.

Why this priority: Robustness ensures the function is reliable in production pipelines without requiring callers to pre-validate input.

Independent Test: Can be tested by passing empty strings, single-character texts, and texts with no separator matches, verifying correct output or empty arrays.

Acceptance Scenarios:

Given an empty string as text, When splitStringOnChunks() is called, Then it returns an empty array.
Given text shorter than chunkLen, When splitting, Then a single chunk is returned with start=0, mid=0, end=text.length, text=originalText.
Given text with no matching separators and chunkLen smaller than text.length, When splitting, Then the function splits at hard chunkLen boundaries as a last resort.

Edge Cases

What happens when overlap >= chunkLen? The overlap MUST be clamped to be strictly less than chunkLen to guarantee forward progress.
What happens when a separator is longer than the remaining text? It is skipped.
What happens when chunkLen is 0 or negative? Return an empty array.
What happens when separators is an empty array? Fall back to hard splitting at chunkLen boundaries.

Requirements (mandatory)

Functional Requirements

FR-001: The function MUST accept a single object parameter with fields: text (string), chunkLen (number), overlap (number), separators (string array), and optionally start (number, defaults to 0) indicating the offset in the text where new content begins.
FR-002: The function MUST return an array of chunk objects, each with start (number), mid (number), end (number), and text (string).
FR-003: For every chunk, text MUST equal originalText.slice(start, end).
FR-004: The first chunk's mid MUST equal the input start offset. When start === 0, the first chunk has start === mid === 0 (no overlap). When start > 0 and overlap > 0, the first chunk's start MAY be less than mid, providing overlap context from text before the starting offset. The overlap length (mid - chunk.start) MUST NOT exceed overlap.
FR-005: For consecutive chunks [i] and [i+1]: chunks[i+1].mid MUST equal chunks[i].end (no gaps in new content).
FR-006: Each chunk's length (end - start) MUST NOT exceed chunkLen.
FR-007: Each chunk's overlap length (mid - start) MUST NOT exceed overlap.
FR-008: Boundary resolution order: The left boundary (overlap) MUST be resolved first. The right boundary MUST then use only the unspent chunkLen budget: remainingBudget = chunkLen - (mid - start). This prioritizes overlap presence over maximizing new content per chunk.
FR-012: Right boundary (chunk end): The function MUST iterate separators from highest to lowest priority. For each separator, find the rightmost occurrence within the remaining budget window [mid, mid + remainingBudget). If found, split there (separator text included in the chunk, before the split point). If not found, try the next separator. When no separator produces a valid split within the remaining budget, the function MUST hard-split at mid + remainingBudget.
FR-009: The union of all chunks' new content ranges [mid, end) MUST cover the entire original text with no gaps.
FR-010: Left boundary (overlap start): When determining where overlap begins for a chunk, the function MUST ignore separator priority order and instead pick the leftmost separator occurrence found within the overlap window. When multiple separators start at the same position, the longest matching separator MUST be chosen (overlap begins after the full separator, avoiding partial separator artifacts). If no separator is found in the overlap window, the chunk MUST have start === mid (no overlap).
FR-011: Separator text MUST NOT be included in the overlap. The overlap [start, mid) MUST begin immediately after the separator occurrence used to determine the left boundary.

Key Entities

ChunkInput: The input configuration object containing text, chunkLen, overlap, separators, and optionally start (offset where new content begins, defaults to 0).
Chunk: The output object representing a single chunk with positional indices (start, mid, end) and the extracted text.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Every character in the original text appears in at least one chunk's new content range [mid, end).
SC-002: No chunk exceeds the specified chunkLen in total length.
SC-003: Overlap regions respect the overlap maximum length.
SC-004: When separators are available within the chunk window, splits occur at separator boundaries at least 90% of the time (hard splits are rare).
SC-005: The function processes a 1MB text input in under 1 second.

Clarifications

Session 2026-03-26

Q: Right boundary separator search strategy — rightmost or leftmost occurrence per priority level? → A: Rightmost occurrence per priority level (maximizes chunk size while respecting separator priority).
Q: Left boundary — when multiple separators start at the same position, pick longest or shortest? → A: Longest match (avoids partial separator artifacts in overlap text).

Assumptions

The function is a pure, synchronous utility — no I/O, no async operations.
Separators are non-empty strings as stated in the input contract.
chunkLen and overlap are positive integers; invalid values result in an empty array.
The function is deterministic — same input always produces the same output.
Multi-byte characters (UTF-16 surrogate pairs) are not specially handled; indices refer to JavaScript string indices (UTF-16 code units).

Теперь нарезаем все спеки на таски

/speckit.tasks break this down

specs/001-string-chunk-splitter/tasks.md

Tasks: String Chunk Splitter with Overlap

Input: Design documents from /specs/001-string-chunk-splitter/ Prerequisites: plan.md (required), spec.md (required for user stories), research.md, data-model.md, contracts/

Tests: Included — vitest for unit tests, fast-check for property-based tests (per plan.md tech stack).

Organization: Tasks are grouped by user story to enable independent implementation and testing of each story.

Format: `[ID] [P?] [Story] Description`

[P]: Can run in parallel (different files, no dependencies)
[Story]: Which user story this task belongs to (e.g., US1, US2, US3)
Include exact file paths in descriptions

Path Conventions

Single project: src/, tests/ at repository root

Phase 1: Setup (Shared Infrastructure)

Purpose: Project initialization, dev dependencies, and type definitions

T001 Install vitest and fast-check as dev dependencies via npm install -D vitest fast-check
T002 Add vitest configuration to package.json: add "test": "vitest run" and "test:watch": "vitest" scripts
T003 Define ChunkInput and Chunk interfaces and export splitStringOnChunks stub (returns []) in src/splitStringOnChunks.ts per contracts/splitStringOnChunks.md
T004 Re-export public API from src/index.ts: export splitStringOnChunks, ChunkInput, and Chunk from ./splitStringOnChunks.js
T005 Create empty test files: tests/splitStringOnChunks.test.ts (unit) and tests/splitStringOnChunks.prop.ts (property-based)

Checkpoint: npm run build succeeds. npm test runs (tests may be empty/skipped). Types are importable.

Phase 2: Foundational (Blocking Prerequisites)

Purpose: Input validation logic that all user stories depend on

⚠️ CRITICAL: No user story work can begin until this phase is complete

T006 Implement input validation in src/splitStringOnChunks.ts: return [] for empty text, chunkLen <= 0, or start >= text.length. Clamp overlap to chunkLen - 1 if overlap >= chunkLen. Default start to 0 if not provided.

Checkpoint: Foundation ready — stub function validates inputs and returns [] for invalid cases. User story implementation can now begin.

Phase 3: User Story 1 — Basic Text Chunking (Priority: P1) 🎯 MVP

Goal: Core chunking loop that splits text into chunks with overlap, producing correct start, mid, end, text fields. Uses hard-split only (no separator logic yet).

Independent Test: Pass a multi-line text with chunkLen=100, overlap=20, separators=[] and verify chunks cover the entire text with correct indices and no gaps.

Tests for User Story 1 ⚠️

NOTE: Write these tests FIRST, ensure they FAIL before implementation

T007 [P] [US1] Write unit tests in tests/splitStringOnChunks.test.ts for basic chunking: text coverage (FR-009), index correctness (FR-003), first chunk mid === start (FR-004), continuity chunks[i+1].mid === chunks[i].end (FR-005), chunk size limit (FR-006), overlap size limit (FR-007). Use separators: [] (hard-split only).
T008 [P] [US1] Write unit tests in tests/splitStringOnChunks.test.ts for start offset: when start > 0 with overlap > 0, verify first chunk has mid === start and overlap from text before start. When start === 0, verify start === mid === 0.

Implementation for User Story 1

T009 [US1] Implement core chunking loop in src/splitStringOnChunks.ts: iterate from start to end of text, for each chunk resolve left boundary (hard: mid - overlap clamped to 0), set end = start + chunkLen clamped to text.length, emit { start, mid, end, text: text.slice(start, end) }, advance mid = end. No separator logic yet — hard splits only.
T010 [US1] Verify all T007 and T008 tests pass via npm test

Checkpoint: Function splits text into chunks with correct indices and overlap via hard boundaries. All US1 tests pass.

Phase 4: User Story 2 — Asymmetric Boundary Rules (Priority: P2)

Goal: Add separator-aware splitting with asymmetric rules: left boundary picks leftmost separator (priority-agnostic, separator excluded from overlap), right boundary picks rightmost occurrence of highest-priority separator within remaining budget.

Independent Test: Pass text with mixed separators (["\n\n", "\n", ". ", " "]) and verify right boundaries prefer high-priority separators, left boundaries pick leftmost separator regardless of type, separator text excluded from overlap, and budget is shared (left first, right gets remainder).

Tests for User Story 2 ⚠️

NOTE: Write these tests FIRST, ensure they FAIL before implementation

T011 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for right boundary: highest-priority separator preferred (FR-012), rightmost occurrence within budget, fallback to lower-priority separators, hard-split when no separator found.
T012 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for left boundary: leftmost separator picked regardless of priority (FR-010), longest match on tie, separator text excluded from overlap (FR-011), start === mid when no separator in overlap window.
T013 [P] [US2] Write unit tests in tests/splitStringOnChunks.test.ts for budget sharing: left boundary resolved first, right boundary uses remainingBudget = chunkLen - (mid - start) (FR-008). Test with chunkLen=100, overlap of 30 chars → right window is 70 chars.

Implementation for User Story 2

T014 [US2] Implement left boundary resolver in src/splitStringOnChunks.ts: scan overlap window [mid - overlap, mid) for all separator occurrences, pick leftmost position, on tie pick longest separator, set start to position after separator end. If no separator found, start = mid.
T015 [US2] Implement right boundary resolver in src/splitStringOnChunks.ts: compute remainingBudget = chunkLen - (mid - start), iterate separators by priority, for each find rightmost occurrence in [mid, mid + remainingBudget) via lastIndexOf semantics, split at first separator found. Hard-split at mid + remainingBudget if none found.
T016 [US2] Integrate boundary resolvers into the core loop: replace hard-split logic with left-then-right resolution per FR-008. Ensure end is set after the separator text (included in chunk).
T017 [US2] Verify all T011, T012, T013 tests pass via npm test

Checkpoint: Function uses asymmetric boundary rules. All US1 and US2 tests pass.

Phase 5: User Story 3 — Edge Case Handling (Priority: P3)

Goal: Handle all edge cases gracefully: empty text, short text, no matching separators, overlap >= chunkLen, chunkLen <= 0, empty separators array, separators longer than remaining text.

Independent Test: Pass empty strings, single-character texts, texts with no matching separators, and extreme parameter values. Verify correct output or empty arrays without errors.

Tests for User Story 3 ⚠️

NOTE: Write these tests FIRST, ensure they FAIL before implementation

T018 [P] [US3] Write unit tests in tests/splitStringOnChunks.test.ts for edge cases: empty text → [], chunkLen <= 0 → [], overlap >= chunkLen → clamped, text shorter than chunkLen → single chunk, empty separators array → hard split, separator longer than remaining text → skipped.

Implementation for User Story 3

T019 [US3] Review and harden edge case handling in src/splitStringOnChunks.ts: ensure separator search skips separators longer than the search window, empty separators array falls back to hard split, and all validation from T006 is complete. Fix any failing edge case tests.
T020 [US3] Verify all T018 tests pass via npm test

Checkpoint: All edge cases handled. All US1, US2, and US3 tests pass.

Phase 6: Polish & Cross-Cutting Concerns

Purpose: Property-based tests, performance validation, build verification

T021 [P] Write property-based tests in tests/splitStringOnChunks.prop.ts using fast-check: generate random text, chunkLen, overlap, separators. Assert invariants: coverage (FR-009), continuity (FR-005), size limit (FR-006), overlap limit (FR-007), text consistency (FR-003), forward progress (ordering).
T022 [P] Write performance test in tests/splitStringOnChunks.test.ts: generate 1MB text string, verify splitStringOnChunks completes in under 1 second (SC-005).
T023 Verify npm run build produces dist/splitStringOnChunks.js and dist/index.js with zero errors
T024 Run quickstart.md validation: execute the usage example from specs/001-string-chunk-splitter/quickstart.md and verify output is correct

Dependencies & Execution Order

Phase Dependencies

Setup (Phase 1): No dependencies — can start immediately
Foundational (Phase 2): Depends on Setup completion — BLOCKS all user stories
User Story 1 (Phase 3): Depends on Foundational. Core loop — no separator logic.
User Story 2 (Phase 4): Depends on US1 completion (refines the same function with separator logic)
User Story 3 (Phase 5): Depends on US2 completion (hardens existing implementation)
Polish (Phase 6): Depends on all user stories being complete

Within Each User Story

Tests MUST be written and FAIL before implementation
Implementation tasks are sequential (they modify the same file)
Story complete before moving to next priority

Parallel Opportunities

T007, T008 can run in parallel (different test sections, same file but independent describe blocks)
T011, T012, T013 can run in parallel (different test sections)
T021, T022 can run in parallel (different test files/sections)
T003, T005 can run in parallel (different files)

Parallel Example: User Story 2

# Launch all test tasks for User Story 2 together:
Task: "T011 - Right boundary tests in tests/splitStringOnChunks.test.ts"
Task: "T012 - Left boundary tests in tests/splitStringOnChunks.test.ts"
Task: "T013 - Budget sharing tests in tests/splitStringOnChunks.test.ts"

# Then implement sequentially (same file):
Task: "T014 - Left boundary resolver"
Task: "T015 - Right boundary resolver"
Task: "T016 - Integrate into core loop"

Implementation Strategy

MVP First (User Story 1 Only)

Complete Phase 1: Setup
Complete Phase 2: Foundational (input validation)
Complete Phase 3: User Story 1 (hard-split core loop)
STOP and VALIDATE: All US1 tests pass, function works with separators: []
Demo: function correctly chunks text with overlap via hard boundaries

Incremental Delivery

Complete Setup + Foundational → Foundation ready
Add User Story 1 → Test → MVP: basic chunking works
Add User Story 2 → Test → Separator-aware splitting works
Add User Story 3 → Test → All edge cases handled
Polish → Property tests + performance validated

Notes

[P] tasks = different files or independent test sections, no dependencies
[Story] label maps task to specific user story for traceability
All user stories modify the same source file (src/splitStringOnChunks.ts), so stories are sequential (US1 → US2 → US3)
Tests for each story can be written in parallel within the story
Commit after each completed phase
Stop at any checkpoint to validate independently

И вот только теперь приступаем к кодингу

/speckit.implement

Ждем 20 минут

Spec-Kit программа

Details

  
export interface ChunkInput {
  text: string;
  chunkLen: number;
  overlap: number;
  separators: string[];
  start?: number;
}

export interface Chunk {
  start: number;
  mid: number;
  end: number;
  text: string;
}

export function splitStringOnChunks(input: ChunkInput): Chunk[] {
  const { text, chunkLen, separators } = input;
  const start = input.start ?? 0;

  if (text.length === 0 || chunkLen <= 0 || start >= text.length) {
    return [];
  }

  const overlap = Math.min(input.overlap, chunkLen - 1);

  const chunks: Chunk[] = [];
  let mid = start;

  while (mid < text.length) {
    // Resolve left boundary (overlap start)
    const overlapStart = Math.max(0, mid - overlap);
    const chunkStart = resolveLeftBoundary(text, overlapStart, mid, separators);

    // Resolve right boundary using remaining budget
    const remainingBudget = chunkLen - (mid - chunkStart);
    const chunkEnd = resolveRightBoundary(text, mid, remainingBudget, separators);

    chunks.push({
      start: chunkStart,
      mid,
      end: chunkEnd,
      text: text.slice(chunkStart, chunkEnd),
    });

    mid = chunkEnd;
  }

  return chunks;
}

function resolveLeftBoundary(
  text: string,
  windowStart: number,
  mid: number,
  separators: string[],
): number {
  if (separators.length === 0 || windowStart >= mid) {
    return mid;
  }

  // Find the leftmost separator occurrence in [windowStart, mid)
  // On tie (same position), pick the longest separator
  let bestPos = -1;
  let bestLen = 0;

  for (const sep of separators) {
    if (sep.length === 0) continue;

    // Scan for all occurrences of this separator in the window
    let searchFrom = windowStart;
    while (searchFrom < mid) {
      const idx = text.indexOf(sep, searchFrom);
      if (idx === -1 || idx >= mid) break;

      if (bestPos === -1 || idx < bestPos || (idx === bestPos && sep.length > bestLen)) {
        bestPos = idx;
        bestLen = sep.length;
      }

      // Only need the first (leftmost) occurrence of this separator
      break;
    }
  }

  if (bestPos === -1) {
    // No separator found — no overlap
    return mid;
  }

  // Overlap starts after the separator (separator excluded from overlap)
  return Math.min(bestPos + bestLen, mid);
}

function resolveRightBoundary(
  text: string,
  mid: number,
  budget: number,
  separators: string[],
): number {
  const windowEnd = Math.min(mid + budget, text.length);

  if (windowEnd >= text.length) {
    return text.length;
  }

  if (separators.length === 0) {
    return windowEnd;
  }

  // Iterate separators by priority (highest first)
  // For each, find the rightmost occurrence in [mid, windowEnd)
  for (const sep of separators) {
    if (sep.length === 0) continue;

    // Find rightmost occurrence: search backwards from windowEnd
    const searchWindow = text.slice(mid, windowEnd);
    const lastIdx = searchWindow.lastIndexOf(sep);

    if (lastIdx !== -1) {
      // Split after the separator (separator included in chunk)
      return mid + lastIdx + sep.length;
    }
  }

  // No separator found — hard split
  return windowEnd;
}

Spec-Kit тесты

Details

  
import { describe, it, expect } from "vitest";
import * as fc from "fast-check";
import { splitStringOnChunks } from "../src/splitStringOnChunks.js";

describe("splitStringOnChunks - property-based tests", () => {
  const separatorsArb = fc.array(
    fc.constantFrom(" ", "\n", "\n\n", ". ", ", ", "; ", "  "),
    { minLength: 0, maxLength: 4 },
  );

  it("coverage: union of [mid, end) covers entire text from start", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          if (chunks.length === 0) return;

          expect(chunks[0]!.mid).toBe(0);
          expect(chunks[chunks.length - 1]!.end).toBe(text.length);
        },
      ),
      { numRuns: 200 },
    );
  });

  it("continuity: chunks[i+1].mid === chunks[i].end", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          for (let i = 0; i < chunks.length - 1; i++) {
            expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
          }
        },
      ),
      { numRuns: 200 },
    );
  });

  it("size limit: chunk.end - chunk.start <= chunkLen", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          for (const chunk of chunks) {
            expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);
          }
        },
      ),
      { numRuns: 200 },
    );
  });

  it("overlap limit: chunk.mid - chunk.start <= clamped overlap", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const clampedOverlap = Math.min(overlap, chunkLen - 1);
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          for (const chunk of chunks) {
            expect(chunk.mid - chunk.start).toBeLessThanOrEqual(clampedOverlap);
          }
        },
      ),
      { numRuns: 200 },
    );
  });

  it("text consistency: chunk.text === text.slice(chunk.start, chunk.end)", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          for (const chunk of chunks) {
            expect(chunk.text).toBe(text.slice(chunk.start, chunk.end));
          }
        },
      ),
      { numRuns: 200 },
    );
  });

  it("forward progress: chunks[i].mid < chunks[i+1].mid", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 500 }),
        fc.integer({ min: 5, max: 200 }),
        fc.integer({ min: 0, max: 100 }),
        separatorsArb,
        (text, chunkLen, overlap, separators) => {
          const chunks = splitStringOnChunks({ text, chunkLen, overlap, separators });
          for (let i = 0; i < chunks.length - 1; i++) {
            expect(chunks[i]!.mid).toBeLessThan(chunks[i + 1]!.mid);
          }
        },
      ),
      { numRuns: 200 },
    );
  });
});

import { describe, it, expect } from "vitest";
import { splitStringOnChunks, type Chunk } from "../src/splitStringOnChunks.js";

// Helper to verify invariants on any chunk result
function assertChunkInvariants(
  chunks: Chunk[],
  text: string,
  chunkLen: number,
  overlap: number,
  inputStart: number,
): void {
  if (chunks.length === 0) return;

  // FR-004: First chunk mid === input start
  expect(chunks[0]!.mid).toBe(inputStart);

  for (let i = 0; i < chunks.length; i++) {
    const chunk = chunks[i]!;

    // FR-003: text consistency
    expect(chunk.text).toBe(text.slice(chunk.start, chunk.end));

    // FR-006: chunk size limit
    expect(chunk.end - chunk.start).toBeLessThanOrEqual(chunkLen);

    // FR-007: overlap size limit
    expect(chunk.mid - chunk.start).toBeLessThanOrEqual(overlap);

    // Basic ordering
    expect(chunk.start).toBeLessThanOrEqual(chunk.mid);
    expect(chunk.mid).toBeLessThanOrEqual(chunk.end);
  }

  // FR-005: continuity
  for (let i = 0; i < chunks.length - 1; i++) {
    expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
  }

  // FR-009: coverage — union of [mid, end) covers [inputStart, text.length)
  expect(chunks[0]!.mid).toBe(inputStart);
  expect(chunks[chunks.length - 1]!.end).toBe(text.length);
}

describe("splitStringOnChunks", () => {
  // =====================
  // US1: Basic Text Chunking
  // =====================

  describe("US1: basic chunking (hard-split, no separators)", () => {
    it("splits text into chunks covering the entire input", () => {
      const text = "a".repeat(500);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 100,
        overlap: 20,
        separators: [],
      });

      assertChunkInvariants(chunks, text, 100, 20, 0);
      expect(chunks.length).toBeGreaterThan(1);
    });

    it("returns single chunk when text is shorter than chunkLen", () => {
      const text = "hello world";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 100,
        overlap: 10,
        separators: [],
      });

      expect(chunks).toHaveLength(1);
      expect(chunks[0]).toEqual({
        start: 0,
        mid: 0,
        end: text.length,
        text,
      });
    });

    it("first chunk has start === mid === 0 when start is 0", () => {
      const text = "a".repeat(200);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 50,
        overlap: 10,
        separators: [],
      });

      expect(chunks[0]!.start).toBe(0);
      expect(chunks[0]!.mid).toBe(0);
    });

    it("maintains continuity: chunks[i+1].mid === chunks[i].end", () => {
      const text = "a".repeat(300);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 80,
        overlap: 15,
        separators: [],
      });

      for (let i = 0; i < chunks.length - 1; i++) {
        expect(chunks[i + 1]!.mid).toBe(chunks[i]!.end);
      }
    });

    it("no chunk exceeds chunkLen", () => {
      const text = "a".repeat(500);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 60,
        overlap: 10,
        separators: [],
      });

      for (const chunk of chunks) {
        expect(chunk.end - chunk.start).toBeLessThanOrEqual(60);
      }
    });

    it("overlap does not exceed overlap param", () => {
      const text = "a".repeat(500);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 60,
        overlap: 10,
        separators: [],
      });

      for (const chunk of chunks) {
        expect(chunk.mid - chunk.start).toBeLessThanOrEqual(10);
      }
    });
  });

  describe("US1: start offset", () => {
    it("first chunk has mid === start when start > 0", () => {
      const text = "a".repeat(200);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 50,
        overlap: 10,
        separators: [],
        start: 50,
      });

      expect(chunks[0]!.mid).toBe(50);
    });

    it("first chunk has overlap when start > 0 and overlap > 0 with separators", () => {
      // Use text with separators so overlap can be resolved
      const text = "aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk lll mmm nnn";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 20,
        overlap: 10,
        separators: [" "],
        start: 20,
      });

      expect(chunks[0]!.mid).toBe(20);
      expect(chunks[0]!.start).toBeLessThan(chunks[0]!.mid);
    });

    it("first chunk has no overlap when start > 0 but no separators in window", () => {
      const text = "a".repeat(200);
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 50,
        overlap: 10,
        separators: [],
        start: 50,
      });

      // No separators → no overlap per FR-010
      expect(chunks[0]!.start).toBe(chunks[0]!.mid);
    });

    it("covers text from start to end", () => {
      const text = "a".repeat(200);
      const startOffset = 50;
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 50,
        overlap: 10,
        separators: [],
        start: startOffset,
      });

      assertChunkInvariants(chunks, text, 50, 10, startOffset);
    });
  });

  // =====================
  // US2: Asymmetric Boundary Rules
  // =====================

  describe("US2: right boundary — separator priority", () => {
    it("prefers highest-priority separator at right boundary", () => {
      // Text with both \n\n and spaces. chunkLen big enough to reach \n\n
      const text = "Hello world\n\nSecond paragraph here";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 30,
        overlap: 5,
        separators: ["\n\n", " "],
      });

      // First chunk should end at or after \n\n (position 13) since it's highest priority
      expect(chunks[0]!.end).toBe(13); // "Hello world\n\n"
    });

    it("picks rightmost occurrence of highest-priority separator", () => {
      const text = "aaa bbb ccc ddd eee";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 19,
        overlap: 0,
        separators: [" "],
      });

      // With chunkLen=19 and full text=19, should be one chunk
      expect(chunks).toHaveLength(1);
      expect(chunks[0]!.end).toBe(19);
    });

    it("falls back to lower-priority separator when higher not found", () => {
      const text = "Hello world this is text";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 15,
        overlap: 0,
        separators: ["\n\n", " "],
      });

      // No \n\n in text, should fall back to " "
      // First chunk: budget=15, rightmost " " in [0,15) is at pos 11 ("Hello world ")
      expect(chunks[0]!.end).toBe(12); // "Hello world " — end after the space
    });

    it("hard-splits when no separator found in budget window", () => {
      const text = "abcdefghijklmnopqrstuvwxyz";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 10,
        overlap: 0,
        separators: ["\n"],
      });

      // No \n in text, should hard-split at 10
      expect(chunks[0]!.end).toBe(10);
    });
  });

  describe("US2: left boundary — leftmost separator", () => {
    it("picks leftmost separator regardless of priority", () => {
      // Overlap window should contain both " " and "\n"
      // " " at earlier position should win even though "\n" is higher priority
      const text = "aaa bbb\nccc ddd eee fff ggg";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 15,
        overlap: 10,
        separators: ["\n", " "],
      });

      // Second chunk: mid should be at chunks[0].end
      // Overlap window looks back from mid. The leftmost separator in the window wins.
      if (chunks.length > 1) {
        const chunk1 = chunks[1]!;
        // The overlap should start after the leftmost separator in the window
        expect(chunk1.start).toBeLessThan(chunk1.mid);
      }
    });

    it("picks longest match when separators start at same position", () => {
      // Construct text where \n\n and \n overlap at the same position
      // and only one occurrence exists in the overlap window
      const text = "aaaa\n\nbbbbbbbbbb cccccccccc";
      //             01234 56789...
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 20,
        overlap: 10,
        separators: ["\n\n", "\n"],
      });

      // Second chunk: mid = first chunk's end
      // Overlap window looks back from mid. The \n\n at index 4 should be found.
      // Since \n\n (len 2) and \n (len 1) both start at index 4,
      // \n\n wins → overlap starts at index 6 (after \n\n), not index 5 (after \n)
      if (chunks.length > 1) {
        const chunk1 = chunks[1]!;
        // Overlap starts after the \n\n separator
        expect(chunk1.start).toBe(6); // position after \n\n
      }
    });

    it("separator excluded from overlap text (FR-011)", () => {
      const text = "hello world foo bar baz qux";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 15,
        overlap: 10,
        separators: [" "],
      });

      for (let i = 1; i < chunks.length; i++) {
        const chunk = chunks[i]!;
        if (chunk.start < chunk.mid) {
          // Overlap text should not start with the separator
          const overlapText = text.slice(chunk.start, chunk.mid);
          expect(overlapText[0]).not.toBe(" ");
        }
      }
    });

    it("no overlap when no separator in overlap window", () => {
      const text = "abcdefghijklmnopqrstuvwxyz";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 10,
        overlap: 5,
        separators: ["\n"],
      });

      // No \n anywhere, so no overlap for any chunk
      for (const chunk of chunks) {
        expect(chunk.start).toBe(chunk.mid);
      }
    });
  });

  describe("US2: budget sharing", () => {
    it("right boundary uses remaining budget after overlap", () => {
      // Build text where overlap will consume some of the chunkLen budget
      const text = "aaa bbb ccc ddd eee fff ggg hhh iii jjj";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 20,
        overlap: 10,
        separators: [" "],
      });

      for (const chunk of chunks) {
        // Total chunk length must not exceed chunkLen
        expect(chunk.end - chunk.start).toBeLessThanOrEqual(20);
      }

      // Verify all invariants hold with separators
      assertChunkInvariants(chunks, text, 20, 10, 0);
    });

    it("chunkLen=100 with 30-char overlap leaves 70 for right boundary", () => {
      // Create text with a separator at position 30 so overlap resolves to ~30 chars
      const prefix = "a".repeat(29) + " "; // 30 chars, separator at 29
      const body = "b".repeat(100); // plenty of text
      const text = prefix + body;
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 100,
        overlap: 35,
        separators: [" "],
        start: 30,
      });

      // First chunk: mid=30, overlap window [0,30), leftmost " " at 29
      // start = 30 (after the space at 29), so overlap = 0 chars
      // Actually the space is at index 29, overlap starts after it = 30 = mid
      // So start === mid (separator right at boundary edge)
      // Let's just verify invariants
      assertChunkInvariants(chunks, text, 100, 35, 30);
    });
  });

  // =====================
  // US3: Edge Case Handling
  // =====================

  describe("US3: edge cases", () => {
    it("returns empty array for empty text", () => {
      expect(splitStringOnChunks({
        text: "",
        chunkLen: 10,
        overlap: 5,
        separators: [" "],
      })).toEqual([]);
    });

    it("returns empty array for chunkLen <= 0", () => {
      expect(splitStringOnChunks({
        text: "hello",
        chunkLen: 0,
        overlap: 0,
        separators: [],
      })).toEqual([]);

      expect(splitStringOnChunks({
        text: "hello",
        chunkLen: -5,
        overlap: 0,
        separators: [],
      })).toEqual([]);
    });

    it("clamps overlap when overlap >= chunkLen", () => {
      const text = "hello world foo bar";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 10,
        overlap: 100, // way larger than chunkLen
        separators: [" "],
      });

      // Should not hang or crash — overlap clamped to chunkLen - 1
      expect(chunks.length).toBeGreaterThan(0);
      for (const chunk of chunks) {
        expect(chunk.end - chunk.start).toBeLessThanOrEqual(10);
        expect(chunk.mid - chunk.start).toBeLessThanOrEqual(9); // clamped to chunkLen - 1
      }
    });

    it("returns single chunk when text is shorter than chunkLen", () => {
      const text = "hi";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 100,
        overlap: 10,
        separators: [" "],
      });

      expect(chunks).toHaveLength(1);
      expect(chunks[0]).toEqual({
        start: 0,
        mid: 0,
        end: 2,
        text: "hi",
      });
    });

    it("falls back to hard split with empty separators array", () => {
      const text = "abcdefghij" + "klmnopqrst";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 10,
        overlap: 5,
        separators: [],
      });

      // With no separators, should hard-split and have no overlaps
      expect(chunks.length).toBeGreaterThan(1);
      for (const chunk of chunks) {
        expect(chunk.start).toBe(chunk.mid); // no overlap without separators
      }
    });

    it("skips separator longer than remaining text in search window", () => {
      const text = "abcde";
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 3,
        overlap: 0,
        separators: ["abcdef"], // longer than text
      });

      // Should not crash, should hard-split
      expect(chunks.length).toBeGreaterThan(0);
      assertChunkInvariants(chunks, text, 3, 0, 0);
    });

    it("returns empty array when start >= text.length", () => {
      expect(splitStringOnChunks({
        text: "hello",
        chunkLen: 10,
        overlap: 0,
        separators: [],
        start: 5,
      })).toEqual([]);

      expect(splitStringOnChunks({
        text: "hello",
        chunkLen: 10,
        overlap: 0,
        separators: [],
        start: 100,
      })).toEqual([]);
    });
  });

  // =====================
  // Performance
  // =====================

  describe("performance", () => {
    it("processes 1MB text in under 1 second (SC-005)", () => {
      // Build 1MB text with spaces every ~6 chars
      const segment = "abcde ";
      const repeats = Math.ceil(1_000_000 / segment.length);
      const text = segment.repeat(repeats);

      const startTime = performance.now();
      const chunks = splitStringOnChunks({
        text,
        chunkLen: 1000,
        overlap: 200,
        separators: ["\n\n", "\n", ". ", " "],
      });
      const elapsed = performance.now() - startTime;

      expect(elapsed).toBeLessThan(1000);
      expect(chunks.length).toBeGreaterThan(0);
    });
  });
});

Способ №5 – OpenSpec: фреймворк чуть меньше

Время 0:20
Ваншот, ошибок нет
Размер md 197 строки, тестов на 261

OpenSpec очень быстро ваншотнул алгоритм, код на уровне, намного быстрее, чем speckit. То ли это я за десятки экспериментов поднаторел объяснять ЛЛМ спеки, то ли реально хороший фреймворк.

OpenSpec страдает от таких же проблем фреймворков – большой оверхед. Фича занимает 197 линий md файлов против 70 в самописных спеках. И все эти 197 строк придется прочитать и поддерживать. И ладно бы каждая линия была полезна, но плотность информации низкая, много повторов, просто неудобный экспириенс.

Тесты, хотя есть, страдают от такой же низкой информационной плотности. Проверяют одно и то же по много раз одним и тем же способом. Если бы у меня не было мастер тестов из 'Пишем код самостоятельно', то я бы не мог с уверенностью сказать, что все работает как надо.

OpenSpec чат

Details

Изначальный запрос

/opsx:propose we are building splitStringOnChunks() TypeScript nodejs function that will cut incoming text on chunks for embeddings
encoding. Supports left overlap and separators.
As an input accepts a single object with:
        text - original text
        chunkLen - max len of a chunk
        overlap - max len of an overlap
        start - from what index start chunking text. this way the first chunk can have an overlap
        separators - an array of non empty string separators in desc of importance order
As an output there should be an array of chunks, each chunk has:
        start - from what index in the original text the chunk starts
        mid - from where the new chunk text starts and overlap ends, [start, mid) - is the overlap
        end - where the chunk ends
        text - the full chunk text [start, end)

first we try to find left boundary and match the most left separator without any priority, just try to find the biggest overlap
if no separator found we skip overlap for this chunk mid = start

then we use unspent chunkLen budget and try to match the right boundary
try to find the most priority the most right separator
if no separator found we may split the text midword

for the left boundary I don't want include separatos in the chunk
for the right boundary I want to include separtors in the chunk

Это произвело несколько файлов

openspec/changes/split-string-on-chunks/proposal.md

Why

We need a text chunking utility for preparing text for embeddings encoding. The function must split text into chunks of bounded size with configurable overlap to preserve context across chunk boundaries, and use separator-aware splitting to produce cleaner chunk boundaries.

What Changes

Add a new splitStringOnChunks() TypeScript function that:
- Accepts a configuration object with text, chunkLen, overlap, start, and separators
- Returns an array of chunk descriptors with start, mid, end, and text fields
- Supports left overlap: finds the largest overlap by matching the most-left separator (no priority), excludes separator from chunk text
- Supports right boundary: uses remaining chunkLen budget, matches the most-right occurrence of the highest-priority separator, includes separator in chunk text
- Falls back to mid-word splitting when no separator matches

Capabilities

New Capabilities

text-chunking: Core text chunking logic with overlap and separator-aware boundary detection

Modified Capabilities

Impact

New source file under src/ for the chunking function
New test file for the chunking function
No external dependencies required — pure string manipulation

openspec/changes/split-string-on-chunks/design.md

Context

We need a utility function to split text into chunks suitable for embeddings encoding. The function operates on raw strings with index-based positioning, enabling callers to control chunking from any starting point and carry overlap from previous chunks. No external dependencies are needed — this is pure string manipulation.

Goals / Non-Goals

Goals:

Provide a single, stateless function splitStringOnChunks() that chunks text with overlap
Support separator-aware boundary detection with configurable separator priority
Left boundary: find largest possible overlap by matching the most-left separator (no priority ordering), exclude separator from chunk text
Right boundary: use remaining chunkLen budget, prefer the most-right occurrence of the highest-priority separator, include separator in chunk text
Gracefully handle edge cases: no separators found, text shorter than chunkLen, empty text

Non-Goals:

Streaming or async chunking
Token-based length (we operate on character length only)
Recursive/hierarchical chunking strategies
Encoding-aware splitting (e.g., UTF-16 surrogate pairs)

Decisions

Single function, single object input/output

The function takes one config object and returns an array of chunk descriptors. This keeps the API surface minimal and easy to extend without breaking changes.

Left boundary: most-left separator, no priority

For the overlap region [start - overlap, start), we search for any separator occurrence and pick the one closest to the beginning of the overlap window. This maximizes overlap size. The separator itself is excluded from the chunk — the chunk starts after the separator. If no separator is found, mid = start (no overlap for this chunk).

Alternative considered: Priority-based left boundary — rejected because maximizing overlap is more valuable than separator importance on the left side.

Right boundary: highest-priority, most-right separator

For the right boundary, we consume the remaining chunkLen budget after the overlap. Within that window, we find the most-right occurrence of the highest-priority separator. The separator is included in the chunk text. If no separator is found, we split mid-word at the budget limit.

Alternative considered: Most-right of any separator — rejected because right boundary quality matters more for semantic coherence, so separator priority should be respected.

Index-based chunk descriptors

Each chunk returns start, mid, end, and text. The mid field separates overlap from new content: [start, mid) is overlap, [mid, end) is new. This lets callers distinguish overlap from fresh content without re-parsing.

Risks / Trade-offs

Multi-character separators: Searching for all separators at every position has O(text × separators × maxSepLen) cost → acceptable for typical chunk sizes (a few thousand chars) and small separator lists.
Mid-word splits: When no separator matches the right boundary, we split mid-word. This is a deliberate trade-off: we guarantee bounded chunk size over clean boundaries.
Overlap skip: If no separator is found in the overlap window, the chunk has no overlap (mid = start). Callers should be aware that overlap is best-effort.

openspec/changes/split-string-on-chunks/tasks.md

1. Core Implementation

1.1 Create src/splitStringOnChunks.ts with the function signature, input type, and output type
1.2 Implement left boundary overlap detection: search overlap window for most-left separator, exclude separator from chunk, fall back to mid=start
1.3 Implement right boundary detection: use remaining chunkLen budget, find most-right occurrence of highest-priority separator, include separator, fall back to mid-word split
1.4 Implement the main chunking loop: iterate from start to end of text, producing chunk descriptors and advancing position

2. Testing

2.1 Add tests for basic chunking: no overlap, single separator, full text coverage
2.2 Add tests for left overlap: separator found (most-left), separator excluded from chunk, no separator falls back to mid=start
2.3 Add tests for right boundary: priority-based separator selection, fallback to lower-priority, mid-word split when no separator
2.4 Add tests for edge cases: empty text, start at end, text shorter than chunkLen, multi-character separators
2.5 Add tests for budget accounting: overlap consumes budget, chunk length never exceeds chunkLen

openspec/changes/split-string-on-chunks/specs/text-chunking/spec.md

ADDED Requirements

Requirement: Function signature and input validation

The splitStringOnChunks() function SHALL accept a single object with the following properties:

text (string): the original text to chunk
chunkLen (number): maximum length of a chunk
overlap (number): maximum length of overlap
start (number): index in the original text from which to begin chunking
separators (string[]): array of non-empty string separators in descending order of importance

The function SHALL return an array of chunk objects.

Scenario: Valid input produces chunks

WHEN called with { text: "hello world foo bar", chunkLen: 10, overlap: 0, start: 0, separators: [" "] }
THEN the function returns an array of chunk objects covering the entire text

Scenario: Empty text returns empty array

WHEN called with { text: "", chunkLen: 10, overlap: 0, start: 0, separators: [" "] }
THEN the function returns an empty array

Scenario: Start at end of text returns empty array

WHEN called with { text: "hello", chunkLen: 10, overlap: 0, start: 5, separators: [" "] }
THEN the function returns an empty array

Requirement: Chunk descriptor format

Each chunk object SHALL have the following properties:

start (number): index in the original text where the chunk starts
mid (number): index where the new content starts (overlap is [start, mid))
end (number): index where the chunk ends (exclusive)
text (string): the full chunk text equal to text.slice(start, end)

Scenario: Chunk text matches slice

WHEN a chunk is returned with start: 2, end: 8
THEN chunk.text SHALL equal originalText.slice(2, 8)

Scenario: Mid equals start when no overlap

WHEN a chunk has no overlap (first chunk with start: 0 and overlap: 0)
THEN chunk.mid SHALL equal chunk.start

Requirement: Left boundary overlap detection

When overlap > 0 and start > 0, the function SHALL search for the left boundary in the region [start - overlap, start). It SHALL find the most-left occurrence of any separator (regardless of priority) to maximize overlap size. The separator itself SHALL NOT be included in the chunk — the chunk starts after the separator.

Scenario: Left boundary matches most-left separator

WHEN text is "aaa.bbb.ccc.ddd", start is 12, overlap is 8, separators are ["."]
THEN the chunk's start SHALL be 4 (after the first . found in the overlap window [4, 12)), giving maximum overlap

Scenario: Left boundary excludes the separator

WHEN text is "hello world test", start is 11, overlap is 6, separators are [" "]
THEN the chunk starts after the space separator (the space is not part of the chunk text)

Scenario: No separator found in overlap window

WHEN no separator is found in the overlap region
THEN mid SHALL equal start and the chunk has no overlap (the chunk starts at start)

Requirement: Right boundary detection with separator priority

After determining the left boundary, the function SHALL use the remaining chunkLen budget to find the right boundary. It SHALL search for the most-right occurrence of the highest-priority separator within the budget window. The separator SHALL be included in the chunk text.

Scenario: Right boundary uses highest-priority separator

WHEN text is "hello world. foo bar", start is 0, chunkLen is 18, separators are [". ", " "]
THEN the chunk ends after ". " (the higher-priority separator), even though " " appears further right

Scenario: Right boundary falls back to lower-priority separator

WHEN the highest-priority separator is not found in the right boundary window
THEN the function SHALL try the next separator in priority order

Scenario: Right boundary includes separator in chunk

WHEN a separator is matched at the right boundary
THEN the chunk text SHALL include the separator characters

Scenario: No separator found splits mid-word

WHEN no separator is found within the chunkLen budget
THEN the chunk SHALL end at exactly start + chunkLen (or end of text), splitting mid-word if necessary

Requirement: Full text coverage

The function SHALL produce chunks that together cover the entire text from start to the end. Each chunk's mid value SHALL equal the previous chunk's end value (or start for the first chunk when no overlap is found), ensuring no text is skipped.

Scenario: Consecutive chunks cover full text

WHEN text is "the quick brown fox jumps over the lazy dog" with chunkLen 15, overlap 5, start 0, separators [" "]
THEN concatenating the non-overlap portions text.slice(chunk.mid, chunk.end) of all chunks SHALL produce the original text from start onward

Scenario: Last chunk may be shorter than chunkLen

WHEN the remaining text is shorter than chunkLen
THEN the last chunk SHALL contain all remaining text without padding

Requirement: ChunkLen budget accounting

The total length of each chunk (end - start) SHALL NOT exceed chunkLen. The overlap consumes part of the budget, and the right boundary uses the remainder.

Scenario: Overlap consumes part of budget

WHEN overlap is found spanning 4 characters
THEN the right boundary search window is limited to chunkLen - 4 characters from mid

Scenario: Chunk length never exceeds chunkLen

WHEN any chunk is produced
THEN chunk.end - chunk.start SHALL be less than or equal to chunkLen

Прочитал - вроде все ок, компилируем код

/opsx:apply

Работает шустрее speckit, закодил меньше чем за 5 минут. Проверяем имплементацию на наших секретных тестах. Все зеленое - ваншот.

OpenSpec программа

Details

  
interface SplitStringOnChunksInput {
  text: string;
  chunkLen: number;
  overlap: number;
  start: number;
  separators: string[];
}

interface Chunk {
  start: number;
  mid: number;
  end: number;
  text: string;
}

/**
 * Find the most-left occurrence of any separator in the window [windowStart, windowEnd).
 * Returns the index right after the separator, or -1 if none found.
 */
function findLeftBoundary(
  text: string,
  windowStart: number,
  windowEnd: number,
  separators: string[],
): number {
  let bestPos = -1; // position after separator

  for (const sep of separators) {
    const idx = text.indexOf(sep, windowStart);
    if (idx >= 0 && idx + sep.length <= windowEnd) {
      const afterSep = idx + sep.length;
      if (bestPos === -1 || afterSep < bestPos) {
        bestPos = afterSep;
      }
    }
  }

  return bestPos;
}

/**
 * Find the most-right occurrence of the highest-priority separator in [windowStart, windowEnd).
 * Returns the index after the separator (end of chunk), or -1 if none found.
 */
function findRightBoundary(
  text: string,
  windowStart: number,
  windowEnd: number,
  separators: string[],
): number {
  for (const sep of separators) {
    // Find the most-right occurrence of this separator
    let bestIdx = -1;
    let searchFrom = windowStart;
    while (searchFrom < windowEnd) {
      const idx = text.indexOf(sep, searchFrom);
      if (idx < 0 || idx + sep.length > windowEnd) break;
      bestIdx = idx;
      searchFrom = idx + 1;
    }
    if (bestIdx >= 0) {
      return bestIdx + sep.length;
    }
  }

  return -1;
}

function splitStringOnChunks(input: SplitStringOnChunksInput): Chunk[] {
  const { text, chunkLen, overlap, separators } = input;
  const chunks: Chunk[] = [];
  let pos = input.start;

  while (pos < text.length) {
    let chunkStart: number;
    let mid: number;

    // Left boundary: find overlap
    if (overlap > 0 && pos > 0) {
      const windowStart = Math.max(0, pos - overlap);
      const leftBound = findLeftBoundary(text, windowStart, pos, separators);
      if (leftBound >= 0) {
        chunkStart = leftBound;
        mid = pos;
      } else {
        chunkStart = pos;
        mid = pos;
      }
    } else {
      chunkStart = pos;
      mid = pos;
    }

    // Right boundary: use remaining budget
    const overlapLen = mid - chunkStart;
    const budget = chunkLen - overlapLen;
    const maxEnd = Math.min(text.length, mid + budget);

    let chunkEnd: number;
    if (maxEnd >= text.length) {
      // Remaining text fits in budget
      chunkEnd = text.length;
    } else {
      // Try to find a separator for clean break
      const rightBound = findRightBoundary(text, mid, maxEnd, separators);
      if (rightBound >= 0) {
        chunkEnd = rightBound;
      } else {
        // Split mid-word
        chunkEnd = maxEnd;
      }
    }

    chunks.push({
      start: chunkStart,
      mid,
      end: chunkEnd,
      text: text.slice(chunkStart, chunkEnd),
    });

    // Advance position to end of new content
    pos = chunkEnd;
  }

  return chunks;
}

export = splitStringOnChunks;

OpenSpec тесты

Details

  
import splitStringOnChunks = require("./splitStringOnChunks");
import test = require("node:test");
import assert = require("node:assert/strict");
const { describe, it } = test;

// 2.1 Basic chunking: no overlap, single separator, full text coverage
describe("basic chunking", () => {
  it("splits text on spaces with no overlap", () => {
    const chunks = splitStringOnChunks({
      text: "hello world foo bar",
      chunkLen: 10,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    assert.ok(chunks.length > 0);
    for (const c of chunks) {
      assert.equal(c.text, c.text.length > 0 ? c.text : "");
      assert.equal(c.mid, c.start); // no overlap
    }
  });

  it("covers full text with no gaps", () => {
    const text = "the quick brown fox jumps over the lazy dog";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 15,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    const reconstructed = chunks.map((c) => text.slice(c.mid, c.end)).join("");
    assert.equal(reconstructed, text);
  });

  it("chunk.text matches text.slice(start, end)", () => {
    const text = "hello world foo bar baz";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 12,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    for (const c of chunks) {
      assert.equal(c.text, text.slice(c.start, c.end));
    }
  });
});

// 2.2 Left overlap tests
describe("left overlap", () => {
  it("finds most-left separator to maximize overlap", () => {
    // text: "aaa.bbb.ccc.ddd", start=12, overlap=8
    // overlap window: [4, 12) = "bbb.ccc."
    // most-left separator "." is at index 3 — but that's before window start (4)
    // In window [4,12): "." at index 7, "." at index 11
    // most-left in window after sep: index 7+1=8
    const text = "aaa.bbb.ccc.ddd";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 20,
      overlap: 8,
      start: 12,
      separators: ["."],
    });
    assert.equal(chunks.length, 1);
    assert.equal(chunks[0]!.start, 8); // after "." at index 7
    assert.equal(chunks[0]!.mid, 12);
  });

  it("excludes separator from chunk text on left boundary", () => {
    const text = "hello world test";
    // start=11 ("test"), overlap=6, window=[5,11)=" world"
    // " " at index 5, after sep = 6
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 20,
      overlap: 6,
      start: 11,
      separators: [" "],
    });
    assert.equal(chunks[0]!.start, 6); // after space at 5
    assert.ok(!chunks[0]!.text.startsWith(" ")); // separator excluded
  });

  it("falls back to mid=start when no separator in overlap window", () => {
    const text = "abcdefghijklmnop";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 10,
      overlap: 5,
      start: 8,
      separators: [" "],
    });
    assert.equal(chunks[0]!.start, 8);
    assert.equal(chunks[0]!.mid, 8);
  });
});

// 2.3 Right boundary tests
describe("right boundary", () => {
  it("uses highest-priority separator", () => {
    // text: "hello world. foo bar", chunkLen=18, separators=[". ", " "]
    // window [0, 18) = "hello world. foo b"
    // ". " at index 11 → end at 13
    // " " at index 16 ("b") — more right but lower priority
    // Should pick ". " because it's higher priority
    const text = "hello world. foo bar";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 18,
      overlap: 0,
      start: 0,
      separators: [". ", " "],
    });
    assert.equal(chunks[0]!.end, 13); // after ". "
    assert.equal(chunks[0]!.text, "hello world. ");
  });

  it("falls back to lower-priority separator when higher not found", () => {
    const text = "hello world foo";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 13,
      overlap: 0,
      start: 0,
      separators: [". ", " "],
    });
    // no ". " found, falls back to " "
    // " " most-right in [0,13) = "hello world f" → " " at 11
    assert.equal(chunks[0]!.end, 12); // after " " at 11
  });

  it("includes separator in chunk text", () => {
    const text = "hello world foo";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 10,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    // First chunk should end after a space, including the space
    assert.ok(chunks[0]!.text.endsWith(" "));
  });

  it("splits mid-word when no separator found", () => {
    const text = "abcdefghijklmnop";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 6,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    assert.equal(chunks[0]!.end, 6);
    assert.equal(chunks[0]!.text, "abcdef");
  });
});

// 2.4 Edge cases
describe("edge cases", () => {
  it("empty text returns empty array", () => {
    const chunks = splitStringOnChunks({
      text: "",
      chunkLen: 10,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    assert.deepEqual(chunks, []);
  });

  it("start at end of text returns empty array", () => {
    const chunks = splitStringOnChunks({
      text: "hello",
      chunkLen: 10,
      overlap: 0,
      start: 5,
      separators: [" "],
    });
    assert.deepEqual(chunks, []);
  });

  it("text shorter than chunkLen returns single chunk", () => {
    const chunks = splitStringOnChunks({
      text: "hi",
      chunkLen: 100,
      overlap: 0,
      start: 0,
      separators: [" "],
    });
    assert.equal(chunks.length, 1);
    assert.equal(chunks[0]!.text, "hi");
  });

  it("handles multi-character separators", () => {
    const text = "part1---part2---part3";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 10,
      overlap: 0,
      start: 0,
      separators: ["---"],
    });
    assert.equal(chunks[0]!.text, "part1---");
    assert.equal(chunks[0]!.end, 8);
  });
});

// 2.5 Budget accounting
describe("budget accounting", () => {
  it("overlap consumes part of chunkLen budget", () => {
    const text = "aaa bbb ccc ddd eee";
    // start=8 ("ccc ddd eee"), overlap=5, chunkLen=12
    // overlap window [3,8) = " bbb ", " " at 3 → afterSep=4
    // overlapLen = 8-4 = 4, budget = 12-4 = 8
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 12,
      overlap: 5,
      start: 8,
      separators: [" "],
    });
    assert.ok(chunks[0]!.end - chunks[0]!.start <= 12);
  });

  it("chunk length never exceeds chunkLen", () => {
    const text = "the quick brown fox jumps over the lazy dog";
    const chunkLen = 15;
    const chunks = splitStringOnChunks({
      text,
      chunkLen,
      overlap: 5,
      start: 0,
      separators: [" "],
    });
    for (const c of chunks) {
      assert.ok(
        c.end - c.start <= chunkLen,
        `Chunk [${c.start},${c.end}) length ${c.end - c.start} exceeds chunkLen ${chunkLen}`,
      );
    }
  });

  it("full coverage with overlap: non-overlap portions reconstruct text", () => {
    const text = "the quick brown fox jumps over the lazy dog";
    const chunks = splitStringOnChunks({
      text,
      chunkLen: 15,
      overlap: 5,
      start: 0,
      separators: [" "],
    });
    const reconstructed = chunks.map((c) => text.slice(c.mid, c.end)).join("");
    assert.equal(reconstructed, text);
  });
});

Способ №5.1 – OpenSpec пишет 100% своих спеков

Провал. Ошибки есть неисправляемые

Пробуем передать все решения OpenSpec: только ставим задачу, не вмешиваемся в процесс ее проектирования.

Клод верно понял, что для эмбеддингов нужны и оверлап, и сепараторы с приоритетом. Хотя, например, сепараторы нельзя конфигурировать.

Однако Клод всё так же принял тупые архитектурные решения: оверлап режет слова пополам, ломая тем самым контекст для ЛЛМ. Также есть мажорные баги. При определенном оверлапе что-то все расхерачивает. Клод просто ленится и не делает конкретные тесты для оверлапа, только проверяет его наличие.

Я запустил еще пару раз цикл:

/opsx:propose fix bugs and make thing better
/opsx:apply

Часть багов пропала. Но и левая, и правая границы чанка могут легко резать слова. А это означает, что базовый функционал не решен.

Вайбкод чат

Details

/opsx:propose Write a TypeScript function that will split text on chunks for embeddings. Test the function with tests and property-based testing.

/opsx:apply

Оверлап может резать слово пополам. Также есть мажорные баги:

text="Hello world. something long" chunkSize=16 overlap=6
Результат просто взрыв
   "Hello world. "
   "llo world. "
   "lo world. "
   "o world. "
   " world. "
   "world. "
   "orld. "
   "rld. "
   "ld. "
   "d. "
   ". something long"

Делаем еще пару раз

/opsx:propose fix bugs and make thing better
/opsx:apply

/opsx:propose fix bugs and make thing better
/opsx:apply

Изначальный баг исправлен и теперь:

text="Hello world. something long" chunkSize=16 overlap=6
   "Hello world. "
   "orld. something "
   "thing long"

Оверлап режет слова, но это по дизайну. Вообще и правая граница режет слова, если есть \n\n сепараторы. Скорее всего есть какие-то баги, на этом моменте лень искать.

'OpenSpec пишет 100% своих спеков' программа

Details

export interface ChunkOptions {
  /** Maximum chunk size in characters */
  chunkSize: number;
  /** Number of overlapping characters between consecutive chunks (default: 0) */
  overlap?: number;
}

/**
 * Find the last paragraph boundary (\n\n) in text up to maxPos.
 * Returns the index after the boundary, or -1 if none found.
 */
function findParagraphBoundary(text: string, maxPos: number): number {
  const idx = text.lastIndexOf("\n\n", maxPos - 1);
  if (idx <= 0) return -1;
  return idx + 2; // position after the double newline
}

/**
 * Find the last sentence boundary (.!? followed by whitespace) in text up to maxPos.
 * Returns the index after the whitespace, or -1 if none found.
 */
function findSentenceBoundary(text: string, maxPos: number): number {
  for (let i = maxPos - 1; i >= 1; i--) {
    const ch = text[i];
    if (ch === " " || ch === "\n" || ch === "\t") {
      const prev = text[i - 1];
      if (prev === "." || prev === "!" || prev === "?") {
        return i + 1; // position after the whitespace
      }
    }
  }
  return -1;
}

/**
 * Find the last word boundary (space) in text up to maxPos.
 * Returns the index after the space, or -1 if none found.
 */
function findWordBoundary(text: string, maxPos: number): number {
  for (let i = maxPos; i >= 1; i--) {
    if (text[i - 1] === " ") {
      return i;
    }
  }
  return -1;
}

/**
 * Split text into chunks suitable for embedding models.
 *
 * Chunks are boundary-aware: the algorithm prefers splitting at paragraph
 * or sentence boundaries. Falls back to a hard split at chunkSize when
 * no boundary is found.
 *
 * @param text - The input text to split
 * @param options - Chunking configuration
 * @returns Array of text chunks
 */
export function splitTextForEmbeddings(
  text: string,
  options: ChunkOptions,
): string[] {
  const { chunkSize, overlap = 0 } = options;

  if (chunkSize < 1) {
    throw new Error("chunkSize must be at least 1");
  }

  if (overlap < 0) {
    throw new Error("overlap must not be negative");
  }

  if (overlap >= chunkSize) {
    throw new Error("overlap must be less than chunkSize");
  }

  if (text.length === 0) {
    return [];
  }

  if (text.length <= chunkSize) {
    return [text];
  }

  const chunks: string[] = [];
  let pos = 0;

  while (pos < text.length) {
    const remaining = text.length - pos;

    // If remaining text fits in one chunk, take it all
    if (remaining <= chunkSize) {
      chunks.push(text.slice(pos));
      break;
    }

    // Try to find the best boundary within the chunk window
    const window = text.slice(pos, pos + chunkSize);
    let splitAt: number;

    // 1. Try paragraph boundary (only if chunk would be large enough for overlap)
    const paraBoundary = findParagraphBoundary(window, chunkSize);
    if (paraBoundary > overlap) {
      splitAt = paraBoundary;
    } else {
      // 2. Try sentence boundary (only if chunk would be large enough for overlap)
      const sentBoundary = findSentenceBoundary(window, chunkSize);
      if (sentBoundary > overlap) {
        splitAt = sentBoundary;
      } else {
        // 3. Try word boundary (only if chunk would be large enough for overlap)
        const wordBoundary = findWordBoundary(window, chunkSize);
        if (wordBoundary > overlap) {
          splitAt = wordBoundary;
        } else {
          // 4. Hard split
          splitAt = chunkSize;
        }
      }
    }

    chunks.push(window.slice(0, splitAt));

    // Advance position, accounting for overlap
    pos += splitAt - overlap;
  }

  return chunks;
}

'OpenSpec пишет 100% своих спеков' тесты

Details

import { describe, it, expect } from "vitest";
import * as fc from "fast-check";
import { splitTextForEmbeddings } from "./chunk-text.js";

// ---------------------------------------------------------------------------
// Unit Tests
// ---------------------------------------------------------------------------

describe("splitTextForEmbeddings", () => {
  describe("basic behavior", () => {
    it("returns empty array for empty text", () => {
      expect(splitTextForEmbeddings("", { chunkSize: 100 })).toEqual([]);
    });

    it("splits long text into chunks not exceeding chunkSize", () => {
      const text = "a".repeat(1000);
      const result = splitTextForEmbeddings(text, { chunkSize: 200 });
      for (const chunk of result) {
        expect(chunk.length).toBeLessThanOrEqual(200);
      }
      expect(result.length).toBeGreaterThan(1);
    });
  });

  describe("boundary-aware splitting", () => {
    it("splits at paragraph boundary", () => {
      const text = "First paragraph content.\n\nSecond paragraph that is long enough to matter for splitting purposes.";
      const result = splitTextForEmbeddings(text, { chunkSize: 40 });
      expect(result[0]).toBe("First paragraph content.\n\n");
    });

    it("splits at sentence boundary when no paragraph boundary", () => {
      const text = "First sentence here. Second sentence that continues for a while and is quite long.";
      const result = splitTextForEmbeddings(text, { chunkSize: 30 });
      expect(result[0]).toBe("First sentence here. ");
    });

    it("hard splits when no boundary found", () => {
      const text = "abcdefghijklmnopqrstuvwxyz";
      const result = splitTextForEmbeddings(text, { chunkSize: 10 });
      expect(result[0]).toBe("abcdefghij");
      expect(result[0].length).toBe(10);
    });

    it("splits at word boundary when no sentence boundary", () => {
      const text = "hello world foobar baz qux";
      const result = splitTextForEmbeddings(text, { chunkSize: 12 });
      expect(result[0]).toBe("hello world ");
    });

    it("reconstructs text with word-boundary splitting and overlap", () => {
      const text = "alpha beta gamma delta epsilon zeta eta theta";
      const overlap = 5;
      const result = splitTextForEmbeddings(text, { chunkSize: 15, overlap });
      for (const chunk of result) {
        expect(chunk.length).toBeLessThanOrEqual(15);
      }
      for (let i = 0; i < result.length - 1; i++) {
        const tail = result[i].slice(-overlap);
        const head = result[i + 1].slice(0, overlap);
        expect(tail).toBe(head);
      }
      let reconstructed = result[0];
      for (let i = 1; i < result.length; i++) {
        reconstructed += result[i].slice(overlap);
      }
      expect(reconstructed).toBe(text);
    });
  });

  describe("overlap", () => {
    it("overlapping chunks share characters at boundaries", () => {
      const text = "abcdefghijklmnopqrstuvwxyz";
      const result = splitTextForEmbeddings(text, {
        chunkSize: 10,
        overlap: 3,
      });
      for (let i = 0; i < result.length - 1; i++) {
        const tail = result[i].slice(-3);
        const head = result[i + 1].slice(0, 3);
        expect(tail).toBe(head);
      }
    });

    it("reconstructs original text with overlap removal", () => {
      const text = "abcdefghijklmnopqrstuvwxyz";
      const overlap = 3;
      const result = splitTextForEmbeddings(text, {
        chunkSize: 10,
        overlap,
      });
      let reconstructed = result[0];
      for (let i = 1; i < result.length; i++) {
        reconstructed += result[i].slice(overlap);
      }
      expect(reconstructed).toBe(text);
    });

    it("throws when overlap >= chunkSize", () => {
      expect(() =>
        splitTextForEmbeddings("hello", { chunkSize: 5, overlap: 5 }),
      ).toThrow("overlap must be less than chunkSize");

      expect(() =>
        splitTextForEmbeddings("hello", { chunkSize: 5, overlap: 10 }),
      ).toThrow("overlap must be less than chunkSize");
    });
  });

  describe("regression: overlap + boundary interaction", () => {
    it("handles overlap larger than early sentence boundary", () => {
      const text = "Done. " + "x".repeat(100);
      const result = splitTextForEmbeddings(text, { chunkSize: 20, overlap: 10 });

      // All chunks must respect size limit
      for (const chunk of result) {
        expect(chunk.length).toBeLessThanOrEqual(20);
      }

      // Overlap must match between consecutive chunks
      for (let i = 0; i < result.length - 1; i++) {
        const tail = result[i].slice(-10);
        const head = result[i + 1].slice(0, 10);
        expect(tail).toBe(head);
      }

      // Reconstruction must produce original text
      let reconstructed = result[0];
      for (let i = 1; i < result.length; i++) {
        reconstructed += result[i].slice(10);
      }
      expect(reconstructed).toBe(text);
    });

    it("reconstructs multi-sentence text with overlap", () => {
      const text = "Hello world. This is a test. Another sentence here. Final part.";
      const overlap = 10;
      const result = splitTextForEmbeddings(text, { chunkSize: 30, overlap });

      let reconstructed = result[0];
      for (let i = 1; i < result.length; i++) {
        reconstructed += result[i].slice(overlap);
      }
      expect(reconstructed).toBe(text);
    });
  });

  describe("validation", () => {
    it("throws when chunkSize is 0", () => {
      expect(() =>
        splitTextForEmbeddings("hello", { chunkSize: 0 }),
      ).toThrow("chunkSize must be at least 1");
    });

    it("throws when chunkSize is negative", () => {
      expect(() =>
        splitTextForEmbeddings("hello", { chunkSize: -1 }),
      ).toThrow("chunkSize must be at least 1");
    });

    it("throws when overlap is negative", () => {
      expect(() =>
        splitTextForEmbeddings("hello", { chunkSize: 10, overlap: -1 }),
      ).toThrow("overlap must not be negative");
    });
  });

  describe("roundtrip integrity", () => {
    it("joining chunks without overlap reproduces original text", () => {
      const text =
        "The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs.";
      const result = splitTextForEmbeddings(text, {
        chunkSize: 30,
        overlap: 0,
      });
      expect(result.join("")).toBe(text);
    });
  });
});

// ---------------------------------------------------------------------------
// Property-Based Tests
// ---------------------------------------------------------------------------

describe("splitTextForEmbeddings (property-based)", () => {
  // Arbitrary for valid chunk options: chunkSize >= 1, 0 <= overlap < chunkSize
  const chunkOptionsArb = fc
    .record({
      chunkSize: fc.integer({ min: 1, max: 500 }),
      overlapRatio: fc.double({ min: 0, max: 0.9, noNaN: true }),
    })
    .map(({ chunkSize, overlapRatio }) => ({
      chunkSize,
      overlap: Math.floor(chunkSize * overlapRatio),
    }));

  it("no chunk exceeds chunkSize", () => {
    fc.assert(
      fc.property(fc.string({ minLength: 0, maxLength: 2000 }), chunkOptionsArb, (text, opts) => {
        const chunks = splitTextForEmbeddings(text, opts);
        for (const chunk of chunks) {
          expect(chunk.length).toBeLessThanOrEqual(opts.chunkSize);
        }
      }),
      { numRuns: 200 },
    );
  });

  it("no text is lost when overlap is 0", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 2000 }),
        fc.integer({ min: 1, max: 500 }),
        (text, chunkSize) => {
          const chunks = splitTextForEmbeddings(text, {
            chunkSize,
            overlap: 0,
          });
          expect(chunks.join("")).toBe(text);
        },
      ),
      { numRuns: 200 },
    );
  });

  it("overlap correctness: tail of chunk N matches head of chunk N+1", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 2000 }),
        chunkOptionsArb.filter((o) => o.overlap > 0),
        (text, opts) => {
          const chunks = splitTextForEmbeddings(text, opts);
          for (let i = 0; i < chunks.length - 1; i++) {
            const tail = chunks[i].slice(-opts.overlap);
            const head = chunks[i + 1].slice(0, opts.overlap);
            expect(tail).toBe(head);
          }
        },
      ),
      { numRuns: 200 },
    );
  });

  it("chunk count is within expected bounds", () => {
    fc.assert(
      fc.property(
        fc.string({ minLength: 1, maxLength: 2000 }),
        chunkOptionsArb,
        (text, opts) => {
          const chunks = splitTextForEmbeddings(text, opts);
          expect(chunks.length).toBeGreaterThanOrEqual(1);
          const step = opts.chunkSize - opts.overlap;
          const maxChunks = Math.ceil(text.length / Math.max(step, 1)) + 1;
          expect(chunks.length).toBeLessThanOrEqual(maxChunks);
        },
      ),
      { numRuns: 200 },
    );
  });
});