A proposal for compact, inspectable instrument patches built by fitting layered synthesis models to real recordings
The idea here is not just to analyze harmonics, and not just to fit attack, decay, sustain, or release curves, and not just to interpolate between a few measured notes. Those are useful tools, but they still assume that the structure of the synthesizer is mostly fixed ahead of time.
The more interesting direction is to let the system discover both the parameter values and the extra structure it needs in order to explain the sound well.
That is what I mean by backtuned instrument synthesis.
You begin with a structured instrument model that is intentionally simple. It might contain a tonal layer, an excitation layer, and a resonance layer. You fit that model to recordings of a real instrument. Then you compare the rendered result with the source recording and study what remains unexplained. If the mismatch can be solved by adjusting the existing parameters, the system does that. If the mismatch is systematic and repeatable, the system is allowed to add a new adjustment layer from a controlled library of possible expansions.
This means the synthesizer is not a fixed object with a fixed number of knobs. It is a growing structured model that can become more descriptive when the evidence says it should.
The end result is not supposed to be a black box. It should compile into a compact patch that engineers and sound designers can inspect from the top down.
There is already a visible path toward this idea in recent synthesis work, and one of the nicest public examples of the earlier part of that journey is Sebastian Lague's video "Coding Adventure: Synthesizing Musical Instruments," where he shows how measured harmonic content, interpolation across the keyboard, and time-varying behavior can already produce surprisingly convincing results.
Video that sparked this note: https://www.youtube.com/watch?v=rRnOtKlg4jA
Comment reference: https://www.youtube.com/watch?v=rRnOtKlg4jA&lc=UgxInJSGgkC35BOXUct4AaABAg
What makes that line of thought exciting is that it shows a real instrument can already be approached through analysis and compact synthesis instead of brute-force sampling. The natural next step is to stop assuming that the final synthesis graph is already known.
The system should be able to say: the current model explains a lot, but not enough. The remaining error has structure. Therefore I need one more layer.
A useful patch should not look like an opaque pile of hundreds of numbers. It should look like a readable stack of rules.
At the top level, an engineer should be able to open the patch and immediately see something like this in conceptual form:
This instrument has a tonal core defined by a small set of pitch- and velocity-dependent laws. It has a transient correction layer for the onset. It has a resonance layer for body coloration. It has one deeper residual layer for the part the earlier layers could not explain. It has a small number of control surfaces that shape temperament, dynamics, brightness, and response.
That top layer matters a lot. It lets a technically minded person understand how the patch is structured without first having to descend into every detail. It gives an overview of the instrument as a hierarchy of approximation.
From there, anyone who wants to go deeper can follow the rabbit hole downward. They can open the tonal layer and inspect how overtone balance changes with pitch. They can inspect how decay laws differ in the low and high ranges. They can examine the transient layer and see what kind of correction was added to explain the onset. They can inspect the resonance layer and see which modes, filters, or response curves were introduced. They can inspect the residual layer and determine whether it is modeling something musically meaningful or just compensating for what the higher layers failed to capture.
This is one of the main reasons the system should stay structured. It should produce not only sound, but understanding.
The patch should be thought of as an explanation of the instrument, organized by depth.
Near the surface are the broad rules that matter most musically. These are the things a developer or sound designer might want to edit first: overall harmonic balance, note response, range-dependent color, dynamic behavior, transient sharpness, resonance strength.
Below that are increasingly specific correction layers. These layers exist because the simpler model could not fully match reality. They may capture onset noise, body ringing, range-specific compensation, or other time-varying details that become necessary for perceptual realism.
Deeper still, the patch may contain narrow adjustment layers that exist only because the residual analysis repeatedly found a stable mismatch that deserved its own mechanism.
The important thing is that these layers should be visible as layers. They should not disappear into a giant flattened parameter soup.
This is where the analogy with NEAT becomes useful.
NEAT stands for NeuroEvolution of Augmenting Topologies. Its key idea is that you do not only optimize the weights inside a neural network. You can also grow the network's structure over time when the current topology is too limited.
That same spirit applies here.
A backtuned synthesis system should not only optimize a large fixed parameter set. It should also be allowed to add new structured adjustment layers when the existing synthesis graph cannot explain the data well enough. A new layer might be a transient shaper, a compact modulation law, a note-range correction field, a small resonator bank, or a lightweight residual module. The exact addition matters less than the principle: complexity is introduced only when the evidence supports it.
This should not be random mutation. It should be disciplined structural growth.
The system should have a library of allowed expansions, each with clear semantics and a complexity cost. Whenever the residual after fitting shows stable unexplained structure, candidate expansions can be proposed and scored. If one of them yields a meaningful perceptual improvement at an acceptable complexity cost, it becomes part of the patch.
This gives the project a very different character from both classic manual synthesis and black-box waveform generation. It becomes a compact instrument compiler that can grow deeper layers only when those layers have earned their place.
The piano is a useful example because it clearly exposes the limitations of simple interpolation. The lower and upper ranges do not just differ in pitch. They differ in overtone balance, inharmonicity, decay behavior, excitation, and resonance. A reasonable first model can explain a lot of the piano's tonal core, but a convincing result quickly forces you to acknowledge hammer behavior, body response, coupling, and structured leftover content.
That makes piano a great demonstration instrument for layered fitting.
The flute is useful for a different reason. It is not only about which harmonics are present, but how the harmonic structure changes while the note is being sustained. The upper partials can evolve, the breath component can shift, and the onset can differ strongly from the stable body of the note. A good backtuned system should therefore be able to fit evolving spectral behavior and, when necessary, add a time-varying correction layer that captures how the timbre develops instead of forcing it into one frozen harmonic profile.
Together, those examples show why this should not be just a better additive synthesizer. Additive synthesis can be a strong foundation, but real instruments behave like layered systems. The patch should therefore also be layered.
A useful implementation should work like an engineer, not like a lottery.
It should first fit the obvious tonal structure. Then it should subtract that reconstruction from the original recording and examine the residual. Some residual will be random. Some will come from the recording chain. Some will be consistent and repeatable. That stable repeatable part is the interesting part. It means the current model is missing something real.
At that point the system should ask a focused question: what is the smallest additional layer that would explain this remaining error well?
Maybe the answer is a short onset-noise model. Maybe it is a body-resonance layer. Maybe the overtone behavior in one pitch region needs its own corrective law. Maybe one part of the note needs a modulation surface that the global envelope model cannot express cleanly.
After adding the new layer, the system refits and evaluates again. If the improvement is real and perceptually meaningful, the layer stays. If not, it is rejected. The fitting process therefore alternates between parameter estimation and controlled structural growth.
That is the core loop.
A lot of modern generative systems can produce compelling output, but they are weak when asked to explain themselves. That is a problem if the intended audience includes engineers, synth programmers, and sound designers.
A backtuned instrument patch should be useful even before playback. Someone should be able to inspect the patch and learn something about how the instrument was approximated. They should be able to see which broad laws carry most of the sound, which deeper layers were added later, and where the model still needed compensation.
This also makes the system more editable. If a developer wants the instrument to respond differently under hard velocity, they should not need to retrain the whole thing blindly. If they want to soften the transient, change the temperament, exaggerate the body resonance, or remap brightness to a controller, the structure should make that possible.
The patch should therefore separate core identity from playable controls.
One especially attractive consequence of a structured patch is that higher-level parameters can be bound to MIDI or other performance controls.
That means the fitted patch does not have to remain a static emulation. It can become a living instrument.
An engineer or musician could expose selected controls for temperament, brightness, transient hardness, resonance depth, dynamic scaling, response curvature, or timbral drift. A few well-chosen mappings could radically extend the usefulness of the patch. You could keep the instrument close to its analyzed identity, or deliberately push it away from the source while still staying within the same learned structural language.
This is where the project becomes more than emulation.
Once the patch is structured, and once its layers and control surfaces are visible, it becomes possible to navigate between faithful reconstruction and creative divergence. A fitted piano patch could be gently altered into a brighter or colder or more unstable variant. A flute-derived patch could be given a dynamic response that no real flute has. Multiple fitted instruments could perhaps even share or exchange certain layers.
So the same machinery that begins as approximation could become a path toward novel instruments.
A 1 to 2 MB patch of this kind would be exciting not only because it is small, but because it would be small for the right reason. It would not simply be a compressed sample dump. It would be a compact layered description of how the instrument behaves.
That is a much more interesting form of compression.
If successful, it could offer a middle ground that many existing tools miss: far more realism than a traditional hand-built preset, far more compactness and flexibility than a giant multisample library, and far more inspectability than a black-box neural audio model.
The first version should stay disciplined.
Start with one instrument family. Start with dry recordings. Start with a small set of allowed layer types. Fit the tonal core first. Then allow the system to add only a few classes of deeper adjustment layers, each with a clear explanation and an explicit complexity penalty.
If the same kinds of deeper layers keep appearing for the same instrument family, that becomes evidence that the framework is discovering a reusable grammar rather than merely chasing noise.
That would already be a meaningful result.
What makes this idea exciting is that it does not force a choice between elegant but oversimplified synthesis and brute-force realism. It tries to begin with a clean, understandable model, then lets the instrument itself demand the extra complexity it needs.
In other words:
Build a synthesizer that can fit itself to a real instrument, show you clearly how the approximation is layered, and grow new structured adjustment layers only when the sound proves they are needed.
That feels like a direction worth exploring.
This note was directly inspired by Sebastian Lague's work on instrument synthesis and especially by the video linked above. Credit should go where it is due. The point here is not to restate that work, but to suggest what the next step might look like if the synthesis model itself were allowed to grow in a disciplined, inspectable way while being fitted against real recordings.