This document provides some guidelines for developing plugins using the JUCE framework.
These are several things that we want to prioritize as part of the development process. Each item will be explained in more detail below.
The general idea with "real-time" programming is that in order to perform a "correct" computation, the computation must not only produce the correct output, but also be performed within a given amount of time. As such, real-time processes are typically defined by a "deadline", after which any result produced by the program is deemed "incorrect" regardless of the result itself.
An audio deadline can be determined by the block size and the sample rate. For example, if your program is asked to process a buffer of 512 samples at 48 kHz, then the deadline is 512 / 48k = ~10.667 milliseconds. However, note that when you are writing a plugin, you need to share that time with the plugin host, as well as any other plugins that may be running concurrently.
What happens if your real-time program misses a deadline? In audio this usually results in a "buffer under-run", which typically causes a "click" or "glitch" in the output audio. This outcome should be avoided at all costs, especially in systems that may be used by musicians in a live setting.
Thinking about real-time programming in terms of deadlines, means that we primarily care about "worst-case" performance. If your real-time deadline is 10 milliseconds, and your plugin can process most buffers in 0.01 milliseconds, but every 10 buffers it takes 1 second, your "average-case" performance is pretty good, but you're still going to have a buffer under-run once every 10 buffers.
The most important rule of real-time programming is "Don't do anything that could take an unbounded amount of time".
The most obvious example of "something that could take an unbounded amount of time" is acquiring a mutex, as is often done when synchronizing work between multiple threads. There are two reasons why this is not real-time safe:
- You have no guarantee how long the other thread is going to hold the mutex, and therefore no guarantee when the audio thread will be able to aquire it.
- The other thread is likely running at a lower priority than the audio thread, leading to a priority inversion.
Other examples of things that are not real-time safe include:
- Allocating/de-allocating memory (using standard heap memory allocation techniques like
mallocornew). - Reading from / writing to a file
- Logging
- Algorithms with poor worst-case complexity
To better understand this issue, I recommend every audio programmer to read Ross Bencina's article Real-time audio programming 101: time waits for nothing.
Here are a few tips for making sure that your audio code is real-time safe:
- Thread synchronization:
- Don't use locks! Pretty much ever. There's almost always a better choice.
- If your data is small: use
std::atomic. (Make sure to checkstd::atomic<T>::is_always_lock_free). - If your data is larger and you need to know every time it's updated, use a lock-free queue. I typically use the
moodycamel::ReaderWriterQueue(which is bundled with mychowdsp_utilslibrary), but depending on your needs a different type of queue or implementation might work better. - If your data is larger and you only need to know the "latest" value on the audio thread, checkout
chowdsp::RealtimeLatestObject. - If all else fails and you really have no choice other than to use a lock, be sure to use a locking mechanism that is real-time safe, such as a "spin lock" (see, e.g.
juce::SpinLock).
- Memory Allocation
- Never call
malloc,free,new,delete, etc on the audio thread. - Try to use data structures with sizes that are known as compile-time (e.g.
std::array). - Do your best to avoid data structures that allocate memory, such as
std::vector. In places where you must use them, be sure to pre-allocate the memory used by the data structure (e.g.std::vector::reserve). However, you should almost never need to use these data structures, because... - You should allocate dynamic memory using memory arenas whenever possible. This will be discussed more later on, but memory arenas are usually the most efficient strategy for working with dynamic memory in a way that is real-time safe.
- Never call
- I/O
- If your audio process needs some data from a file, it should either be "pre-loaded" before the audio processing starts, or it should be loaded on a background thread and passed to the audio thread using a lock-free mechanism as described above.
- If you audio process needs to generate some data that will be saved to a file, or otherwise output by the program (e.g. to
stdout), break the data into fixed-size chunks, and pass each chunk through a lock-free queue. Make sure that the background thread is always pulling from the queue, otherwise the queue will fill up!
- JUCE
- Avoid JUCE's
AudioProcessorValueTreeStateandParameterListenermechanisms. Along with being pretty cumbersome to use (in my opinion), these mechanisms are not real-time safe.chowdsp_utilsprovides replacements in the form ofchowdsp::PluginStateandchowdsp::SliderAttachment(and other attachments).
- Avoid JUCE's
clang-20 shipped with a new "Realtime Sanitizer". The idea is that the sanitizer can catch real-time safety violations at run-time. When I tried it out around a year ago, I ran into some issues with false positives, but maybe those issues have been fixed by now?
For more information on real-time safe programming, check out the following resources:
- Again, Ross Bencina's Real-time audio programming 101: time waits for nothing
- Dave Rowland and Fabian Renn-Giles ADC Talk on real-time safety
- Daniel Anderson's talk on wait-free programming
- Moodycamel's block on lock-free queues
- My blog on wait-free programming
- Timur Doumler's article on real-time safe spin locks
- Chris Apple and Daniel Trevelyan's talk on real-time sanitizers
While real-time safety is primarily focused on worst-case performance, and avoiding stupid mistakes that will kill your worst-case performance, average-case performance is also important. To that end, it's imperative that we think about performance in how we write our audio code.
If you go on Stack Overflow, or ask ChatGPT about how to optimize audio code, it will usually tell you to reduce the number of divide operations, or use lookup tables for complicated math operations, and so on. While this advice is sometimes helpful, a lot of times there are more and better optimization opportunities to think about.
The most obvious way to optimize your code is to simply reduce the number of operations that your code is asking the computer to do. There are a few ways to go about this:
- Pre-computation: if you can pre-compute some information (especially at compile time), that can definitely save some CPU cycles. However, be careful about pre-computations that end up in large amounts of memory (we'll talk about this more in a minute).
- Give the compiler as much information as possible (at compile time!) about the computation it's doing. Things like virtual function calls, indirect calls through function pointers, and so on can limit what the compiler is able to assume about your code.
Probably the best tool for getting to know your compiler, its flags, and it's capabilities is Matt Godbolt's Compiler Explorer.
When code runs slowly on a modern CPU, the most likely cause is cache misses. When your program needs to do some work, it needs two things:
- The data it's operating on
- The CPU instructions for which operations to perform
Those things need to come from somewhere! In the ideal scenario, all your data is in the CPU's registers, and all the instructions are in the execution pipeline... then everything can just go. But what if your data is not in registers? What if the instructions that your program needs are not already in the pipeline? Then the CPU has to wait while the data or instructions are fetched. Okay, but fetched from where? Modern CPUs typically have 2-3 layers of cache, each growing progressively bigger and slower, followed by main memory (RAM). Fetching from a Level 1 (L1) cache, is usually very fast (~5 cycles), while fetching from main memory is usually very slow (~400 cycles). So if we want to get the best performance out of our code, we want to make sure that we're using our caches as effectively as possible.
For the most part, your CPU's execution pipeline knows what instructions are going to come next, so it can "pre-fetch" the instructions that it's going to be running. However, there's a few thing that our code can do that could make it difficult for the pre-fetcher to do it's job.
-
Branching: While
ifstatements (and similar conditional logic) are a fundamental part of programming, we need to be very careful about how we use them in our audio code. When the execution pipeline reaches a conditional statement, how does it decide which "branch" of the conditional statement to fetch? Most CPUs use a "branch predictor" that makes an educated guess about which branch the program is going to take. For predictable branching patterns (like aforloop), the branch predictor can usually guess right most of the time, but for highly variable branches, the branch predictor has a hard time doing it's job, often leading to a lot of instruction cache misses. -
Dynamic Dispatch: Most dynamic dispatch methods can also cause instruction cache misses. This includes things like function pointers, virtual function calls, and so on. Again, the fundamental idea is that the pre-fetcher doesn't know what instructions are at the other end of the dynamic dispatch, so it can't do it's job, leading to an instuction cache miss. Worse, most dynamic dispatch mechanisms make it impossible for the compiler to know what's happening on the other side of the dispatch, thereby precluding optimization opportunities like function inlining.
In small doses, branching and dynamic dispatch are okay, and can sometimes even help performance (by selecting code paths that require less computation). But it's best to avoid over-using them, and definitely avoid using them at the per-sample level whenever possible.
There's generally two ways to avoid data cache misses:
- Operate on a small amount of data, so that data can stay in cache throughout the entire operation.
- Operate on data in a predictable manner so that the CPU pre-fetcher can load it into cache before you need it.
First, I think it's useful to separate our audio code's memory into two parts:
- Persistent Memory: this is memory that will need to persist between audio callbacks.
- Scratch Memory: this is memory that we only need during a single audio callback.
For the persistent memory, we just want to make sure that all of the data is allocated into a "linear" block of memory (to help out the pre-fetcher). The easiest way to do that is to is by allocating everything on the stack. If we do need to allocate anything dynamically, we should do so using an "arena allocator".
The arena is even more useful for scratch memory. As long as we know the proper size for the arena ahead of time, we can dynamically allocate any data we need into the arena. Further, a lot of scratch memory can be re-used by different algorithms within the program, meaning we can get by with a pretty small memory arena, that should fit entirely in cache.
Along with getting the most out of the CPU's memory caches, we also want to make the most out of other CPU features, including SIMD and instruction pipelining.
Most modern CPUs support 128-bit SIMD registers, wide enough for 4 single-precision floating-point numbers or 2 double-precision floating-point numbers. Newer x64 CPUs also support 256-bit and 512-bit wide SIMD registers. There's a few ways to make use of SIMD instructions (in order of complexity):
- Vector operations: Use functions like
juce::FloatVectorOperations::add()in place of aforloop. - Write algorithms that operate on vectorized types (for example,
chowdsp::StateVariableFilter<xsimd::batch<float>>). - Write DSP kernels with native SIMD implementations.
All three of these strategies have good use-cases. I think option 3 typically lends itself to the best performance, but it's also the most work, and the least portable.
Modern CPUs have multiple execution units, meaning they can perform multiple operations at a time. However, in order to accomplish this, there cannot be any data dependencies between those operations. For example, if the output of one operation becomes the input to the next operation, then the CPU can't pipeline that operation since it's still waiting on the result of the previous one. With that in mind, it can sometimes be more optimal to choose an algorithm that requires more operations, if those operations can be executed in parallel (see, e.g. Estrin's Scheme).
In audio code, we often have to compute complicated math functions, like sin(x), log(x), tanh(x) etc. Performing these computations at full-precision can be computationally expensive, and for our purposes we often don't require full-precision.
Older DSP programmers might suggest using a lookup table approximation, which was a very useful optimization technique on older CPUs and DSP chips. However, large lookup tables often lead to data cache misses and lookup table approximations typically can't be evaluated in parallel using SIMD operations, so lookup table performance is often not as good on more modern hardware.
I typically implement my own math approximations using polynomial fits, and various bit hacks where appropriate. This GitHub repository contains a library of math approximations that I've implemented, along with derivations of those approximations.
- Nic Barker's talk on Data-Oriented Design
- Ryan Fleury's blog about memory arenas
- My talk on memory arenas
- Matt Godbolt's talk on CPUs
- A great article on approximating the
sin()function to 5 ULP
As in many creative programming endeavours, reducing iteration speed is paramount to developing a good plugin. Slow iteration speed limits the number of ideas that the developer can try, makes it more difficult and time-consuming to fix bugs, and has negative effects on the developer's physical and mental health.
When I started developing plugins, my debug cycle was ~3-5 minutes. Re-compiling my plugin usually took 1-2 minutes, then I had to open my DAW (another 1-2 minutes), and finally I could load and test the plugin. Since then I've been working a lot on improving my iteration speed when developing plugins.
Unfortunately, compilation speed is limited by the available compilers and linkers for C++, which aren't particularly fast in my opinion (in my experience, Rust tooling is typically slower, the C tooling is usually a little bit faster). However, there's still a lot we can do to improve compilation speed.
The idea for a "unity build" (also called a "jumbo build") is that you combine all of the code in your program into a single translation unit, which is then compiled to a single object. This greatly improves compilation speed (by reducing the time spent parsing redundant header files) as well as linking speed.
The way I've been setting up unity builds for my plugins is by writing all the code for the plugin in header files (preferably a single header file). Then the header files are all included in a single implementation file that is then compiled. Along with creating a simple unity build workflow, this approach also makes it easy to include the header file(s) in testing code and so on.
Clang allows you to profile your compilation using the -ftime-trace compilation flag.
The best debugging cycle that I've settled on relies on debugging via Bitwig Studio. Bitwig is very clever in that it runs its audio engine in a separate process from the application. This means that you can stop and re-start the audio engine without needing to restart the application.
So my debugging cycle nowadays is:
- Compile the plugin
- Install the plugin
- Terminate the Bitwig audio engine
- Re-start the Bitwig audio engine
Depending on how long the plugin compilation takes, this whole cycle usually takes just a few seconds!
If you want to take it even further, BaconPaul has a setup that launches the Bitwig Audio Engine directly from your debugger.
Developing plugin is hard. Plugins are complicated along several dimensions. They need to support a variety of (sometimes quirky) host behaviours. They need to operate across at least two threads, one real-time (the audio thread), and one quasi-real-time (the GUI thread). They need to implement audio DSP algorithms that can be complicated and hard to debug.
The worst thing a plugin developer can do is add unnecessary complexity to their code. Don't implement complicated class heirarchies or design patterns, unless you've already tried at least 3 simpler solutions first. Almost everything worth doing can be done with simple structs and functions. If you are successfully able to avoid adding unnecessary complexity, then when the necessary complexity naturally arises, it won't overwhelm you, or your code.