A signed scientific-range float packed into four normalized u8 channels — the same layout as an RGBA color. No dedicated bit fields. No bitwise operations in decode. Every channel is a plain real number in [0, 1].
All four inputs are raw u8 bytes divided by 255 before use. The formula operates purely on those normalized scalars.
The high half of R's range is negative, the low half positive. In normalized form: threshold at 0.5.
A base-256 positional number. R contributes the high 7 bits of the fractional part, G the middle 8, B the low 8 — 23 bits of significand total, identical in width to IEEE 754 single precision.
In normalized scalar arithmetic, stripping the sign half from R is a subtraction: r_man = r - 0.5 if r ≥ 0.5, else r_man = r. No masking needed.
The quadratic mapping is the key design decision. States cluster near
Exponent state density by A value:
| A | E | 10^E | Step to next (decades) |
|---|---|---|---|
| 0 | −6.0000 | 0.000001 | 0.093749 |
| 32 | −3.3662 | 0.000430 | 0.070127 |
| 64 | −1.4883 | 0.032489 | 0.046505 |
| 96 | −0.3662 | 0.430300 | 0.022884 |
| 112 | −0.0887 | 0.815317 | 0.011073 |
| 124 | −0.0045 | 0.989643 | 0.002215 |
| 127 | −0.0001 | 0.999788 | 0.000185 |
| 128 | +0.0001 | 1.000212 | 0.000738 |
| 131 | +0.0045 | 1.010465 | 0.002953 |
| 143 | +0.0887 | 1.226517 | 0.011811 |
| 191 | +1.4883 | 30.779 | 0.047243 |
| 223 | +3.3662 | 2323.7 | 0.070865 |
| 255 | +6.0000 | 1000000 | 0.094487 |
Near
The encoder inverts the formula: given a value, find (R, G, B, A).
Step 1 — sign. Trivial: pos = value > 0.
Step 2 — exponent. Compute the real (fractional) log₁₀ of |value| and invert the quadratic to find A.
Using the isqrt collapse: sqrt(|E|/6) * 127.5 = sqrt(|E| * K) where K = 127.5²/6 = 2709.375. One multiply absorbed into the constant.
Why real log₁₀, not integer floor? If you snap E to an integer decade first, then compute M = absv / 10^E_floor, values like 3.14159 give M = 3.14 — outside [1, 2). The clamp silently drops the high bit. Using the real log₁₀ ensures M always lands in [1, 2) without loss.
Step 3 — mantissa. After A is quantized, recompute the stored E and extract the mantissa residual:
Step 4 — decompose m_int into channels using subtract-chain arithmetic (no modulo cascade):
rr = m_int / 65536
rem = m_int - rr * 65536
gg = rem / 256
bb = rem - gg * 256
R = rr % 128 + (0 if positive else 128)
Honestly: not great. This is a scientific order-of-magnitude format, not a computation format. Expect 1–2 significant decimal digits across most of the range. Good enough for sensor readouts, color-mapped scalar fields, or coarse telemetry. Not suitable for accumulation, iterative arithmetic, or anything requiring more than ~2 sig figs.
| Scale | Mean rel. err | Max rel. err | Abs step | ±10^x | Sig digits |
|---|---|---|---|---|---|
| 1e−6 | 4.68% | 11.68% | 7.87e−9 | −9 | 1.3 |
| 1e−5 | 4.83% | 10.60% | 7.71e−8 | −8 | 1.3 |
| 1e−4 | 3.76% | 9.48% | 7.34e−7 | −7 | 1.4 |
| 1e−3 | 3.49% | 8.16% | 7.47e−6 | −6 | 1.5 |
| 1e−2 | 2.67% | 6.72% | 7.99e−5 | −5 | 1.6 |
| 1e−1 | 1.86% | 4.76% | 7.57e−4 | −4 | 1.7 |
| 1e+0 | 0.55% | 1.98% | 7.88e−3 | −3 | 2.3 |
| 1e+1 | 1.97% | 5.19% | 8.19e−2 | −2 | 1.7 |
| 1e+2 | 3.03% | 7.02% | 7.76e−1 | −1 | 1.5 |
| 1e+3 | 3.56% | 8.45% | 8.30e+0 | +0 | 1.4 |
| 1e+4 | 3.66% | 9.67% | 8.45e+1 | +1 | 1.4 |
| 1e+5 | 4.24% | 10.99% | 8.05e+2 | +2 | 1.4 |
| 1e+6 | 11.66% | 21.30% | 7.87e+3 | +3 | 0.9 |
The ±10^x column is the absolute error exponent: at scale S the absolute error is bounded by roughly ±10^(floor(log₁₀(S)) − 3). It tracks scale exactly — the format behaves like floating-point, not fixed-point.
Best precision is near scale 1 (values between ~0.5 and ~2), where the quadratic exponent mapping is densest. The 1e+6 boundary degrades because A=255 is a hard edge with no finer quantization available.
| Value | RGBA | Decoded | Rel. err | ±10^x |
|---|---|---|---|---|
1e-6 |
(0,0,0,0) |
1e-06 |
0.000% | exact |
0.0001 |
(2,84,217,23) |
9.49e-05 |
5.08% | −6 |
0.001 |
(1,188,35,37) |
0.000962 |
3.85% | −5 |
0.1 |
(1,76,239,75) |
0.09708 |
2.92% | −3 |
1.0 |
(0,0,0,128) |
1.00021 |
0.021% | −4 |
3.14159 |
(0,103,22,164) |
3.11239 |
0.93% | −2 |
2.71828 |
(0,0,0,162) |
2.74984 |
1.16% | −2 |
10.0 |
(0,0,0,180) |
10.4064 |
4.06% | −1 |
100.0 |
(0,115,175,201) |
98.958 |
1.04% | +0 |
10000.0 |
(0,0,0,232) |
10728.6 |
7.29% | +2 |
1e6 |
(0,0,0,255) |
1e+06 |
0.000% | exact |
-1.0 |
(128,0,0,128) |
-1.00415 |
0.42% | −3 |
-1e-6 |
(128,0,0,0) |
-1.004e-06 |
0.39% | −9 |
-1e6 |
(128,0,0,255) |
-1.004e+06 |
0.39% | +3 |
123456.789 |
(0,0,0,245) |
124662 |
0.98% | +3 |
-4.2e-5 |
(128,0,0,19) |
-4.54e-05 |
8.01% | −6 |
0.5 |
(0,0,0,99) |
0.501427 |
0.29% | −3 |
Exact hits at 1e±6 because those values pin A to its endpoint states (0 or 255), where no rounding error is possible.
All four channels must remain valid RGBA u8 values normalized to [0, 1]. A dedicated sign bit would require bit manipulation on decode, breaking the "normalized scalar only" ABI. The R-MSB approach keeps decode as pure scalar arithmetic — the threshold comparison r >= 0.5 is identical to R >= 128 but works on any normalized float regardless of byte order or source format.
All snippets implement the same ABI:
- decode: takes four
f32∈ [0, 1] (caller normalizesu8 / 255), returnsf32 - encode: takes
f32, returns fouru8values
Constants used everywhere:
| Name | Value | Meaning |
|---|---|---|
L |
3.321928… | log₂(10) |
IL |
0.301029… | 1/log₂(10) = log₁₀(2) |
K |
2709.375 | 127.5²/6 — isqrt scale |
C0 |
2.007874… | 255/127 |
C1 |
0.007843… | 255/32512 |
C2 |
0.000031… | 255/8323072 |
MS |
2097151.0 | 2²³−1 |
import math
L=math.log2(10); IL=1/L; K=2709.375
C0=255/127; C1=255/32512; C2=255/8323072; MS=2097151.0
def decode(r:float, g:float, b:float, a:float) -> float:
s = -1. if r >= .5 else 1.
rm = r - (.5 if r >= .5 else 0.)
t = (a - .5) * 2.
E = math.copysign(t*t*6., t)
return s * (1. + rm*C0 + g*C1 + b*C2) * 2.**(E*L)
def encode(v:float) -> tuple[int,int,int,int]:
if v == 0.: return (0, 0, 0, 127)
p = v > 0.; a = abs(v)
er = max(-6., min(6., math.log2(a)*IL))
A = max(0, min(255, int(127.5 + (1 if er>=0 else -1)*math.sqrt(abs(er)*K) + .5)))
ts = (A/255. - .5)*2.; Es = math.copysign(ts*ts*6., ts)
M = max(1., min(1.9999999, a * 2.**(-Es*L)))
mi = int((M-1.)*MS + .5)
rr = mi//65536; rem = mi - rr*65536; gg = rem//256
return (rr%128 + (0 if p else 128), gg, rem - gg*256, A)
# usage
rgba = encode(3.14159) # → (0, 103, 22, 164)
val = decode(*[x/255 for x in rgba]) # → 3.11239#include <math.h>
#include <stdint.h>
#define L 3.32192809488736234787f /* log2(10) */
#define IL 0.30102999566398119521f /* 1/log2(10) */
#define K 2709.375f /* 127.5^2/6 */
#define C0 2.00787401574803149606f /* 255/127 */
#define C1 0.00784325476635514018f /* 255/32512 */
#define C2 0.00003063848735955057f /* 255/8323072*/
#define MS 2097151.0f /* 2^23 - 1 */
typedef struct { uint8_t r, g, b, a; } RGBA;
static inline float rgba_decode(float r, float g, float b, float a) {
float s = (r >= 0.5f) ? -1.f : 1.f;
float rm = r - ((r >= 0.5f) ? 0.5f : 0.f);
float t = (a - 0.5f) * 2.f;
float E = copysignf(t*t*6.f, t);
return s * (1.f + rm*C0 + g*C1 + b*C2) * exp2f(E * L);
}
static inline RGBA rgba_encode(float v) {
if (v == 0.f) return (RGBA){0, 0, 0, 127};
int pos = v > 0.f;
float a = fabsf(v);
float er = fmaxf(-6.f, fminf(6.f, log2f(a) * IL));
int A = (int)fmaxf(0.f, fminf(255.f,
127.5f + (er >= 0.f ? 1.f : -1.f) * sqrtf(fabsf(er)*K) + .5f));
float ts = (A/255.f - .5f)*2.f;
float Es = copysignf(ts*ts*6.f, ts);
float M = fmaxf(1.f, fminf(1.9999999f, a * exp2f(-Es * L)));
int mi = (int)(((M-1.f)*MS) + .5f);
int rr = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
return (RGBA){
(uint8_t)(rr%128 + (pos ? 0 : 128)),
(uint8_t)gg,
(uint8_t)(rem - gg*256),
(uint8_t)A
};
}
/* usage */
/*
RGBA px = rgba_encode(3.14159f);
float v = rgba_decode(px.r/255.f, px.g/255.f, px.b/255.f, px.a/255.f);
*/public final class RgbaFloat {
private static final float L = (float)(Math.log(10) / Math.log(2));
private static final float IL = 1.f / L;
private static final float K = 2709.375f;
private static final float C0 = 255f / 127f;
private static final float C1 = 255f / 32512f;
private static final float C2 = 255f / 8323072f;
private static final float MS = 2097151f;
private RgbaFloat() {}
/** Four normalized floats [0,1] → scalar value. */
public static float decode(float r, float g, float b, float a) {
float s = (r >= 0.5f) ? -1f : 1f;
float rm = r - ((r >= 0.5f) ? 0.5f : 0f);
float t = (a - 0.5f) * 2f;
float E = Math.signum(t) * t * t * 6f;
return s * (1f + rm*C0 + g*C1 + b*C2) * (float)Math.pow(2f, E * L);
}
/** Scalar value → int[4] with R,G,B,A in [0,255]. */
public static int[] encode(float v) {
if (v == 0f) return new int[]{0, 0, 0, 127};
boolean pos = v > 0f;
float a = Math.abs(v);
float er = Math.max(-6f, Math.min(6f,
(float)(Math.log(a) / Math.log(2)) * IL));
int A = Math.max(0, Math.min(255,
(int)(127.5f + (er >= 0f ? 1f : -1f)
* (float)Math.sqrt(Math.abs(er) * K) + .5f)));
float ts = (A / 255f - .5f) * 2f;
float Es = Math.signum(ts) * ts * ts * 6f;
float M = Math.max(1f, Math.min(1.9999999f,
a * (float)Math.pow(2f, -Es * L)));
int mi = (int)((M - 1f) * MS + .5f);
int rr = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
return new int[]{
rr % 128 + (pos ? 0 : 128),
gg,
rem - gg * 256,
A
};
}
// usage:
// int[] px = RgbaFloat.encode(3.14159f);
// float v = RgbaFloat.decode(px[0]/255f, px[1]/255f, px[2]/255f, px[3]/255f);
}Strict math ABI: only exp2, log2, sqrt, sign, abs, select, clamp. No log, no pow, no copysign. All GPU-friendly scalar arithmetic.
const L :f32= 3.321928094887362; // log2(10)
const IL:f32= 0.301029995663981; // 1/log2(10)
const K :f32= 2709.375; // 127.5²/6 — isqrt scale
const C0:f32= 2.007874015748031;
const C1:f32= 0.007843254766355;
const C2:f32= 0.000030638487360;
const MS:f32= 2097151.0;
// ch.xyzw = r,g,b,a normalized to [0,1]
fn decode(ch: vec4<f32>) -> f32 {
let s = select(1., -1., ch.x >= .5);
let rm = ch.x - select(0., .5, ch.x >= .5);
let t = (ch.w - .5) * 2.;
return s * (1. + rm*C0 + ch.y*C1 + ch.z*C2)
* exp2(sign(t) * t*t * 6. * L);
}
fn encode(v: f32) -> vec4<u32> {
if v == 0. { return vec4<u32>(0,0,0,127); }
let p = v > 0.;
let a = abs(v);
let er = clamp(log2(a)*IL, -6., 6.);
// isqrt collapse: sqrt(|E|/6)*127.5 = sqrt(|E|*K)
let A = u32(clamp(i32(127.5 + select(-1.,1.,er>=0.) * sqrt(abs(er)*K) + .5), 0, 255));
let ts = (f32(A)/255. - .5)*2.;
let Es = sign(ts)*ts*ts*6.;
let M = clamp(a * exp2(-Es*L), 1., 1.9999999);
let mi = u32(clamp(i32((M-1.)*MS+.5), 0, i32(MS)));
// arithmetic decompose — no bit ops
let rr = mi/65536u; let rem = mi - rr*65536u; let gg = rem/256u;
return vec4<u32>(rr%128u + select(0u,128u,!p), gg, rem-gg*256u, A);
}GLSL 4.5 / ES 3.0. Uses exp2, log2, sqrt, sign, abs, mix (as select). No bit ops in decode.
#define L 3.321928094887362
#define IL 0.301029995663981
#define K 2709.375
#define C0 2.007874015748031
#define C1 0.007843254766355
#define C2 0.000030638487360
#define MS 2097151.0
// Decode: pass in texture sample directly (already [0,1])
float rgba_decode(vec4 ch) {
float s = ch.x >= 0.5 ? -1.0 : 1.0;
float rm = ch.x - (ch.x >= 0.5 ? 0.5 : 0.0);
float t = (ch.w - 0.5) * 2.0;
return s * (1.0 + rm*C0 + ch.y*C1 + ch.z*C2)
* exp2(sign(t) * t*t * 6.0 * L);
}
// Encode: returns uvec4 with components in [0,255]
uvec4 rgba_encode(float v) {
if (v == 0.0) return uvec4(0, 0, 0, 127);
bool p = v > 0.0;
float a = abs(v);
float er = clamp(log2(a)*IL, -6.0, 6.0);
uint A = uint(clamp(int(127.5 + (er>=0.0 ? 1.0:-1.0)*sqrt(abs(er)*K) + 0.5), 0, 255));
float ts = (float(A)/255.0 - 0.5)*2.0;
float Es = sign(ts)*ts*ts*6.0;
float M = clamp(a * exp2(-Es*L), 1.0, 1.9999999);
uint mi = uint(clamp(int((M-1.0)*MS+0.5), 0, int(MS)));
uint rr = mi/65536u, rem = mi - rr*65536u, gg = rem/256u;
return uvec4(rr%128u + (p ? 0u : 128u), gg, rem-gg*256u, A);
}
// Decode from a texture (RGBA8 sampled as unorm [0,1]):
// float val = rgba_decode(texture(uScalarTex, uv));
//
// Encode to a pixel write (RGBA32UI rendertarget):
// fragColor = rgba_encode(scalar_value);Zero is unrepresentable. The format has no zero state. encode(0) returns (0,0,0,127) which decodes to +1.00021 — the closest representable value. Applications that need to distinguish zero should reserve a sentinel (e.g. all-zero RGBA as a special case handled outside the formula).
Range is hard-clamped. Values outside ±[10⁻⁶, 2·10⁶] will silently saturate at the nearest endpoint. No overflow flag, no infinity, no NaN.
Not suitable for arithmetic. Decoded values should be treated as read-only measurements. Adding or multiplying two decoded values and re-encoding will accumulate encoding error with each round-trip.
Texture store use case. The primary motivation is storing a signed scalar field in an RGBA8 texture — common in scientific visualization (temperature, pressure, elevation, signed distance fields). The format requires no custom texture formats, works on any GPU, and decodes in a single texture fetch plus ~10 ALU operations.