Skip to content

Instantly share code, notes, and snippets.

@hUwUtao
Created March 19, 2026 17:17
Show Gist options
  • Select an option

  • Save hUwUtao/0f2e29e94e576f4535086001a1554eeb to your computer and use it in GitHub Desktop.

Select an option

Save hUwUtao/0f2e29e94e576f4535086001a1554eeb to your computer and use it in GitHub Desktop.
What if number are in colours? This sloppy paper encodes float point to RGBA8 scalar!

RGBA u8 Scientific Float

A signed scientific-range float packed into four normalized u8 channels — the same layout as an RGBA color. No dedicated bit fields. No bitwise operations in decode. Every channel is a plain real number in [0, 1].


The formula

$$\boxed{V = \text{sgn}(R) \times M(R,G,B) \times 10^{E(A)}}$$

All four inputs are raw u8 bytes divided by 255 before use. The formula operates purely on those normalized scalars.

Sign — R MSB

$$sgn(R) = \begin{cases} -1 & R \geq 128 \ +1 & R < 128 \end{cases}$$

The high half of R's range is negative, the low half positive. In normalized form: threshold at 0.5.

Mantissa — R[6:0], G, B

$$M(R,G,B) = 1 + \frac{R \text{and} 127}{127} + \frac{G}{127 \times 256} + \frac{B}{127 \times 65536} \in [1, 2)$$

A base-256 positional number. R contributes the high 7 bits of the fractional part, G the middle 8, B the low 8 — 23 bits of significand total, identical in width to IEEE 754 single precision.

In normalized scalar arithmetic, stripping the sign half from R is a subtraction: r_man = r - 0.5 if r ≥ 0.5, else r_man = r. No masking needed.

Exponent — A

$$t = (A/255 - 0.5) \times 2 \quad \in [-1,\ +1]$$

$$E(A) = \text{sign}(t) \times t^2 \times 6 \quad \in [-6,\ +6]$$

$$10^{E(A)} \quad \in [10^{-6},\ 10^{+6}]$$

The quadratic mapping is the key design decision. States cluster near $t = 0$ (i.e. $A \approx 127$–$128$, $E \approx 0$), and spread out toward the extremes. This gives high relative precision for values near 1 and graceful degradation at large magnitudes.

Exponent state density by A value:

A E 10^E Step to next (decades)
0 −6.0000 0.000001 0.093749
32 −3.3662 0.000430 0.070127
64 −1.4883 0.032489 0.046505
96 −0.3662 0.430300 0.022884
112 −0.0887 0.815317 0.011073
124 −0.0045 0.989643 0.002215
127 −0.0001 0.999788 0.000185
128 +0.0001 1.000212 0.000738
131 +0.0045 1.010465 0.002953
143 +0.0887 1.226517 0.011811
191 +1.4883 30.779 0.047243
223 +3.3662 2323.7 0.070865
255 +6.0000 1000000 0.094487

Near $A = 127$: step ≈ 0.000185 decades (0.043% multiplicative). At $A = 0$ or $255$: step ≈ 0.094 decades (24% multiplicative) — the worst case.


Encode

The encoder inverts the formula: given a value, find (R, G, B, A).

Step 1 — sign. Trivial: pos = value > 0.

Step 2 — exponent. Compute the real (fractional) log₁₀ of |value| and invert the quadratic to find A.

Using the isqrt collapse: sqrt(|E|/6) * 127.5 = sqrt(|E| * K) where K = 127.5²/6 = 2709.375. One multiply absorbed into the constant.

$$A = \text{round}!\left(127.5 + \text{sign}(E) \times \sqrt{|E| \times K}\right)$$

Why real log₁₀, not integer floor? If you snap E to an integer decade first, then compute M = absv / 10^E_floor, values like 3.14159 give M = 3.14 — outside [1, 2). The clamp silently drops the high bit. Using the real log₁₀ ensures M always lands in [1, 2) without loss.

Step 3 — mantissa. After A is quantized, recompute the stored E and extract the mantissa residual:

$$M = \text{clamp}!\left(\frac{|V|}{10^{E_\text{stored}}},\ 1,\ 1.9999999\right)$$

$$m_\text{int} = \text{round}((M - 1) \times (2^{23} - 1))$$

Step 4 — decompose m_int into channels using subtract-chain arithmetic (no modulo cascade):

rr  = m_int / 65536
rem = m_int - rr * 65536
gg  = rem / 256
bb  = rem - gg * 256
R   = rr % 128 + (0 if positive else 128)

Precision

Honestly: not great. This is a scientific order-of-magnitude format, not a computation format. Expect 1–2 significant decimal digits across most of the range. Good enough for sensor readouts, color-mapped scalar fields, or coarse telemetry. Not suitable for accumulation, iterative arithmetic, or anything requiring more than ~2 sig figs.

By scale (400 samples per decade, ±)

Scale Mean rel. err Max rel. err Abs step ±10^x Sig digits
1e−6 4.68% 11.68% 7.87e−9 −9 1.3
1e−5 4.83% 10.60% 7.71e−8 −8 1.3
1e−4 3.76% 9.48% 7.34e−7 −7 1.4
1e−3 3.49% 8.16% 7.47e−6 −6 1.5
1e−2 2.67% 6.72% 7.99e−5 −5 1.6
1e−1 1.86% 4.76% 7.57e−4 −4 1.7
1e+0 0.55% 1.98% 7.88e−3 −3 2.3
1e+1 1.97% 5.19% 8.19e−2 −2 1.7
1e+2 3.03% 7.02% 7.76e−1 −1 1.5
1e+3 3.56% 8.45% 8.30e+0 +0 1.4
1e+4 3.66% 9.67% 8.45e+1 +1 1.4
1e+5 4.24% 10.99% 8.05e+2 +2 1.4
1e+6 11.66% 21.30% 7.87e+3 +3 0.9

The ±10^x column is the absolute error exponent: at scale S the absolute error is bounded by roughly ±10^(floor(log₁₀(S)) − 3). It tracks scale exactly — the format behaves like floating-point, not fixed-point.

Best precision is near scale 1 (values between ~0.5 and ~2), where the quadratic exponent mapping is densest. The 1e+6 boundary degrades because A=255 is a hard edge with no finer quantization available.

Roundtrip samples

Value RGBA Decoded Rel. err ±10^x
1e-6 (0,0,0,0) 1e-06 0.000% exact
0.0001 (2,84,217,23) 9.49e-05 5.08% −6
0.001 (1,188,35,37) 0.000962 3.85% −5
0.1 (1,76,239,75) 0.09708 2.92% −3
1.0 (0,0,0,128) 1.00021 0.021% −4
3.14159 (0,103,22,164) 3.11239 0.93% −2
2.71828 (0,0,0,162) 2.74984 1.16% −2
10.0 (0,0,0,180) 10.4064 4.06% −1
100.0 (0,115,175,201) 98.958 1.04% +0
10000.0 (0,0,0,232) 10728.6 7.29% +2
1e6 (0,0,0,255) 1e+06 0.000% exact
-1.0 (128,0,0,128) -1.00415 0.42% −3
-1e-6 (128,0,0,0) -1.004e-06 0.39% −9
-1e6 (128,0,0,255) -1.004e+06 0.39% +3
123456.789 (0,0,0,245) 124662 0.98% +3
-4.2e-5 (128,0,0,19) -4.54e-05 8.01% −6
0.5 (0,0,0,99) 0.501427 0.29% −3

Exact hits at 1e±6 because those values pin A to its endpoint states (0 or 255), where no rounding error is possible.


Why not a dedicated sign bit?

All four channels must remain valid RGBA u8 values normalized to [0, 1]. A dedicated sign bit would require bit manipulation on decode, breaking the "normalized scalar only" ABI. The R-MSB approach keeps decode as pure scalar arithmetic — the threshold comparison r >= 0.5 is identical to R >= 128 but works on any normalized float regardless of byte order or source format.


Implementations

All snippets implement the same ABI:

  • decode: takes four f32 ∈ [0, 1] (caller normalizes u8 / 255), returns f32
  • encode: takes f32, returns four u8 values

Constants used everywhere:

Name Value Meaning
L 3.321928… log₂(10)
IL 0.301029… 1/log₂(10) = log₁₀(2)
K 2709.375 127.5²/6 — isqrt scale
C0 2.007874… 255/127
C1 0.007843… 255/32512
C2 0.000031… 255/8323072
MS 2097151.0 2²³−1

Python

import math

L=math.log2(10); IL=1/L; K=2709.375
C0=255/127; C1=255/32512; C2=255/8323072; MS=2097151.0

def decode(r:float, g:float, b:float, a:float) -> float:
    s  = -1. if r >= .5 else 1.
    rm = r - (.5 if r >= .5 else 0.)
    t  = (a - .5) * 2.
    E  = math.copysign(t*t*6., t)
    return s * (1. + rm*C0 + g*C1 + b*C2) * 2.**(E*L)

def encode(v:float) -> tuple[int,int,int,int]:
    if v == 0.: return (0, 0, 0, 127)
    p = v > 0.; a = abs(v)
    er = max(-6., min(6., math.log2(a)*IL))
    A  = max(0, min(255, int(127.5 + (1 if er>=0 else -1)*math.sqrt(abs(er)*K) + .5)))
    ts = (A/255. - .5)*2.; Es = math.copysign(ts*ts*6., ts)
    M  = max(1., min(1.9999999, a * 2.**(-Es*L)))
    mi = int((M-1.)*MS + .5)
    rr = mi//65536; rem = mi - rr*65536; gg = rem//256
    return (rr%128 + (0 if p else 128), gg, rem - gg*256, A)

# usage
rgba = encode(3.14159)          # → (0, 103, 22, 164)
val  = decode(*[x/255 for x in rgba])  # → 3.11239

C

#include <math.h>
#include <stdint.h>

#define L   3.32192809488736234787f  /* log2(10)   */
#define IL  0.30102999566398119521f  /* 1/log2(10) */
#define K   2709.375f               /* 127.5^2/6  */
#define C0  2.00787401574803149606f  /* 255/127    */
#define C1  0.00784325476635514018f  /* 255/32512  */
#define C2  0.00003063848735955057f  /* 255/8323072*/
#define MS  2097151.0f              /* 2^23 - 1   */

typedef struct { uint8_t r, g, b, a; } RGBA;

static inline float rgba_decode(float r, float g, float b, float a) {
    float s  = (r >= 0.5f) ? -1.f : 1.f;
    float rm = r - ((r >= 0.5f) ? 0.5f : 0.f);
    float t  = (a - 0.5f) * 2.f;
    float E  = copysignf(t*t*6.f, t);
    return s * (1.f + rm*C0 + g*C1 + b*C2) * exp2f(E * L);
}

static inline RGBA rgba_encode(float v) {
    if (v == 0.f) return (RGBA){0, 0, 0, 127};
    int   pos = v > 0.f;
    float a   = fabsf(v);
    float er  = fmaxf(-6.f, fminf(6.f, log2f(a) * IL));
    int   A   = (int)fmaxf(0.f, fminf(255.f,
                    127.5f + (er >= 0.f ? 1.f : -1.f) * sqrtf(fabsf(er)*K) + .5f));
    float ts  = (A/255.f - .5f)*2.f;
    float Es  = copysignf(ts*ts*6.f, ts);
    float M   = fmaxf(1.f, fminf(1.9999999f, a * exp2f(-Es * L)));
    int   mi  = (int)(((M-1.f)*MS) + .5f);
    int   rr  = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
    return (RGBA){
        (uint8_t)(rr%128 + (pos ? 0 : 128)),
        (uint8_t)gg,
        (uint8_t)(rem - gg*256),
        (uint8_t)A
    };
}

/* usage */
/*
    RGBA px = rgba_encode(3.14159f);
    float v = rgba_decode(px.r/255.f, px.g/255.f, px.b/255.f, px.a/255.f);
*/

Java

public final class RgbaFloat {

    private static final float L   = (float)(Math.log(10) / Math.log(2));
    private static final float IL  = 1.f / L;
    private static final float K   = 2709.375f;
    private static final float C0  = 255f / 127f;
    private static final float C1  = 255f / 32512f;
    private static final float C2  = 255f / 8323072f;
    private static final float MS  = 2097151f;

    private RgbaFloat() {}

    /** Four normalized floats [0,1] → scalar value. */
    public static float decode(float r, float g, float b, float a) {
        float s  = (r >= 0.5f) ? -1f : 1f;
        float rm = r - ((r >= 0.5f) ? 0.5f : 0f);
        float t  = (a - 0.5f) * 2f;
        float E  = Math.signum(t) * t * t * 6f;
        return s * (1f + rm*C0 + g*C1 + b*C2) * (float)Math.pow(2f, E * L);
    }

    /** Scalar value → int[4] with R,G,B,A in [0,255]. */
    public static int[] encode(float v) {
        if (v == 0f) return new int[]{0, 0, 0, 127};
        boolean pos = v > 0f;
        float   a   = Math.abs(v);
        float   er  = Math.max(-6f, Math.min(6f,
                          (float)(Math.log(a) / Math.log(2)) * IL));
        int     A   = Math.max(0, Math.min(255,
                          (int)(127.5f + (er >= 0f ? 1f : -1f)
                                * (float)Math.sqrt(Math.abs(er) * K) + .5f)));
        float   ts  = (A / 255f - .5f) * 2f;
        float   Es  = Math.signum(ts) * ts * ts * 6f;
        float   M   = Math.max(1f, Math.min(1.9999999f,
                          a * (float)Math.pow(2f, -Es * L)));
        int     mi  = (int)((M - 1f) * MS + .5f);
        int     rr  = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
        return new int[]{
            rr % 128 + (pos ? 0 : 128),
            gg,
            rem - gg * 256,
            A
        };
    }

    // usage:
    //   int[]  px = RgbaFloat.encode(3.14159f);
    //   float  v  = RgbaFloat.decode(px[0]/255f, px[1]/255f, px[2]/255f, px[3]/255f);
}

WGSL

Strict math ABI: only exp2, log2, sqrt, sign, abs, select, clamp. No log, no pow, no copysign. All GPU-friendly scalar arithmetic.

const L :f32= 3.321928094887362;   // log2(10)
const IL:f32= 0.301029995663981;   // 1/log2(10)
const K :f32= 2709.375;            // 127.5²/6  — isqrt scale
const C0:f32= 2.007874015748031;
const C1:f32= 0.007843254766355;
const C2:f32= 0.000030638487360;
const MS:f32= 2097151.0;

// ch.xyzw = r,g,b,a normalized to [0,1]
fn decode(ch: vec4<f32>) -> f32 {
    let s  = select(1., -1., ch.x >= .5);
    let rm = ch.x - select(0., .5, ch.x >= .5);
    let t  = (ch.w - .5) * 2.;
    return s * (1. + rm*C0 + ch.y*C1 + ch.z*C2)
             * exp2(sign(t) * t*t * 6. * L);
}

fn encode(v: f32) -> vec4<u32> {
    if v == 0. { return vec4<u32>(0,0,0,127); }
    let p  = v > 0.;
    let a  = abs(v);
    let er = clamp(log2(a)*IL, -6., 6.);
    // isqrt collapse: sqrt(|E|/6)*127.5 = sqrt(|E|*K)
    let A  = u32(clamp(i32(127.5 + select(-1.,1.,er>=0.) * sqrt(abs(er)*K) + .5), 0, 255));
    let ts = (f32(A)/255. - .5)*2.;
    let Es = sign(ts)*ts*ts*6.;
    let M  = clamp(a * exp2(-Es*L), 1., 1.9999999);
    let mi = u32(clamp(i32((M-1.)*MS+.5), 0, i32(MS)));
    // arithmetic decompose — no bit ops
    let rr = mi/65536u; let rem = mi - rr*65536u; let gg = rem/256u;
    return vec4<u32>(rr%128u + select(0u,128u,!p), gg, rem-gg*256u, A);
}

GLSL

GLSL 4.5 / ES 3.0. Uses exp2, log2, sqrt, sign, abs, mix (as select). No bit ops in decode.

#define L   3.321928094887362
#define IL  0.301029995663981
#define K   2709.375
#define C0  2.007874015748031
#define C1  0.007843254766355
#define C2  0.000030638487360
#define MS  2097151.0

// Decode: pass in texture sample directly (already [0,1])
float rgba_decode(vec4 ch) {
    float s  = ch.x >= 0.5 ? -1.0 : 1.0;
    float rm = ch.x - (ch.x >= 0.5 ? 0.5 : 0.0);
    float t  = (ch.w - 0.5) * 2.0;
    return s * (1.0 + rm*C0 + ch.y*C1 + ch.z*C2)
             * exp2(sign(t) * t*t * 6.0 * L);
}

// Encode: returns uvec4 with components in [0,255]
uvec4 rgba_encode(float v) {
    if (v == 0.0) return uvec4(0, 0, 0, 127);
    bool  p  = v > 0.0;
    float a  = abs(v);
    float er = clamp(log2(a)*IL, -6.0, 6.0);
    uint  A  = uint(clamp(int(127.5 + (er>=0.0 ? 1.0:-1.0)*sqrt(abs(er)*K) + 0.5), 0, 255));
    float ts = (float(A)/255.0 - 0.5)*2.0;
    float Es = sign(ts)*ts*ts*6.0;
    float M  = clamp(a * exp2(-Es*L), 1.0, 1.9999999);
    uint  mi = uint(clamp(int((M-1.0)*MS+0.5), 0, int(MS)));
    uint  rr = mi/65536u, rem = mi - rr*65536u, gg = rem/256u;
    return uvec4(rr%128u + (p ? 0u : 128u), gg, rem-gg*256u, A);
}

// Decode from a texture (RGBA8 sampled as unorm [0,1]):
//   float val = rgba_decode(texture(uScalarTex, uv));
//
// Encode to a pixel write (RGBA32UI rendertarget):
//   fragColor = rgba_encode(scalar_value);

Notes and limitations

Zero is unrepresentable. The format has no zero state. encode(0) returns (0,0,0,127) which decodes to +1.00021 — the closest representable value. Applications that need to distinguish zero should reserve a sentinel (e.g. all-zero RGBA as a special case handled outside the formula).

Range is hard-clamped. Values outside ±[10⁻⁶, 2·10⁶] will silently saturate at the nearest endpoint. No overflow flag, no infinity, no NaN.

Not suitable for arithmetic. Decoded values should be treated as read-only measurements. Adding or multiplying two decoded values and re-encoding will accumulate encoding error with each round-trip.

Texture store use case. The primary motivation is storing a signed scalar field in an RGBA8 texture — common in scientific visualization (temperature, pressure, elevation, signed distance fields). The format requires no custom texture formats, works on any GPU, and decodes in a single texture fetch plus ~10 ALU operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment