hUwUtao/README.md

RGBA u8 Scientific Float

A signed scientific-range float packed into four normalized u8 channels — the same layout as an RGBA color. No dedicated bit fields. No bitwise operations in decode. Every channel is a plain real number in [0, 1].

The formula

$$\boxed{V = \text{sgn}(R) \times M(R,G,B) \times 10^{E(A)}}$$

All four inputs are raw u8 bytes divided by 255 before use. The formula operates purely on those normalized scalars.

Sign — R MSB

$$sgn(R) = \begin{cases} -1 & R \geq 128 \ +1 & R < 128 \end{cases}$$

The high half of R's range is negative, the low half positive. In normalized form: threshold at 0.5.

Mantissa — R[6:0], G, B

$$M(R,G,B) = 1 + \frac{R \text{and} 127}{127} + \frac{G}{127 \times 256} + \frac{B}{127 \times 65536} \in [1, 2)$$

A base-256 positional number. R contributes the high 7 bits of the fractional part, G the middle 8, B the low 8 — 23 bits of significand total, identical in width to IEEE 754 single precision.

In normalized scalar arithmetic, stripping the sign half from R is a subtraction: r_man = r - 0.5 if r ≥ 0.5, else r_man = r. No masking needed.

Exponent — A

$$t = (A/255 - 0.5) \times 2 \quad \in [-1,\ +1]$$

$$E(A) = \text{sign}(t) \times t^2 \times 6 \quad \in [-6,\ +6]$$

$$10^{E(A)} \quad \in [10^{-6},\ 10^{+6}]$$

The quadratic mapping is the key design decision. States cluster near $t = 0$ (i.e. $A \approx 127$–$128$, $E \approx 0$), and spread out toward the extremes. This gives high relative precision for values near 1 and graceful degradation at large magnitudes.

Exponent state density by A value:

A	E	10^E	Step to next (decades)
0	−6.0000	0.000001	0.093749
32	−3.3662	0.000430	0.070127
64	−1.4883	0.032489	0.046505
96	−0.3662	0.430300	0.022884
112	−0.0887	0.815317	0.011073
124	−0.0045	0.989643	0.002215
127	−0.0001	0.999788	0.000185
128	+0.0001	1.000212	0.000738
131	+0.0045	1.010465	0.002953
143	+0.0887	1.226517	0.011811
191	+1.4883	30.779	0.047243
223	+3.3662	2323.7	0.070865
255	+6.0000	1000000	0.094487

Near $A = 127$: step ≈ 0.000185 decades (0.043% multiplicative). At $A = 0$ or $255$: step ≈ 0.094 decades (24% multiplicative) — the worst case.

Encode

The encoder inverts the formula: given a value, find (R, G, B, A).

Step 1 — sign. Trivial: pos = value > 0.

Step 2 — exponent. Compute the real (fractional) log₁₀ of |value| and invert the quadratic to find A.

Using the isqrt collapse: sqrt(|E|/6) * 127.5 = sqrt(|E| * K) where K = 127.5²/6 = 2709.375. One multiply absorbed into the constant.

$$A = \text{round}!\left(127.5 + \text{sign}(E) \times \sqrt{|E| \times K}\right)$$

Why real log₁₀, not integer floor? If you snap E to an integer decade first, then compute M = absv / 10^E_floor, values like 3.14159 give M = 3.14 — outside [1, 2). The clamp silently drops the high bit. Using the real log₁₀ ensures M always lands in [1, 2) without loss.

Step 3 — mantissa. After A is quantized, recompute the stored E and extract the mantissa residual:

$$M = \text{clamp}!\left(\frac{|V|}{10^{E_\text{stored}}},\ 1,\ 1.9999999\right)$$

$$m_\text{int} = \text{round}((M - 1) \times (2^{23} - 1))$$

Step 4 — decompose m_int into channels using subtract-chain arithmetic (no modulo cascade):

rr  = m_int / 65536
rem = m_int - rr * 65536
gg  = rem / 256
bb  = rem - gg * 256
R   = rr % 128 + (0 if positive else 128)

Precision

Honestly: not great. This is a scientific order-of-magnitude format, not a computation format. Expect 1–2 significant decimal digits across most of the range. Good enough for sensor readouts, color-mapped scalar fields, or coarse telemetry. Not suitable for accumulation, iterative arithmetic, or anything requiring more than ~2 sig figs.

By scale (400 samples per decade, ±)

Scale	Mean rel. err	Max rel. err	Abs step	±10^x	Sig digits
1e−6	4.68%	11.68%	7.87e−9	−9	1.3
1e−5	4.83%	10.60%	7.71e−8	−8	1.3
1e−4	3.76%	9.48%	7.34e−7	−7	1.4
1e−3	3.49%	8.16%	7.47e−6	−6	1.5
1e−2	2.67%	6.72%	7.99e−5	−5	1.6
1e−1	1.86%	4.76%	7.57e−4	−4	1.7
1e+0	0.55%	1.98%	7.88e−3	−3	2.3
1e+1	1.97%	5.19%	8.19e−2	−2	1.7
1e+2	3.03%	7.02%	7.76e−1	−1	1.5
1e+3	3.56%	8.45%	8.30e+0	+0	1.4
1e+4	3.66%	9.67%	8.45e+1	+1	1.4
1e+5	4.24%	10.99%	8.05e+2	+2	1.4
1e+6	11.66%	21.30%	7.87e+3	+3	0.9

The ±10^x column is the absolute error exponent: at scale S the absolute error is bounded by roughly ±10^(floor(log₁₀(S)) − 3). It tracks scale exactly — the format behaves like floating-point, not fixed-point.

Best precision is near scale 1 (values between ~0.5 and ~2), where the quadratic exponent mapping is densest. The 1e+6 boundary degrades because A=255 is a hard edge with no finer quantization available.

Roundtrip samples

Value	RGBA	Decoded	Rel. err	±10^x
`1e-6`	`(0,0,0,0)`	`1e-06`	0.000%	exact
`0.0001`	`(2,84,217,23)`	`9.49e-05`	5.08%	−6
`0.001`	`(1,188,35,37)`	`0.000962`	3.85%	−5
`0.1`	`(1,76,239,75)`	`0.09708`	2.92%	−3
`1.0`	`(0,0,0,128)`	`1.00021`	0.021%	−4
`3.14159`	`(0,103,22,164)`	`3.11239`	0.93%	−2
`2.71828`	`(0,0,0,162)`	`2.74984`	1.16%	−2
`10.0`	`(0,0,0,180)`	`10.4064`	4.06%	−1
`100.0`	`(0,115,175,201)`	`98.958`	1.04%	+0
`10000.0`	`(0,0,0,232)`	`10728.6`	7.29%	+2
`1e6`	`(0,0,0,255)`	`1e+06`	0.000%	exact
`-1.0`	`(128,0,0,128)`	`-1.00415`	0.42%	−3
`-1e-6`	`(128,0,0,0)`	`-1.004e-06`	0.39%	−9
`-1e6`	`(128,0,0,255)`	`-1.004e+06`	0.39%	+3
`123456.789`	`(0,0,0,245)`	`124662`	0.98%	+3
`-4.2e-5`	`(128,0,0,19)`	`-4.54e-05`	8.01%	−6
`0.5`	`(0,0,0,99)`	`0.501427`	0.29%	−3

Exact hits at 1e±6 because those values pin A to its endpoint states (0 or 255), where no rounding error is possible.

Why not a dedicated sign bit?

All four channels must remain valid RGBA u8 values normalized to [0, 1]. A dedicated sign bit would require bit manipulation on decode, breaking the "normalized scalar only" ABI. The R-MSB approach keeps decode as pure scalar arithmetic — the threshold comparison r >= 0.5 is identical to R >= 128 but works on any normalized float regardless of byte order or source format.

Implementations

All snippets implement the same ABI:

decode: takes four f32 ∈ [0, 1] (caller normalizes u8 / 255), returns f32
encode: takes f32, returns four u8 values

Constants used everywhere:

Name	Value	Meaning
`L`	3.321928…	log₂(10)
`IL`	0.301029…	1/log₂(10) = log₁₀(2)
`K`	2709.375	127.5²/6 — isqrt scale
`C0`	2.007874…	255/127
`C1`	0.007843…	255/32512
`C2`	0.000031…	255/8323072
`MS`	2097151.0	2²³−1

Python

import math

L=math.log2(10); IL=1/L; K=2709.375
C0=255/127; C1=255/32512; C2=255/8323072; MS=2097151.0

def decode(r:float, g:float, b:float, a:float) -> float:
    s  = -1. if r >= .5 else 1.
    rm = r - (.5 if r >= .5 else 0.)
    t  = (a - .5) * 2.
    E  = math.copysign(t*t*6., t)
    return s * (1. + rm*C0 + g*C1 + b*C2) * 2.**(E*L)

def encode(v:float) -> tuple[int,int,int,int]:
    if v == 0.: return (0, 0, 0, 127)
    p = v > 0.; a = abs(v)
    er = max(-6., min(6., math.log2(a)*IL))
    A  = max(0, min(255, int(127.5 + (1 if er>=0 else -1)*math.sqrt(abs(er)*K) + .5)))
    ts = (A/255. - .5)*2.; Es = math.copysign(ts*ts*6., ts)
    M  = max(1., min(1.9999999, a * 2.**(-Es*L)))
    mi = int((M-1.)*MS + .5)
    rr = mi//65536; rem = mi - rr*65536; gg = rem//256
    return (rr%128 + (0 if p else 128), gg, rem - gg*256, A)

# usage
rgba = encode(3.14159)          # → (0, 103, 22, 164)
val  = decode(*[x/255 for x in rgba])  # → 3.11239

C

#include <math.h>
#include <stdint.h>

#define L   3.32192809488736234787f  /* log2(10)   */
#define IL  0.30102999566398119521f  /* 1/log2(10) */
#define K   2709.375f               /* 127.5^2/6  */
#define C0  2.00787401574803149606f  /* 255/127    */
#define C1  0.00784325476635514018f  /* 255/32512  */
#define C2  0.00003063848735955057f  /* 255/8323072*/
#define MS  2097151.0f              /* 2^23 - 1   */

typedef struct { uint8_t r, g, b, a; } RGBA;

static inline float rgba_decode(float r, float g, float b, float a) {
    float s  = (r >= 0.5f) ? -1.f : 1.f;
    float rm = r - ((r >= 0.5f) ? 0.5f : 0.f);
    float t  = (a - 0.5f) * 2.f;
    float E  = copysignf(t*t*6.f, t);
    return s * (1.f + rm*C0 + g*C1 + b*C2) * exp2f(E * L);
}

static inline RGBA rgba_encode(float v) {
    if (v == 0.f) return (RGBA){0, 0, 0, 127};
    int   pos = v > 0.f;
    float a   = fabsf(v);
    float er  = fmaxf(-6.f, fminf(6.f, log2f(a) * IL));
    int   A   = (int)fmaxf(0.f, fminf(255.f,
                    127.5f + (er >= 0.f ? 1.f : -1.f) * sqrtf(fabsf(er)*K) + .5f));
    float ts  = (A/255.f - .5f)*2.f;
    float Es  = copysignf(ts*ts*6.f, ts);
    float M   = fmaxf(1.f, fminf(1.9999999f, a * exp2f(-Es * L)));
    int   mi  = (int)(((M-1.f)*MS) + .5f);
    int   rr  = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
    return (RGBA){
        (uint8_t)(rr%128 + (pos ? 0 : 128)),
        (uint8_t)gg,
        (uint8_t)(rem - gg*256),
        (uint8_t)A
    };
}

/* usage */
/*
    RGBA px = rgba_encode(3.14159f);
    float v = rgba_decode(px.r/255.f, px.g/255.f, px.b/255.f, px.a/255.f);
*/

Java

public final class RgbaFloat {

    private static final float L   = (float)(Math.log(10) / Math.log(2));
    private static final float IL  = 1.f / L;
    private static final float K   = 2709.375f;
    private static final float C0  = 255f / 127f;
    private static final float C1  = 255f / 32512f;
    private static final float C2  = 255f / 8323072f;
    private static final float MS  = 2097151f;

    private RgbaFloat() {}

    /** Four normalized floats [0,1] → scalar value. */
    public static float decode(float r, float g, float b, float a) {
        float s  = (r >= 0.5f) ? -1f : 1f;
        float rm = r - ((r >= 0.5f) ? 0.5f : 0f);
        float t  = (a - 0.5f) * 2f;
        float E  = Math.signum(t) * t * t * 6f;
        return s * (1f + rm*C0 + g*C1 + b*C2) * (float)Math.pow(2f, E * L);
    }

    /** Scalar value → int[4] with R,G,B,A in [0,255]. */
    public static int[] encode(float v) {
        if (v == 0f) return new int[]{0, 0, 0, 127};
        boolean pos = v > 0f;
        float   a   = Math.abs(v);
        float   er  = Math.max(-6f, Math.min(6f,
                          (float)(Math.log(a) / Math.log(2)) * IL));
        int     A   = Math.max(0, Math.min(255,
                          (int)(127.5f + (er >= 0f ? 1f : -1f)
                                * (float)Math.sqrt(Math.abs(er) * K) + .5f)));
        float   ts  = (A / 255f - .5f) * 2f;
        float   Es  = Math.signum(ts) * ts * ts * 6f;
        float   M   = Math.max(1f, Math.min(1.9999999f,
                          a * (float)Math.pow(2f, -Es * L)));
        int     mi  = (int)((M - 1f) * MS + .5f);
        int     rr  = mi / 65536, rem = mi - rr*65536, gg = rem / 256;
        return new int[]{
            rr % 128 + (pos ? 0 : 128),
            gg,
            rem - gg * 256,
            A
        };
    }

    // usage:
    //   int[]  px = RgbaFloat.encode(3.14159f);
    //   float  v  = RgbaFloat.decode(px[0]/255f, px[1]/255f, px[2]/255f, px[3]/255f);
}

WGSL

Strict math ABI: only exp2, log2, sqrt, sign, abs, select, clamp. No log, no pow, no copysign. All GPU-friendly scalar arithmetic.

const L :f32= 3.321928094887362;   // log2(10)
const IL:f32= 0.301029995663981;   // 1/log2(10)
const K :f32= 2709.375;            // 127.5²/6  — isqrt scale
const C0:f32= 2.007874015748031;
const C1:f32= 0.007843254766355;
const C2:f32= 0.000030638487360;
const MS:f32= 2097151.0;

// ch.xyzw = r,g,b,a normalized to [0,1]
fn decode(ch: vec4<f32>) -> f32 {
    let s  = select(1., -1., ch.x >= .5);
    let rm = ch.x - select(0., .5, ch.x >= .5);
    let t  = (ch.w - .5) * 2.;
    return s * (1. + rm*C0 + ch.y*C1 + ch.z*C2)
             * exp2(sign(t) * t*t * 6. * L);
}

fn encode(v: f32) -> vec4<u32> {
    if v == 0. { return vec4<u32>(0,0,0,127); }
    let p  = v > 0.;
    let a  = abs(v);
    let er = clamp(log2(a)*IL, -6., 6.);
    // isqrt collapse: sqrt(|E|/6)*127.5 = sqrt(|E|*K)
    let A  = u32(clamp(i32(127.5 + select(-1.,1.,er>=0.) * sqrt(abs(er)*K) + .5), 0, 255));
    let ts = (f32(A)/255. - .5)*2.;
    let Es = sign(ts)*ts*ts*6.;
    let M  = clamp(a * exp2(-Es*L), 1., 1.9999999);
    let mi = u32(clamp(i32((M-1.)*MS+.5), 0, i32(MS)));
    // arithmetic decompose — no bit ops
    let rr = mi/65536u; let rem = mi - rr*65536u; let gg = rem/256u;
    return vec4<u32>(rr%128u + select(0u,128u,!p), gg, rem-gg*256u, A);
}

GLSL

GLSL 4.5 / ES 3.0. Uses exp2, log2, sqrt, sign, abs, mix (as select). No bit ops in decode.

#define L   3.321928094887362
#define IL  0.301029995663981
#define K   2709.375
#define C0  2.007874015748031
#define C1  0.007843254766355
#define C2  0.000030638487360
#define MS  2097151.0

// Decode: pass in texture sample directly (already [0,1])
float rgba_decode(vec4 ch) {
    float s  = ch.x >= 0.5 ? -1.0 : 1.0;
    float rm = ch.x - (ch.x >= 0.5 ? 0.5 : 0.0);
    float t  = (ch.w - 0.5) * 2.0;
    return s * (1.0 + rm*C0 + ch.y*C1 + ch.z*C2)
             * exp2(sign(t) * t*t * 6.0 * L);
}

// Encode: returns uvec4 with components in [0,255]
uvec4 rgba_encode(float v) {
    if (v == 0.0) return uvec4(0, 0, 0, 127);
    bool  p  = v > 0.0;
    float a  = abs(v);
    float er = clamp(log2(a)*IL, -6.0, 6.0);
    uint  A  = uint(clamp(int(127.5 + (er>=0.0 ? 1.0:-1.0)*sqrt(abs(er)*K) + 0.5), 0, 255));
    float ts = (float(A)/255.0 - 0.5)*2.0;
    float Es = sign(ts)*ts*ts*6.0;
    float M  = clamp(a * exp2(-Es*L), 1.0, 1.9999999);
    uint  mi = uint(clamp(int((M-1.0)*MS+0.5), 0, int(MS)));
    uint  rr = mi/65536u, rem = mi - rr*65536u, gg = rem/256u;
    return uvec4(rr%128u + (p ? 0u : 128u), gg, rem-gg*256u, A);
}

// Decode from a texture (RGBA8 sampled as unorm [0,1]):
//   float val = rgba_decode(texture(uScalarTex, uv));
//
// Encode to a pixel write (RGBA32UI rendertarget):
//   fragColor = rgba_encode(scalar_value);

Notes and limitations

Zero is unrepresentable. The format has no zero state. encode(0) returns (0,0,0,127) which decodes to +1.00021 — the closest representable value. Applications that need to distinguish zero should reserve a sentinel (e.g. all-zero RGBA as a special case handled outside the formula).

Range is hard-clamped. Values outside ±[10⁻⁶, 2·10⁶] will silently saturate at the nearest endpoint. No overflow flag, no infinity, no NaN.

Not suitable for arithmetic. Decoded values should be treated as read-only measurements. Adding or multiplying two decoded values and re-encoding will accumulate encoding error with each round-trip.

Texture store use case. The primary motivation is storing a signed scalar field in an RGBA8 texture — common in scientific visualization (temperature, pressure, elevation, signed distance fields). The format requires no custom texture formats, works on any GPU, and decodes in a single texture fetch plus ~10 ALU operations.