Tiago Chaves laurelkeys

Lessons from Hash Table Merging

Merging two hash maps seems like an O(N) operation. However, while merging millions of keys, I encountered a massive >10x performance degradation unexpectedly. This post explores why some of the most popular libraries fall into this trap and how to fix it. The source code is available here.

The set up
Merging hash tables may be slow

Article:

https://lotusspring.substack.com

Disclaimer!

This code was written without the intention of being publicly shared. Not much effort was put into beautification or anything like that, one big file that does it all! Some effort is requried on your part to make this compile.

Python Disclaimer!

I heavily dislike python and consider the code wasteful slop. I have very little python experience, so there are likely much better ways of writing the python portion. Exercise caution!

What you NEED to Know Before Touching a Video File

Hanging out in subtitling and video re-editing communities, I see my fair share of novice video editors and video encoders, and see plenty of them make the classic beginner mistakes when it comes to working with videos. A man can only read "Use Handbrake to convert your mkv to an mp4 :)" so many times before losing it, so I am writing this article to channel the resulting psychic damage into something productive.

If you are new to working with videos (or, let's face it, even if you aren't), please read through this guide to avoid making mistakes that can cost you lots of time, computing power, storage space, or video quality.

Writing to Compressed Textures

In general it's not possible to use a block-compressed texture as a render target or as a compute shader output. Instead you have to either: Alias the block compressed texture with an uncompressed texture where each texel corresponds to a block, or to output the compressed blocks to an uncompressed texture buffer, and then copy the compressed blocks from that intermediate memory location to the final compressed texture.

Each of the graphics APIs expose this functionality in a different way. This document explains the options available under the following APIs:

Direct3D
Vulkan
Metal
OpenGL

Every atomic object has a timeline (TL) of writes:
- A write is either a store or a read-modify-write (RMW): it read latest write & pushed new one.
- A write is either tagged Relaxed, Release, or SeqCst.
- A read observes some write on the timeline:
  - On the same thread, future reads can't go backwards on the timeline.
  - A read is either tagged Relaxed, Acquire, or SeqCst.
  - RMWs can also be tagged Acquire (or AcqRel). If so, the Acquire refers to the "read" portion of "RMW".
Each thread has its own view of the world:

Shared write timelines but each thread could be reading at different points.

Header

Some bold text before an inline function: $y = x^2$

NOTE: The gist preview is completely whacked. Click on raw for the source.

What is "Work Expansion"

In a GPU-driven renderer, "work expansion" is a commonly occurring problem. "Work Expansion" means that a single item of work spawns N following work items. Typically one work item will be executed by one shader thread/invocation.

An example for work expansion is gpu driven meshlet culling following mesh culling. In this example a "work item" is culling a mesh, where each mesh cull work item spawns N following meshlet cull work items.

There are many diverse cases of this problem and many solutions. Some are trivial to solve, for example, when N (how many work items are spawned) is fixed.

	// Implements "Recursive Implementation of the Gaussian Filter Using Truncated Cosine Functions" by Charalampidis [2016].
	// https://discovery.researcher.life/article/recursive-implementation-of-the-gaussian-filter-using-truncated-cosine-functions/dcf24675f5eb30dba93c5205cdae3c40
	// This code is based on:
	// https://github.com/cloudinary/ssimulacra2/blob/main/src/lib/jxl/gauss_blur.cc
	// Copyright (c) the JPEG XL Project Authors. All rights reserved.

	struct RecursiveGaussian {
	RecursiveGaussian(float sigma);

	float mul_in[3];

	const State = struct {
	clowns: StringHashMap(Clown) = .empty,

	const Clown = struct {
	scariness: f32,
	funniness: f32,
	};

	fn deinit(state: *State, gpa: Allocator) void {
	var it = state.clowns.iterator();

	// run with `RUSTFLAGS='-C target-cpu=native' cargo +nightly bench`

	#![feature(test)]

	fn main() {
	let mut a = [0u32; 65536];
	a[1] = 42;
	println!("{}", scalar_max(&a));
	println!("{}", avx2_max(&a));
	}