Skip to content

Instantly share code, notes, and snippets.

@newpolaris
Forked from badboy/inthash.md
Last active February 17, 2017 04:46
Show Gist options
  • Select an option

  • Save newpolaris/0b1aa3940efbdbb2c428f4b860e506f1 to your computer and use it in GitHub Desktop.

Select an option

Save newpolaris/0b1aa3940efbdbb2c428f4b860e506f1 to your computer and use it in GitHub Desktop.

Original link: http://www.concentric.net/~Ttwang/tech/inthash.htm
Taken from: http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm
Reformatted using pandoc

Integer Hash Function

Thomas Wang, Jan 1997

last update Mar 2007

Version 3.1


Abstract

An integer hash function accepts an integer hash key, and returns an integer hash result with uniform distribution. In this article, we will be discussing the construction of integer hash functions.

Introduction

Hash table is an important data structure. All elementary data structure text books contain some algorithms of hash table. However, all too often the treatment of hash function is discussed as an after-thought. Thus examples abound in systems where the poor choice of the hash function resulted in inferior performance.

Certainly the integer hash function is the most basic form of the hash function. The integer hash function transforms an integer hash key into an integer hash result. For a hash function, the distribution should be uniform. This implies when the hash result is used to calculate hash bucket address, all buckets are equally likely to be picked. In addition, similar hash keys should be hashed to very different hash results. Ideally, a single bit change in the hash key should influence all bits of the hash result.

Hash Function Construction Principles

A good mixing function must be reversible. A hash function has form h(x) -> y. If the input word size and the output word size are identical, and in addition the operations in h() are reversible, then the following properties are true.

  1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
  2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.

The case of h(x1) == y1, and h(x2) == y1 is called a collision. Using only reversible operations in a hash function makes collisions impossible. There is an one-to-one mapping between the input and the output of the mixing function.

Beside reversibility, the operations must use a chain of computations to achieve avalanche. Avalanche means that a single bit of difference in the input will make about 1/2 of the output bits be different. At a point in the chain, a new result is obtained by a computation involving earlier results.

For example, the operation a = a + b is reversible if we know the value of 'b', and the after value of 'a'. The before value of 'a' is obtained by subtracting the after value of 'a' with the value of 'b'.

Knuth's Multiplicative Method

In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.

Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.

Robert Jenkins' 96 bit Mix Function

Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.

All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.

The following source is the mixing part of the hash function.

int mix(int a, int b, int c)
{
  a=a-b;  a=a-c;  a=a^(c >>> 13);
  b=b-c;  b=b-a;  b=b^(a << 8);
  c=c-a;  c=c-b;  c=c^(b >>> 13);
  a=a-b;  a=a-c;  a=a^(c >>> 12);
  b=b-c;  b=b-a;  b=b^(a << 16);
  c=c-a;  c=c-b;  c=c^(b >>> 5);
  a=a-b;  a=a-c;  a=a^(c >>> 3);
  b=b-c;  b=b-a;  b=b^(a << 10);
  c=c-a;  c=c-b;  c=c^(b >>> 15);
  return c;
}

Variable 'c' contains the input key. When the mixing is complete, variable 'c' also contains the hash result. Variable 'a', and 'b' contain initialized random bits. Notice the total number of internal state is 96 bits, much larger than the final output of 32 bits. Also notice the sequence of subtractions rolls through variable 'a' to variable 'c' three times. Each row will act on one variable, mixing in information from the other two variables, followed by a shift operation.

Subtraction is similar to multiplication in that changes in upper bits of the key do not influence lower bits of the addition. The 9 bit shift operations in Robert Jenkins' mixing algorithm shifts the key to the right 61 bits in total, and shifts the key to the left 34 bits in total. As the calculation is chained, each exclusive-or doubles the number of states. There are at least 2^9 different combined versions of the original key, shifted by various amounts. That is why a single bit change in the key can influence widely apart bits in the hash result.

The uniform distribution of the hash function can be determined from the nature of the subtraction operation. Look at a single bit subtraction operation between a key, and a random bit. If the random bit is 0, then the key remains unchanged. If the random bit is 1, then the key will be flipped. A carry will occur in the case where both the key bit and the random bit are 1. Subtracting the random bits will cause about half of the key bits to be flipped. So even if the key is not uniform, subtracting the random bits will result in uniform distribution.

32 bit Mix Functions

Based on an original suggestion on Robert Jenkin's part in 1997, I have done some research for a version of the integer hash function. This is my latest version as of January 2007. The specific value of the bit shifts are obtained from running the accompanied search program.

public int hash32shift(int key)
{
  key = ~key + (key << 15); // key = (key << 15) - key - 1;
  key = key ^ (key >>> 12);
  key = key + (key << 2);
  key = key ^ (key >>> 4);
  key = key * 2057; // key = (key + (key << 3)) + (key << 11);
  key = key ^ (key >>> 16);
  return key;
}

(~x) + y is equivalent to y - x - 1 in two's complement representation.

By taking advantages of the native instructions such as 'add complement', and 'shift & add', the above hash function runs in 11 machine cycles on HP 9000 workstations.

Having more rounds will strengthen the hash function by making the result more random looking, but performance will be slowed down accordingly. Simulation seems to prefer small shift amounts for inner rounds, and large shift amounts for outer rounds.

Robert Jenkins' 32 bit integer hash function

uint32_t hash( uint32_t a)
{
   a = (a+0x7ed55d16) + (a<<12);
   a = (a^0xc761c23c) ^ (a>>19);
   a = (a+0x165667b1) + (a<<5);
   a = (a+0xd3a2646c) ^ (a<<9);
   a = (a+0xfd7046c5) + (a<<3);
   a = (a^0xb55a4f09) ^ (a>>16);
   return a;
}

This version of integer hash function uses operations with integer constants to help producing a hash value. I suspect the actual values of the magic constants are not very important. Even using 16 bit constants may still work pretty well.

These magic constants open up the construction of perfect integer hash functions. A test program can vary the magic constants until a set of perfect hashes are found.

Using Multiplication for Hashing

Using multiplication requires a mechanism to transport changes from high bit positions to low bit positions. Bit reversal is best, but is slow to implement. A viable alternative is left shifts.

Using multiplication presents some sort of dilemma. Certain machine platforms supports integer multiplication in hardware, and an integer multiplication can be completed in 4 or less cycles. But on some other platforms an integer multiplication could take 8 or more cycles to complete. On the other hand, integer hash functions implemented with bit shifts perform equally well on all platforms.

A compromise is to multiply the key with a 'sparse' bit pattern, where on machines without fast integer multiplier they can be replaced with a 'shift & add' sequence. An example is to multiply the key with (4096 + 8

  • 1), with an equivalent expression of (key + (key << 3)) + (key << 12).

On most machines a bit shift of 3 bits or less, following by an addition can be performed in one cycle. For example, Pentium's 'lea' instruction can be used to good effect to compute a 'shift & add' in one cycle.

Function hash32shiftmult() uses a combination of bit shifts and integer multiplication to hash the input key.

public int hash32shiftmult(int key)
{
  int c2=0x27d4eb2d; // a prime or an odd constant
  key = (key ^ 61) ^ (key >>> 16);
  key = key + (key << 3);
  key = key ^ (key >>> 4);
  key = key * c2;
  key = key ^ (key >>> 15);
  return key;
}

64 bit Mix Functions

public long hash64shift(long key)
{
  key = (~key) + (key << 21); // key = (key << 21) - key - 1;
  key = key ^ (key >>> 24);
  key = (key + (key << 3)) + (key << 8); // key * 265
  key = key ^ (key >>> 14);
  key = (key + (key << 2)) + (key << 4); // key * 21
  key = key ^ (key >>> 28);
  key = key + (key << 31);
  return key;
}

The longer width of 64 bits require more mixing than the 32 bit version.

64 bit to 32 bit Hash Functions

One such use for this kind of hash function is to hash a 64 bit virtual address to a hash table index. Because the output of the hash function is narrower than the input, the result is no longer one-to-one.

Another usage is to hash two 32 bit integers into one hash value.

public int hash6432shift(long key)
{
  key = (~key) + (key << 18); // key = (key << 18) - key - 1;
  key = key ^ (key >>> 31);
  key = key * 21; // key = (key + (key << 2)) + (key << 4);
  key = key ^ (key >>> 11);
  key = key + (key << 6);
  key = key ^ (key >>> 22);
  return (int) key;
}

Parallel Operations

If a CPU can dispatch multiple instructions in the same clock cycle, one can consider adding more parallelism in the formula.

For example, for the following formula, the two shift operations can be performed in parallel. On certain platforms where there are multiple ALUs but a single shifter unit, this idea does not offer a speed increase.

key ^= (key << 17) | (key >>> 16);

For 32 bit word operations, only certain pairs of shift amounts will be reversible. The valid pairs include: (17,16) (16,17) (14,19) (19,14) (13,20) (20,13) (10,23) (23,10) (8,25) (25,8)

Multiplication can be computed in parallel. Any multiplication with odd number is reversible.

key += (key << 3) + (key << 9); // key = key * (1 + 8 + 512)

On certain machines, bit rotation can be performed in one cycle. Any odd number bits rotation can be made reversible when exclusive-or is applied to the un-rotated key with one particular bit set to 1 or 0.

key = (key | 64) ^ ((key >>> 15) | (key << 17));

However, on certain machine and compiler combinations, this code sequence can run as slow as 4 cycles. 2 cycles: a win, 3 cycles: tie, more than 3 cycles: a loss.

Pseudo Random Usages

There has been queries whether these mix functions can be used for pseudo random purposes. Although the out does look random to the naked eye, the official recommendation is to use a real pseudo random number generator instead, such as the Mercenne Twister.

The hash functions listed in this article were only tested for hashing purpose. No tests of randomness were performed.

Test Program

This is a test program testing the choices of the shift amounts with regard to the resulting avalanche property. The program detects if a certain bit position has both changes and no changes, based on a single bit flip. Promising candidates are further tested to verify the percentage chance of bit flip is sufficiently close to 50% for all input and output bit pairs.

The test program prints out the name of the algorithm under test, followed by the list of shift amounts that pass the avalanche test.

Power of 2 Hash Table Size

Programmer uses hash table size that is power of 2 because address calculation can be performed very quickly. The integer hash function can be used to post condition the output of a marginal quality hash function before the final address calculation is done.

addr = inthash(marginal_hash_value) & (tablesize - 1);

Using the inlined version of the integer hash function is faster than doing a remaindering operation with a prime number! An integer remainder operation may take up to 18 cycles or longer to complete, depending on machine architecture.

Conclusions

In this article, we have examined a number of hash function construction algorithms. Knuth's multiplicative method is the simplest, but has some known defects. Robert Jenkins' 96 bit mix function can be used as an integer hash function, but is more suitable for hashing long keys. A dedicated hash function is well suited for hashing an integer number.

We have also presented an application of the integer hash function to improve the quality of a hash value.


Give feedbacks to wang@cup.hp.com

Comments from original post:

@dbaarda commented on Aug 5, 2016

For Power of 2 hash table sizes, doing "& (tablesize - 1)" is equivalent to "mod tablesize". For a power of 2 table size this is equivalent to only using the bottom N bits for tablesize=2^N, which throws away all the information in the higher order bits.

If instead you use "mod (tablesize - 1)" then this does "end over carrys" of the higher order bits into the addr, so you use all the bits at the tiny cost of not using one table entry. If your marginal hash is not too bad, this might be good enough. This can be efficiently implemented for:

tablesize = 1 << N;
mask = tablesize - 1;

Using the following:

addr = marginal_hash_value;
while addr > mask 
   addr = (addr >> N) + (addr & mask);

Note this can give addr = mask as a result, but for non-zero marginal_hash_values will never give a zero result. So it's not exactly the same as "mod mask" but close enough. To make it exactly the same you could add this on the end:

if addr >= mask
  addr -= mask;

@velavokr velavokr commented on Dec 6, 2016 A nice followup with avalanche analysis http://burtleburtle.net/bob/hash/integer.html

POD type hash

http://stackoverflow.com/questions/35826416/why-no-default-hash-for-c-pod-structs

return h1 ^ (h2 << 1); // or use boost::hash_combine

or recast to string

const std::string str =
        std::string( reinterpret_cast<const std::string::value_type*>( &a ), sizeof(A) );
    return std::hash<std::string>()( str );

usage boost hash combine

friend std::size_t hash_value(point const& p)
{
    std::size_t seed = 0;
    boost::hash_combine(seed, p.x);
    boost::hash_combine(seed, p.y);

    return seed;
}

Why boost hash_combine such equation

http://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values

seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);

Simple hash algorithm

Introduction to Algorithms Thomas H. Cormen https://goo.gl/3LZpnO

Algorithm Robert Sedgewick http://algs4.cs.princeton.edu/34hash/

Java hashMap algorithm and prime number

http://stackoverflow.com/questions/5152015/why-setting-hashtables-length-to-a-prime-number-is-a-good-practice @David Rodríguez - dribeas: As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here(http://www.javaspecialists.eu/archive/Issue054.html, Integer division and modulo operations are very slow)

JDK 1.3

// Rehash (size increment 1, 3, 7, 15, 31, 63, ...) 
int newCapacity = oldCapacity * 2 + 1;
Entry newMap[] = new Entry[newCapacity];

// put
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;

But, m is recommanded prime that is not too close to power of two. (Introduction to Algorithm (Cormen))

JDK 8u40 (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java)

default load factor (.75) offers a good tradeoff between time and space cost

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don't benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

static final int hash(Object key) {
   int h;
   return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

putMapEntries

// init 16 item. tableSizeFor find next power of two.
if (t > threshold)
	threshold = tableSizeFor(t);

C++ impl

libstdc++-api-4.5 4, 8, char* https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.5/a00877_source.html

00144       const char* __cptr = reinterpret_cast<const char*>(__ptr);
00145       for (; __clength; --__clength)
00146         {
00147           __hash ^= static_cast<size_t>(*__cptr++);
00148           __hash *= static_cast<size_t>(16777619UL);
00149         }

gcc-4.6.3 https://gcc.gnu.org/onlinedocs/gcc-4.6.3/libstdc++/api/a00886_source.html http://stackoverflow.com/questions/19411742/what-is-the-default-hash-function-used-in-c-stdunordered-map

00127   struct _Hash_impl
00128   {
00129     static size_t
00130     hash(const void* __ptr, size_t __clength,
00131      size_t __seed = static_cast<size_t>(0xc70f6907UL))
00132     { return _Hash_bytes(__ptr, __clength, __seed); }
00133 
00134     template<typename _Tp>
00135       static size_t
00136       hash(const _Tp& __val)
00137       { return hash(&__val, sizeof(__val)); }
00138 
00139     template<typename _Tp>
00140       static size_t
00141       __hash_combine(const _Tp& __val, size_t __hash)
00142       { return hash(&__val, sizeof(__val), __hash); }
00143   };

Links

How did Knuth derive A?: http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a

http://www.azillionmonkeys.com/qed/hash.html https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ http://burtleburtle.net/bob/hash/index.html http://burtleburtle.net/bob/hash/doobs.html

@newpolaris
Copy link
Author

newpolaris commented Jan 21, 2017

Java Object.hashCode() algorithm

@nkukhar Native hashCode method implementation depends on the JVM. By default in HotSpot it returns random number, you can check it in the source code (function get_next_hash)
http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/runtime/synchronizer.cpp
http://stackoverflow.com/questions/10963345/does-java-internally-use-hashing-technique-to-manage-all-objects/10963376/
http://blogs.tedneward.com/post/objecthashcode-implementation/

Basic type: Override
http://stackoverflow.com/questions/11890805/hashcode-of-an-int

Taken from the Integer.class source code:

public int hashCode() {
    return value;
}

Long.class source code:
(int)(value ^ (value >>> 32))

@newpolaris
Copy link
Author

newpolaris commented Feb 17, 2017

Simple hash algorithm

Introduction to Algorithms Thomas H. Cormen: https://goo.gl/3LZpnO

Algorithm Robert Sedgewick: http://algs4.cs.princeton.edu/34hash/

Cormen:

좋은 해시 함수는 m개의 자리 중 하나에 같은 확률로 해시된다는 가정에 만족하는 것.
하지만 키가 추출되는 확률분포를 알기 힘들고 키들이 독립적으로 추출되지 않기 때문에 이 조건을 검사하는 것은 일반적으로 가능하지 않다.

휴리스틱 2개 : 나누기, 곱셈
유니버설 해싱: 좋은 성능을 위해 랜덤화 한다.

휴리스틱 나누기:
h(k) = k mod m
=> m = 2^p 라면 k의 가장 낮은 p비트로 이루어진 수가 되므로 좋지 못하다
=> 대신 2의 지수승 값에 근접하지 않은 소수를 택하자

곱하기 방법:

  1. kA의 소수 부분을 추출하고 (0 < A < 1)
  2. m 을 곱한 후 내림 연산을 취한다
    h(k) = floor(m(kA mod 1))
    장점은 m의 값이 그렇게 중요하지는 않다. 그렇기 때문에 쉽게 구현하기 위해,
    m = 2^p 를 선택한다.
    컴퓨터에서 w bit가 워드이고, k가 워드 단위일때,
    kA2^w = (r1)(2^w) + r0
    여기서 p비트 해쉬값은 r0의 최상위 p개 비트이다.

knuth는 A 가 (sqrt(5)-1)/2 랑 유사할 때 대체로 잘 작동할 것이라 제안.

w=32, p = 14, k = 123456, m = 16384 = 2^14 일 때,

s/2^32 의 형태를 가지면서, (sqrt(5)-1)/2에 근접한 소수가 되도록 하면
A = 2654435769/2^32 로 잡으면
ks = (763002^32) + 17612864
r0의 최상위 14비트는 67이다

http://stackoverflow.com/a/41537995/1890382

std::uint32_t knuth(int x, int p) {
assert(p >= 0 && p <= 32);

const std::uint32_t knuth = 2654435769;
const std::uint32_t y = x;
return (y * knuth) >> (32 - p);

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment