* Re: [PATCH] dcache: better name hash function
[not found] <9986527.24561256620662709.JavaMail.root@tahiti.vyatta.com>
@ 2009-10-27 5:19 ` Stephen Hemminger
2009-10-27 5:24 ` David Miller
2009-10-27 6:07 ` [PATCH] dcache: better name hash function Eric Dumazet
0 siblings, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 5:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
One of the root causes of slowness in network usage
was my original choice of power of 2 for hash size, to avoid
a mod operation. It turns out if size is not a power of 2
the original algorithm works fairly well.
On slow cpu; with 10million entries and 211 hash size
Algorithm Time Ratio Max StdDev
string10 1.271871 1.00 47397 0.01
djb2 1.406322 1.00 47452 0.12
SuperFastHash 1.422348 1.00 48400 1.99
string_hash31 1.424079 1.00 47437 0.08
jhash_string 1.459232 1.00 47954 1.01
sdbm 1.499209 1.00 47499 0.22
fnv32 1.539341 1.00 47728 0.75
full_name_hash 1.556792 1.00 47412 0.04
string_hash17 1.719039 1.00 47413 0.05
pjw 1.827365 1.00 47441 0.09
elf 2.033545 1.00 47441 0.09
fnv64 2.199533 1.00 47666 0.53
crc 5.705784 1.00 47913 0.95
md5_string 10.308376 1.00 47946 1.00
fletcher 1.418866 1.01 53189 18.65
adler32 2.842117 1.01 53255 18.79
kr_hash 1.175678 6.43 468517 507.44
xor 1.114692 11.02 583189 688.96
lastchar 0.795316 21.10 1000000 976.02
How important is saving the one division, versus getting better
distribution.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 5:19 ` [PATCH] dcache: better name hash function Stephen Hemminger
@ 2009-10-27 5:24 ` David Miller
2009-10-27 17:22 ` [PATCH] net: fold network name hash Stephen Hemminger
2009-10-27 6:07 ` [PATCH] dcache: better name hash function Eric Dumazet
1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2009-10-27 5:24 UTC (permalink / raw)
To: stephen.hemminger
Cc: eric.dumazet, akpm, torvalds, opurdila, netdev, linux-kernel,
viro
From: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date: Mon, 26 Oct 2009 22:19:44 -0700 (PDT)
> How important is saving the one division, versus getting better
> distribution.
80 cpu cycles or more on some processors. Cheaper to use
jenkins with a power-of-2 sized hash.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 5:19 ` [PATCH] dcache: better name hash function Stephen Hemminger
2009-10-27 5:24 ` David Miller
@ 2009-10-27 6:07 ` Eric Dumazet
2009-10-27 6:50 ` Eric Dumazet
1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2009-10-27 6:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
Stephen Hemminger a écrit :
> One of the root causes of slowness in network usage
> was my original choice of power of 2 for hash size, to avoid
> a mod operation. It turns out if size is not a power of 2
> the original algorithm works fairly well.
Interesting, but I suspect all users have power of 2 tables :(
>
> On slow cpu; with 10million entries and 211 hash size
>
>
>
> How important is saving the one division, versus getting better
> distribution.
unsigned int fold1(unsigned hash)
{
return hash % 211;
}
Compiler uses a reciprocal divide because of 211 being a constant.
And you also could try following that contains one multiply only,
and check if hash distribution properties are still OK
unsigned int fold2(unsigned hash)
{
return ((unsigned long long)hash * 211) >> 32;
}
fold1:
movl 4(%esp), %ecx
movl $-1689489505, %edx
movl %ecx, %eax
mull %edx
shrl $7, %edx
imull $211, %edx, %edx
subl %edx, %ecx
movl %ecx, %eax
ret
fold2:
movl $211, %eax
mull 4(%esp)
movl %edx, %eax
ret
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 6:07 ` [PATCH] dcache: better name hash function Eric Dumazet
@ 2009-10-27 6:50 ` Eric Dumazet
2009-10-27 7:29 ` Eric Dumazet
0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2009-10-27 6:50 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
Eric Dumazet a écrit :
> unsigned int fold2(unsigned hash)
> {
> return ((unsigned long long)hash * 211) >> 32;
> }
>
I tried this reciprocal thing with 511 and 1023 values and got on a PIII 550 MHz, gcc-3.3.2 :
# ./hashtest 100000 511
jhash_string 0.033123 1.01 234 1.06
fnv32 0.033911 1.02 254 1.38
# ./hashtest 1000000 511
jhash_string 0.331155 1.00 2109 1.10
fnv32 0.359346 1.00 2151 1.65
# ./hashtest 10000000 511
jhash_string 3.383340 1.00 19985 1.03
fnv32 3.849359 1.00 20198 1.53
# ./hashtest 100000 1023
jhash_string 0.033123 1.03 134 1.01
fnv32 0.034260 1.03 142 1.32
# ./hashtest 1000000 1023
jhash_string 0.332329 1.00 1075 1.06
fnv32 0.422035 1.00 1121 1.59
# ./hashtest 10000000 1023
jhash_string 3.417559 1.00 10107 1.01
fnv32 3.747563 1.00 10223 1.35
511 value on 64bit, and 1023 on 32bit arches are nice because
hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
Conclusion : jhash and 511/1023 hashsize for netdevices,
no divides, only one multiply for the fold.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 6:50 ` Eric Dumazet
@ 2009-10-27 7:29 ` Eric Dumazet
2009-10-27 17:07 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2009-10-27 7:29 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
Eric Dumazet a écrit :
>
>
> 511 value on 64bit, and 1023 on 32bit arches are nice because
> hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
>
> Conclusion : jhash and 511/1023 hashsize for netdevices,
> no divides, only one multiply for the fold.
Just forget about 511 & 1023, as power of two works too.
-> 512 & 1024 + jhash
Guess what, David already said this :)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 7:29 ` Eric Dumazet
@ 2009-10-27 17:07 ` Stephen Hemminger
2009-10-27 17:32 ` Linus Torvalds
[not found] ` <4AE72B91.7040700@gmail.com>
0 siblings, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 17:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Andrew Morton, Linus Torvalds,
Octavian Purdila, netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009 08:29:51 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Eric Dumazet a écrit :
> >
> >
> > 511 value on 64bit, and 1023 on 32bit arches are nice because
> > hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
> >
> > Conclusion : jhash and 511/1023 hashsize for netdevices,
> > no divides, only one multiply for the fold.
>
> Just forget about 511 & 1023, as power of two works too.
>
> -> 512 & 1024 + jhash
>
> Guess what, David already said this :)
Rather than wasting space, or doing expensive, modulus; just folding
the higher bits back with XOR redistributes the bits better.
On fast machine (Nehalam):
100000000 Iterations
256 Slots (order 8)
Algorithm Time Ratio Max StdDev
string10 2.505290 1.00 390628 0.00
xor 2.521329 1.00 392120 2.14
SuperFastHash 2.781745 1.00 397027 4.43
fnv32 2.847892 1.00 392139 0.98
djb2 2.886342 1.00 390827 0.12
string_hash31 2.900980 1.00 391001 0.20
string_hash17 2.938708 1.00 391122 0.20
full_name_hash 3.080886 1.00 390860 0.10
jhash_string 3.092161 1.00 392775 1.08
fnv64 5.340740 1.00 392854 0.88
kr_hash 2.395757 7.30 4379091 1568.25
On slow machine (CULV):
100000000 Iterations
256 Slots (order 8)
Algorithm Time Ratio Max StdDev
string10 10.807174 1.00 390628 0.00
SuperFastHash 11.397303 1.00 397027 4.43
xor 11.660968 1.00 392120 2.14
djb2 11.674707 1.00 390827 0.12
jhash_string 11.997104 1.00 392775 1.08
fnv32 12.289086 1.00 392139 0.98
string_hash17 12.863864 1.00 391122 0.20
full_name_hash 13.249483 1.00 390860 0.10
string_hash31 13.668270 1.00 391001 0.20
fnv64 39.808964 1.00 392854 0.88
kr_hash 10.316305 7.30 4379091 1568.25
So Eric's string10 is fastest for special case of fooNNN style names.
But probably isn't best for general strings. Orignal function
is >20% slower, which is surprising probably because of overhead
of 2 shifts and multipy. jenkins and fnv are both 10% slower.
The following seems to give best results (combination of 16bit trick
and string17).
static unsigned int xor17(const unsigned char *key, unsigned int len)
{
uint32_t h = 0;
unsigned int rem;
rem = len & 1;
len >>= 1;
while (len--) {
h = ((h << 4) + h) ^ get_unaligned16(key);
key += sizeof(uint16_t);
}
if (rem)
h = ((h << 4) + h) ^ *key;
return h;
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] net: fold network name hash
2009-10-27 5:24 ` David Miller
@ 2009-10-27 17:22 ` Stephen Hemminger
2009-10-27 18:02 ` Octavian Purdila
2009-10-27 22:04 ` [PATCH] net: fold network name hash (v2) Stephen Hemminger
0 siblings, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 17:22 UTC (permalink / raw)
To: David Miller
Cc: netdev, linux-kernel, eric.dumazet, akpm, torvalds, opurdila,
netdev, linux-kernel, viro
The full_name_hash does not produce a value that is evenly distributed
over the lower 8 bits. This causes name hash to be unbalanced with large
number of names. A simple fix is to just fold in the higher bits
with XOR.
This is independent of possible improvements to full_name_hash()
in future.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/core/dev.c 2009-10-27 09:21:46.127252547 -0700
+++ b/net/core/dev.c 2009-10-27 09:25:14.593313378 -0700
@@ -199,7 +199,11 @@ EXPORT_SYMBOL(dev_base_lock);
static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
{
unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
- return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
+
+ hash ^= (hash >> NETDEV_HASHBITS);
+ hash &= NETDEV_HASHENTRIES - 1;
+
+ return &net->dev_name_head[hash];
}
static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 17:07 ` Stephen Hemminger
@ 2009-10-27 17:32 ` Linus Torvalds
2009-10-27 23:08 ` Stephen Hemminger
[not found] ` <4AE72B91.7040700@gmail.com>
1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-10-27 17:32 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009, Stephen Hemminger wrote:
>
> Rather than wasting space, or doing expensive, modulus; just folding
> the higher bits back with XOR redistributes the bits better.
Please don't make up any new hash functions without having a better input
set than the one you seem to use.
The 'fnv' function I can believe in, because the whole "multiply by big
prime number" thing to spread out the bits is a very traditional model.
But making up a new hash function based on essentially consecutive names
is absolutely the wrong thing to do. You need a much better corpus of path
component names for testing.
> The following seems to give best results (combination of 16bit trick
> and string17).
.. and these kinds of games are likely to work badly on some
architectures. Don't use 16-bit values, and don't use 'get_unaligned()'.
Both tend to work fine on x86, but likely suck on some other
architectures.
Also remember that the critical hash function needs to check for '/' and
'\0' while at it, which is one reason why it does things byte-at-a-time.
If you try to be smart, you'd need to be smart about the end condition
too.
The loop to optimize is _not_ based on 'name+len', it is this code:
this.name = name;
c = *(const unsigned char *)name;
hash = init_name_hash();
do {
name++;
hash = partial_name_hash(c, hash);
c = *(const unsigned char *)name;
} while (c && (c != '/'));
this.len = name - (const char *) this.name;
this.hash = end_name_hash(hash);
(which depends on us having already removed all slashed at the head, and
knowing that the string is not zero-sized)
So doing things multiple bytes at a time is certainly still possible, but
you would always have to find the slashes/NUL's in there first. Doing that
efficiently and portably is not trivial - especially since a lot of
critical path components are short.
(Remember: there may be just a few 'bin' directory names, but if you do
performance analysis, 'bin' as a path component is probably hashed a lot
more than 'five_slutty_bimbos_and_a_donkey.jpg'. So the relative weighting
of importance of the filename should probably include the frequency it
shows up in pathname lookup)
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
[not found] ` <4AE72B91.7040700@gmail.com>
@ 2009-10-27 17:35 ` Stephen Hemminger
0 siblings, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 17:35 UTC (permalink / raw)
To: Eric Dumazet, David Miller; +Cc: netdev, linux-kernel
On Tue, 27 Oct 2009 18:19:13 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Stephen Hemminger a écrit :
>
> > So Eric's string10 is fastest for special case of fooNNN style names.
> > But probably isn't best for general strings. Orignal function
> > is >20% slower, which is surprising probably because of overhead
> > of 2 shifts and multipy. jenkins and fnv are both 10% slower.
> >
>
>
> jhash() is faster when strings are longer, being able to process 12 bytes per loop.
>
But jhash is not amenable to usage in namei (with partial_name_hash).
name_hash is rarely done on long strings, the average length of a filename
is fairly short (probably leftover Unix legacy). On my system, average
path component length in /usr is 13 characters; therefore jhash has
no big benefit here.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] net: fold network name hash
2009-10-27 17:22 ` [PATCH] net: fold network name hash Stephen Hemminger
@ 2009-10-27 18:02 ` Octavian Purdila
2009-10-27 22:04 ` [PATCH] net: fold network name hash (v2) Stephen Hemminger
1 sibling, 0 replies; 19+ messages in thread
From: Octavian Purdila @ 2009-10-27 18:02 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, netdev, linux-kernel, eric.dumazet, akpm, torvalds,
viro
On Tuesday 27 October 2009 19:22:51 you wrote:
> The full_name_hash does not produce a value that is evenly distributed
> over the lower 8 bits. This causes name hash to be unbalanced with large
> number of names. A simple fix is to just fold in the higher bits
> with XOR.
>
> This is independent of possible improvements to full_name_hash()
> in future.
>
I can confirm that the distribution looks good now for our most common cases.
Thanks,
tavi
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] net: fold network name hash (v2)
2009-10-27 17:22 ` [PATCH] net: fold network name hash Stephen Hemminger
2009-10-27 18:02 ` Octavian Purdila
@ 2009-10-27 22:04 ` Stephen Hemminger
2009-10-28 6:07 ` Eric Dumazet
1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 22:04 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, netdev, linux-kernel, eric.dumazet, akpm, torvalds,
opurdila, viro
The full_name_hash does not produce a value that is evenly distributed
over the lower 8 bits. This causes name hash to be unbalanced with large
number of names. There is a standard function to fold in upper bits
so use that.
This is independent of possible improvements to full_name_hash()
in future.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/core/dev.c 2009-10-27 14:54:21.922563076 -0700
+++ b/net/core/dev.c 2009-10-27 15:04:16.733813459 -0700
@@ -86,6 +86,7 @@
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/errno.h>
+#include <linux/hash.h>
#include <linux/interrupt.h>
#include <linux/if_ether.h>
#include <linux/netdevice.h>
@@ -199,7 +200,7 @@ EXPORT_SYMBOL(dev_base_lock);
static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
{
unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
- return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
+ return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
}
static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 17:32 ` Linus Torvalds
@ 2009-10-27 23:08 ` Stephen Hemminger
2009-10-27 23:41 ` Linus Torvalds
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-27 23:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009 10:32:44 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 27 Oct 2009, Stephen Hemminger wrote:
> >
> > Rather than wasting space, or doing expensive, modulus; just folding
> > the higher bits back with XOR redistributes the bits better.
>
> Please don't make up any new hash functions without having a better input
> set than the one you seem to use.
>
> The 'fnv' function I can believe in, because the whole "multiply by big
> prime number" thing to spread out the bits is a very traditional model.
> But making up a new hash function based on essentially consecutive names
> is absolutely the wrong thing to do. You need a much better corpus of path
> component names for testing.
>
> > The following seems to give best results (combination of 16bit trick
> > and string17).
>
> .. and these kinds of games are likely to work badly on some
> architectures. Don't use 16-bit values, and don't use 'get_unaligned()'.
> Both tend to work fine on x86, but likely suck on some other
> architectures.
>
> Also remember that the critical hash function needs to check for '/' and
> '\0' while at it, which is one reason why it does things byte-at-a-time.
> If you try to be smart, you'd need to be smart about the end condition
> too.
>
> The loop to optimize is _not_ based on 'name+len', it is this code:
>
> this.name = name;
> c = *(const unsigned char *)name;
>
> hash = init_name_hash();
> do {
> name++;
> hash = partial_name_hash(c, hash);
> c = *(const unsigned char *)name;
> } while (c && (c != '/'));
> this.len = name - (const char *) this.name;
> this.hash = end_name_hash(hash);
>
> (which depends on us having already removed all slashed at the head, and
> knowing that the string is not zero-sized)
>
> So doing things multiple bytes at a time is certainly still possible, but
> you would always have to find the slashes/NUL's in there first. Doing that
> efficiently and portably is not trivial - especially since a lot of
> critical path components are short.
>
> (Remember: there may be just a few 'bin' directory names, but if you do
> performance analysis, 'bin' as a path component is probably hashed a lot
> more than 'five_slutty_bimbos_and_a_donkey.jpg'. So the relative weighting
> of importance of the filename should probably include the frequency it
> shows up in pathname lookup)
>
> Linus
Going back to basics. Run tests across different input sets.
Dropping off the slow ones like crc, md5, ...
Not using jhash because it doesn't have good character at a time
interface.
Also, the folding algorithm used matters. Since the kernel
already uses hash_long() to fold back to N bits, all the
tests were rerun with that.
Test run across names in /usr
Algorithm Time Ratio Max StdDev
kr_hash 2.481275 1.21 4363 358.98
string10 2.834562 1.15 4168 303.66
fnv32 2.887600 1.18 4317 332.38
string_hash31 3.655745 1.16 4258 314.33
string_hash17 3.816443 1.16 4177 311.28
djb2 3.883914 1.18 4269 331.75
full_name_hash 4.067633 1.16 4282 312.29
pjw 6.517316 1.17 4184 316.69
sdbm 6.945385 1.17 4447 324.32
elf 7.402180 1.17 4184 316.69
And in /home (mail directories and git)
Algorithm Time Ratio Max StdDev
kr_hash 2.765015 1.44 7175 701.99
string10 3.136947 1.19 7092 469.73
fnv32 3.162626 1.19 6986 458.48
string_hash31 3.832031 1.19 7053 463.29
string_hash17 4.136220 1.19 7023 469.30
djb2 4.241706 1.23 7537 512.02
full_name_hash 4.437741 1.19 7000 467.19
pjw 6.758093 1.20 6970 476.03
sdbm 7.239758 1.22 7526 494.32
elf 7.446356 1.20 6970 476.03
And with names like pppXXX
Algorithm Time Ratio Max StdDev
kr_hash 0.849656 9.26 5520 1121.79
fnv32 1.004682 1.01 453 23.29
string10 1.004729 1.00 395 2.08
string_hash31 1.108335 1.00 409 5.14
string_hash17 1.231257 1.00 410 8.10
djb2 1.238314 1.01 435 29.88
full_name_hash 1.320822 1.00 422 11.07
elf 1.994794 1.15 716 151.19
pjw 2.063958 1.15 716 151.19
sdbm 2.070033 1.00 408 8.11
* The new test has big table so more cache effects.
* Existing full_name_hash distributes okay if folded correctly.
* fnv32 and string10 are slightly faster
More data (on /usr) from older slower machines:
IA-64
Algorithm Time Ratio Max StdDev
kr_hash 1.676064 1.17 664 63.81
string_hash17 1.773553 1.12 616 54.40
djb2 2.103359 1.12 598 54.71
string10 2.103959 1.13 698 56.80
string_hash31 2.108254 1.13 602 55.51
full_name_hash 3.237209 1.13 614 56.74
sdbm 3.279243 1.12 611 54.78
pjw 3.314135 1.13 639 56.74
elf 3.821029 1.13 639 56.74
fnv32 5.619829 1.16 865 62.51
Slow ULV 1Ghz laptop:
Algorithm Time Ratio Max StdDev
kr_hash 5.754460 1.19 2017 194.64
string10 6.698358 1.15 1638 171.29
sdbm 8.134431 1.15 1652 170.65
djb2 8.231058 1.17 1659 184.44
string_hash31 8.447873 1.15 1633 172.13
fnv32 8.552569 1.18 2170 189.61
string_hash17 9.226992 1.16 1616 175.01
full_name_hash 10.555072 1.15 1703 170.45
pjw 16.193485 1.17 1642 181.45
elf 19.770414 1.17 1642 181.45
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 23:08 ` Stephen Hemminger
@ 2009-10-27 23:41 ` Linus Torvalds
2009-10-28 0:10 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-10-27 23:41 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009, Stephen Hemminger wrote:
>
> Going back to basics. Run tests across different input sets.
> Dropping off the slow ones like crc, md5, ...
> Not using jhash because it doesn't have good character at a time
> interface.
>
> Also, the folding algorithm used matters. Since the kernel
> already uses hash_long() to fold back to N bits, all the
> tests were rerun with that.
Yeah, the 'hash_long()' folding matters for anything that doesn't multiply
big some big number to spread the bits out, because otherwise the bits
from the last character hashed will always be in the low bits.
That explains why our current hash looked so bad with your previous code.
>From your numbers, I think we can dismiss 'kr_hash()' as having horrible
behavior with names like pppXXX (and that isn't just a special case: it's
also noticeably worse for your /home directory case, which means that the
bad behavior shows up in practice too, not just in some special cases).
'elf' and 'pjw' don't have quite the same bad case, but the stddev for the
pppXXX cases are still clearly worse than the other alternatives. They
also seem to always be slower than what we already have.
The 'fnv32' algorithm gets fairly good behavior, but seems bad on Itanium.
Looks like it depends on a fast multiplication unit, and all even your
"slow" ULV chip seems to be a Core2 one, so all your x86 targets have
that. And our current name hash still actually seems to do better in all
cases (maybe I missed some case) even if fnv32 is slightly faster on x86.
>From your list 'string10' seems to get consistently good results and is at
or near the top of performance too. It seems to be the one that
consistently beats 'full_name_hash()' both in performance and in behavior
(string_hash17/31 come close, but aren't as clearly better performing).
But I haven't actually seen the hashes. Maybe there's something that makes
string10 bad?
Regardless, one thing your new numbers do say is that our current hash
really isn't that bad.
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-27 23:41 ` Linus Torvalds
@ 2009-10-28 0:10 ` Stephen Hemminger
2009-10-28 0:58 ` Linus Torvalds
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-28 0:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
[-- Attachment #1: Type: text/plain, Size: 2345 bytes --]
On Tue, 27 Oct 2009 16:41:52 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 27 Oct 2009, Stephen Hemminger wrote:
> >
> > Going back to basics. Run tests across different input sets.
> > Dropping off the slow ones like crc, md5, ...
> > Not using jhash because it doesn't have good character at a time
> > interface.
> >
> > Also, the folding algorithm used matters. Since the kernel
> > already uses hash_long() to fold back to N bits, all the
> > tests were rerun with that.
>
> Yeah, the 'hash_long()' folding matters for anything that doesn't multiply
> big some big number to spread the bits out, because otherwise the bits
> from the last character hashed will always be in the low bits.
>
> That explains why our current hash looked so bad with your previous code.
>
> From your numbers, I think we can dismiss 'kr_hash()' as having horrible
> behavior with names like pppXXX (and that isn't just a special case: it's
> also noticeably worse for your /home directory case, which means that the
> bad behavior shows up in practice too, not just in some special cases).
>
> 'elf' and 'pjw' don't have quite the same bad case, but the stddev for the
> pppXXX cases are still clearly worse than the other alternatives. They
> also seem to always be slower than what we already have.
>
> The 'fnv32' algorithm gets fairly good behavior, but seems bad on Itanium.
> Looks like it depends on a fast multiplication unit, and all even your
> "slow" ULV chip seems to be a Core2 one, so all your x86 targets have
> that. And our current name hash still actually seems to do better in all
> cases (maybe I missed some case) even if fnv32 is slightly faster on x86.
>
> From your list 'string10' seems to get consistently good results and is at
> or near the top of performance too. It seems to be the one that
> consistently beats 'full_name_hash()' both in performance and in behavior
> (string_hash17/31 come close, but aren't as clearly better performing).
>
> But I haven't actually seen the hashes. Maybe there's something that makes
> string10 bad?
>
> Regardless, one thing your new numbers do say is that our current hash
> really isn't that bad.
>
> Linus
Agreed. Here is the reduced version of the program.
To run:
find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
[-- Attachment #2: htest.c --]
[-- Type: text/x-c++src, Size: 5219 bytes --]
#include <stdio.h>
#include <stdint.h>
#include <malloc.h>
#include <sys/types.h>
#include <sys/time.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <getopt.h>
#define init_name_hash() 0
/* partial hash update function. Assume roughly 4 bits per character */
static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
return (prevhash + (c << 4) + (c >> 4)) * 11;
}
/* Compute the hash for a name string. */
static unsigned int
full_name_hash(const unsigned char *name, unsigned int len)
{
unsigned long hash = 0;
while (len--)
hash = partial_name_hash(*name++, hash);
return hash;
}
static unsigned int
djb2(const unsigned char *str, unsigned int len)
{
unsigned long hash = 5381;
while (len--)
hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */
return hash;
}
static unsigned int
string_hash31(const unsigned char *name, unsigned int len)
{
unsigned hash = 0;
int i;
for (i = 0; i < len; ++i)
hash = 31 * hash + name[i];
return hash;
}
static unsigned int
string_hash17(const unsigned char *name, unsigned int len)
{
unsigned hash = 0;
int i;
for (i = 0; i < len; ++i)
hash = 17 * hash + name[i];
return hash;
}
static unsigned int
string10(const unsigned char *name, unsigned int len)
{
unsigned hash = 0;
int i;
for (i = 0; i < len; ++i)
hash = hash * 10 + name[i] - '0';
return hash;
}
static const uint32_t FNV_32_PRIME = 16777619u;
static const uint32_t FNV1_32_INIT = 2166136261u;
static unsigned int fnv32(const unsigned char *key, unsigned int len)
{
uint32_t hval = FNV1_32_INIT;
unsigned int i;
for (i = 0; i < len; i++) {
hval ^= key[i];
#if defined(NO_FNV_GCC_OPTIMIZATION)
hval *= FNV_32_PRIME;
#else
hval += (hval<<1) + (hval<<4) + (hval<<7)
+ (hval<<8) + (hval<<24);
#endif
}
return hval;
}
static unsigned order = 8;
static unsigned iterations = 10000;
static char **names;
static unsigned num_names;
static void read_names(void)
{
char line[1024];
unsigned int n = 0;
/* Read input to create name set */
while (fgets(line, sizeof line, stdin) != NULL) {
names = realloc(names, (n + 1) * sizeof(char *));
names[n] = malloc(strlen(line) + 1);
strcpy(names[n], line);
++n;
}
num_names = n;
}
/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
#define GOLDEN_RATIO_PRIME_32 0x9e370001UL
/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL
/* Unrolled version of hash_long() */
static inline unsigned int fold(unsigned int val, unsigned int bits)
{
if (sizeof(val) == 8) {
uint64_t hash = val;
/* Sigh, gcc can't optimise this alone like it does for 32 bits. */
uint64_t n = hash;
n <<= 18;
hash -= n;
n <<= 33;
hash -= n;
n <<= 3;
hash += n;
n <<= 3;
hash -= n;
n <<= 4;
hash += n;
n <<= 2;
hash += n;
/* High bits are more random, so use them. */
return hash >> (64 - bits);
} else {
/* On some cpus multiply is faster, on others gcc will do shifts */
uint32_t hash = val * GOLDEN_RATIO_PRIME_32;
/* High bits are more random, so use them. */
return hash >> (32 - bits);
}
}
static void measure(const char *name,
unsigned int (*hash)(const unsigned char *, unsigned int))
{
unsigned int i;
struct timeval t0, t1;
unsigned int hashsz = 1 << order;
unsigned int dist[hashsz];
unsigned long long probes = 0;
double t;
unsigned long m = 0;
const double avg = num_names / hashsz;
double ideal = hashsz * (avg * (1 + avg)) / 2;
double std = 0;
memset(dist, 0, sizeof(unsigned int) * hashsz);
gettimeofday(&t0, NULL);
for (i = 0; i < num_names; i++) {
unsigned char *name = (unsigned char *) names[i];
size_t len = strlen(names[i]);
unsigned int h = 0, j;
for (j = 0; j < iterations; j++)
h = hash(name, len);
/* fold in extra bits */
++dist[fold(h, order)];
}
gettimeofday(&t1, NULL);
t = (double) (t1.tv_sec - t0.tv_sec);
t += (double) (t1.tv_usec - t0.tv_usec) / 1000000.;
for (i = 0; i < hashsz; i++) {
unsigned int n = dist[i];
double delta = (n - avg);
if (n > m) m = n; /* longest chain */
std += delta * delta; /* compute standard deviation */
/* number of probes used when accessing
is same as sum of arithmetic series */
probes += ((uint64_t) n * (1 + n)) / 2;
}
printf("%-20s %f", name, t);
printf(" %8.2f %9lu %6.2f\n",
(double) probes / ideal, m, sqrt(std / hashsz));
}
#define MEASURE(func) measure(#func, &func)
int main(int argc, char **argv)
{
int f;
while ((f = getopt(argc, argv, "h:n:")) != -1)
switch (f) {
case 'n':
iterations = strtoul(optarg, NULL, 0);
break;
case 'h':
order = strtoul(optarg, NULL, 0);
break;
default:
fprintf(stderr,
"Usage: %s -a -h hash -n testsize\n",
argv[0]);
return 1;
}
read_names();
printf("Algorithm Time Ratio Max StdDev\n");
MEASURE(full_name_hash);
MEASURE(djb2);
MEASURE(string10);
MEASURE(string_hash17);
MEASURE(string_hash31);
MEASURE(fnv32);
return 0;
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-28 0:10 ` Stephen Hemminger
@ 2009-10-28 0:58 ` Linus Torvalds
2009-10-28 1:56 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-10-28 0:58 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009, Stephen Hemminger wrote:
>
> Agreed. Here is the reduced version of the program.
> To run:
> find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
The timings are very sensitive to random I$ layout at least on Nehalem.
The reason seems to be that the inner loop is _so_ tight that just
depending on exactly where the loop ends up, you can get subtle
interactions with the loop cache.
Look here:
[torvalds@nehalem ~]$ find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
Algorithm Time Ratio Max StdDev
full_name_hash 1.141899 1.03 4868 263.37
djb2 0.980200 1.03 4835 266.05
string10 0.909175 1.03 4850 262.67
string10a 0.673915 1.03 4850 262.67
string10b 0.909374 1.03 4850 262.67
string_hash17 0.966050 1.03 4805 263.68
string_hash31 1.008544 1.03 4807 259.37
fnv32 0.774806 1.03 4817 259.17
what do you think the difference between 'string10', 'string10a' and
'string10b' are?
None. None what-so-ever. The source code is identical, and gcc generates
identical assembly language. Yet those timings are extremely stable for
me, and 'string10b' is 25% faster than the identical string10 and
string10a functions.
The only difference? 'string10a' starts aligned to just 16 bytes, but that
in turn happens to mean that the tight inner loop ends up aligned on a
128-byte boundary. And being cacheline aligned just there seems to matters
for some subtle micro-architectural reason.
The reason I noticed this is that I wondered what small modifications to
'string10' would do for performance, and noticed that even _without_ the
small modifications, performance fluctuated.
Lesson? Microbenchmarks like this can be dangerous and misleading. That's
_especially_ true if the loop ends up being just tight enough that it can
fit in some trace cache or similar. In real life, the name hash is
performance-critical, but at the same time almost certainly won't be run
in a tight enough loop that you'd ever notice things like that.
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] dcache: better name hash function
2009-10-28 0:58 ` Linus Torvalds
@ 2009-10-28 1:56 ` Stephen Hemminger
0 siblings, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-28 1:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Stephen Hemminger, Andrew Morton, Octavian Purdila,
netdev, linux-kernel, Al Viro
On Tue, 27 Oct 2009 17:58:53 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 27 Oct 2009, Stephen Hemminger wrote:
> >
> > Agreed. Here is the reduced version of the program.
> > To run:
> > find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
>
> The timings are very sensitive to random I$ layout at least on Nehalem.
> The reason seems to be that the inner loop is _so_ tight that just
> depending on exactly where the loop ends up, you can get subtle
> interactions with the loop cache.
>
> Look here:
>
> [torvalds@nehalem ~]$ find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
> Algorithm Time Ratio Max StdDev
> full_name_hash 1.141899 1.03 4868 263.37
> djb2 0.980200 1.03 4835 266.05
> string10 0.909175 1.03 4850 262.67
> string10a 0.673915 1.03 4850 262.67
> string10b 0.909374 1.03 4850 262.67
> string_hash17 0.966050 1.03 4805 263.68
> string_hash31 1.008544 1.03 4807 259.37
> fnv32 0.774806 1.03 4817 259.17
>
> what do you think the difference between 'string10', 'string10a' and
> 'string10b' are?
>
> None. None what-so-ever. The source code is identical, and gcc generates
> identical assembly language. Yet those timings are extremely stable for
> me, and 'string10b' is 25% faster than the identical string10 and
> string10a functions.
>
> The only difference? 'string10a' starts aligned to just 16 bytes, but that
> in turn happens to mean that the tight inner loop ends up aligned on a
> 128-byte boundary. And being cacheline aligned just there seems to matters
> for some subtle micro-architectural reason.
>
> The reason I noticed this is that I wondered what small modifications to
> 'string10' would do for performance, and noticed that even _without_ the
> small modifications, performance fluctuated.
>
> Lesson? Microbenchmarks like this can be dangerous and misleading. That's
> _especially_ true if the loop ends up being just tight enough that it can
> fit in some trace cache or similar. In real life, the name hash is
> performance-critical, but at the same time almost certainly won't be run
> in a tight enough loop that you'd ever notice things like that.
>
> Linus
Thanks. I wasn't putting huge amount of stock in the micro benchmark,
was more interested in how the distribution worked out (which is CPU
independent) rather than the time. As long as all usage of name hashing
fold properly, there isn't a lot of reason to change.
--
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] net: fold network name hash (v2)
2009-10-27 22:04 ` [PATCH] net: fold network name hash (v2) Stephen Hemminger
@ 2009-10-28 6:07 ` Eric Dumazet
2009-10-28 9:28 ` David Miller
2009-10-28 15:57 ` Stephen Hemminger
0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2009-10-28 6:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, netdev, linux-kernel, akpm, torvalds, opurdila,
viro
Stephen Hemminger a écrit :
> The full_name_hash does not produce a value that is evenly distributed
> over the lower 8 bits. This causes name hash to be unbalanced with large
> number of names. There is a standard function to fold in upper bits
> so use that.
>
> This is independent of possible improvements to full_name_hash()
> in future.
> static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
> {
> unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
> - return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
> + return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
> }
>
> static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
full_name_hash() returns an "unsigned int", which is guaranteed to be 32 bits
You should therefore use hash_32(hash, NETDEV_HASHBITS),
not hash_long() that maps to hash_64() on 64 bit arches, which is
slower and certainly not any better with a 32bits input.
/* Compute the hash for a name string. */
static inline unsigned int
full_name_hash(const unsigned char *name, unsigned int len)
{
unsigned long hash = init_name_hash();
while (len--)
hash = partial_name_hash(*name++, hash);
return end_name_hash(hash);
}
static inline u32 hash_32(u32 val, unsigned int bits)
{
/* On some cpus multiply is faster, on others gcc will do shifts */
u32 hash = val * GOLDEN_RATIO_PRIME_32;
/* High bits are more random, so use them. */
return hash >> (32 - bits);
}
static inline u64 hash_64(u64 val, unsigned int bits)
{
u64 hash = val;
/* Sigh, gcc can't optimise this alone like it does for 32 bits. */
u64 n = hash;
n <<= 18;
hash -= n;
n <<= 33;
hash -= n;
n <<= 3;
hash += n;
n <<= 3;
hash -= n;
n <<= 4;
hash += n;
n <<= 2;
hash += n;
/* High bits are more random, so use them. */
return hash >> (64 - bits);
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] net: fold network name hash (v2)
2009-10-28 6:07 ` Eric Dumazet
@ 2009-10-28 9:28 ` David Miller
2009-10-28 15:57 ` Stephen Hemminger
1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2009-10-28 9:28 UTC (permalink / raw)
To: eric.dumazet
Cc: shemminger, netdev, linux-kernel, akpm, torvalds, opurdila, viro
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 28 Oct 2009 07:07:10 +0100
> You should therefore use hash_32(hash, NETDEV_HASHBITS),
> not hash_long() that maps to hash_64() on 64 bit arches, which is
> slower and certainly not any better with a 32bits input.
Agreed.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] net: fold network name hash (v2)
2009-10-28 6:07 ` Eric Dumazet
2009-10-28 9:28 ` David Miller
@ 2009-10-28 15:57 ` Stephen Hemminger
1 sibling, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2009-10-28 15:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, netdev, linux-kernel, akpm, torvalds, opurdila,
viro
On Wed, 28 Oct 2009 07:07:10 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Stephen Hemminger a écrit :
> > The full_name_hash does not produce a value that is evenly distributed
> > over the lower 8 bits. This causes name hash to be unbalanced with large
> > number of names. There is a standard function to fold in upper bits
> > so use that.
> >
> > This is independent of possible improvements to full_name_hash()
> > in future.
>
> > static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
> > {
> > unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
> > - return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
> > + return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
> > }
> >
> > static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
>
> full_name_hash() returns an "unsigned int", which is guaranteed to be 32 bits
>
> You should therefore use hash_32(hash, NETDEV_HASHBITS),
> not hash_long() that maps to hash_64() on 64 bit arches, which is
> slower and certainly not any better with a 32bits input.
OK, I was following precedent. Only a couple places use hash_32, most use
hash_long().
Using the upper bits does give better distribution.
With 100,000 network names:
Time Ratio Max StdDev
hash_32 0.002123 1.00 422 11.07
hash_64 0.002927 1.00 400 3.97
The time field is pretty meaningless for such a small sample
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2009-10-28 15:57 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <9986527.24561256620662709.JavaMail.root@tahiti.vyatta.com>
2009-10-27 5:19 ` [PATCH] dcache: better name hash function Stephen Hemminger
2009-10-27 5:24 ` David Miller
2009-10-27 17:22 ` [PATCH] net: fold network name hash Stephen Hemminger
2009-10-27 18:02 ` Octavian Purdila
2009-10-27 22:04 ` [PATCH] net: fold network name hash (v2) Stephen Hemminger
2009-10-28 6:07 ` Eric Dumazet
2009-10-28 9:28 ` David Miller
2009-10-28 15:57 ` Stephen Hemminger
2009-10-27 6:07 ` [PATCH] dcache: better name hash function Eric Dumazet
2009-10-27 6:50 ` Eric Dumazet
2009-10-27 7:29 ` Eric Dumazet
2009-10-27 17:07 ` Stephen Hemminger
2009-10-27 17:32 ` Linus Torvalds
2009-10-27 23:08 ` Stephen Hemminger
2009-10-27 23:41 ` Linus Torvalds
2009-10-28 0:10 ` Stephen Hemminger
2009-10-28 0:58 ` Linus Torvalds
2009-10-28 1:56 ` Stephen Hemminger
[not found] ` <4AE72B91.7040700@gmail.com>
2009-10-27 17:35 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).