Common hash table implementation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Common hash table implementation
@ 2001-07-18  0:57 Brian J. Watson
  2001-07-18  1:34 ` Larry McVoy
  2001-07-18  9:48 ` Richard Guenther
  0 siblings, 2 replies; 16+ messages in thread
From: Brian J. Watson @ 2001-07-18  0:57 UTC (permalink / raw)
  To: Linux Kernel, schoebel

A couple of days ago, I was thinking about a common hash table
implementation, ala include/linux/list.h. Then I came across
include/linux/ghash.h, and thought that someone's already done it.
After that I noticed the copyright line said 1997, and a quick check
in cscope showed that nobody's including it.

Does anyone know if this file is worth studying and working with? I
have to wonder if nobody's using it after four years.

Does anyone see a problem with a common hash table implementation?
I've implemented a few hash tables from scratch for our clustering
work, and it's starting to get a little old. Something easy to use
like list.h would be a lot nicer.

-- 
Brian Watson             | "The common people of England... so 
Linux Kernel Developer   |  jealous of their liberty, but like the 
Open SSI Clustering Lab  |  common people of most other countries 
Compaq Computer Corp     |  never rightly considering wherein it 
Los Angeles, CA          |  consists..."
                         |      -Adam Smith, Wealth of Nations, 1776

mailto:Brian.J.Watson@compaq.com
http://opensource.compaq.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-18  0:57 Common hash table implementation Brian J. Watson
@ 2001-07-18  1:34 ` Larry McVoy
  2001-07-18 13:46   ` Daniel Phillips
  2001-07-22  2:23   ` Rusty Russell
  2001-07-18  9:48 ` Richard Guenther
  1 sibling, 2 replies; 16+ messages in thread
From: Larry McVoy @ 2001-07-18  1:34 UTC (permalink / raw)
  To: Brian J. Watson; +Cc: Linux Kernel, schoebel

We've got a fairly nice hash table interface in BitKeeper that we'd be 
happy to provide under the GPL.  I've always thought it would be cool
to have it in the kernel, we use it everywhere.

http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm

will let you browse it.  The general interface is gdbm() like and there
are both file backed and memory backed versions.  It was designed to be
useful in small and large configs, you can get a hash into 128 bytes if
I recall correctly.

On Tue, Jul 17, 2001 at 05:57:25PM -0700, Brian J. Watson wrote:
> A couple of days ago, I was thinking about a common hash table
> implementation, ala include/linux/list.h. Then I came across
> include/linux/ghash.h, and thought that someone's already done it.
> After that I noticed the copyright line said 1997, and a quick check
> in cscope showed that nobody's including it.
> 
> Does anyone know if this file is worth studying and working with? I
> have to wonder if nobody's using it after four years.
> 
> Does anyone see a problem with a common hash table implementation?
> I've implemented a few hash tables from scratch for our clustering
> work, and it's starting to get a little old. Something easy to use
> like list.h would be a lot nicer.
> 
> -- 
> Brian Watson             | "The common people of England... so 
> Linux Kernel Developer   |  jealous of their liberty, but like the 
> Open SSI Clustering Lab  |  common people of most other countries 
> Compaq Computer Corp     |  never rightly considering wherein it 
> Los Angeles, CA          |  consists..."
>                          |      -Adam Smith, Wealth of Nations, 1776
> 
> mailto:Brian.J.Watson@compaq.com
> http://opensource.compaq.com/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-18  1:34 ` Larry McVoy
@ 2001-07-18 13:46   ` Daniel Phillips
  2001-07-21  0:24     ` Brian J. Watson
  2001-07-22  2:23   ` Rusty Russell
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2001-07-18 13:46 UTC (permalink / raw)
  To: Larry McVoy, Brian J. Watson; +Cc: Linux Kernel, schoebel

On Wednesday 18 July 2001 03:34, Larry McVoy wrote:
> We've got a fairly nice hash table interface in BitKeeper that we'd
> be happy to provide under the GPL.  I've always thought it would be
> cool to have it in the kernel, we use it everywhere.
>
> http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm

Oh goodie, lots of new hash functions to test :-)  I'll pass the
interesting ones on to the guys with the serious hash-testing equipment.

I think the original poster was thinking more along the lines of a
generic insertion, deletion and lookup interface, which we are now
doing in an almost-generic way in a few places.  Once place that is
distinctly un-generic is the buffer hash, for no good reason that I
can see.  This would be a good starting point for a demonstration.

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-18 13:46   ` Daniel Phillips
@ 2001-07-21  0:24     ` Brian J. Watson
  2001-07-21 20:25       ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: Brian J. Watson @ 2001-07-21  0:24 UTC (permalink / raw)
  To: Daniel Phillips, Larry McVoy; +Cc: Linux Kernel

Daniel Phillips wrote:
> 
> On Wednesday 18 July 2001 03:34, Larry McVoy wrote:
> > We've got a fairly nice hash table interface in BitKeeper that we'd
> > be happy to provide under the GPL.  I've always thought it would be
> > cool to have it in the kernel, we use it everywhere.
> >
> > http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm

Thanks, Larry. Your hashing functions are much more sophisticated than
the simple modulo operator I've been using for hashing by inode
number.


> I think the original poster was thinking more along the lines of a
> generic insertion, deletion and lookup interface, which we are now
> doing in an almost-generic way in a few places.  Once place that is
> distinctly un-generic is the buffer hash, for no good reason that I
> can see.  This would be a good starting point for a demonstration.
> 

Daniel's correct. I'm hashing function agnostic, but would like some
common code to simplify the management of a hash table.

Richard Guenther sent the following link to his own common hashing
code, which makes nice use of pseudo-templates:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/glame/src/include/hash.h?rev=1.5&content-type=text/plain

A few things I would consider changing are:

  - ditching the pprev pointer
  - encapsulating the next pointer inside a struct hash_head_##FOOBAR
  - stripping out the hard-coded hashing function, and allowing the
    user to provide their own

All the backslashes offend my aesthetic sensibility, but the
preprocessor provides no alternative. ;)


-- 
Brian Watson                 | "The common people of England... so 
Linux Kernel Developer       |  jealous of their liberty, but like the 
Open SSI Clustering Project  |  common people of most other countries 
Compaq Computer Corp         |  never rightly considering wherein it 
Los Angeles, CA              |  consists..."
                             |      -Adam Smith, Wealth of Nations,
1776

mailto:Brian.J.Watson@compaq.com
http://opensource.compaq.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-21  0:24     ` Brian J. Watson
@ 2001-07-21 20:25       ` Daniel Phillips
  2001-07-22 10:18         ` Richard Guenther
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-21 20:25 UTC (permalink / raw)
  To: Brian J. Watson, Larry McVoy; +Cc: Linux Kernel

On Saturday 21 July 2001 02:24, Brian J. Watson wrote:
> Daniel Phillips wrote:
> > On Wednesday 18 July 2001 03:34, Larry McVoy wrote:
> > > We've got a fairly nice hash table interface in BitKeeper that
> > > we'd be happy to provide under the GPL.  I've always thought it
> > > would be cool to have it in the kernel, we use it everywhere.
> > >
> > > http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm
>
> Thanks, Larry. Your hashing functions are much more sophisticated
> than the simple modulo operator I've been using for hashing by inode
> number.

Yes, I tested almost all of them to see how well they worked my
directory index application.  There are really only two criterea:

  1) How random is the hash
  2) How efficient is it

My testing was hardly what you would call rigorous.  Basically, what I 
do is hash a lot of very unrandom strings and see how uniform the 
resulting hash bucket distribution is.  The *only* function from 
Larry's set that did well on the randomness side is the linear 
congruential hash - it did nearly as well as my dx_hack_hash.

Surprisingly, at least to me, the CRC32 turned in an extremely variable 
performance.  With a small number of buckets (say 100) it did ok, but 
with a larger numbers it showed a very lumpy distribution.  Yes, this 
is way too imprecise a way of describing what happened and I should 
take a closer look at it.  I don't have the mathematical background to 
be really sure about this, but I suspect CRC32 isn't optimized at all 
for randomness - it's optimized for detecting bit errors and has good 
properties with respect to neighbouring bits, properties that are no 
use at all to a randomizing funciton.  Anyway, I wasn't all that 
unhappy to see CRC32 turn in a poor performance for two reasons: a) the 
1K xor table would represent a 25% increase of the indexing code and b) 
hashing through the table eats an extra 1K of precious L1 cache.

The linear congruential hash from Larry's set and my dx_hack_hash share 
a common characteristic: they both munge each character against a 
pseudorandom sequence.  In Larry's hash it's a linear congruential 
sequence, and in my case it's a feedback shift register.  In addition,
I use a multiply to spread the effect of each character over a broader 
range of bits.

Larry's hash doesn't do this and you can see right away that strings 
that vary only in the last character aren't going to be distributed 
very randomly.  It might work a little better with the hashing step 
spelled this way:

-	((h) = 0x63c63cd9*(h) + 0x9c39c33d + (c))
+	((h) = 0x63c63cd9*(h + (c)) + 0x9c39c33d)

I haven't tried this, but I will.

There are people out there who know a lot more about analyzing hash 
functions than I do, and I have their names somewhere in my mailbox.  
I'll go look them up soon and submit for proper testing the whole batch 
of functions that have been suggested to me over the last few months.  
By the way, in case you haven't already deduced this, this stuff is 
really time consuming.

[...]
> Richard Guenther sent the following link to his own common hashing
> code, which makes nice use of pseudo-templates:
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/glame
>/src/include/hash.h?rev=1.5&content-type=text/plain
>
> A few things I would consider changing are:
>
>   - ditching the pprev pointer

Yes, you want to use the generic list macros there.

>   - encapsulating the next pointer inside a struct hash_head_##FOOBAR

I think the generic list macros give you that for free.

>   - stripping out the hard-coded hashing function, and allowing the
>     user to provide their own

Naturally.  And trying to reduce the size of the macros.  It's not that 
easy to get stuff that has dozens of lines ending with "\" into the 
kernel.  You might have better luck just generalizing a few short sets 
of common operations used in hashes, and showing examples of how you'd 
use them to rewrite some of the existing hash code.  Obviously, the 
new, improved approach has to be no less efficient than the current way 
of doing things.

> All the backslashes offend my aesthetic sensibility, but the
> preprocessor provides no alternative. ;)

It's hard to argue against using inlines there.  It's true that there 
are a lot of generalizations you just can't do with inlines, but so 
what?  What matters is how efficient the generated code is and to a 
lesser extent, how readable the source is.  You could make that source 
quite a bit more readable with a few *small* macros and some inline 
functions.  Suggestion: express the bucket probe as an inline, compute 
the hash outside and pass it in.  Then you can wrap the whole thing up 
in a really short macro.

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-21 20:25       ` Daniel Phillips
@ 2001-07-22 10:18         ` Richard Guenther
  2001-07-23 14:36           ` Daniel Phillips
  2001-07-22 16:37         ` Larry McVoy
  2001-07-22 23:34         ` Eyal Lebedinsky
  2 siblings, 1 reply; 16+ messages in thread
From: Richard Guenther @ 2001-07-22 10:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Brian J. Watson, Larry McVoy, Linux Kernel

On Sat, 21 Jul 2001, Daniel Phillips wrote:

> On Saturday 21 July 2001 02:24, Brian J. Watson wrote:
> > Daniel Phillips wrote:
> > Richard Guenther sent the following link to his own common hashing
> > code, which makes nice use of pseudo-templates:
> >
> > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/glame
> >/src/include/hash.h?rev=1.5&content-type=text/plain
> >
> > A few things I would consider changing are:
> >
> >   - ditching the pprev pointer
> 
> Yes, you want to use the generic list macros there.

You get one-pointer size hash table entries and generic deletion from it.
 
> >   - encapsulating the next pointer inside a struct hash_head_##FOOBAR
> 
> I think the generic list macros give you that for free.

Umm, if you use such, you get lists.h style type-casting stuff which
doesnt have a nice interface as

my_type *hash_find_my()

instead you'd get

hash_dead_my *hash_find_my()

which you'll have to cast with something like the list_entry() macro.

> >   - stripping out the hard-coded hashing function, and allowing the
> >     user to provide their own

Ok, if the two expressions are not generic enough.

> Naturally.  And trying to reduce the size of the macros.  It's not that 
> easy to get stuff that has dozens of lines ending with "\" into the 
> kernel.  You might have better luck just generalizing a few short sets 
> of common operations used in hashes, and showing examples of how you'd 
> use them to rewrite some of the existing hash code.  Obviously, the 
> new, improved approach has to be no less efficient than the current way 
> of doing things.

All those \s are to encapsulate the whole thing into the template-style
macro - not that I like this, but I cannot see an alternative.

> > All the backslashes offend my aesthetic sensibility, but the
> > preprocessor provides no alternative. ;)
> 
> It's hard to argue against using inlines there.  It's true that there 
> are a lot of generalizations you just can't do with inlines, but so 
> what?  What matters is how efficient the generated code is and to a 
> lesser extent, how readable the source is.  You could make that source 
> quite a bit more readable with a few *small* macros and some inline 
> functions.  Suggestion: express the bucket probe as an inline, compute 
> the hash outside and pass it in.  Then you can wrap the whole thing up 
> in a really short macro.

Ok, with an approach like the list.h one (struct hash_head) you'd get
there, but of course without automatic type conversion (and safety).
Of course, if you can tidy up the macro without changing to use a
hash_head structure, I'd be glad to see how :)

Richard.

--
Richard Guenther <richard.guenther@uni-tuebingen.de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
The GLAME Project: http://www.glame.de/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-22 10:18         ` Richard Guenther
@ 2001-07-23 14:36           ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-23 14:36 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Brian J. Watson, Larry McVoy, Linux Kernel

On Sunday 22 July 2001 12:18, Richard Guenther wrote:
> On Sat, 21 Jul 2001, Daniel Phillips wrote:
> > On Saturday 21 July 2001 02:24, Brian J. Watson wrote:
> > > Daniel Phillips wrote:
> > > Richard Guenther sent the following link to his own common
> > > hashing code, which makes nice use of pseudo-templates:
> > >
> > > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/g
> > >lame /src/include/hash.h?rev=1.5&content-type=text/plain
> > >
> > > A few things I would consider changing are:
> > >
> > >   - ditching the pprev pointer
> >
> > Yes, you want to use the generic list macros there.
>
> You get one-pointer size hash table entries and generic deletion from
> it.

Having thought about it a little more, I don't see why the symmetric 
double link (i.e., like the page hash, not like the buffer hash) 
couldn't be used with single-pointer tables, using similar tests for 
NULL in insertion and deletion.

> > >   - encapsulating the next pointer inside a struct
> > > hash_head_##FOOBAR
> >
> > I think the generic list macros give you that for free.
>
> Umm, if you use such, you get lists.h style type-casting stuff which
> doesnt have a nice interface as

You could wrap the final product in inlines with internal typecasts.  
As you pointed out, C just doesn't have templates.

> > Naturally.  And trying to reduce the size of the macros.  It's not
> > that easy to get stuff that has dozens of lines ending with "\"
> > into the kernel.  You might have better luck just generalizing a
> > few short sets of common operations used in hashes, and showing
> > examples of how you'd use them to rewrite some of the existing hash
> > code.  Obviously, the new, improved approach has to be no less
> > efficient than the current way of doing things.
>
> All those \s are to encapsulate the whole thing into the
> template-style macro - not that I like this, but I cannot see an
> alternative.

An obvious alternative is to leave things the way they are.  I've 
noticed that this is a stance Linus typically adopts for improvements 
that aren't clearly better in every way ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-21 20:25       ` Daniel Phillips
  2001-07-22 10:18         ` Richard Guenther
@ 2001-07-22 16:37         ` Larry McVoy
  2001-07-23 14:24           ` Daniel Phillips
  2001-07-22 23:34         ` Eyal Lebedinsky
  2 siblings, 1 reply; 16+ messages in thread
From: Larry McVoy @ 2001-07-22 16:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Brian J. Watson, Larry McVoy, Linux Kernel

On Sat, Jul 21, 2001 at 10:25:51PM +0200, Daniel Phillips wrote:
>   1) How random is the hash
>   2) How efficient is it

The hash is not the only part to consider for performance.  The rest of the
code is important as well.  The code I pointed you to has been really carefully
tuned for performance.  And it can be made to be MP safe, SGI did that and
managed to get 455,000 random fetches/second on an 8 way R4400 (each of
these is about the same as the original Pentium at 150Mhz).
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-22 16:37         ` Larry McVoy
@ 2001-07-23 14:24           ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-23 14:24 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Brian J. Watson, Larry McVoy, Linux Kernel

On Sunday 22 July 2001 18:37, Larry McVoy wrote:
> On Sat, Jul 21, 2001 at 10:25:51PM +0200, Daniel Phillips wrote:
> >   1) How random is the hash
> >   2) How efficient is it
>
> The hash is not the only part to consider for performance.  The rest
> of the code is important as well.  The code I pointed you to has been
> really carefully tuned for performance.

Yes, I can see that.  The linear congruential hash will be faster than 
the CRC32 on most modern machines, where we have fast multiplies vs 
multi-cycle table access.

If it's true that the CRC32 is actually less random as well, I'd 
consider dropping the others and just going with the linear 
congruential hash.

> And it can be made to be MP
> safe, SGI did that and managed to get 455,000 random fetches/second
> on an 8 way R4400 (each of these is about the same as the original
> Pentium at 150Mhz).

Did I mention that your linear congruential hash rated among the best 
of all I've tested?  It's possible it might be further improved along 
the lines I suggested.  I'll try this pretty soon.

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-21 20:25       ` Daniel Phillips
  2001-07-22 10:18         ` Richard Guenther
  2001-07-22 16:37         ` Larry McVoy
@ 2001-07-22 23:34         ` Eyal Lebedinsky
  2001-07-24 12:57           ` Daniel Phillips
  2 siblings, 1 reply; 16+ messages in thread
From: Eyal Lebedinsky @ 2001-07-22 23:34 UTC (permalink / raw)
  To: Linux Kernel

Daniel Phillips wrote:
> Yes, I tested almost all of them to see how well they worked my
> directory index application.  There are really only two criterea:
> 
>   1) How random is the hash
>   2) How efficient is it
> 
> My testing was hardly what you would call rigorous.  Basically, what I
> do is hash a lot of very unrandom strings and see how uniform the

Actually, to measure the randomness you need to measure the randomness
of
the output in the face of non-random input. Most well constructed hash
functions perform well when the strings are random, however real world
data (e.g. directory contntent) is not random at all.

Efficiency should measure both space and time resources. If it should
work in a multithreaded situation then another level of complexity is
added.

--
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.anu.edu.au/eyal/>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-22 23:34         ` Eyal Lebedinsky
@ 2001-07-24 12:57           ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-24 12:57 UTC (permalink / raw)
  To: Eyal Lebedinsky, Linux Kernel

On Monday 23 July 2001 01:34, Eyal Lebedinsky wrote:
> Daniel Phillips wrote:
> > Yes, I tested almost all of them to see how well they worked my
> > directory index application.  There are really only two criterea:
> >
> >   1) How random is the hash
> >   2) How efficient is it
> >
> > My testing was hardly what you would call rigorous.  Basically,
> > what I do is hash a lot of very unrandom strings and see how
> > uniform the
>
> Actually, to measure the randomness you need to measure the
> randomness of the output in the face of non-random input.

This is exactly what I do.

> Most well constructed
> hash functions perform well when the strings are random, however real
> world data (e.g. directory contntent) is not random at all.

I think you meant to say there, "even many poorly constructed hash
functions perform well when..."

> Efficiency should measure both space and time resources. If it should
> work in a multithreaded situation then another level of complexity is
> added.

Sure, I could have added "how big is it".  For me, that's just 
another kind of efficiency.  Writing the code so it's reentrant is 
just good practice.  There is no excuse whatsoever for not doing
that for something simple like a hash function, even if you
yourself never expect to run two copies concurrently.

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-18  1:34 ` Larry McVoy
  2001-07-18 13:46   ` Daniel Phillips
@ 2001-07-22  2:23   ` Rusty Russell
  2001-07-24 12:28     ` Daniel Phillips
  1 sibling, 1 reply; 16+ messages in thread
From: Rusty Russell @ 2001-07-22  2:23 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Linux Kernel

In message <20010717183410.S29668@work.bitmover.com> you write:
> We've got a fairly nice hash table interface in BitKeeper that we'd be 
> happy to provide under the GPL.  I've always thought it would be cool
> to have it in the kernel, we use it everywhere.
> 
> http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm

Hmmm.... cf. tdb on sourceforge.  Although having code to be mmap or
read/write backed is overkill in the kernel.

Interestingly, there's an unused, undocumented hash table interface in
include/linux/ghash.h.

Cheers,
Rusty.
--
Premature optmztion is rt of all evl. --DK

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-22  2:23   ` Rusty Russell
@ 2001-07-24 12:28     ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-24 12:28 UTC (permalink / raw)
  To: Rusty Russell, Larry McVoy; +Cc: Linux Kernel

On Sunday 22 July 2001 04:23, Rusty Russell wrote:
> Interestingly, there's an unused, undocumented hash table interface
> in include/linux/ghash.h.

Yikes:

#define DEF_HASH(LINKAGE,NAME,HASHSIZE,TYPE,PTRS,KEYTYPE,KEY,KEYCMP,KEYEQ,HASHFN)\

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-18  0:57 Common hash table implementation Brian J. Watson
  2001-07-18  1:34 ` Larry McVoy
@ 2001-07-18  9:48 ` Richard Guenther
  1 sibling, 0 replies; 16+ messages in thread
From: Richard Guenther @ 2001-07-18  9:48 UTC (permalink / raw)
  To: Brian J. Watson; +Cc: Linux Kernel, schoebel

On Tue, 17 Jul 2001, Brian J. Watson wrote:

> A couple of days ago, I was thinking about a common hash table
> implementation, ala include/linux/list.h. Then I came across
> include/linux/ghash.h, and thought that someone's already done it.
> After that I noticed the copyright line said 1997, and a quick check
> in cscope showed that nobody's including it.
> 
> Does anyone know if this file is worth studying and working with? I
> have to wonder if nobody's using it after four years.
> 
> Does anyone see a problem with a common hash table implementation?
> I've implemented a few hash tables from scratch for our clustering
> work, and it's starting to get a little old. Something easy to use
> like list.h would be a lot nicer.

You may look at

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/glame/src/include/hash.h?rev=1.5&content-type=text/plain

which is essentially a automatic generator of code for static hash
tables like those from linux/mm/

Richard.

--
Richard Guenther <richard.guenther@uni-tuebingen.de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
The GLAME Project: http://www.glame.de/


^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <oupitgqjxoi.fsf@pigdrop.muc.suse.de>]

* Re: Common hash table implementation
       [not found] <oupitgqjxoi.fsf@pigdrop.muc.suse.de>
@ 2001-07-20 22:32 ` Brian J. Watson
  2001-07-21 22:57   ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: Brian J. Watson @ 2001-07-20 22:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

Andi Kleen wrote:
> It's a "fuzzy hash", which is a bit different from the normal hash and
> probably not always appropiate.
> 
> It was at one point used in the dcache but then later ripped out again
> when the data structures changed.
> 

Ahh. I'm not familiar with fuzzy hashes, but I suspected they might
not be what I was interested in.


> I would like to see a generic hash abstraction in the spirit of list.h
> Especially if it would NOT use normal list_heads as open hash table
> heads but instead single pointers for the head [currently some important
> 
> hash tables like the inode hash are twice as big as needed because
> each bucket is two pointers instead of one]
> 

I agree. Hash tables such as inode_hashtable and dentry_hashtable are
half as efficient under stress as they would otherwise be, because
they use an array of list_heads.

OTOH, I have no objections to using list_heads in other applications
where a singly-linked list is all that's needed. Common code is a Good
Thing. I'm just commenting specifically on hash table implementations,
which tend to be used for really _big_ data structures.


-- 
Brian Watson                 | "The common people of England... so 
Linux Kernel Developer       |  jealous of their liberty, but like the 
Open SSI Clustering Project  |  common people of most other countries 
Compaq Computer Corp         |  never rightly considering wherein it 
Los Angeles, CA              |  consists..."
                             |      -Adam Smith, Wealth of Nations,
1776

mailto:Brian.J.Watson@compaq.com
http://opensource.compaq.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Common hash table implementation
  2001-07-20 22:32 ` Brian J. Watson
@ 2001-07-21 22:57   ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2001-07-21 22:57 UTC (permalink / raw)
  To: Brian J. Watson, Andi Kleen; +Cc: Linux Kernel

On Saturday 21 July 2001 00:32, Brian J. Watson wrote:
> Andi Kleen wrote:
> > hash tables like the inode hash are twice as big as needed because
> > each bucket is two pointers instead of one]
>
> I agree. Hash tables such as inode_hashtable and dentry_hashtable are
> half as efficient under stress as they would otherwise be, because
> they use an array of list_heads.

Whoops, yes.  The buffer_head pprev strategy gives both the 
double linked list and a table vector of single links, at the expense 
of a few extra tests.

For a one-size-fits-all hash strategy the question is, double-linked of 
singled-linked?  The advantage of a double linked list for a hash is 
quick, generic deletion.  Using the pprev strategy the disadvantage of 
double links in the table vector can be eliminated.  The main advantage 
of the single link is space savings both in the table and in the 
structures.  The disadvantage is having to do an extra bucket search on 
deletion, though this is partially offset by fewer pointer operations 
to perform insertion or deletion.

It's hard to say which is fastest.  It might turn out that the single 
linked strategy is actually faster, it really depends on typical depth 
of the buckets.  A double-linked list deletion needs to follow two 
pointers and set two links, the single-linked list only one of each.  
There's a similar difference for insertion.  So the extra overhead of 
insertion and deletion can be set off against say, three or four 
iterations of the bucket search loop.  If buckets average six to eight 
elements deep, it's a tie.

A more subtle cost of the double-link approach is the slight extra 
cache pressure due to the enlarged objects.  This cost is incurred 
continously as the objects are used, it might very well add up to more 
than the cost of the final list traversal for the delete in the 
single-linked case.

How about implementing both strategies, using your generic interface to 
make them look the same?  And then seeing which is fastest in practice.

--
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2001-07-24 13:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-07-18  0:57 Common hash table implementation Brian J. Watson
2001-07-18  1:34 ` Larry McVoy
2001-07-18 13:46   ` Daniel Phillips
2001-07-21  0:24     ` Brian J. Watson
2001-07-21 20:25       ` Daniel Phillips
2001-07-22 10:18         ` Richard Guenther
2001-07-23 14:36           ` Daniel Phillips
2001-07-22 16:37         ` Larry McVoy
2001-07-23 14:24           ` Daniel Phillips
2001-07-22 23:34         ` Eyal Lebedinsky
2001-07-24 12:57           ` Daniel Phillips
2001-07-22  2:23   ` Rusty Russell
2001-07-24 12:28     ` Daniel Phillips
2001-07-18  9:48 ` Richard Guenther
     [not found] <oupitgqjxoi.fsf@pigdrop.muc.suse.de>
2001-07-20 22:32 ` Brian J. Watson
2001-07-21 22:57   ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox