2.4.19pre2aa1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.19pre2aa1
@ 2002-03-07  8:21 Andrea Arcangeli
  2002-03-07 10:49 ` 2.4.19pre2aa1 William Lee Irwin III
  2002-03-07 11:34 ` 2.4.19pre2aa1 Christoph Hellwig
  0 siblings, 2 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-07  8:21 UTC (permalink / raw)
  To: linux-kernel

URL:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre2aa1.gz
	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre2aa1/

The alone vm-29 against 2.4.19pre2 can be downloaded from here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.19pre2/vm-29

Only in 2.4.19pre2aa1: 00_TCDRAIN-1

	drain/flush pty fixes from Michael Schroeder.

Only in 2.4.19pre1aa1: 00_VM_IO-2
Only in 2.4.19pre2aa1: 00_VM_IO-3
Only in 2.4.19pre1aa1: 00_block-highmem-all-18b-4.gz
Only in 2.4.19pre2aa1: 00_block-highmem-all-18b-5.gz
Only in 2.4.19pre1aa1: 00_flush_icache_range-2
Only in 2.4.19pre2aa1: 00_flush_icache_range-3
Only in 2.4.19pre1aa1: 00_module-gfp-5
Only in 2.4.19pre2aa1: 00_module-gfp-6
Only in 2.4.19pre1aa1: 00_nfs-2.4.17-pathconf-1
Only in 2.4.19pre2aa1: 00_nfs-2.4.17-pathconf-2
Only in 2.4.19pre1aa1: 00_lvm-1.0.2-1.gz
Only in 2.4.19pre2aa1: 00_lvm-1.0.2-2.gz
Only in 2.4.19pre1aa1: 00_silent-stack-overflow-14
Only in 2.4.19pre2aa1: 00_silent-stack-overflow-15
Only in 2.4.19pre1aa1: 20_numa-mm-1
Only in 2.4.19pre2aa1: 20_numa-mm-2
Only in 2.4.19pre1aa1: 10_compiler.h-2
Only in 2.4.19pre2aa1: 10_compiler.h-3

	Rediffed.

Only in 2.4.19pre2aa1: 00_alpha-lseek-1

	Export right interface for lseek/sys_lseek.

Only in 2.4.19pre2aa1: 00_ffs_compile_failure-1

	Fix compilation failure with gcc 3.1 due a kernel bug.
	From Andi Kleen and Jan Hubicka.

Only in 2.4.19pre2aa1: 00_gcc-3_1-compile-1

	Fix compilation failure with gcc 3.1.

Only in 2.4.19pre2aa1: 00_kiobuf_init-1

	Optimize kiobuf initialization. From Intel and Hubert Mantel.

Only in 2.4.19pre2aa1: 00_nfs-fix_create-1

	New nfs fix from Trond.

Only in 2.4.19pre2aa1: 00_proc-sig-race-fix-1

	Fix /proc signal SMP race from Chris Mason.

Only in 2.4.19pre1aa1: 00_rwsem-fair-26
Only in 2.4.19pre1aa1: 00_rwsem-fair-26-recursive-7
Only in 2.4.19pre2aa1: 00_rwsem-fair-27
Only in 2.4.19pre2aa1: 00_rwsem-fair-27-recursive-8

	Fixed 64bit bug s/int/long/ in down_read. Thanks to
	Jeff Wiedemeier and Andreas Schwab for providing
	testcases on such bug.

Only in 2.4.19pre2aa1: 00_sd-cleanup-1

	Rest of the sd cleanups from Pete Zaitcev.

Only in 2.4.19pre2aa1: 00_shmdt-retval-1

	Fix shmdt retval from Andreas Schwab.


Only in 2.4.19pre1aa1: 10_nfs-o_direct-3
Only in 2.4.19pre2aa1: 10_nfs-o_direct-4

	New version from Trond.

Only in 2.4.19pre1aa1: 10_no-virtual-2

	Obsoleted by mainline.

Only in 2.4.19pre1aa1: 10_vm-28
Only in 2.4.19pre2aa1: 10_vm-29

	Make sure to free memclass-realted-metadata for pages out of the
	memclass (bh headers in short).  kmem_cache_shrink as well in the
	max_mapped path now. Added other sysctls and minor tuning.

	Changed the wait_table stuff, first of all have it per-node (no point
	of per-zone), then let it scale more with the ram in the machine (the
	amount of ram used for the wait table is ridicolously small and it
	mostly depends on the amount of the I/O, not on the amount of ram, so
	set up a low limit instead of an high limit). The hashfn I wasn't
	very confortable with. This one is the same of pagemap.h, and it's
	not that huge on the 64bit archs.  If something it had to be a mul.
	This hashfn ensures to be contigous on the cacheline for physically
	consecutive pages, and once the pages are randomized they just provide
	enough randomization, we just need to protect against the bootup when
	it's more likely the pages are physically consecutive.

Only in 2.4.19pre1aa1: 20_pte-highmem-14
Only in 2.4.19pre2aa1: 20_pte-highmem-15

	Initialize the kmap functionality eary during boot, right
	after the mem init, so pte_alloc and in turn ioremap can be used
	in the driver init functions.

Only in 2.4.19pre1aa1: 50_uml-patch-2.4.17-12.gz
Only in 2.4.19pre2aa1: 50_uml-patch-2.4.18-2.gz

	New revision from Jeff at user-mode-linux.sourceforge.net.

Only in 2.4.19pre1aa1: 70_xfs-2.gz
Only in 2.4.19pre2aa1: 70_xfs-4.gz

	Rediffed.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07  8:21 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-07 10:49 ` William Lee Irwin III
  2002-03-07 11:27   ` 2.4.19pre2aa1 Daniel Phillips
  2002-03-07 17:03   ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-07 11:34 ` 2.4.19pre2aa1 Christoph Hellwig
  1 sibling, 2 replies; 53+ messages in thread
From: William Lee Irwin III @ 2002-03-07 10:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, riel, hch, phillips

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> 	Changed the wait_table stuff, first of all have it per-node (no point
> 	of per-zone),

Yes there is. It's called locality of reference. Zones exhibit distinct
usage patterns and affinities; hence, they should have distinct tables.
It is unwise to allow pressure on one zone (i.e. many waiters there) to
interfere with (by creating collisions) efficient wakeups for other zones.

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
>	then let it scale more with the ram in the machine (the
> 	amount of ram used for the wait table is ridicolously small and it
> 	mostly depends on the amount of the I/O, not on the amount of ram, so
> 	set up a low limit instead of an high limit).

The wait_table does not need to scale with the RAM on the machine
	It needs to scale with the number of waiters.

4096 threads blocked on I/O is already approaching or exceeding the
scalability limits of other core kernel subsystems.

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
>                                                      The hashfn I wasn't
> 	very confortable with.

I am less than impressed by this unscientific rationale.

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
>                               This one is the same of pagemap.h, and it's
> 	not that huge on the 64bit archs.  If something it had to be a mul.

It was a mul. It was explicitly commented that the mul was expanded to a
shift/add implementation to accommodate 64-bit architectures, and the
motivation for the particular constant to multiply by was documented in
several posts of mine. It didn't "have to be", it was, and was commented.

This is not mere idle speculation and unfounded micro-optimization.
This is a direct consequence of reports of poor performance due to the
cost of multiplication on several 64-bit architectures, which was
resolved by the expansion of the multiplication to shifts and adds
(the multiplier was designed from inception to support expansion to shifts
and adds; it was discovered the compiler could not do this automatically).

And for Pete's sake that pagemap hash function is ridiculously dirty code;
please, regardless of whether you insist on being inefficient for the
sake of some sort of vendetta, or deny the facts upon which the prior
implementation was based, or just refuse to cooperate until the bitter
end, please, please, clean that hash up instead of copying and pasting it.

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> 	This hashfn ensures to be contigous on the cacheline for physically
> 	consecutive pages, and once the pages are randomized they just provide
> 	enough randomization, we just need to protect against the bootup when
> 	it's more likely the pages are physically consecutive.

Is this some kind of joke? You honestly believe some kind of homebrew
strategy to create collisions is remotely worthwhile? Do you realize
even for one instant that the cost of collisions here is *not* just a
list traversal, but also the spurious wakeup of blocked threads? You've
gone off the deep end in an attempt to be contrary; this is even more
than demonstrably inefficient, it's so blatant inspection reveals it.

Designing your hash function to create collisions is voodoo, pure and
simple. "Just enough randomization" is also voodoo. The pagemap.h hash
was one of the specific hash functions in the kernel producing
measurably horrific hash table distributions on large memory machines,
and my efforts to find a better hash function for it contributed
directly to the design of the hash functions presented in the hashed
waitqueue code. The (rather long) lengths I've gone to to prevent
collisions are justified by the low impact on the implementation (the
choice of multiplier) and the extremely high cost of collisions in this
scheme, which comes from the wake-all semantics of the hash table buckets.

This reversion to a demonstrably poor hashing scheme in the face of
very high collision penalties is laughable.

Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 10:49 ` 2.4.19pre2aa1 William Lee Irwin III
@ 2002-03-07 11:27   ` Daniel Phillips
  2002-03-07 11:47     ` 2.4.19pre2aa1 William Lee Irwin III
  2002-03-07 17:03   ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: Daniel Phillips @ 2002-03-07 11:27 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli; +Cc: linux-kernel, riel, hch

On March 7, 2002 11:49 am, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> >	then let it scale more with the ram in the machine (the
> > 	amount of ram used for the wait table is ridicolously small and it
> > 	mostly depends on the amount of the I/O, not on the amount of ram, so
> > 	set up a low limit instead of an high limit).
> 
> The wait_table does not need to scale with the RAM on the machine
> 	It needs to scale with the number of waiters.
>
> 4096 threads blocked on I/O is already approaching or exceeding the
> scalability limits of other core kernel subsystems.

I think he meant that with larger ram it's easy to justify making the wait
table a little looser, to gain a tiny little bit of extra performance.

That seems perfectly reasonable to me.  Oh, and there's also the observation
that machines with larger ram tend to be more heavily loaded with processes,
just because one can.

--
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07  8:21 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-07 10:49 ` 2.4.19pre2aa1 William Lee Irwin III
@ 2002-03-07 11:34 ` Christoph Hellwig
  1 sibling, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2002-03-07 11:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

In article <20020307092119.A25470@dualathlon.random> you wrote:
> Only in 2.4.19pre1aa1: 00_lvm-1.0.2-1.gz
> Only in 2.4.19pre2aa1: 00_lvm-1.0.2-2.gz

Is there a specific reason you revert back to LVM 1.0.2 from the intree
LVM 1.0.3?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 11:47     ` 2.4.19pre2aa1 William Lee Irwin III
@ 2002-03-07 11:46       ` Daniel Phillips
  0 siblings, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-07 11:46 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, linux-kernel, riel, hch

On March 7, 2002 12:47 pm, William Lee Irwin III wrote:
> On March 7, 2002 11:49 am, William Lee Irwin III wrote:
> >> 4096 threads blocked on I/O is already approaching or exceeding the
> >> scalability limits of other core kernel subsystems.
> 
> On Thu, Mar 07, 2002 at 12:27:41PM +0100, Daniel Phillips wrote:
> > I think he meant that with larger ram it's easy to justify making the wait
> > table a little looser, to gain a tiny little bit of extra performance.
> > That seems perfectly reasonable to me.  Oh, and there's also the observation
> > that machines with larger ram tend to be more heavily loaded with processes,
> > just because one can.
> 
> And these processes will not be waiting on pages as often, either, as
> pages will be more plentiful.
> 
> From the reports I've seen, typical numbers of waiters on pages, even
> on large systems, are as much as an order of magnitude fewer in number
> than 4096. It would seem applications need to wait on pages less with
> the increased memory size.

Uh Yup.

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 11:27   ` 2.4.19pre2aa1 Daniel Phillips
@ 2002-03-07 11:47     ` William Lee Irwin III
  2002-03-07 11:46       ` 2.4.19pre2aa1 Daniel Phillips
  0 siblings, 1 reply; 53+ messages in thread
From: William Lee Irwin III @ 2002-03-07 11:47 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, linux-kernel, riel, hch

On March 7, 2002 11:49 am, William Lee Irwin III wrote:
>> 4096 threads blocked on I/O is already approaching or exceeding the
>> scalability limits of other core kernel subsystems.

On Thu, Mar 07, 2002 at 12:27:41PM +0100, Daniel Phillips wrote:
> I think he meant that with larger ram it's easy to justify making the wait
> table a little looser, to gain a tiny little bit of extra performance.
> That seems perfectly reasonable to me.  Oh, and there's also the observation
> that machines with larger ram tend to be more heavily loaded with processes,
> just because one can.


And these processes will not be waiting on pages as often, either, as
pages will be more plentiful.

>From the reports I've seen, typical numbers of waiters on pages, even
on large systems, are as much as an order of magnitude fewer in number
than 4096. It would seem applications need to wait on pages less with
the increased memory size.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 10:49 ` 2.4.19pre2aa1 William Lee Irwin III
  2002-03-07 11:27   ` 2.4.19pre2aa1 Daniel Phillips
@ 2002-03-07 17:03   ` Andrea Arcangeli
  2002-03-07 20:18     ` 2.4.19pre2aa1 William Lee Irwin III
  1 sibling, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-07 17:03 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, riel, hch, phillips

On Thu, Mar 07, 2002 at 02:49:42AM -0800, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > 	Changed the wait_table stuff, first of all have it per-node (no point
> > 	of per-zone),
> 
> Yes there is. It's called locality of reference. Zones exhibit distinct
> usage patterns and affinities; hence, they should have distinct tables.

there is no different usage pattern, the pages you're waiting on are all
in the pagecache, that spreads across the whole ram of the machine on
any arch. That's just wasted ram. I didn't made it global for the
numa locality in memory of the local node (so it uses the whole bandwith
better and with reads there's more chances the cpu issuing the wait is
on the local node too).

> It is unwise to allow pressure on one zone (i.e. many waiters there) to
> interfere with (by creating collisions) efficient wakeups for other zones.
> 
> 
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> >	then let it scale more with the ram in the machine (the
> > 	amount of ram used for the wait table is ridicolously small and it
> > 	mostly depends on the amount of the I/O, not on the amount of ram, so
> > 	set up a low limit instead of an high limit).
> 
> The wait_table does not need to scale with the RAM on the machine
> 	It needs to scale with the number of waiters.

16G machines tends to have much more cpu and waiters than a 32mbyte
machines. With a 32mbyte machine you'll run out of kernel stack first :)

> 4096 threads blocked on I/O is already approaching or exceeding the
> scalability limits of other core kernel subsystems.

It's not only a matter of 4096 limit, the ram wastage is so ridicolously
low, that we definitely want to decrease the probability of collisions,
since 16k for those 16G boxes is nothing, also knowing we'll more likely
have several thousand of threads blocking on I/O in those machines
(no-async IO in 2.4). Furthmore we don't know if some future arch is
able to deliver more cpu/bus/ram performance and scale better so I
wouldn't put such limit given the so little ram utilization on a huge
box.

> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> >                                                      The hashfn I wasn't
> > 	very confortable with.
> 
> I am less than impressed by this unscientific rationale.
> 
> 
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> >                               This one is the same of pagemap.h, and it's
> > 	not that huge on the 64bit archs.  If something it had to be a mul.
> 
> It was a mul. It was explicitly commented that the mul was expanded to a
> shift/add implementation to accommodate 64-bit architectures, and the
> motivation for the particular constant to multiply by was documented in
> several posts of mine. It didn't "have to be", it was, and was commented.
> 
> This is not mere idle speculation and unfounded micro-optimization.
> This is a direct consequence of reports of poor performance due to the
> cost of multiplication on several 64-bit architectures, which was

I think that happened to be true 10 years ago. I don't know of any
recent machine where some dozen of incremental shifts is faster than a
single plain mul. Am I wrong?

> resolved by the expansion of the multiplication to shifts and adds
> (the multiplier was designed from inception to support expansion to shifts
> and adds; it was discovered the compiler could not do this automatically).
> 
> And for Pete's sake that pagemap hash function is ridiculously dirty code;

the pagemap.h ensures that following indexes will fall into the same
hash cacheline. That is an important property. The sizeof(struct inode)
trickery also allows the compiler not to divided the inode pointer for
the size of the inode, but it allows it to optimize it to a right shift,
Linus is be so kind to explain it to me years ago when I really couldn't
see the point of such trickery. A shift is faster than a div (note:
this have nothing to do with the dozen of shifts and the mul).

> please, regardless of whether you insist on being inefficient for the
> sake of some sort of vendetta, or deny the facts upon which the prior

What vendetta sorry :) ? If I change something is because I feel it's
faster/cleaner done that way, I will never change anything unless I've
a technical argument for that.

> implementation was based, or just refuse to cooperate until the bitter
> end, please, please, clean that hash up instead of copying and pasting it.
> 
> 
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > 	This hashfn ensures to be contigous on the cacheline for physically
> > 	consecutive pages, and once the pages are randomized they just provide
> > 	enough randomization, we just need to protect against the bootup when
> > 	it's more likely the pages are physically consecutive.
> 
> Is this some kind of joke? You honestly believe some kind of homebrew
> strategy to create collisions is remotely worthwhile? Do you realize
> even for one instant that the cost of collisions here is *not* just a
> list traversal, but also the spurious wakeup of blocked threads? You've

I know what happens during collisiosn, do you really think I designed my
hashfn to make more collisions than yours?

> gone off the deep end in an attempt to be contrary; this is even more
> than demonstrably inefficient, it's so blatant inspection reveals it.

Please proof it in practice.

I repeat: the page during production are truly random, so neither my nor
your hashfn can make any difference, they cannot predict the future. So
IMHO you're totally wrong.

I think you're confused sometime there's a relationship between the hash
input, here there's no relationship at all, except at the very early
boot stage (or because somebody freed some multipage previously) where
pages could have an higher probability to be physically consecutive, and
my hashfn infact takes care of distributing well such case, unlike
yours. A mul cannot predict the future.

so please, go ahead, change my code to use your hashfn, monitor the
number of collisions with my hashfn and yours, and if you can measure
that my hashfn will generate a 50% more collisions than yours (you know,
some random variations are very possible since it's totally random, 50%
is a random number, it depends on the standard deviation of the
collisions) I will be glad to apologise for my mistake and to revert to
your version immediatly. All I'm asking is for a real world measurement,
in particular on the long run, just run cerberus for 10 hours or
something like that, and measure the max and mean population at regular
intervals with a tasklet or something like that. Take it as a
"comparison", just to see the difference. If your hashfn is so much
better you will have no worry to show the 50% improvement in real life.
I will also try to do it locally here if time permits, so if you write a
patch please pass it over so we don't duplicate efforts.

For the other points I think you shouldn't really complain (both at
runtime and in code style as well, please see how clean it is with the
wait_table_t thing), I made a definitive improvement to your code, the
only not obvious part is the hashfn but I really cannot see yours
beating mine because of the total random input, infact it could be the
other way around due the fact if something there's the probability the
pages are physically consecutive and I take care of that fine.

Take my changes as a suggestion to try it out, definitely not a
vendetta. Instead of just endlessy sprading words on l-k, I did the
changes in practice instead, so now it is very very easy to make the
comparison.

Don't be offended if I changed your code. I very much appreciate your
and Rik's changes they're good, I only meant to improve them.

If my mistake about the hashfn is so huge, I hope it will be compensated
by the other two improvements at least :).

thanks,

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 17:03   ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-07 20:18     ` William Lee Irwin III
  2002-03-07 20:38       ` 2.4.19pre2aa1 Richard B. Johnson
  2002-03-08  0:11       ` 2.4.19pre2aa1 Andrea Arcangeli
  0 siblings, 2 replies; 53+ messages in thread
From: William Lee Irwin III @ 2002-03-07 20:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, riel, hch, phillips

On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> For the other points I think you shouldn't really complain (both at
> runtime and in code style as well, please see how clean it is with the
> wait_table_t thing), I made a definitive improvement to your code, the
> only not obvious part is the hashfn but I really cannot see yours
> beating mine because of the total random input, infact it could be the
> other way around due the fact if something there's the probability the
> pages are physically consecutive and I take care of that fine.


I don't know whose definition of clean code this is:

+static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
+{
+#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
+#define s(x) ((x)+((x)>>wait_table->shift))
+	return wait_table->head + (s(i) & (wait_table->size-1));
+#undef i
+#undef s
+}


I'm not sure I want to find out.


Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 20:18     ` 2.4.19pre2aa1 William Lee Irwin III
@ 2002-03-07 20:38       ` Richard B. Johnson
  2002-03-08  0:22         ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-08  0:11       ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: Richard B. Johnson @ 2002-03-07 20:38 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, linux-kernel, riel, hch, phillips

On Thu, 7 Mar 2002, William Lee Irwin III wrote:

> On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > For the other points I think you shouldn't really complain (both at
> > runtime and in code style as well, please see how clean it is with the
> > wait_table_t thing), I made a definitive improvement to your code, the
> > only not obvious part is the hashfn but I really cannot see yours
> > beating mine because of the total random input, infact it could be the
> > other way around due the fact if something there's the probability the
> > pages are physically consecutive and I take care of that fine.
> 
> 
> I don't know whose definition of clean code this is:
> 
> +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> +{
> +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> +#define s(x) ((x)+((x)>>wait_table->shift))
> +	return wait_table->head + (s(i) & (wait_table->size-1));
> +#undef i
> +#undef s
> +}
> 
> 
> I'm not sure I want to find out.
> 
> 
> Bill

Methinks there is entirely too much white-space in the code. It
is almost readable *;). It probably could be fixed up to slip through
the compiler with no possibility of human interpretation, but that would
take a bit more work ;^)

The choice of temporaries makes me think of a former President
who, when confronted with a lie, said; "It depends upon what 'is' is!"
Keep up the good work!

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (799.53 BogoMips).

	Bill Gates? Who?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 20:18     ` 2.4.19pre2aa1 William Lee Irwin III
  2002-03-07 20:38       ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-08  0:11       ` Andrea Arcangeli
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-08  0:11 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, riel, hch, phillips

On Thu, Mar 07, 2002 at 12:18:19PM -0800, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > For the other points I think you shouldn't really complain (both at
> > runtime and in code style as well, please see how clean it is with the
> > wait_table_t thing), I made a definitive improvement to your code, the
> > only not obvious part is the hashfn but I really cannot see yours
> > beating mine because of the total random input, infact it could be the
> > other way around due the fact if something there's the probability the
> > pages are physically consecutive and I take care of that fine.
> 
> 
> I don't know whose definition of clean code this is:
> 
> +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> +{
> +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> +#define s(x) ((x)+((x)>>wait_table->shift))
> +	return wait_table->head + (s(i) & (wait_table->size-1));
> +#undef i
> +#undef s
> +}
> 
> 
> I'm not sure I want to find out.

The above is again the hashfunction, the hashfn code doesn't need to be
nice, the API around wait_table_hashfn has to instead. See the above
wait_table_t typedef.

During some further auditing I also noticed now that you introduced
a certain usused wake_up_page. That's buggy, if you use it you'll
deadlock. Also it would be cleaner if __lock_page wasn't using the
exclusive waitqueue and that in turn you would keep using wake_up for
unlock_page. By the time you share the waitqueue nothing can be wake one
any longer, this is probably the worst drawback of the wait_table
memory-saving patch. Infact I was considering to solve the collisions
with additional memory, rather than by having to drop the wake-one
behaviour when many threads are working on the same chunk of the file
that your design solution requires. quite frankly I don't think this was
an urgent thing to change in 2.4 (it only saves some memory and even if
64G will now boot with CONFIG_1G, the lowmem will be way too much
unbalanced to be good for general purpose).

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-07 20:38       ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-08  0:22         ` Andrea Arcangeli
  2002-03-08  0:26           ` 2.4.19pre2aa1 Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-08  0:22 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: William Lee Irwin III, linux-kernel, riel, hch, phillips

On Thu, Mar 07, 2002 at 03:38:09PM -0500, Richard B. Johnson wrote:
> On Thu, 7 Mar 2002, William Lee Irwin III wrote:
> 
> > On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > > For the other points I think you shouldn't really complain (both at
> > > runtime and in code style as well, please see how clean it is with the
> > > wait_table_t thing), I made a definitive improvement to your code, the
> > > only not obvious part is the hashfn but I really cannot see yours
> > > beating mine because of the total random input, infact it could be the
> > > other way around due the fact if something there's the probability the
> > > pages are physically consecutive and I take care of that fine.
> > 
> > 
> > I don't know whose definition of clean code this is:
> > 
> > +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> > +{
> > +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> > +#define s(x) ((x)+((x)>>wait_table->shift))
> > +	return wait_table->head + (s(i) & (wait_table->size-1));
> > +#undef i
> > +#undef s
> > +}
> > 
> > 
> > I'm not sure I want to find out.
> > 
> > 
> > Bill
> 
> Methinks there is entirely too much white-space in the code. It
> is almost readable *;). It probably could be fixed up to slip through
> the compiler with no possibility of human interpretation, but that would
> take a bit more work ;^)

well, to me it is very readable, but I'm certainly biased I admit.

You have to think it this way:

-	"i" is the random not predictable input
-	"i" is generated by dropping the non random bits of the "page"
		with a cute trick that drops the bit with
		a shiftright with an immediate argument, it's not a
		division so it's faster it could been
		"page/sizeof(struct page)" but it would been slower
-	s(i) takes into account not just the last "wait_table->shift" bytes
	of the random information in "i", but also the next "shift"
	bytes, so it's more random even if some of those lower or higher
	bits is generating patterns for whatever reason
-	finally you mask out the resulting random information s(i) with
	size-1, so you fit into the wait_table->head memory.

The result is that it exploits most of the random input, and physically
consecutive pages are liekly to use the same cacheline in case there is
some pattern. That looks a good algorithm to me.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-08  0:22         ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-08  0:26           ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2002-03-08  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Richard B. Johnson, William Lee Irwin III, linux-kernel, hch,
	phillips

On Fri, 8 Mar 2002, Andrea Arcangeli wrote:

> The result is that it exploits most of the random input, and physically
> consecutive pages are liekly to use the same cacheline in case there is
> some pattern. That looks a good algorithm to me.

Worthy of documentation, even.

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 53+ messages in thread

* 2.4.19pre2aa1
@ 2002-03-08  2:40 rwhron
  0 siblings, 0 replies; 53+ messages in thread
From: rwhron @ 2002-03-08  2:40 UTC (permalink / raw)
  To: linux-kernel

Extended changelog at:
http://home.earthlink.net/~rwhron/kernel/andrea/2.4.19pre2aa1.html
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
@ 2002-03-12  4:19 wli
  2002-03-12  5:31 ` 2.4.19pre2aa1 wli
  2002-03-12  6:06 ` 2.4.19pre2aa1 Andrea Arcangeli
  0 siblings, 2 replies; 53+ messages in thread
From: wli @ 2002-03-12  4:19 UTC (permalink / raw)
  To: andrea; +Cc: Richard B. Johnson, linux-kernel, riel, hch, phillips

On Fri, Mar 08, 2002 at 01:22:39AM +0100, Andrea Arcangeli wrote:
> -	"i" is the random not predictable input
> -	"i" is generated by dropping the non random bits of the "page"
> 		with a cute trick that drops the bit with
> 		a shiftright with an immediate argument, it's not a
> 		division so it's faster it could been
> 		"page/sizeof(struct page)" but it would been slower
> -	s(i) takes into account not just the last "wait_table->shift" bytes
> 	of the random information in "i", but also the next "shift"
> 	bytes, so it's more random even if some of those lower or higher
> 	bits is generating patterns for whatever reason
> -	finally you mask out the resulting random information s(i) with
> 	size-1, so you fit into the wait_table->head memory.
> The result is that it exploits most of the random input, and physically
> consecutive pages are liekly to use the same cacheline in case there is
> some pattern. That looks a good algorithm to me.

I'm sorry, this reasoning is too flawed to address in detail. I'm a
programmer, not a schoolteacher. I'll drop a hint: "randomness" is not
the right word to use. The way to make a meaningful statement is a
statistical measurement of how close the resulting bucket distributions
are to uniform, as uniform distributions across buckets minimize collisions.
I suggest the usual chi^2 and Kolmogorov-Smirnov (and variants) tests.

The results of these tests are simple: they will compute the
probability that the difference between the measured distribution and a
true uniform distribution could not have occurred by chance (i.e.
whether its difference from the uniform distribution is statistically
significant). With these things in hand you should be able to make more
meaningful assertions about your hashing algorithms.


Cheers,
Bill

P.S.:	The quality (or lack thereof) of this hash function is already
	visible in histograms of the pagecache bucket distribution on
	larger systems. Statistical measurements of its quality reflect
	this very clearly as well. Someone needs to move the stable out
	of the Sahara and closer to a lake.

	http://www.samba.org/~anton/linux/pagecache/pagecache_before.png

	is a histogram of the pagecache hash function's bucket distribution
	on an SMP ppc64 machine after some benchmark run.

	http://www.samba.org/~anton/linux/pagecache/pagecache_after.png

	has a histogram of a Fibonacci hashing hash function's bucket
	distribution on the same machine after an identical benchmark run.

	(There is more to come from the hash table analysis front.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12  4:19 2.4.19pre2aa1 wli
@ 2002-03-12  5:31 ` wli
  2002-03-12  6:36   ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12  6:06 ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: wli @ 2002-03-12  5:31 UTC (permalink / raw)
  To: wli, andrea, Richard B. Johnson, linux-kernel, riel, hch,
	phillips

On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
> 	http://www.samba.org/~anton/linux/pagecache/pagecache_before.png
> 
> 	is a histogram of the pagecache hash function's bucket distribution
> 	on an SMP ppc64 machine after some benchmark run.
> 
> 	http://www.samba.org/~anton/linux/pagecache/pagecache_after.png
> 
> 	has a histogram of a Fibonacci hashing hash function's bucket
> 	distribution on the same machine after an identical benchmark run.

akpm just pointed out to me these histograms are not quite the best
comparisons as the tables differ in size. I'll get something webabble
soon with head-to-head comparisons. OTOH the general nature of things
should be clear and the behavior of that hash function visible.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12  4:19 2.4.19pre2aa1 wli
  2002-03-12  5:31 ` 2.4.19pre2aa1 wli
@ 2002-03-12  6:06 ` Andrea Arcangeli
  2002-03-12 10:46   ` 2.4.19pre2aa1 Rik van Riel
  2002-03-12 11:29   ` 2.4.19pre2aa1 wli
  1 sibling, 2 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12  6:06 UTC (permalink / raw)
  To: wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
> On Fri, Mar 08, 2002 at 01:22:39AM +0100, Andrea Arcangeli wrote:
> > -	"i" is the random not predictable input
> > -	"i" is generated by dropping the non random bits of the "page"
> > 		with a cute trick that drops the bit with
> > 		a shiftright with an immediate argument, it's not a
> > 		division so it's faster it could been
> > 		"page/sizeof(struct page)" but it would been slower
> > -	s(i) takes into account not just the last "wait_table->shift" bytes
> > 	of the random information in "i", but also the next "shift"
> > 	bytes, so it's more random even if some of those lower or higher
> > 	bits is generating patterns for whatever reason
> > -	finally you mask out the resulting random information s(i) with
> > 	size-1, so you fit into the wait_table->head memory.
> > The result is that it exploits most of the random input, and physically
> > consecutive pages are liekly to use the same cacheline in case there is
> > some pattern. That looks a good algorithm to me.
> 
> I'm sorry, this reasoning is too flawed to address in detail. I'm a

Well, that's the same reasoning behind the hashfn that everybody using
linux computes at every read syscall. I didn't invented it, I only told
you what's the reasoning behind it.

btw, some year back also Chuck Lever did an very interesting extensive
research on hashing and he just evaluated the multiplicative hash
method.

The critical thing that araised from Chuck's extensive research on the
hashes is been the dynamic sizing of the hashes as boot. the hashfn
weren't changed. If we could have exploited a huge performance
optimization and if the current hashfn would been flawed how could you
explain that the hashfn are still unchanged? Because Linus dropped the
email with the patches and those emails weren't resent? Then why didn't
somebody piked them up in a different benchmark tree?

Said that I agree that for the pagecache hash, there could be
interesting patterns in input. If the inodes are allocated at nearly the
same time after boot, and you work on large files, then you'll likely
generate lots of collisions that may be avoided by the multiplicatve
hash. So it's not optimal for the big files handling only a few inodes.

But note that this email thread is about the wait_table hash, not the
pagecache hash, they're two completly different beasts, even if the
reasoning behind them is the same one.

> programmer, not a schoolteacher. I'll drop a hint: "randomness" is not

I studied the multiplicative hashes (i recall the div method) in
Introduction to Algorithms and what I think about them is that those
hashfns are been designed to work well in presence of non random
patterns in input. One example of an hashfn in linux that doesn't get
significantive random input, is the buffer cache hash.  That's indexed
by offset and the blkdev. the blkdev is just fixed, no matter if it's a
pointer or a major/minor. The offset will definitely have huge patterns
that can lead to bad distribution (just like the "offset" in the
pagecache hash). I recall Chuck noticed that, and infact the hashfn of
the buffercache is the only one that I'm sure it's been replaced because
it was bad.

See what it become now:

#define _hashfn(dev,block)	\
	((((dev)<<(bh_hash_shift - 6)) ^ ((dev)<<(bh_hash_shift - 9))) ^ \
	 (((block)<<(bh_hash_shift - 6)) ^ ((block) >> 13) ^ \
	  ((block) << (bh_hash_shift - 12))))

For the pagecache hashfn similar collisions can happen with big files and
only a few inodes allocated right after boot. The current pagecache hash
is optimal on the small files, like simultaneous kernel compile + kde
compile + gcc compile etc....

Also note that one thing that math doesn't contemplate is that it may be
faster to handle more collisions on a hot L2/L1 cache, rather than not
having collisons but to constantly run at the speed of the DRAM and the
current hashes definitely cares a lot about this bit. I only see you
looking at the distribution on the paper and  not at the access pattern
to memory and cache effects.

But anyways we're not talking about the pagecache has here, the
randomness of the input of the wait_table isn't remotely comparable to
the randomness of the input of the pagecache hash: after load the free
pages in the freelists will become completly random, and if there's a
probability, the probability is that they're contigous (for example when
a kernel stack gets released for example, and two page of pagecahce gets
allocated while reading a file, and then we wait on both, and we want
both waitqueues in the same cacheline). It's not like with the
pagecacache that you can artificially generate collisions easily by
working on big files with two inodes allocated right after boot.  Here
the pages are the random input and we can't ask for something better in
input for an hashfn.

If you assume a pure random input, I don't see how can you distribute it
better with a simple mul assembler instruction. That's my whole point.
Now if intuition doesn't work here, and you need a complex math
demonstration to proof it, please give post a poiner to a book or
pubblication and I will study it, I definitely want to understand how
the hell could a "mul" distribute better a pure random input because
intuitions tells me it's not possible.

> the right word to use. The way to make a meaningful statement is a
> statistical measurement of how close the resulting bucket distributions
> are to uniform, as uniform distributions across buckets minimize collisions.
> I suggest the usual chi^2 and Kolmogorov-Smirnov (and variants) tests.
> 
> The results of these tests are simple: they will compute the
> probability that the difference between the measured distribution and a
> true uniform distribution could not have occurred by chance (i.e.
> whether its difference from the uniform distribution is statistically
> significant). With these things in hand you should be able to make more
> meaningful assertions about your hashing algorithms.
> 
> 
> Cheers,
> Bill
> 
> P.S.:	The quality (or lack thereof) of this hash function is already
> 	visible in histograms of the pagecache bucket distribution on
> 	larger systems. Statistical measurements of its quality reflect
> 	this very clearly as well. Someone needs to move the stable out
> 	of the Sahara and closer to a lake.
> 
> 	http://www.samba.org/~anton/linux/pagecache/pagecache_before.png
> 
> 	is a histogram of the pagecache hash function's bucket distribution
> 	on an SMP ppc64 machine after some benchmark run.
> 
> 	http://www.samba.org/~anton/linux/pagecache/pagecache_after.png
> 
> 	has a histogram of a Fibonacci hashing hash function's bucket
> 	distribution on the same machine after an identical benchmark run.
> 
> 	(There is more to come from the hash table analysis front.)

from the graphs it seems the "before" had 250000 buckets, and the
"after" had 8 million buckets. Is that right? Or were all the other
8millions - 250000 buckets unused in the "before" test, if yes, then you
were definitely not monitorning a misc load with small files, but of
course for such a kind of workload the fibonacci hashing would very well
be better than the current pagecache hash, I completly agree. the
"after" graph also shows the hash is sized wrong, that's because we're
not using the bootmem allocator, Dave has a patch for that (but it's
highmem-not-aware still). btw, it's amazing that the hash is almost as
fast as the radix tree despite the fact the hash can at max be 2M in
size currently, on a >=4G box you really need it bigger or you'll be
guaranteed having to walk a bigger depth than the radix tree.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12  5:31 ` 2.4.19pre2aa1 wli
@ 2002-03-12  6:36   ` Andrea Arcangeli
  0 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12  6:36 UTC (permalink / raw)
  To: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 05:31:52AM +0000, wli@holomorphy.com wrote:
> On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
> > 	http://www.samba.org/~anton/linux/pagecache/pagecache_before.png
> > 
> > 	is a histogram of the pagecache hash function's bucket distribution
> > 	on an SMP ppc64 machine after some benchmark run.
> > 
> > 	http://www.samba.org/~anton/linux/pagecache/pagecache_after.png
> > 
> > 	has a histogram of a Fibonacci hashing hash function's bucket
> > 	distribution on the same machine after an identical benchmark run.
> 
> akpm just pointed out to me these histograms are not quite the best
> comparisons as the tables differ in size. I'll get something webabble

yes, I also noticed it immediatly, 250000 buckets vs 8million buckets...
Not only that, now I also noticed it seems there is a different number
of entries into the two hashes.

> soon with head-to-head comparisons. OTOH the general nature of things
> should be clear and the behavior of that hash function visible.

I won't be really surprised if you can beat the pagecache hash with big
files. Fibonacci/mul may very well be better there. I'm not sure if you
can beat it on the small files though, and still one should always take
into account the cache effects, the monitoring of the hash distribution
isn't the end of the story.

I would be mostly interested to see a comparions for the hashfn of the
wait_table too, that is the thing we were discussing here.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12  6:06 ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-12 10:46   ` Rik van Riel
  2002-03-12 11:47     ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 11:29   ` 2.4.19pre2aa1 wli
  1 sibling, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2002-03-12 10:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: wli, Richard B. Johnson, linux-kernel, hch, phillips

On Tue, 12 Mar 2002, Andrea Arcangeli wrote:

> If you assume a pure random input, I don't see how can you distribute it
> better with a simple mul assembler instruction. That's my whole point.

Since you've just given two examples of common non-random
workloads yourself, I don't see how you can reasonably
expect a pure random input.

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12  6:06 ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 10:46   ` 2.4.19pre2aa1 Rik van Riel
@ 2002-03-12 11:29   ` wli
  2002-03-12 12:56     ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: wli @ 2002-03-12 11:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Fri, Mar 08, 2002 at 01:22:39AM +0100, Andrea Arcangeli wrote:
>>> The result is that it exploits most of the random input, and physically
>>> consecutive pages are liekly to use the same cacheline in case there is
>>> some pattern. That looks a good algorithm to me.

On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
>> I'm sorry, this reasoning is too flawed to address in detail. I'm a

On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
> Well, that's the same reasoning behind the hashfn that everybody using
> linux computes at every read syscall. I didn't invented it, I only told
> you what's the reasoning behind it.

I maintain that this reasoning is fallacious and that the results are
suboptimal. The argument is by authority and the results I have measured.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> btw, some year back also Chuck Lever did an very interesting extensive
> research on hashing and he just evaluated the multiplicative hash
> method.

I cited Chuck Lever in my comments.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> The critical thing that araised from Chuck's extensive research on the
> hashes is been the dynamic sizing of the hashes as boot. the hashfn
> weren't changed. If we could have exploited a huge performance
> optimization and if the current hashfn would been flawed how could you
> explain that the hashfn are still unchanged? Because Linus dropped the
> email with the patches and those emails weren't resent? Then why didn't
> somebody piked them up in a different benchmark tree?

The pagecache hash function and the inode cache hash functions don't pass
statistical tests and performance improvements from using hash functions
that do are demonstrable.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> Said that I agree that for the pagecache hash, there could be
> interesting patterns in input. If the inodes are allocated at nearly the
> same time after boot, and you work on large files, then you'll likely
> generate lots of collisions that may be avoided by the multiplicatve
> hash. So it's not optimal for the big files handling only a few inodes.

-ESPECULATION

Q(chi^2|nu), the probability that a uniform distribution would give
rise to such measured distributions by chance, is vanishingly small for
the pagecache hash functions. I have measured this in various scenarios,
and am attempting to gather enough information to compute statistical
correlations between benchmarks and statistical tests for the
uniformity of the distribution across buckets in addition to statistical
properties of the bucket distributions themselves.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> But note that this email thread is about the wait_table hash, not the
> pagecache hash, they're two completly different beasts, even if the
> reasoning behind them is the same one.

Not really. The pagecache hash function proved itself so poor in the
pagecache setting it was not a worthwhile candidate.

On Tue, Mar 12, 2002 at 04:19:58AM +0000, wli@parcelfarce.linux.theplanet.co.uk wrote:
>> programmer, not a schoolteacher. I'll drop a hint: "randomness" is not

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> I studied the multiplicative hashes (i recall the div method) in
> Introduction to Algorithms and what I think about them is that those
> hashfns are been designed to work well in presence of non random
> patterns in input. One example of an hashfn in linux that doesn't get
> significantive random input, is the buffer cache hash.  That's indexed
> by offset and the blkdev. the blkdev is just fixed, no matter if it's a
> pointer or a major/minor. The offset will definitely have huge patterns
> that can lead to bad distribution (just like the "offset" in the
> pagecache hash). I recall Chuck noticed that, and infact the hashfn of
> the buffercache is the only one that I'm sure it's been replaced because
> it was bad.

CLR's mention of multiplicative hashing is little more than a citation
of Knuth with a brief computation of a bucket index (using floating
point arithmetic!) for an example. The div method is in the section
immediately preceding it and is described as slot = key % tablesize.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> See what it become now:
> #define _hashfn(dev,block)	\
> 	((((dev)<<(bh_hash_shift - 6)) ^ ((dev)<<(bh_hash_shift - 9))) ^ \
> 	 (((block)<<(bh_hash_shift - 6)) ^ ((block) >> 13) ^ \
> 	  ((block) << (bh_hash_shift - 12))))

I've actually measured the distributions of entries across buckets
resulting from this hash function and it has impressive statistical
properties. Fibonacci hashing falls flat on its face for the bcache,
which seems to be at variance with your sequential input assertion,
as that appears to be one of Fibonacci hashing's specialties (behold
ext2 inode number allocation).

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> For the pagecache hashfn similar collisions can happen with big files and
> only a few inodes allocated right after boot. The current pagecache hash
> is optimal on the small files, like simultaneous kernel compile + kde
> compile + gcc compile etc....

-ESPECULATION

The pagecache hash function has never passed a single statistical test
that I've put it through. Q(chi^2|nu) < 10**-200 etc. is typical.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> Also note that one thing that math doesn't contemplate is that it may be
> faster to handle more collisions on a hot L2/L1 cache, rather than not
> having collisons but to constantly run at the speed of the DRAM and the
> current hashes definitely cares a lot about this bit. I only see you
> looking at the distribution on the paper and  not at the access pattern
> to memory and cache effects.

I'll humor you for a moment but point out that L2/L1 cache misses are
considerably cheaper than scheduling storms. Furthermore, the kinds of
proof you'd to establish this sort of assertion for ordinary bucket
disciplines appears to involve reading profiling registers to count
cache misses or perhaps tracing bus accesses using logic analyzers, or
simulation logging, which I doubt have been done.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> But anyways we're not talking about the pagecache has here, the
> randomness of the input of the wait_table isn't remotely comparable to
> the randomness of the input of the pagecache hash: after load the free
> pages in the freelists will become completly random, and if there's a
> probability, the probability is that they're contigous (for example when
> a kernel stack gets released for example, and two page of pagecahce gets
> allocated while reading a file, and then we wait on both, and we want
> both waitqueues in the same cacheline). It's not like with the
> pagecacache that you can artificially generate collisions easily by
> working on big files with two inodes allocated right after boot.  Here
> the pages are the random input and we can't ask for something better in
> input for an hashfn.

This is pure speculation. Would you care demonstrate these phenomena by
means of tracing methods, e.g. simulator traces or instrumentation code?

My personal opinion on these sorts of assertions is that they are highly
workload dependent and involve interactions with the policies of the slab
and physical allocators with usage patterns, architectural formatting of
virtual addresses, and physical and virtual memory layouts, which in my
mind are complex enough and vary enough they are best modeled with
statistical and empirical methods.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> If you assume a pure random input, I don't see how can you distribute it
> better with a simple mul assembler instruction. That's my whole point.
> Now if intuition doesn't work here, and you need a complex math
> demonstration to proof it, please give post a poiner to a book or
> pubblication and I will study it, I definitely want to understand how
> the hell could a "mul" distribute better a pure random input because
> intuitions tells me it's not possible.

A random variable could have a distribution concentrated at a single
point (which would defeat any hashing scheme). You need to revise your
terminology at the very least.

I'd also love to hear what sort of distribution you believe your random
input variable to have, and measurements to compare it against.

For specific reasons why multiplication by a magic constant (related to
the golden ratio) should have particularly interesting scattering
properties, Knuth cites Vera Turán Sós, Acta Math. Acad. Sci. Hung. 8
(1957), 461-471; Ann. Univ. Sci. Budapest, Eötvös Sect. Math 1 (1958),
127-134, and gives further details in exercises 8 and 9 in section 6.4,
TAOCP vol. 2 (2nd ed.). There are also some relevant lemmas around for
the connoisseur.

I don't take this to mean that Fibonacci hashing is gospel.
In my mind the advantages are specifically:
(1) the search process for good hash functions may be guided by sieving
(2) there is a comprehensible motivation behind the hashing scheme
(3) the hash functions actually pass the statistical tests I run
(4) they provide demonstrable benefits for the tables I use them in

Now, the proofs as they are carried out appear to involve sequential
natural numbers, yet one should notice that this is *not* a finite
process. Hence, it makes an assertion regarding all of the natural
numbers simultaneously, not sequential insertions.

On Tue, Mar 12, 2002 at 07:06:45AM +0100, Andrea Arcangeli wrote:
> from the graphs it seems the "before" had 250000 buckets, and the
> "after" had 8 million buckets. Is that right? Or were all the other
> 8millions - 250000 buckets unused in the "before" test, if yes, then you
> were definitely not monitorning a misc load with small files, but of
> course for such a kind of workload the fibonacci hashing would very well
> be better than the current pagecache hash, I completly agree. the
> "after" graph also shows the hash is sized wrong, that's because we're
> not using the bootmem allocator, Dave has a patch for that (but it's
> highmem-not-aware still). btw, it's amazing that the hash is almost as
> fast as the radix tree despite the fact the hash can at max be 2M in
> size currently, on a >=4G box you really need it bigger or you'll be
> guaranteed having to walk a bigger depth than the radix tree.

What you're noticing here is what I actually pointed you at was "before"
and "after" for pagecache alloc in bootmem. I don't actually know where
anton's comparisons for my Fibonacci hashing functions vs. that are,
and probably won't hear until hacker hours resume in .au

I will either sort out the results I have on the waitqueues or rerun
tests at my leisure.

Note I am only trying to help you avoid shooting yourself in the foot.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 10:46   ` 2.4.19pre2aa1 Rik van Riel
@ 2002-03-12 11:47     ` Andrea Arcangeli
  2002-03-12 11:48       ` 2.4.19pre2aa1 wli
  2002-03-12 12:21       ` 2.4.19pre2aa1 Daniel Phillips
  0 siblings, 2 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12 11:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: wli, Richard B. Johnson, linux-kernel, hch, phillips

On Tue, Mar 12, 2002 at 07:46:51AM -0300, Rik van Riel wrote:
> On Tue, 12 Mar 2002, Andrea Arcangeli wrote:
> 
> > If you assume a pure random input, I don't see how can you distribute it
> > better with a simple mul assembler instruction. That's my whole point.
> 
> Since you've just given two examples of common non-random
> workloads yourself, I don't see how you can reasonably
> expect a pure random input.

It is not pure random of course, it can't be pure random there will be
always some influence for example from the I/O patterns, in particular
right after boot, but the randomness of the input is still excellent,
it's not an inode pointer that could have a very little change.  I made
the example to say in the so unlikely case it happens, that's great,
we'll have the guarantee that those two pages won't collide in the same
waitqueue.

Infact above I said "If you _assume_ pure random input". That's what I'm
interested about, if you _assume_ pure random input (in practice there
will always be some infuence).

Bill said that "randomness" is not the right word to use, I instead
think it's the key. If it's pure random mul will make no difference to
the distribution. And the closer we're to pure random like in the
wait_table hash, the less mul will help and the more important will be
to just get right the two contigous pages in the same cacheline and
nothing else.

For the pagecache hash the thing is different, as said with the
pagecache some patological workload can really generate lots of
collisions reproducibly, big patterns can happen there, for example if
the inode happens to be allocated in a row right after boot, and then
some massive I/O starts on similar patterns. While in the
developer-workloads with lots of small files and VM activity, the
pagecache hash should be just good.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 11:47     ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-12 11:48       ` wli
  2002-03-12 12:21       ` 2.4.19pre2aa1 Daniel Phillips
  1 sibling, 0 replies; 53+ messages in thread
From: wli @ 2002-03-12 11:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, wli, Richard B. Johnson, linux-kernel, hch,
	phillips

On Tue, Mar 12, 2002 at 12:47:28PM +0100, Andrea Arcangeli wrote:
> Bill said that "randomness" is not the right word to use, I instead
> think it's the key. If it's pure random mul will make no difference to
> the distribution. And the closer we're to pure random like in the
> wait_table hash, the less mul will help and the more important will be
> to just get right the two contigous pages in the same cacheline and
> nothing else.

Aha! It's random! Is the distribution:

(1) Poisson
(2) singular
(3) binomial
(4) hypergeometric

or what?


Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 11:47     ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 11:48       ` 2.4.19pre2aa1 wli
@ 2002-03-12 12:21       ` Daniel Phillips
  2002-03-12 14:25         ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: Daniel Phillips @ 2002-03-12 12:21 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel
  Cc: wli, Richard B. Johnson, linux-kernel, hch, phillips

On March 12, 2002 12:47 pm, Andrea Arcangeli wrote:
> If it's pure random mul will make no difference to
> the distribution. And the closer we're to pure random like in the
> wait_table hash, the less mul will help and the more important will be
> to just get right the two contigous pages in the same cacheline and
> nothing else.

You're ignoring the possibility (probability) of corner cases.  I'm not
sure why you're beating away on this, Bill has done a fine job of coming
up with hashes that are both satisfactory for the common case and have
sound reasons for being resistant to corner cases.

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 11:29   ` 2.4.19pre2aa1 wli
@ 2002-03-12 12:56     ` Andrea Arcangeli
  2002-03-12 13:20       ` 2.4.19pre2aa1 Rik van Riel
                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12 12:56 UTC (permalink / raw)
  To: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
> For specific reasons why multiplication by a magic constant (related to
> the golden ratio) should have particularly interesting scattering
> properties, Knuth cites Vera Turán Sós, Acta Math. Acad. Sci. Hung. 8

I know about the scattering properties of some number (that is been
measured empirically if I remeber well what I read). I was asking for
something else, I was asking if this magical number can scatter better a
random input as well or not. My answer is no. And I believe the nearest
you are to random input to the hashfn the less the mul can scatter
better than the input itself and it will just lose cache locality for
consecutive pages. So the nearest you are, the better if you avoid the
mul and you take full advantage of the randomness of the input, rather
than keeping assuming the input has pattenrs.

I mean, start reading from /dev/random and see how the distribution goes
with and without mul, it will be the same I think.

> I will either sort out the results I have on the waitqueues or rerun
> tests at my leisure.

I'm waiting for it, if you've monitor patches I can run something too.
I'd like to count the number of collisions over time in particular.

> Note I am only trying to help you avoid shooting yourself in the foot.

If I've rescheduling problems you're likely to have them too, I'm quite
sure, if something the fact I keep the hashtable large enough will make
me less bitten by the collisions. The only certain way to avoid riskying
regressions would be to backout the wait_table part that was merged in
mainline 19pre2. the page_zone thing cannot generate any regression for
instance (same is true for page_address), the wait_table part is gray
area instead, it's just an heuristic like all the hashes and you can
always have a corner case bitten hard, it's just that the probabiliy for
such a corner case to happen has to be small enough for it to be
acceptable, but you can always be unlucky, no matter if you mul or not,
you cannot predict the future of what's the next pages that the people
will wait on from userspace.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 12:56     ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-12 13:20       ` Rik van Riel
  2002-03-12 13:33       ` 2.4.19pre2aa1 Richard B. Johnson
  2002-03-12 14:14       ` 2.4.19pre2aa1 wli
  2 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2002-03-12 13:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, wli, Richard B. Johnson, linux-kernel, hch, phillips

On Tue, 12 Mar 2002, Andrea Arcangeli wrote:

> I know about the scattering properties of some number (that is been
> measured empirically if I remeber well what I read). I was asking for
> something else, I was asking if this magical number can scatter better a
> random input as well or not. My answer is no.

> I mean, start reading from /dev/random and see how the distribution goes
> with and without mul, it will be the same I think.

I think it's time we introduce a CS theory equivalent
of Godwin's Law.

You've just outknuthed yourself ;)

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 12:56     ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 13:20       ` 2.4.19pre2aa1 Rik van Riel
@ 2002-03-12 13:33       ` Richard B. Johnson
  2002-03-12 14:17         ` 2.4.19pre2aa1 wli
  2002-03-13  2:18         ` 2.4.19pre2aa1 wli
  2002-03-12 14:14       ` 2.4.19pre2aa1 wli
  2 siblings, 2 replies; 53+ messages in thread
From: Richard B. Johnson @ 2002-03-12 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: wli, wli, linux-kernel, riel, hch, phillips

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=US-ASCII, Size: 3777 bytes --]

On Tue, 12 Mar 2002, Andrea Arcangeli wrote:

> On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
> > For specific reasons why multiplication by a magic constant (related to
> > the golden ratio) should have particularly interesting scattering
> > properties, Knuth cites Vera Turán Sós, Acta Math. Acad. Sci. Hung. 8
> 
> I know about the scattering properties of some number (that is been
> measured empirically if I remeber well what I read). I was asking for
> something else, I was asking if this magical number can scatter better a
> random input as well or not. My answer is no. And I believe the nearest
> you are to random input to the hashfn the less the mul can scatter
> better than the input itself and it will just lose cache locality for
> consecutive pages. So the nearest you are, the better if you avoid the
> mul and you take full advantage of the randomness of the input, rather
> than keeping assuming the input has pattenrs.
> 
> I mean, start reading from /dev/random and see how the distribution goes
> with and without mul, it will be the same I think.
> 
> > I will either sort out the results I have on the waitqueues or rerun
> > tests at my leisure.
> 
> I'm waiting for it, if you've monitor patches I can run something too.
> I'd like to count the number of collisions over time in particular.
> 
> > Note I am only trying to help you avoid shooting yourself in the foot.
> 
> If I've rescheduling problems you're likely to have them too, I'm quite
> sure, if something the fact I keep the hashtable large enough will make
> me less bitten by the collisions. The only certain way to avoid riskying
> regressions would be to backout the wait_table part that was merged in
> mainline 19pre2. the page_zone thing cannot generate any regression for
> instance (same is true for page_address), the wait_table part is gray
> area instead, it's just an heuristic like all the hashes and you can
> always have a corner case bitten hard, it's just that the probabiliy for
> such a corner case to happen has to be small enough for it to be
> acceptable, but you can always be unlucky, no matter if you mul or not,
> you cannot predict the future of what's the next pages that the people
> will wait on from userspace.
> 
> Andrea



This is a simple random number generator. It takes a pointer to your
own private long, somewhere in your code, and returns a long random
number with a period of 0xfffd4011. I ran a program for about a
year, trying to find a magic number that will produce a longer
period.

You could add a ldiv and return the modulus to set hash-table limits.
ANDs are not good because, in principle, you could get many numbers
in which all the low bits are zero.


The advantage of this simple code is it works quickly. The disadvantages
are, of course, its not portable and a rotation of a binary number
is not a mathematical function, lending itself to rigorous analysis.


#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#
#   File rnd.S            Created 04-FEB-2000        Richard B. Johnson
#                                                    rjohnson@analogic.com
#
#   Simple random number generator.
#
#

MAGIC = 0xd0047077 
INTPTR = 0x08
.section	.text
.global		rnd
.type		rnd,@function
.align	0x04
	pushl	%ebx
	movl	INTPTR(%esp), %ebx
	movl	(%ebx), %eax
	rorl	$3, %eax
	addl	$MAGIC, %eax
	movl	%eax, (%ebx)
	popl	%ebx
        ret
.end
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

	Bill Gates? Who?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 12:56     ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 13:20       ` 2.4.19pre2aa1 Rik van Riel
  2002-03-12 13:33       ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-12 14:14       ` wli
  2002-03-12 15:04         ` 2.4.19pre2aa1 Andrea Arcangeli
  2 siblings, 1 reply; 53+ messages in thread
From: wli @ 2002-03-12 14:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
>> For specific reasons why multiplication by a magic constant (related to
>> the golden ratio) should have particularly interesting scattering
>> properties, Knuth cites Vera Turán Sós, Acta Math. Acad. Sci. Hung. 8

On Tue, Mar 12, 2002 at 01:56:05PM +0100, Andrea Arcangeli wrote:
> I know about the scattering properties of some number (that is been
> measured empirically if I remeber well what I read). I was asking for
> something else, I was asking if this magical number can scatter better a
> random input as well or not. My answer is no. And I believe the nearest
> you are to random input to the hashfn the less the mul can scatter
> better than the input itself and it will just lose cache locality for
> consecutive pages. So the nearest you are, the better if you avoid the
> mul and you take full advantage of the randomness of the input, rather
> than keeping assuming the input has pattenrs.
> I mean, start reading from /dev/random and see how the distribution goes
> with and without mul, it will be the same I think.

The assumption here is then that the input distribution is uniform.
This is, in fact the input distribution Fibonacci hashing is
designed for (i.e. uniform distribution on the natural numbers, which
arises quite naturally from considering all of \mathbb{N} as input).


On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
>> I will either sort out the results I have on the waitqueues or rerun
>> tests at my leisure.

On Tue, Mar 12, 2002 at 01:56:05PM +0100, Andrea Arcangeli wrote:
> I'm waiting for it, if you've monitor patches I can run something too.
> I'd like to count the number of collisions over time in particular.

This is a fairly crude statistic but one that I'm collecting. The
hashtable profiling is something I've been treating as a userspace
issue with the state dumps done by mmapping /dev/kmem, and in fact,
I did not write that myself, that is due to Anton Blanchard and Rusty
Russell, who initiated this area of study, and have made the more
important contributions to it. In this instance, you can see as well
as I that the references date to the 50's and 60's, and that accurate
measurement is the truly difficult aspect of the problem. Formulae to
grind statistics into single numbers are nothing but following
directions, and these have been precisely the portions of the effort
for which I've been responsible, aside from perhaps some of the sieving
routines, with which I had assistance from Krystof and Fare from OPN,
and these are not especially significant in terms of effort either.
Anton and Rusty have been the ones with the insight to identify the
problematic areas and also to find methods of collecting data.


On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
>> Note I am only trying to help you avoid shooting yourself in the foot.

On Tue, Mar 12, 2002 at 01:56:05PM +0100, Andrea Arcangeli wrote:
> If I've rescheduling problems you're likely to have them too, I'm quite
> sure, if something the fact I keep the hashtable large enough will make
> me less bitten by the collisions. The only certain way to avoid riskying
> regressions would be to backout the wait_table part that was merged in
> mainline 19pre2. the page_zone thing cannot generate any regression for
> instance (same is true for page_address), the wait_table part is gray
> area instead, it's just an heuristic like all the hashes and you can
> always have a corner case bitten hard, it's just that the probabiliy for
> such a corner case to happen has to be small enough for it to be
> acceptable, but you can always be unlucky, no matter if you mul or not,
> you cannot predict the future of what's the next pages that the people
> will wait on from userspace.

I have some small reservations about these assertions.
(1) the scheduling storms are direct results of collisions in the hash
	tables, which are direct consequences of the hash function quality.
	OTOH increasing table size is one method of collision reduction,
	albeit one I would like extremely strict control over, and one
	that should not truly be considered until the load is high
	(.e.g. 0.75 or thereabouts)
(2) page_address() has been seen to cause small regressions on
	architectures where the additional memory reference in comparison
	to a __va((page - mem_map) << PAGE_SHIFT) implementation is
	measurable. In my mind, several generic variants are required for
	optimality and I will either personally implement them or endorse
	any suitably clean implementations of them (clean is the hard part).
(3) Hashing is not mere heuristics. A hash function is a sufficiently
	small portion of an implementation of a table lookup scheme
	that the search may be guided by statistical measurements of
	both the effectiveness of the hash function in distributing
	keys across buckets and net profiling results. The fact I
	restricted the space I searched to Fibonacci hashing is
	irrelevant; it was merely a tractable search subspace.
(4) The notion that a hashing scheme may be defeated by a sufficiently
	unlucky input distribution is irrelevant. This has been known
	(and I suppose feared) since its invention in the early 1950's.
	Hash tables are nondeterministic data structures by design, and
	their metrics of performance are either expected or amortized
	running time, not worst case, which always equal to linear search.
(5) Last, but not least, the /dev/random argument is bogus. Mass
	readings of pseudorandom number generators do not make for	
	necessarily uniformly distributed trials or samples. By and
	large the methods by which pseudorandom number generators are
	judged (and rightly so) are various kinds of statistical relations
	between some (fixed for the duration of the trial) number of
	successively generated numbers. For instance, groups of N (where
	N is small and fixed for the duration of the trial) consecutively
	generated numbers should not lie within the same hyperplane for
	some applications. These have no relation to the properties
	which would cause a distribution of a set of numbers which is
	measured as a snapshot in time to pass a test for uniformity.


Otherwise we appear to be largely in agreement.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 13:33       ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-12 14:17         ` wli
  2002-03-12 14:30           ` 2.4.19pre2aa1 Richard B. Johnson
  2002-03-13  2:18         ` 2.4.19pre2aa1 wli
  1 sibling, 1 reply; 53+ messages in thread
From: wli @ 2002-03-12 14:17 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote:
> This is a simple random number generator. It takes a pointer to your
> own private long, somewhere in your code, and returns a long random
> number with a period of 0xfffd4011. I ran a program for about a
> year, trying to find a magic number that will produce a longer
> period.
> 
> You could add a ldiv and return the modulus to set hash-table limits.
> ANDs are not good because, in principle, you could get many numbers
> in which all the low bits are zero.
> 
> 
> The advantage of this simple code is it works quickly. The disadvantages
> are, of course, its not portable and a rotation of a binary number
> is not a mathematical function, lending itself to rigorous analysis.

Would you mind explaining what the point of this is? AFAICT this is
meaningless noise inspired by the words "/dev/random".


Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 12:21       ` 2.4.19pre2aa1 Daniel Phillips
@ 2002-03-12 14:25         ` Andrea Arcangeli
  2002-03-12 14:32           ` 2.4.19pre2aa1 Daniel Phillips
  2002-03-15 17:20           ` 2.4.19pre2aa1 Horst von Brand
  0 siblings, 2 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12 14:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, wli, Richard B. Johnson, linux-kernel, hch

On Tue, Mar 12, 2002 at 01:21:35PM +0100, Daniel Phillips wrote:
> On March 12, 2002 12:47 pm, Andrea Arcangeli wrote:
> > If it's pure random mul will make no difference to
> > the distribution. And the closer we're to pure random like in the
> > wait_table hash, the less mul will help and the more important will be
> > to just get right the two contigous pages in the same cacheline and
> > nothing else.
> 
> You're ignoring the possibility (probability) of corner cases.  I'm not

The corner cases cannot go away with a mul. If what you care about are
corner cases you shouldn't have dropped page->wait.

I changed the hashfn to make it better IMHO, Bill says it is suboptimal,
but I would prefer to see it happening on real load too before returning
to the mul. Counting the number of collisions per second under load
should be good enough measurement, nominal performance would be the
other variable but I doubt there's anything to measure in real time
differences.

If I'm generating visibly more collisions, I will be very glad to return
to the mul hashfn of course.

AFIK my current hashfn is never been tested in precendence on this kind
of random input of the wait_table pages.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 14:17         ` 2.4.19pre2aa1 wli
@ 2002-03-12 14:30           ` Richard B. Johnson
  0 siblings, 0 replies; 53+ messages in thread
From: Richard B. Johnson @ 2002-03-12 14:30 UTC (permalink / raw)
  To: wli; +Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Tue, 12 Mar 2002 wli@holomorphy.com wrote:

> On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote:
> > This is a simple random number generator. It takes a pointer to your
> > own private long, somewhere in your code, and returns a long random
> > number with a period of 0xfffd4011. I ran a program for about a
> > year, trying to find a magic number that will produce a longer
> > period.
> > 
> > You could add a ldiv and return the modulus to set hash-table limits.
> > ANDs are not good because, in principle, you could get many numbers
> > in which all the low bits are zero.
> > 
> > 
> > The advantage of this simple code is it works quickly. The disadvantages
> > are, of course, its not portable and a rotation of a binary number
> > is not a mathematical function, lending itself to rigorous analysis.
> 
> Would you mind explaining what the point of this is? AFAICT this is
> meaningless noise inspired by the words "/dev/random".
> 
> 
> Bill
> 
Really?


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 14:25         ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-12 14:32           ` Daniel Phillips
  2002-03-15 17:20           ` 2.4.19pre2aa1 Horst von Brand
  1 sibling, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-12 14:32 UTC (permalink / raw)
  To: Andrea Arcangeli, Daniel Phillips
  Cc: Rik van Riel, wli, Richard B. Johnson, linux-kernel, hch

On March 12, 2002 03:25 pm, Andrea Arcangeli wrote:
> On Tue, Mar 12, 2002 at 01:21:35PM +0100, Daniel Phillips wrote:
> > On March 12, 2002 12:47 pm, Andrea Arcangeli wrote:
> > > If it's pure random mul will make no difference to
> > > the distribution. And the closer we're to pure random like in the
> > > wait_table hash, the less mul will help and the more important will be
> > > to just get right the two contigous pages in the same cacheline and
> > > nothing else.
> > 
> > You're ignoring the possibility (probability) of corner cases.  I'm not
> 
> The corner cases cannot go away with a mul.

Oh but they do[1].  If there's a major point you're missing it has to be that.

[1] Not entirely of course, which Bill already pointed out clearly enough.
'Go away' always means 'get less common' with respect to hash functions,
that's the best you can do.

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 14:14       ` 2.4.19pre2aa1 wli
@ 2002-03-12 15:04         ` Andrea Arcangeli
  2002-03-12 23:31           ` 2.4.19pre2aa1 wli
  0 siblings, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-12 15:04 UTC (permalink / raw)
  To: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 02:14:39PM +0000, wli@holomorphy.com wrote:
> The assumption here is then that the input distribution is uniform.

Yes.

> On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
> >> I will either sort out the results I have on the waitqueues or rerun
> >> tests at my leisure.
> 
> On Tue, Mar 12, 2002 at 01:56:05PM +0100, Andrea Arcangeli wrote:
> > I'm waiting for it, if you've monitor patches I can run something too.
> > I'd like to count the number of collisions over time in particular.
> 
> This is a fairly crude statistic but one that I'm collecting. The
> hashtable profiling is something I've been treating as a userspace
> issue with the state dumps done by mmapping /dev/kmem, and in fact,

I'm not sure if the collisions will be frequent enough to show up in
kmem. I think collisions in this hash should never happen except in very
unlikely cases (one collision every few seconds would be ok for
example). So I guess we'd better use a synchronous method to count the
number of collisions with proper kernel code for that. 

> I did not write that myself, that is due to Anton Blanchard and Rusty
> Russell, who initiated this area of study, and have made the more
> important contributions to it. In this instance, you can see as well
> as I that the references date to the 50's and 60's, and that accurate
> measurement is the truly difficult aspect of the problem. Formulae to
> grind statistics into single numbers are nothing but following
> directions, and these have been precisely the portions of the effort
> for which I've been responsible, aside from perhaps some of the sieving
> routines, with which I had assistance from Krystof and Fare from OPN,
> and these are not especially significant in terms of effort either.
> Anton and Rusty have been the ones with the insight to identify the
> problematic areas and also to find methods of collecting data.
> 
> 
> On Tue, Mar 12, 2002 at 11:29:00AM +0000, wli@holomorphy.com wrote:
> >> Note I am only trying to help you avoid shooting yourself in the foot.
> 
> On Tue, Mar 12, 2002 at 01:56:05PM +0100, Andrea Arcangeli wrote:
> > If I've rescheduling problems you're likely to have them too, I'm quite
> > sure, if something the fact I keep the hashtable large enough will make
> > me less bitten by the collisions. The only certain way to avoid riskying
> > regressions would be to backout the wait_table part that was merged in
> > mainline 19pre2. the page_zone thing cannot generate any regression for
> > instance (same is true for page_address), the wait_table part is gray
> > area instead, it's just an heuristic like all the hashes and you can
> > always have a corner case bitten hard, it's just that the probabiliy for
> > such a corner case to happen has to be small enough for it to be
> > acceptable, but you can always be unlucky, no matter if you mul or not,
> > you cannot predict the future of what's the next pages that the people
> > will wait on from userspace.
> 
> I have some small reservations about these assertions.
> (1) the scheduling storms are direct results of collisions in the hash
> 	tables, which are direct consequences of the hash function quality.

you mean in _the_ hash table (wait_table hash table). Collisions in all
other hash tables in the kernel doesn't lead to scheduling storms of
course.

> 	OTOH increasing table size is one method of collision reduction,
> 	albeit one I would like extremely strict control over, and one
> 	that should not truly be considered until the load is high
> 	(.e.g. 0.75 or thereabouts)

yes, however strict control over 52k on a 256G machine when spending
some more minor ram would reduce a lot the probability of collisions
looked a bit excessive. I'm all for spending ram _if_ it pays off
singificantly.

> (2) page_address() has been seen to cause small regressions on
> 	architectures where the additional memory reference in comparison

That's a minor global regression, but linear without corner cases, 
and the additional ram available to userspace may make more difference
as well depending on the workload and the size of the ram.

> 	to a __va((page - mem_map) << PAGE_SHIFT) implementation is
> 	measurable. In my mind, several generic variants are required for
> 	optimality and I will either personally implement them or endorse
> 	any suitably clean implementations of them (clean is the hard part).
> (3) Hashing is not mere heuristics. A hash function is a sufficiently
> 	small portion of an implementation of a table lookup scheme
> 	that the search may be guided by statistical measurements of
> 	both the effectiveness of the hash function in distributing
> 	keys across buckets and net profiling results. The fact I
> 	restricted the space I searched to Fibonacci hashing is
> 	irrelevant; it was merely a tractable search subspace.
> (4) The notion that a hashing scheme may be defeated by a sufficiently
> 	unlucky input distribution is irrelevant. This has been known
> 	(and I suppose feared) since its invention in the early 1950's.
> 	Hash tables are nondeterministic data structures by design, and
> 	their metrics of performance are either expected or amortized
> 	running time, not worst case, which always equal to linear search.
> (5) Last, but not least, the /dev/random argument is bogus. Mass
> 	readings of pseudorandom number generators do not make for	
> 	necessarily uniformly distributed trials or samples. By and
> 	large the methods by which pseudorandom number generators are
> 	judged (and rightly so) are various kinds of statistical relations
> 	between some (fixed for the duration of the trial) number of
> 	successively generated numbers. For instance, groups of N (where
> 	N is small and fixed for the duration of the trial) consecutively
> 	generated numbers should not lie within the same hyperplane for
> 	some applications. These have no relation to the properties
> 	which would cause a distribution of a set of numbers which is
> 	measured as a snapshot in time to pass a test for uniformity.

/dev/random is not a pseudorandom number generator. /dev/urandom is.

> 
> 
> Otherwise we appear to be largely in agreement.

Yes.

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 15:04         ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-12 23:31           ` wli
  2002-03-13  0:09             ` 2.4.19pre2aa1 Andrew Morton
  0 siblings, 1 reply; 53+ messages in thread
From: wli @ 2002-03-12 23:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 02:14:39PM +0000, wli@holomorphy.com wrote:
>> The assumption here is then that the input distribution is uniform.

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> Yes.

This is a much stronger assertion than random. Random variables
may have distributions concentrated on low-dimensional surfaces
(or even singleton points!) which is a case that should be excluded.
Various other distributions are probably not what you want either.

To be clear, I am not operating under the assumption that the input
distribution is uniform (or anything in particular).

At some point in the past, I wrote:
>> This is a fairly crude statistic but one that I'm collecting. The
>> hashtable profiling is something I've been treating as a userspace
>> issue with the state dumps done by mmapping /dev/kmem, and in fact,

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> I'm not sure if the collisions will be frequent enough to show up in
> kmem. I think collisions in this hash should never happen except in very
> unlikely cases (one collision every few seconds would be ok for
> example). So I guess we'd better use a synchronous method to count the
> number of collisions with proper kernel code for that. 

That is pretty easy to do.

At some point in the past, I wrote:
>> I have some small reservations about these assertions.
>> (1) the scheduling storms are direct results of collisions in the hash
>> 	tables, which are direct consequences of the hash function quality.

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> you mean in _the_ hash table (wait_table hash table). Collisions in all
> other hash tables in the kernel doesn't lead to scheduling storms of
> course.

There is more than one wait_table, one per node or per zone.

At some point in the past, I wrote:
>> 	OTOH increasing table size is one method of collision reduction,
>> 	albeit one I would like extremely strict control over, and one
>> 	that should not truly be considered until the load is high
>> 	(.e.g. 0.75 or thereabouts)

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> yes, however strict control over 52k on a 256G machine when spending
> some more minor ram would reduce a lot the probability of collisions
> looked a bit excessive. I'm all for spending ram _if_ it pays off
> singificantly.

Also, there are already too many boot-time allocations proportional to
memory. At the very least consider setting a hard constant upper bound
with some small constant of proportionality to PID_MAX.

If I may wander (further?) off-topic and stand on the soapbox for a moment:
I personally believe this is serious enough various actions should be
taken so that we don't die a death by a thousand "every mickey mouse
hash table in the kernel allocating 0.5% of RAM at boot time" cuts.
Note that we are already in such dire straits that sufficiently large
physical memory sizes on 36-bit physically-addressed 32-bit machines
alone will experience consumption of the entire kernel virtual address
space by various boot-time allocations. This is a bug.

The lower bound of 256 looks intriguing; if it's needed then it should
be done, but if it is needed it begins to raise the question of whether
this fragment of physical memory is worthwhile or should be ignored. On
i386 this would be a fragment of size 1MB or smaller packaged into a
pgdat or zone. There is also an open question as to how much overhead
introducing nodes with tiny amounts of memory like this creates. I
think there may be a reason why they're commonly ignored. I suspect the
real issue here may be that contention for memory doesn't decrease with
the shrinking size of a region of physical memory either.

At some point in the past, I wrote:
>> (2) page_address() has been seen to cause small regressions on
>> 	architectures where the additional memory reference in comparison

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> That's a minor global regression, but linear without corner cases, 
> and the additional ram available to userspace may make more difference
> as well depending on the workload and the size of the ram.

True, and (aside from how I would phrase it) this is my rationale for
not taking immediate action. For architectures where the interaction of
the address calculation arithmetic, arithmetic feature sets, and
numerical properties of structure sizes and so on are particularly
unfortunate around this code (I believe SPARC is one example and there
may be others), something will have to be done whether it's using a
different computation or just using the memory.

At some point in the past, I wrote:
>> (5) Last, but not least, the /dev/random argument is bogus. Mass
>> 	readings of pseudorandom number generators do not make for	
>> 	necessarily uniformly distributed trials or samples. By and
>> 	large the methods by which pseudorandom number generators are
>> 	judged (and rightly so) are various kinds of statistical relations
>> 	between some (fixed for the duration of the trial) number of
>> 	successively generated numbers. For instance, groups of N (where
>> 	N is small and fixed for the duration of the trial) consecutively
>> 	generated numbers should not lie within the same hyperplane for
>> 	some applications. These have no relation to the properties
>> 	which would cause a distribution of a set of numbers which is
>> 	measured as a snapshot in time to pass a test for uniformity.

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> /dev/random is not a pseudorandom number generator. /dev/urandom is.

Right right. Random number generation is a different problem regardless.

At some point in the past, I wrote:
>> Otherwise we appear to be largely in agreement.

On Tue, Mar 12, 2002 at 04:04:30PM +0100, Andrea Arcangeli wrote:
> Yes.

We're getting down to details too minor to bother with.

Enough for now? Shall we meet with benchmarks next time?

Cheers,
Bill

P.S.:	Note that I do maintain my code. If you do have demonstrable
	improvements or even cleanups I will review them and endorse
	them if they pass my review. These changes did not. Also,
	these changes to the hashing scheme were not separated out
	from the rest of the VM patch, so the usual "break this up
	into a separate patch please" applies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 23:31           ` 2.4.19pre2aa1 wli
@ 2002-03-13  0:09             ` Andrew Morton
  2002-03-13  1:06               ` 2.4.19pre2aa1 wli
                                 ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Andrew Morton @ 2002-03-13  0:09 UTC (permalink / raw)
  To: wli
  Cc: Andrea Arcangeli, wli, Richard B. Johnson, linux-kernel, riel,
	hch, phillips

wli@holomorphy.com wrote:
> 
> Also, these changes to the hashing scheme were not separated out
> from the rest of the VM patch, so the usual "break this up
> into a separate patch please" applies.

FYI, I am doing that at present.  It look like Andrea's 10_vm-30
patch will end up as twenty or thirty separate patches.  I won't
be testing every darn patch individually - I'll batch them up into
maybe four groups for testing and merging.

Andrea introduced some subtly changed buffer locking rules, and
this causes ext3 to deadlock under heavy load.  Until we sort
this out, I'm afraid that the -aa VM is not suitable for production
use with ext3.

ext2 is OK and I *assume* it's OK with reiserfs.  The problem occurs
when a filesystem performs:

	lock_buffer(dirty_bh);
	allocate_something(GFP_NOFS);

without having locked the buffer's page.  sync_page_buffers()
can perform a wait_on_buffer() against dirty_bh.  (I think.
I'm not quite up-to-speed with the new buffer state bits yet).

-

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  0:09             ` 2.4.19pre2aa1 Andrew Morton
@ 2002-03-13  1:06               ` wli
  2002-03-13  1:24                 ` 2.4.19pre2aa1 Andrew Morton
  2002-03-13  7:30               ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-13  8:12               ` 2.4.19pre2aa1 Daniel Phillips
  2 siblings, 1 reply; 53+ messages in thread
From: wli @ 2002-03-13  1:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, wli, Richard B. Johnson, linux-kernel, riel,
	hch, phillips

wli@holomorphy.com wrote:
>> Also, these changes to the hashing scheme were not separated out
>> from the rest of the VM patch, so the usual "break this up
>> into a separate patch please" applies.

On Tue, Mar 12, 2002 at 04:09:22PM -0800, Andrew Morton wrote:
> FYI, I am doing that at present.  It look like Andrea's 10_vm-30
> patch will end up as twenty or thirty separate patches.  I won't
> be testing every darn patch individually - I'll batch them up into
> maybe four groups for testing and merging.

There are some okay parts to aa's hashing changes but we need to
work together to get the patch to perform those and drop the rest,
and also address some style issues. Some of the constants are
adjustable (though it's not clear how they should be adjusted) and the
wait_table_t ADT is fine. The hash function change is not and moving
the table from per-zone to per-node is questionable, as its effects are
not well-understood.

Perhaps understandably so, I'd like to take a hands-on role in
guiding this patch into a form suitable for the mainline kernel.


On Tue, Mar 12, 2002 at 04:09:22PM -0800, Andrew Morton wrote:
> Andrea introduced some subtly changed buffer locking rules, and
> this causes ext3 to deadlock under heavy load.  Until we sort
> this out, I'm afraid that the -aa VM is not suitable for production
> use with ext3.
> ext2 is OK and I *assume* it's OK with reiserfs.  The problem occurs
> when a filesystem performs:
> 	lock_buffer(dirty_bh);
> 	allocate_something(GFP_NOFS);
> without having locked the buffer's page.  sync_page_buffers()
> can perform a wait_on_buffer() against dirty_bh.  (I think.
> I'm not quite up-to-speed with the new buffer state bits yet).

This looks like a change of invariants that could generate a fair
amount of audit work. Ugh...


Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  1:06               ` 2.4.19pre2aa1 wli
@ 2002-03-13  1:24                 ` Andrew Morton
  2002-03-13  7:37                   ` 2.4.19pre2aa1 Daniel Phillips
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2002-03-13  1:24 UTC (permalink / raw)
  To: wli
  Cc: Andrea Arcangeli, wli, Richard B. Johnson, linux-kernel, riel,
	hch, phillips

wli@holomorphy.com wrote:
> 
> Perhaps understandably so, I'd like to take a hands-on role in
> guiding this patch into a form suitable for the mainline kernel.

The hashing changes will become the topmost (ie: last-applied) diff 
for that reason...


> This looks like a change of invariants that could generate a fair
> amount of audit work. Ugh...

In a better world, the VM wouldn't know anything at all about
these "buffer" thingies.

-

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 13:33       ` 2.4.19pre2aa1 Richard B. Johnson
  2002-03-12 14:17         ` 2.4.19pre2aa1 wli
@ 2002-03-13  2:18         ` wli
  2002-03-13 19:06           ` 2.4.19pre2aa1 Richard B. Johnson
  1 sibling, 1 reply; 53+ messages in thread
From: wli @ 2002-03-13  2:18 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote:
> This is a simple random number generator. It takes a pointer to your
> own private long, somewhere in your code, and returns a long random
> number with a period of 0xfffd4011. I ran a program for about a
> year, trying to find a magic number that will produce a longer
> period.

Random number generators are not hash functions. Where is the relevance?
I think you're generating a stream of uniform deviates to test whether
a given hash function performs well with random input.

I'm far more concerned about how the hash function behaves on real input
data and I've been measuring that from the start, and as far as I'm
concerned this microbenchmark is fairly useless. If it defeats it, it
doesn't demonstrate that the hash function behaves poorly on the input
distribution I'm interested in. If it doesn't defeat it it doesn't
demonstrate that the hash function behaves well on the input distribution
I'm interested in. If you want to generate useful random numbers for
simulating workloads for hash table benchmarking you should attempt to
measure and simulate the distribution of hash keys from the hash tables
you're interested in and use statistical tests to verify that you're
actually generating numbers with similar distributions. One easy way
to get those might be, say, dumping all the elements of a hash table
by traversing all the collision chains and sampling the keys' fields.

On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote:
> You could add a ldiv and return the modulus to set hash-table limits.
> ANDs are not good because, in principle, you could get many numbers
> in which all the low bits are zero.

It sounds like you're generating a stream of random deviates and then
hashing them by mapping to their least residues mod the hash table size
with perhaps a division thrown in.  With a long period you're going to
get quite impressive statistics, not that it means anything. =)

On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote:
> The advantage of this simple code is it works quickly. The disadvantages
> are, of course, its not portable and a rotation of a binary number
> is not a mathematical function, lending itself to rigorous analysis.

Rigorous mathematical analysis does not require analyticity =) or
number-theoretic tractability. There are things called statistical
analysis and asymptotic analysis, which go great together and are
quite wonderful in combination with empirical testing.

On Tue, Mar 12, 2002 at 08:33:23AM -0500, Richard B. Johnson wrote
	(and I added comments to):

> MAGIC = 0xd0047077 
> INTPTR = 0x08
> .section	.text
> .global		rnd
> .type		rnd,@function
> .align	0x04
> 	pushl	%ebx			# save %ebx
> 	movl	INTPTR(%esp), %ebx	# load argument from stack to %ebx
> 	movl	(%ebx), %eax		# deref arg, loading to %eax
> 	rorl	$3, %eax		# roll %eax right by 3
> 	addl	$MAGIC, %eax		# add $MAGIC to %eax
> 	movl	%eax, (%ebx)		# save result
> 	popl	%ebx			# restore %ebx
>         ret
> .end

So you've invented the following function:

f(n) = 2**29 * (n % 8) + floor(n/8) + 0xd0047077

I wonder what several tests will look like for this, but you should
really test your own random number generator.
I have my own patches to work on.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  0:09             ` 2.4.19pre2aa1 Andrew Morton
  2002-03-13  1:06               ` 2.4.19pre2aa1 wli
@ 2002-03-13  7:30               ` Andrea Arcangeli
  2002-03-13  7:55                 ` 2.4.19pre2aa1 Andrew Morton
  2002-03-13 10:57                 ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-13  8:12               ` 2.4.19pre2aa1 Daniel Phillips
  2 siblings, 2 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-13  7:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

[-- Attachment #1: Type: text/plain, Size: 6214 bytes --]

On Tue, Mar 12, 2002 at 04:09:22PM -0800, Andrew Morton wrote:
> wli@holomorphy.com wrote:
> > 
> > Also, these changes to the hashing scheme were not separated out
> > from the rest of the VM patch, so the usual "break this up
> > into a separate patch please" applies.
> 
> FYI, I am doing that at present.  It look like Andrea's 10_vm-30
> patch will end up as twenty or thirty separate patches.  I won't
> be testing every darn patch individually - I'll batch them up into
> maybe four groups for testing and merging.
> 
> Andrea introduced some subtly changed buffer locking rules, and
> this causes ext3 to deadlock under heavy load.  Until we sort
> this out, I'm afraid that the -aa VM is not suitable for production
> use with ext3.
> 
> ext2 is OK and I *assume* it's OK with reiserfs.  The problem occurs
> when a filesystem performs:
> 
> 	lock_buffer(dirty_bh);
> 	allocate_something(GFP_NOFS);
> 
> without having locked the buffer's page.  sync_page_buffers()
> can perform a wait_on_buffer() against dirty_bh.  (I think.
> I'm not quite up-to-speed with the new buffer state bits yet).

The deadlock I fixed with BH_Pending_IO was only related to
write_some_buffers. What you found now is an additional deadlock that is
been fixed by further changes in mainline, alternative to the
BH_Pending_IO, but less restrictive.

The reason I kept using BH_Pending_IO is that without it the VM cannot
throttle correctly on locked buffers. With BH_launder the VM is only
able to throttle on buffers that the VM submitted by itself (and
incidentally it doesn't deadlock for you). So an excessive amount of
locked buffers submitted by bdflush/kupdate would lead the VM not to be
able to proper throttle. This is most important for the lowmem machines
where the amount of locked memory can be substantial thanks to the
fairly large I/O queues (mostly with more than on disk).

You can see the attached emails for historical reference.

Now, I think the best fix is to just drop the BH_Pending_IO and to add a
BH_launder similar mainline, but used in a completly different manner.
The way used by mainline is too permissive for those locked buffers just
submitted to the I/O layer.

In short I will set BH_launder in submit_bh, instead of
sync_page_buffers, this way I'll be sure if BH_launder is set I can
safely wait on the locked buffer because it's just visible to the I/O
layer and unplugging the queue will be enough to get it unlocked
eventually. BH_launder will be cleared in unlock_buffer as usual. That
is the _right_ usage of BH_launder, I simply weren't doing that because
I didn't noticed this further deadlock yet, I was only aware of the
write_some_buffers vs GFP_NOHIGHIO deadlock so far. In short this new
usage is the complemental information of my BH_Pending_IO. So now I can
wait if BH_Launder is set, while previously I couldn't wait if
BH_Pending_IO was set. But now the difference is that BH_Pending_IO
wasn't covering all the cases while BH_Launder used in my manner really
cover all the cases (the BH_launder in mainline doesn't cover all the
cases either, but the only downside is that it won't be able to write
throttle, while I could deadlock).

What do you think about this patch against 2.4.19pre3aa1?

Unless somebody finds anything wrong, I will release a new -aa tree and
vm-32 with it applied. I think it's the best approch to cure the
deadlock and to still be able to throttle on the locked buffers.

diff -urN ref/drivers/block/ll_rw_blk.c launder/drivers/block/ll_rw_blk.c
--- ref/drivers/block/ll_rw_blk.c	Wed Mar 13 08:24:07 2002
+++ launder/drivers/block/ll_rw_blk.c	Wed Mar 13 08:12:40 2002
@@ -1025,6 +1025,7 @@
 		BUG();

 	set_bit(BH_Req, &bh->b_state);
+	set_bit(BH_Launder, &bh->b_state);

 	/*
 	 * First step, 'identity mapping' - RAID or LVM might
diff -urN ref/fs/buffer.c launder/fs/buffer.c
--- ref/fs/buffer.c	Wed Mar 13 08:24:07 2002
+++ launder/fs/buffer.c	Wed Mar 13 08:26:02 2002
@@ -171,6 +171,7 @@
 		brelse(bh);
 		if (ret == 0) {
 			clear_bit(BH_Wait_IO, &bh->b_state);
+			clear_bit(BH_Launder, &bh->b_state);
 			smp_mb__after_clear_bit();
 			if (waitqueue_active(&bh->b_wait)) {
 				wake_up(&bh->b_wait);
@@ -184,6 +185,7 @@
 {
 	clear_bit(BH_Wait_IO, &bh->b_state);
 	clear_bit(BH_Lock, &bh->b_state);
+	clear_bit(BH_Launder, &bh->b_state);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(&bh->b_wait))
 		wake_up(&bh->b_wait);
@@ -238,7 +240,6 @@
 	do {
 		struct buffer_head * bh = *array++;
 		bh->b_end_io = end_buffer_io_sync;
-		clear_bit(BH_Pending_IO, &bh->b_state);
 		submit_bh(WRITE, bh);
 	} while (--count);
 }
@@ -278,7 +279,6 @@
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
-			set_bit(BH_Pending_IO, &bh->b_state);
 			array[count++] = bh;
 			if (count < NRSYNC)
 				continue;
@@ -2704,13 +2704,12 @@
 			continue;
 		}

-		if (unlikely(buffer_pending_IO(bh))) {
-			tryagain = 0;
-			continue;
-		}
-
 		/* Second time through we start actively writing out.. */
 		if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+			if (unlikely(!buffer_launder(bh))) {
+				tryagain = 0;
+				continue;
+			}
 			wait_on_buffer(bh);
 			tryagain = 1;
 			continue;
diff -urN ref/include/linux/fs.h launder/include/linux/fs.h
--- ref/include/linux/fs.h	Wed Mar 13 08:24:11 2002
+++ launder/include/linux/fs.h	Wed Mar 13 08:23:59 2002
@@ -232,7 +232,7 @@
 	BH_New,		/* 1 if the buffer is new and not yet written out */
 	BH_Async,	/* 1 if the buffer is under end_buffer_io_async I/O */
 	BH_Wait_IO,	/* 1 if we should write out this buffer */
-	BH_Pending_IO,	/* 1 if the buffer is locked but not in the I/O queue yet */
+	BH_Launder,	/* 1 if we can throttle on this buffer */
 	BH_JBD,		/* 1 if it has an attached journal_head */
 	BH_Delay,	/* 1 if the buffer is delayed allocate */

@@ -295,7 +295,7 @@
 #define buffer_mapped(bh)	__buffer_state(bh,Mapped)
 #define buffer_new(bh)		__buffer_state(bh,New)
 #define buffer_async(bh)	__buffer_state(bh,Async)
-#define buffer_pending_IO(bh)	__buffer_state(bh,Pending_IO)
+#define buffer_launder(bh)	__buffer_state(bh,Launder)
 #define buffer_delay(bh)	__buffer_state(bh,Delay)

 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)

Andrea

[-- Attachment #2: Type: message/rfc822, Size: 4788 bytes --]

From: Andrea Arcangeli <andrea@suse.de>
To: Robert Macaulay <robert_macaulay@dell.com>
Cc: Rik van Riel <riel@conectiva.com.br>, Craig Kulesa <ckulesa@as.arizona.edu>, linux-kernel@vger.kernel.org, Bob Matthews <bmatthews@redhat.com>, Marcelo Tosatti <marcelo@conectiva.com.br>, Linus Torvalds <torvalds@transmeta.com>
Subject: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Date: Fri, 28 Sep 2001 00:13:21 +0200
Message-ID: <20010928001321.L14277@athlon.random>

On Wed, Sep 26, 2001 at 08:36:51PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 26, 2001 at 01:17:29PM -0500, Robert Macaulay wrote:
> > We've tried the 2.4.10 with vmtweaks2 on out machine with 8GB RAM. It was 
> > looking good for a while, until it just stopped. Here is what was 
> > happening on the machine.
> > 
> > I was ftping files into the box at a rate of about 8MB/sec. This continued 
> > until all the RAM was in the  cache column. This was earlier in the 
> > included vmstat output. The I started a dd if=/dev/sde of=/dev/null in a 
> > new window.
> > 
> > All was looking good until it just stopped. I captured the vmstat below. 
> > vmstat continued running for about 1 minute, then it died too. What other 
> > info can I provide?
> 
> the best/first info in this case would be sysrq+T along with the system.map.

Ok, both your trace and Bob's trace show the problem clearly. thanks
to both for the helpful feedback btw.

The deadlock happens because of a collision between write_some_buffers()
and the GFP_NOHIGHIO logic. The deadlock was not introduced in the vm
rewrite but it was introduced with the nohighio logic.

The problem is that we are locking a couple of buffers, and later - after
they're all locked - we start writing them via write_locked_buffers.

The deadlock happens in the middle of write_locked_buffers when we hit
an highmem buffer, so while allocating with GFP_NOHIGHIO we end doing
sync_page_buffers on any page that isn't highmem, but that incidentally is one of the
other next buffers in the array that we previously locked in
write_some_buffers but that aren't in the I/O queue yet (so we'll wait
forever since they depends on us to be written).

Robert just confirmed that dropping the NOHIGHIO logic fixes the
problem.

So the fix is either:

1) to drop the NOHIGHIO logic like my test patch did
2) or to keep track of what buffers we must not wait while releasing
   ram

I'll try approch 2) in the below untested patch (the nohighio logic make
sense so I'd prefer not to drop it), Robert and Bob, can you give it a
spin on the highmem boxes and check if it helps?

I suggest to test it on top of 2.4.10+vm-tweaks-2.

--- 2.4.10aa2/fs/buffer.c.~1~	Wed Sep 26 18:45:29 2001
+++ 2.4.10aa2/fs/buffer.c	Fri Sep 28 00:04:44 2001
@@ -194,6 +194,7 @@
 		struct buffer_head * bh = *array++;
 		bh->b_end_io = end_buffer_io_sync;
 		submit_bh(WRITE, bh);
+		clear_bit(BH_Pending_IO, &bh->b_state);
 	} while (--count);
 }

@@ -225,6 +226,7 @@
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
+			set_bit(BH_Pending_IO, &bh->b_state);
 			array[count++] = bh;
 			if (count < NRSYNC)
 				continue;
@@ -2519,7 +2521,9 @@
 	int tryagain = 1;

 	do {
-		if (buffer_dirty(p) || buffer_locked(p)) {
+		if (unlikely(buffer_pending_IO(p)))
+			tryagain = 0;
+		else if (buffer_dirty(p) || buffer_locked(p)) {
 			if (test_and_set_bit(BH_Wait_IO, &p->b_state)) {
 				if (buffer_dirty(p)) {
 					ll_rw_block(WRITE, 1, &p);
--- 2.4.10aa2/include/linux/fs.h.~1~	Wed Sep 26 18:51:25 2001
+++ 2.4.10aa2/include/linux/fs.h	Fri Sep 28 00:01:54 2001
@@ -217,6 +217,7 @@
 	BH_New,		/* 1 if the buffer is new and not yet written out */
 	BH_Async,	/* 1 if the buffer is under end_buffer_io_async I/O */
 	BH_Wait_IO,	/* 1 if we should throttle on this buffer */
+	BH_Pending_IO,	/* 1 if the buffer is locked but not in the I/O queue yet */

 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -277,6 +278,7 @@
 #define buffer_mapped(bh)	__buffer_state(bh,Mapped)
 #define buffer_new(bh)		__buffer_state(bh,New)
 #define buffer_async(bh)	__buffer_state(bh,Async)
+#define buffer_pending_IO(bh)	__buffer_state(bh,Pending_IO)

 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  1:24                 ` 2.4.19pre2aa1 Andrew Morton
@ 2002-03-13  7:37                   ` Daniel Phillips
  0 siblings, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-13  7:37 UTC (permalink / raw)
  To: Andrew Morton, wli
  Cc: Andrea Arcangeli, wli, Richard B. Johnson, linux-kernel, riel,
	hch, phillips

On March 13, 2002 02:24 am, Andrew Morton wrote:
> wli@holomorphy.com wrote:
> > This looks like a change of invariants that could generate a fair
> > amount of audit work. Ugh...
> 
> In a better world, the VM wouldn't know anything at all about
> these "buffer" thingies.

;-)

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  7:30               ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-13  7:55                 ` Andrew Morton
  2002-03-13  8:06                   ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-13 10:57                 ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2002-03-13  7:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

Andrea Arcangeli wrote:
> 
> ...
> What do you think about this patch against 2.4.19pre3aa1?

Well it won't apply on 10_vm-30, but that's OK...

So BH_Launder is set when we start I/O and is cleared on
I/O completion.   That sounds fine - clearly it is always
safe to throttle on these buffers.

Thanks - I'll stress-test it tomorrow.
 
-

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  7:55                 ` 2.4.19pre2aa1 Andrew Morton
@ 2002-03-13  8:06                   ` Andrea Arcangeli
  0 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-13  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Tue, Mar 12, 2002 at 11:55:27PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > ...
> > What do you think about this patch against 2.4.19pre3aa1?
> 
> Well it won't apply on 10_vm-30, but that's OK...

Yes, I will make a new vm-31 shortly anyways.

> So BH_Launder is set when we start I/O and is cleared on
> I/O completion.   That sounds fine - clearly it is always
> safe to throttle on these buffers.
> 
> Thanks - I'll stress-test it tomorrow.

Fine, thanks again!

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  0:09             ` 2.4.19pre2aa1 Andrew Morton
  2002-03-13  1:06               ` 2.4.19pre2aa1 wli
  2002-03-13  7:30               ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-13  8:12               ` Daniel Phillips
  2 siblings, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-13  8:12 UTC (permalink / raw)
  To: Andrew Morton, wli
  Cc: Andrea Arcangeli, wli, Richard B. Johnson, linux-kernel, riel,
	hch, phillips

On March 13, 2002 01:09 am, Andrew Morton wrote:
> Andrea introduced some subtly changed buffer locking rules, and
> this causes ext3 to deadlock under heavy load.  Until we sort
> this out, I'm afraid that the -aa VM is not suitable for production
> use with ext3.
> 
> ext2 is OK and I *assume* it's OK with reiserfs.  The problem occurs
> when a filesystem performs:
> 
> 	lock_buffer(dirty_bh);
> 	allocate_something(GFP_NOFS);
> 
> without having locked the buffer's page.  sync_page_buffers()
> can perform a wait_on_buffer() against dirty_bh.  (I think.
> I'm not quite up-to-speed with the new buffer state bits yet).

Note that this is a perfect example of the tangle-of-states problem we
discussed on IRC.  This problem comes up because of the uholy way in
which pages and buffers are tangled together (page->buffer->page...).
The solution imho is to get rid of buffer_heads and divide their
current functionality between struct page and struct bio.

Cleaning up the state mess is just one of the reasons for doing it, of
course, but it's not all that unimportant.

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  7:30               ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-13  7:55                 ` 2.4.19pre2aa1 Andrew Morton
@ 2002-03-13 10:57                 ` Andrea Arcangeli
  2002-03-13 13:51                   ` 2.4.19pre2aa1 Rik van Riel
  1 sibling, 1 reply; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: wli, wli, Richard B. Johnson, linux-kernel, riel, hch, phillips

On Wed, Mar 13, 2002 at 08:30:55AM +0100, Andrea Arcangeli wrote:
>  {
>  	clear_bit(BH_Wait_IO, &bh->b_state);
>  	clear_bit(BH_Lock, &bh->b_state);
> +	clear_bit(BH_Launder, &bh->b_state);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(&bh->b_wait))

actually, while refining the patch and integrating it, I audited it some
more carefully and the above was wrong, we must ensure nobody is able to
acquire BH_Lock before BH_Launder. So we must also enforce ordering at
the cpu level.  This is the correct version.

	clear_bit(BH_Wait_IO, &bh->b_state);
	clear_bit(BH_Launder, &bh->b_state);
	/* nobody must acquire BH_Lock again before BH_Launder is clear */
	smp_mb__after_clear_bit();
	clear_bit(BH_Lock, &bh->b_state);

The race would been nearly impossible to trigger during stress testing,
you'd need to BH_Lock + GFP_NOFS + alloc_pages + shrink_cache + the
interesting page is near the end of the lru + try_to_free_buffers +
sync_page_buffers + wait_on_buffer all in a few cycles (irq handlers
could trigger it :), but with the huge userbase somebody I don't exclude
somebody could been really hurted by and anyways it would be still a
common code bug even if it would not be possible to reproduce it with
current available hardware.

Note that the above very same race can happen in mainline too on
architectures where a clear_bit doesn't imply a strong CPU barrier.
So on the paper your same kind of deadlock could happen on mainline as
well but there it is reduced to an SMP race and it would affect only
alpha, ppc and s390. So I'm glad my deadlock is been useful to fix
another SMP race condition at least :)

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13 10:57                 ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-13 13:51                   ` Rik van Riel
  2002-03-13 14:03                     ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-13 19:19                     ` 2.4.19pre2aa1 Andrew Morton
  0 siblings, 2 replies; 53+ messages in thread
From: Rik van Riel @ 2002-03-13 13:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, wli, wli, Richard B. Johnson, linux-kernel, hch,
	phillips

On Wed, 13 Mar 2002, Andrea Arcangeli wrote:
> On Wed, Mar 13, 2002 at 08:30:55AM +0100, Andrea Arcangeli wrote:
> >  {
> >  	clear_bit(BH_Wait_IO, &bh->b_state);
> >  	clear_bit(BH_Lock, &bh->b_state);
> > +	clear_bit(BH_Launder, &bh->b_state);
> >  	smp_mb__after_clear_bit();
> >  	if (waitqueue_active(&bh->b_wait))
>
> actually, while refining the patch and integrating it, I audited it some
> more carefully and the above was wrong,

It's complex.

Would there be a way to simplify the thing so the author
of the code can at least get it right and there's a chance
of other people understanding it too ? ;)

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13 13:51                   ` 2.4.19pre2aa1 Rik van Riel
@ 2002-03-13 14:03                     ` Andrea Arcangeli
  2002-03-13 19:19                     ` 2.4.19pre2aa1 Andrew Morton
  1 sibling, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-13 14:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, wli, wli, Richard B. Johnson, linux-kernel, hch,
	phillips

On Wed, Mar 13, 2002 at 10:51:26AM -0300, Rik van Riel wrote:
> On Wed, 13 Mar 2002, Andrea Arcangeli wrote:
> > On Wed, Mar 13, 2002 at 08:30:55AM +0100, Andrea Arcangeli wrote:
> > >  {
> > >  	clear_bit(BH_Wait_IO, &bh->b_state);
> > >  	clear_bit(BH_Lock, &bh->b_state);
> > > +	clear_bit(BH_Launder, &bh->b_state);
> > >  	smp_mb__after_clear_bit();
> > >  	if (waitqueue_active(&bh->b_wait))
> >
> > actually, while refining the patch and integrating it, I audited it some
> > more carefully and the above was wrong,
> 
> It's complex.

The SMP kernel is complex, preempt+SMP is even more complex. If you find
a design solution more simple or/and more efficient to be able to
identify which locked buffers are just been submitted for I/O let me
know ASAP, I can't think for a better/simpler one and the locking rules
IMHO here are very simple, nothing remotely comparable to other parts of
the kernel. Infact I think it is the simplicity of this fix that renders
is so obviously right and why Andrew as well could reply to me this
morning with an agreement that that is the right fix.

It is as simple as:

	when a locked buffer is visible to the I/O layer BH_Launder is set

This means before unlocking we must clear BH_Launder, mb() on alpha and
then clear BH_Lock, so no reader can see BH_Launder set on an unlocked
buffer and then risk to deadlock.

I think it is very simple and clean. If you want to know something way
more complex than that just ask or alternatively grep for:

	grep preempt 2.5.7-pre1/kernel/sched.c 

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13  2:18         ` 2.4.19pre2aa1 wli
@ 2002-03-13 19:06           ` Richard B. Johnson
  2002-03-13 22:10             ` 2.4.19pre2aa1 wli
  0 siblings, 1 reply; 53+ messages in thread
From: Richard B. Johnson @ 2002-03-13 19:06 UTC (permalink / raw)
  To: wli; +Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Wed, 13 Mar 2002 wli@holomorphy.com wrote:

[SNIPPED..]
You might want to look at www.eece.unm.edu/faculty/heileman/hash/hash.html
rather than assuming everyone is uninformed. Source-code is provided
for several hashing functions as well as source-code for tests. This
is a relatively old reference although it addresses both the chaos
and fractal methods discussed here, plus chaotic probe strategies
in address hashing.

A fast random number generator is essential to produce meaningful
tests within a reasonable period of time. It is also used within one
of the hashing functions to 'guess' at the direction of displacement
when an insertion slot is not immediately located, as well as the
displacement mechanism for several chaotic methods discussed.

Using your own hash-function as a template for tests of the same
hash-function, as you propose, is unlikely to provide meaningful
results.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13 13:51                   ` 2.4.19pre2aa1 Rik van Riel
  2002-03-13 14:03                     ` 2.4.19pre2aa1 Andrea Arcangeli
@ 2002-03-13 19:19                     ` Andrew Morton
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Morton @ 2002-03-13 19:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, wli, wli, Richard B. Johnson, linux-kernel, hch,
	phillips

Rik van Riel wrote:
> 
> On Wed, 13 Mar 2002, Andrea Arcangeli wrote:
> > On Wed, Mar 13, 2002 at 08:30:55AM +0100, Andrea Arcangeli wrote:
> > >  {
> > >     clear_bit(BH_Wait_IO, &bh->b_state);
> > >     clear_bit(BH_Lock, &bh->b_state);
> > > +   clear_bit(BH_Launder, &bh->b_state);
> > >     smp_mb__after_clear_bit();
> > >     if (waitqueue_active(&bh->b_wait))
> >
> > actually, while refining the patch and integrating it, I audited it some
> > more carefully and the above was wrong,
> 
> It's complex.
> 
> Would there be a way to simplify the thing so the author
> of the code can at least get it right and there's a chance
> of other people understanding it too ? ;)

I'll be documenting the BH state bits, and sync_page_buffers(),
when it settles down.

> ...
> <insert bitkeeper endorsement here>

You should lubricate it first.

-

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13 19:06           ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-13 22:10             ` wli
  2002-03-14 12:18               ` 2.4.19pre2aa1 Richard B. Johnson
  0 siblings, 1 reply; 53+ messages in thread
From: wli @ 2002-03-13 22:10 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> [SNIPPED..]

Nice move. No one will have the foggiest idea without hunting for
my prior message whether your comments on what I said are accurate.

On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> You might want to look at www.eece.unm.edu/faculty/heileman/hash/hash.html
> rather than assuming everyone is uninformed. Source-code is provided
> for several hashing functions as well as source-code for tests. This
> is a relatively old reference although it addresses both the chaos
> and fractal methods discussed here, plus chaotic probe strategies
> in address hashing.

You've presented a paper that attempts to establish a connection
between chaos theory and hashing by open addressing. I'm not convinced
chaos theory is all that, but some of their measurement techniques
appear useful, and I'll probably try using them, despite the fact Linux
appears to favor hashing by separate chaining.

This URL has not convinced me you are informed, and I don't really want
to be convinced whether you're informed or not. Your contributions to
these threads have been far less than enlightening or helpful thus far.

On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> A fast random number generator is essential to produce meaningful
> tests within a reasonable period of time. It is also used within one
> of the hashing functions to 'guess' at the direction of displacement
> when an insertion slot is not immediately located, as well as the
> displacement mechanism for several chaotic methods discussed.

I don't buy this, and the reason why is that hashing is obviously
sensitive to the input distribution. To defeat any hash function,
you need only produce a distribution concentrated on a set hashing
to the same number. If your distribution is literally uniform, then
you'll never do better (by one measure) than least residue mod hash
table size. You need to produce a realistic set of test keys. The key
distributions you actually want to hash will be neither of the above.
(Though one should verify it doesn't do poorly on uniform input, it's
little more than a sanity check.)

On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> Using your own hash-function as a template for tests of the same
> hash-function, as you propose, is unlikely to provide meaningful
> results.

I'm not proposing that. I have no idea why you think I am.

If I may summarize, if you're going to use random number generation
to simulate test inputs, verify the distribution of the test inputs
you generate actually resembles the thing you're trying to simulate.

It's far more productive to get samples of kernel virtual addresses
of the objects whose addresses you're hashing, or inode numbers from
real filesystems, or filenames, or whatever, than to just feed it a
stream of numbers spewed by some random number generator no one's
heard of, and that's not going to change in the least. For a Monte
Carlo simulation of its usage in the kernel, you're going to need to
actually figure out how to simulate those things' distributions.

Hmm, I wouldn't be suprised to get a bunch of "YHBT" messages after
this but the chaos theory paper does have some useful stuff for
analyzing hash table performance. Distributions' entropies and Lyapunov
exponents of hash functions should be easy enough to compute.

Bill

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-13 22:10             ` 2.4.19pre2aa1 wli
@ 2002-03-14 12:18               ` Richard B. Johnson
  2002-03-14 12:47                 ` 2.4.19pre2aa1 Daniel Phillips
  2002-03-14 13:59                 ` 2.4.19pre2aa1 Rik van Riel
  0 siblings, 2 replies; 53+ messages in thread
From: Richard B. Johnson @ 2002-03-14 12:18 UTC (permalink / raw)
  To: wli; +Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On Wed, 13 Mar 2002 wli@holomorphy.com wrote:

> On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> > [SNIPPED..]
> 
> Nice move. No one will have the foggiest idea without hunting for
> my prior message whether your comments on what I said are accurate.
> 
> 
> On Wed, Mar 13, 2002 at 02:06:48PM -0500, Richard B. Johnson wrote:
> > You might want to look at www.eece.unm.edu/faculty/heileman/hash/hash.html
> > rather than assuming everyone is uninformed. Source-code is provided
> > for several hashing functions as well as source-code for tests. This
> > is a relatively old reference although it addresses both the chaos
> > and fractal methods discussed here, plus chaotic probe strategies
> > in address hashing.
> 
> You've presented a paper that attempts to establish a connection
> between chaos theory and hashing by open addressing. I'm not convinced
> chaos theory is all that, but some of their measurement techniques
> appear useful, and I'll probably try using them, despite the fact Linux
> appears to favor hashing by separate chaining.
> 
> This URL has not convinced me you are informed, and I don't really want
> to be convinced whether you're informed or not. Your contributions to
> these threads have been far less than enlightening or helpful thus far.
> 
[SNIPPED rest of diatribe...]

Listen you incompetent amoeba. I attempted to tell you in a nice
way that your apparent mastery of technical mumbo-jumbo will be
detected by many who have actual knowledge and expertise in the
area in which you pretend to be competent.

That said, have a nice day.


Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-14 12:18               ` 2.4.19pre2aa1 Richard B. Johnson
@ 2002-03-14 12:47                 ` Daniel Phillips
  2002-03-14 13:59                 ` 2.4.19pre2aa1 Rik van Riel
  1 sibling, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-14 12:47 UTC (permalink / raw)
  To: root, wli; +Cc: Andrea Arcangeli, wli, linux-kernel, riel, hch, phillips

On March 14, 2002 01:18 pm, Richard B. Johnson wrote:
> On Wed, 13 Mar 2002 wli@holomorphy.com wrote:
> > [big swinging math stuff]
>
> Listen you incompetent amoeba. I attempted to tell you in a nice
> way that your apparent mastery of technical mumbo-jumbo will be
> detected by many who have actual knowledge and expertise in the
> area in which you pretend to be competent.

You were way wide of the mark, though admittedly Bill reacted more 
agressively than necessary, it's what happens when you use steroids to build 
up your math muscles ;-)

Bill knows what a pseudorandom generator is, and how to use it for testing 
hash functions, so do I.  I don't really like the one you showed, though I 
appeciate the fact it's a couple of assembly instructions and has a decent 
period.  Did you run a spectral test[1] on it?  I'd be surprised if the 
results are pretty, though pleasantly surprised.  I have one myself sitting 
around somewhere that's 2 or 3 instructions long, based on an LSR, which does 
have some analyzable properties.  Though for serious testing I wouldn't use 
it - I'd crack my Numerical Recipes in C or use urandom, taking it on faith 
that the coder was duly diligent.  A few cycles saved evaluating the hash 
just isn't worth it if you then have to wonder if patterns in your generator 
are showing through to your test results.

You missed a lot if you didn't notice the quality of Bill's work on the hash.
I 100% agree with his approach[2].  The fact he managed to satisfy davem 
should tell you a lot - we now have nice, short multiplicative hashes to use 
that get evaluated as fast shift-adds on sparc.  These hashes have *provably* 
good behavior.  Thanks Bill.

[1] see Knuth
[2] well, I sort of put him up to it...

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-14 12:18               ` 2.4.19pre2aa1 Richard B. Johnson
  2002-03-14 12:47                 ` 2.4.19pre2aa1 Daniel Phillips
@ 2002-03-14 13:59                 ` Rik van Riel
  2002-03-14 14:02                   ` 2.4.19pre2aa1 Daniel Phillips
  1 sibling, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2002-03-14 13:59 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: wli, Andrea Arcangeli, wli, linux-kernel, hch, phillips

On Thu, 14 Mar 2002, Richard B. Johnson wrote:

> [SNIPPED rest of diatribe...]
>
> Listen you incompetent amoeba.

Likewise.  Lets go over this again:

1) the idea is to have a hashing function that
   performs well for the page cache and/or the
   hashed wait queues

2) input for these two hash functions is not at
   all guaranteed to be random or uniformly spread

3) this means we need a hash function that performs
   well on very much non-random inputs

Now, where does your random number generator fit in this
scenario ?

regards,

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-14 13:59                 ` 2.4.19pre2aa1 Rik van Riel
@ 2002-03-14 14:02                   ` Daniel Phillips
  0 siblings, 0 replies; 53+ messages in thread
From: Daniel Phillips @ 2002-03-14 14:02 UTC (permalink / raw)
  To: Rik van Riel, Richard B. Johnson
  Cc: wli, Andrea Arcangeli, wli, linux-kernel, hch, phillips

On March 14, 2002 02:59 pm, Rik van Riel wrote:
> Now, where does your random number generator fit in this
> scenario ?

He was offering it up as a means of testing the hash, a well intentioned 
thought.  What he didn't seem to realize is, that's like telling Linus how to 
read an oops.

-- 
Daniel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-15 17:20           ` 2.4.19pre2aa1 Horst von Brand
@ 2002-03-15 16:43             ` Andrea Arcangeli
  0 siblings, 0 replies; 53+ messages in thread
From: Andrea Arcangeli @ 2002-03-15 16:43 UTC (permalink / raw)
  To: Horst von Brand; +Cc: linux-kernel, hch

On Fri, Mar 15, 2002 at 01:20:29PM -0400, Horst von Brand wrote:
> Andrea Arcangeli <andrea@suse.de> said:
> 
> [...]
> 
> > AFIK my current hashfn is never been tested in precendence on this kind
> > of random input of the wait_table pages.
> 
> If the input is really random, anything will do. I.e., just chopping off a
> few not guaranteed-zero bits (probably better low-end) and using that would
> be enough.

indeed, that's basically what I'm doing there (plus a "+ >>" just to use
more of the random bits).

Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: 2.4.19pre2aa1
  2002-03-12 14:25         ` 2.4.19pre2aa1 Andrea Arcangeli
  2002-03-12 14:32           ` 2.4.19pre2aa1 Daniel Phillips
@ 2002-03-15 17:20           ` Horst von Brand
  2002-03-15 16:43             ` 2.4.19pre2aa1 Andrea Arcangeli
  1 sibling, 1 reply; 53+ messages in thread
From: Horst von Brand @ 2002-03-15 17:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, hch

Andrea Arcangeli <andrea@suse.de> said:

[...]

> AFIK my current hashfn is never been tested in precendence on this kind
> of random input of the wait_table pages.

If the input is really random, anything will do. I.e., just chopping off a
few not guaranteed-zero bits (probably better low-end) and using that would
be enough.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2002-03-15 16:43 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-03-08  2:40 2.4.19pre2aa1 rwhron
  -- strict thread matches above, loose matches on Subject: below --
2002-03-12  4:19 2.4.19pre2aa1 wli
2002-03-12  5:31 ` 2.4.19pre2aa1 wli
2002-03-12  6:36   ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12  6:06 ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 10:46   ` 2.4.19pre2aa1 Rik van Riel
2002-03-12 11:47     ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 11:48       ` 2.4.19pre2aa1 wli
2002-03-12 12:21       ` 2.4.19pre2aa1 Daniel Phillips
2002-03-12 14:25         ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 14:32           ` 2.4.19pre2aa1 Daniel Phillips
2002-03-15 17:20           ` 2.4.19pre2aa1 Horst von Brand
2002-03-15 16:43             ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 11:29   ` 2.4.19pre2aa1 wli
2002-03-12 12:56     ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 13:20       ` 2.4.19pre2aa1 Rik van Riel
2002-03-12 13:33       ` 2.4.19pre2aa1 Richard B. Johnson
2002-03-12 14:17         ` 2.4.19pre2aa1 wli
2002-03-12 14:30           ` 2.4.19pre2aa1 Richard B. Johnson
2002-03-13  2:18         ` 2.4.19pre2aa1 wli
2002-03-13 19:06           ` 2.4.19pre2aa1 Richard B. Johnson
2002-03-13 22:10             ` 2.4.19pre2aa1 wli
2002-03-14 12:18               ` 2.4.19pre2aa1 Richard B. Johnson
2002-03-14 12:47                 ` 2.4.19pre2aa1 Daniel Phillips
2002-03-14 13:59                 ` 2.4.19pre2aa1 Rik van Riel
2002-03-14 14:02                   ` 2.4.19pre2aa1 Daniel Phillips
2002-03-12 14:14       ` 2.4.19pre2aa1 wli
2002-03-12 15:04         ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-12 23:31           ` 2.4.19pre2aa1 wli
2002-03-13  0:09             ` 2.4.19pre2aa1 Andrew Morton
2002-03-13  1:06               ` 2.4.19pre2aa1 wli
2002-03-13  1:24                 ` 2.4.19pre2aa1 Andrew Morton
2002-03-13  7:37                   ` 2.4.19pre2aa1 Daniel Phillips
2002-03-13  7:30               ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-13  7:55                 ` 2.4.19pre2aa1 Andrew Morton
2002-03-13  8:06                   ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-13 10:57                 ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-13 13:51                   ` 2.4.19pre2aa1 Rik van Riel
2002-03-13 14:03                     ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-13 19:19                     ` 2.4.19pre2aa1 Andrew Morton
2002-03-13  8:12               ` 2.4.19pre2aa1 Daniel Phillips
2002-03-07  8:21 2.4.19pre2aa1 Andrea Arcangeli
2002-03-07 10:49 ` 2.4.19pre2aa1 William Lee Irwin III
2002-03-07 11:27   ` 2.4.19pre2aa1 Daniel Phillips
2002-03-07 11:47     ` 2.4.19pre2aa1 William Lee Irwin III
2002-03-07 11:46       ` 2.4.19pre2aa1 Daniel Phillips
2002-03-07 17:03   ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-07 20:18     ` 2.4.19pre2aa1 William Lee Irwin III
2002-03-07 20:38       ` 2.4.19pre2aa1 Richard B. Johnson
2002-03-08  0:22         ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-08  0:26           ` 2.4.19pre2aa1 Rik van Riel
2002-03-08  0:11       ` 2.4.19pre2aa1 Andrea Arcangeli
2002-03-07 11:34 ` 2.4.19pre2aa1 Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox