Re: [PATCH] Radix-tree pagecache for 2.5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] Radix-tree pagecache for 2.5
@ 2002-02-06 21:34 rwhron
  2002-02-06 21:37 ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: rwhron @ 2002-02-06 21:34 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel

On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> Once you get over 'dbench 16' or so the whole thing basically
> becomes an excercise in how well the system can trigger task
> starvation in get_request_wait.

It's neat you've identified that bottleneck.  

dbench 192 also appears to trigger more swapin/swapout than 
usual with rmap based kernels about 12-15 minutes into the test;
and it remains unusually high for the duration of the run.  
(not a huge amount of swapping, but vmstat 60 shows double digit
numbers, rather than the more typical 0 with occasional single
digits "spikes").  dbench 64 doesn't trigger this behavior on 
my test box.

I want diversity in the workloads.  bonnie++ does a single
thread, and tiobench is doing 1, 2, 4, and 8 threads.  dbench
fits in well at the other end of the spectrum.

The newest bench added to the lineup is OSDB on postgresql.
If everything executes properly, 2.5.3-dj3 (which includes
radix-tree) will win the first timer award for OSDB. :)

-- 
Randy Hron

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-06 21:34 [PATCH] Radix-tree pagecache for 2.5 rwhron
@ 2002-02-06 21:37 ` Rik van Riel
  2002-02-06 22:06   ` rwhron
                     ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Rik van Riel @ 2002-02-06 21:37 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

On Wed, 6 Feb 2002 rwhron@earthlink.net wrote:
> On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> > Once you get over 'dbench 16' or so the whole thing basically
> > becomes an excercise in how well the system can trigger task
> > starvation in get_request_wait.
>
> It's neat you've identified that bottleneck.

Umm, there's one thing you need to remember about these
high dbench loads though.

They run fastest when you run each of the dbench forks
sequentially and have the others stuck in get_request_wait.

This, of course, is completely unacceptable for real-world
server scenarios, where all users of the server need to be
serviced fairly.

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-06 21:37 ` Rik van Riel
@ 2002-02-06 22:06   ` rwhron
  2002-02-07 11:32   ` Daniel Phillips
  2002-02-07 11:32   ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: rwhron @ 2002-02-06 22:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: rwhron, linux-kernel

> They run fastest when you run each of the dbench forks
> sequentially and have the others stuck in get_request_wait.

One interesting part of tiotest is the latency measurements.
Latency isn't printed by tiobench.pl though.  I think it's 
valueable information (and wish I had it).

> This, of course, is completely unacceptable for real-world
> server scenarios, where all users of the server need to be
> serviced fairly.

Agreed.  I'm glad kernel hackers focus on latency too. :)

There are _some_ applications where throughput is critical 
though.  I would prefer to measure both throughput and 
latency at the same time, but am not yet clear on how to
deal with the Heisenberg principle.

-- 
Randy Hron

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-06 21:37 ` Rik van Riel
  2002-02-06 22:06   ` rwhron
@ 2002-02-07 11:32   ` Daniel Phillips
  2002-02-07 11:32   ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: Daniel Phillips @ 2002-02-07 11:32 UTC (permalink / raw)
  To: Rik van Riel, rwhron; +Cc: linux-kernel

On February 6, 2002 10:37 pm, Rik van Riel wrote:
> On Wed, 6 Feb 2002 rwhron@earthlink.net wrote:
> > On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> > > Once you get over 'dbench 16' or so the whole thing basically
> > > becomes an excercise in how well the system can trigger task
> > > starvation in get_request_wait.
> >
> > It's neat you've identified that bottleneck.
> 
> Umm, there's one thing you need to remember about these
> high dbench loads though.
> 
> They run fastest when you run each of the dbench forks
> sequentially and have the others stuck in get_request_wait.
> 
> This, of course, is completely unacceptable for real-world
> server scenarios, where all users of the server need to be
> serviced fairly.

Right, and as we just discussed on irc, it's a useful effect - only if we 
control it, so that stopped or slowed processes do eventually get forcibly 
elevated in terms of IO priority so they can make progress, after being sat 
upon by more agressive/successful processes long enough.  And we can control 
this, it's just going to take a few months to get basic issues of IO queues, 
RSS accounting, etc. out of the way so we can address it.

The trouble I have with paying a lot of attention to dbench results at this 
point is - we're measuring effects of kernel behaviour that is, at this 
point, uncontrolled and effectively random.  IOW, we're not measuring the 
effects that we're interested in just now.  If we need to know IO throughput, 
we need to use benches that test exactly that, and not other randomly 
interacting effects.  As they said on Laugh-in many moons ago: 'very 
interesting, but useless'.

-- 
Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-06 21:37 ` Rik van Riel
  2002-02-06 22:06   ` rwhron
  2002-02-07 11:32   ` Daniel Phillips
@ 2002-02-07 11:32   ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: Daniel Phillips @ 2002-02-07 11:32 UTC (permalink / raw)
  To: Rik van Riel, rwhron; +Cc: linux-kernel

On February 6, 2002 10:37 pm, Rik van Riel wrote:
> On Wed, 6 Feb 2002 rwhron@earthlink.net wrote:
> > On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> > > Once you get over 'dbench 16' or so the whole thing basically
> > > becomes an excercise in how well the system can trigger task
> > > starvation in get_request_wait.
> >
> > It's neat you've identified that bottleneck.
> 
> Umm, there's one thing you need to remember about these
> high dbench loads though.
> 
> They run fastest when you run each of the dbench forks
> sequentially and have the others stuck in get_request_wait.
> 
> This, of course, is completely unacceptable for real-world
> server scenarios, where all users of the server need to be
> serviced fairly.

Right, and as we just discussed on irc, it's a useful effect - only if we 
control it, so that stopped or slowed processes do eventually get forcibly 
elevated in terms of IO priority so they can make progress, after being sat 
upon by more agressive/successful processes long enough.  And we can control 
this, it's just going to take a few months to get basic issues of IO queues, 
RSS accounting, etc. out of the way so we can address it.

The trouble I have with paying a lot of attention to dbench results at this 
point is - we're measuring effects of kernel behaviour that is, at this 
point, uncontrolled and effectively random.  IOW, we're not measuring the 
effects that we're interested in just now.  If we need to know IO throughput, 
we need to use benches that test exactly that, and not other randomly 
interacting effects.  As they said on Laugh-in many moons ago: 'very 
interesting, but useless'.

-- 
Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
@ 2002-02-02 19:23 rwhron
  2002-02-03 14:31 ` chris
  2002-02-03 23:33 ` Momchil Velikov
  0 siblings, 2 replies; 87+ messages in thread
From: rwhron @ 2002-02-02 19:23 UTC (permalink / raw)
  To: linux-kernel


Benchmark results on 2.4.17 and 2.4.17 with radix-tree
(ratpagecache) patch.  Identical results removed.  

dbench 192 was the same.  dbench 64 is virtually equal, 
considering dbench normal flucutation.

dbench 64 processes
2.4.17               *************************  12.5  MB/sec
2.4.17rat            ************************  12.1  MB/sec


Unixbench-4.1.0
                                    2.4.17   2.4.17rat
Pipe Throughput                   387881.3    379702.9
Pipe-based Context Switching      105911.3     91653.3
Process Creation                    1180.3      1197.4
System Call Overhead              304463.9    335158.8
Shell Scripts (1 concurrent)         617.4       615.6
C Compiler Throughput                225.7       227.7
Dc: sqrt(2) to 99 decimal places   13988.7     14360.6


LMbench 2.0p2

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
OS               null        open  selct  sig   sig   fork   exec    sh
                 call  stat  clos  TCP    inst  hndl  proc   proc   proc
---------------  ----  ----  ----  -----  ----  ----  -----  -----  -----
Linux 2.4.17     0.43  3.41  6.16   36.8  1.43  3.26    932   3717  13612
Linux 2.4.17rat  0.43  4.56  7.80   55.4  1.45  3.17    901   3580  13239


Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
OS               2p/0K  2p/16K  2p/64K  8p/16K  8p/64K  16p/16K 16p/64K
                 ctxsw  ctxsw   ctxsw   ctxsw   ctxsw   ctxsw   ctxsw
---------------- -----  ------  ------  ------  ------  ------  ------
Linux 2.4.17     1.20   24.07   188.7   56.5    209.0    59.2   223.3
Linux 2.4.17rat  1.21   22.93   187.9   58.0    208.8    61.8   226.4


*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
OS               2p/0K  Pipe    AF     TCP     TCP
                 ctxsw         UNIX            conn
---------------  -----  -----  -----   -----   -----
Linux 2.4.17     2.04  10.17  21.21   65.82   288.2
Linux 2.4.17rat  2.04  10.09  20.46   60.76   289.4


File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
OS                 0K    File     10K    File     Mmap    Prot   Page
                 Create  Delete  Create  Delete  Latency  Fault  Fault
---------------  ------  ------  ------  ------  -------  -----  -----
Linux 2.4.17      123.9   165.6   677.1   242.8    2602   1.076   8.0
Linux 2.4.17rat   128.9   169.7   618.8   256.2    2762   1.051   7.7

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
OS              Pipe   AF    TCP  File   Mmap   Bcopy  Bcopy  Mem   Mem
                      UNIX        reread reread (libc) (hand) read  write
--------------- ----- ----- ----- ------ ------ ------ ------ ----- -----
Linux 2.4.17     61.5  49.2  60.5   61.7  237.4   59.2   60.3 237.3  85.0
Linux 2.4.17rat  64.4  42.8  60.3   61.4  237.5   59.1   60.2 237.4  84.7


Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
OS                 Mhz   L1 $   L2 $    Main mem
-----------------  ----  -----  ------  --------
Linux 2.4.17        501   4.2   188.1    262.2
Linux 2.4.17rat     501   4.2   195.4    262.0


Threaded I/O Bench
Read, Write, and Seeks are MB/sec
CPU Effiency (CPU Eff) = (MB/sec) / CPU% (bigger is better)

 		Num     Seq Read    CPU    Rand Read   CPU   
 		Thr    Rate (CPU%)  Eff   Rate (CPU%)  Eff   
 		---  -------------------  -----------------  
2.4.17            1  13.23  43.2%  30.62  2.74  3.7%  73.88  
2.4.17rat         1  12.60  43.0%  29.30  2.75  3.8%  72.45  

2.4.17            2  11.56  29.9%  38.66  3.04  3.8%  79.89  
2.4.17rat         2  11.07  30.4%  36.41  3.06  4.3%  71.81  

2.4.17            4  11.03  25.9%  42.59  3.19  4.1%  78.26  
2.4.17rat         4  10.62  26.5%  40.08  3.15  4.2%  74.32  

2.4.17            8  10.62  22.8%  46.58  3.29  4.2%  77.99  
2.4.17rat         8  10.23  23.4%  43.72  3.26  4.5%  73.05  

 		Num    Seq Write    CPU    Rand Write   CPU
 		Thr   Rate (CPU%)   Eff    Rate (CPU%)  Eff
 		---  -------------------  -----------------
2.4.17            1  11.08  50.5%  21.94  0.69  1.6%  44.10
2.4.17rat         1   7.77  32.8%  23.69  0.53  1.1%  48.44

2.4.17            2  10.83  48.6%  22.28  0.69  1.5%  45.13
2.4.17rat         2   7.51  32.1%  23.38  0.52  1.1%  45.98

2.4.17            4  10.40  45.9%  22.66  0.68  1.5%  44.70
2.4.17rat         4   7.62  32.8%  23.25  0.53  1.2%  45.21

2.4.17            8  10.17  44.8%  22.70  0.67  1.5%  44.73
2.4.17rat         8   7.55  32.5%  23.24  0.53  1.2%  44.66


bonnie++
Version 1.02a     ---------------------Sequential Output--------------------
                  -----Per Char-----  ------Block-------  -----Rewrite------
Kernel      Size  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff
2.4.17      1024    3.41  98.0  3.48   14.56  65.6 22.19    8.83  51.0 17.32
2.4.17rat   1024    3.36  98.0  3.43   10.64  41.6 25.57    6.79  37.2 18.27

Version 1.02a     -----------Sequential Input-----------   ------Random-----
                  -----Per Char-----  ------Block-------   ------Seeks------
Kernel      Size  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff    /sec  %CPU   Eff
2.4.17      1024    3.98  97.2  4.10   16.39  60.6 27.05     132   2.0  6609
2.4.17rat   1024    3.92  96.0  4.08   15.46  57.0 27.12     126   2.0  6302

                 -------Sequential Create-----------
                 ------Create-----  -----Delete-----
           files  /sec  %CPU   Eff  /sec  %CPU   Eff
2.4.17     16384  4421  96.8  4567  4719  97.4  4845
2.4.17rat  16384  3973  97.6  4070  4531  95.0  4769

                 -------Random Create----------------
                 ------Create-----   -----Delete-----
           files  /sec  %CPU   Eff   /sec  %CPU   Eff
2.4.17     16384  4475  98.0  4566   4124  94.0  4387
2.4.17rat  16384  4203  98.0  4288   4005  94.4  4242


Build times.  Smaller is better - task completed faster.

		perl_build 	kernel_build
2.4.17		1620		1347
2.4.17rat	1629		1444

radix-tree saves 768k on 384M machine.

2.4.17rat  Memory: 385932k/393216k available (885k kernel code, 6900k reserved
2.4.17     Memory: 385164k/393216k available (884k kernel code, 7668k reserved

Results like these on more kernels at:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-02 19:23 rwhron
@ 2002-02-03 14:31 ` chris
  2002-02-03 23:33 ` Momchil Velikov
  1 sibling, 0 replies; 87+ messages in thread
From: chris @ 2002-02-03 14:31 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel


On Sat, 2 Feb 2002 rwhron@earthlink.net wrote:

>  		Num    Seq Write    CPU    Rand Write   CPU
>  		Thr   Rate (CPU%)   Eff    Rate (CPU%)  Eff
>  		---  -------------------  -----------------
> 2.4.17            1  11.08  50.5%  21.94  0.69  1.6%  44.10
> 2.4.17rat         1   7.77  32.8%  23.69  0.53  1.1%  48.44

This is a worrying trend your benching has exposed - all the streaming
write tests have taken a performance hit. Above is tiobench, and bonnie
showed the same trend too.

Chris



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-02 19:23 rwhron
  2002-02-03 14:31 ` chris
@ 2002-02-03 23:33 ` Momchil Velikov
  2002-02-04  3:59   ` rwhron
  2002-02-06  2:04   ` rwhron
  1 sibling, 2 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-02-03 23:33 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

>>>>> "rwhron" == rwhron  <rwhron@earthlink.net> writes:

rwhron> bonnie++
rwhron> Version 1.02a     ---------------------Sequential Output--------------------
rwhron>                   -----Per Char-----  ------Block-------  -----Rewrite------
rwhron> Kernel      Size  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff
rwhron> 2.4.17      1024    3.41  98.0  3.48   14.56  65.6 22.19    8.83  51.0 17.32
rwhron> 2.4.17rat   1024    3.36  98.0  3.43   10.64  41.6 25.57    6.79  37.2 18.27

rwhron> Version 1.02a     -----------Sequential Input-----------   ------Random-----
rwhron>                   -----Per Char-----  ------Block-------   ------Seeks------
rwhron> Kernel      Size  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff    /sec  %CPU   Eff
rwhron> 2.4.17      1024    3.98  97.2  4.10   16.39  60.6 27.05     132   2.0  6609
rwhron> 2.4.17rat   1024    3.92  96.0  4.08   15.46  57.0 27.12     126   2.0  6302

rwhron>                  -------Sequential Create-----------
rwhron>                  ------Create-----  -----Delete-----
rwhron>            files  /sec  %CPU   Eff  /sec  %CPU   Eff
rwhron> 2.4.17     16384  4421  96.8  4567  4719  97.4  4845
rwhron> 2.4.17rat  16384  3973  97.6  4070  4531  95.0  4769

rwhron>                  -------Random Create----------------
rwhron>                  ------Create-----   -----Delete-----
rwhron>            files  /sec  %CPU   Eff   /sec  %CPU   Eff
rwhron> 2.4.17     16384  4475  98.0  4566   4124  94.0  4387
rwhron> 2.4.17rat  16384  4203  98.0  4288   4005  94.4  4242

Hmm, I've got different results with bonnie++, are you sure you didn't
swap the results :)

Linux 2.5.3-dj1
--------------
Version  1.02b      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
merlin         496M           22038  13 11768   8           19219   6 183.9   0
merlin         496M           22495  13 11216   6           22390   6 183.9   0
merlin         496M           22292  13 11713   7           19249   6 188.8   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
merlin           10  2015  99 +++++ +++ +++++ +++  2526  99 +++++ +++  8514  99
merlin           10  2354  99 +++++ +++ +++++ +++  2656  99 +++++ +++ 10195 100
merlin           10  2871  99 +++++ +++ +++++ +++  3036  99 +++++ +++ 12012  98

Linux 2.5.3-dj1-radix-pagecache
-------------------------------
Version  1.02b      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
merlin         496M           22397  13 11248   7           22595   7 178.6   0
merlin         496M           22088  12 11204   7           22591   7 188.1   0
merlin         496M           24767  14 10966   7           22474   7 191.1   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
merlin           10  2485  99 +++++ +++ +++++ +++  2456  99 +++++ +++  9614 100
merlin           10  2653  99 +++++ +++ +++++ +++  2884 100 +++++ +++ 11101 100
merlin           10  1948  99 +++++ +++ +++++ +++  2530  99 +++++ +++  9294  99

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-03 23:33 ` Momchil Velikov
@ 2002-02-04  3:59   ` rwhron
  2002-02-06  2:04   ` rwhron
  1 sibling, 0 replies; 87+ messages in thread
From: rwhron @ 2002-02-04  3:59 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: rwhron, linux-kernel

On Mon, Feb 04, 2002 at 01:33:19AM +0200, Momchil Velikov wrote:
> Hmm, I've got different results with bonnie++, are you sure you didn't
> swap the results :)

I don't think so, but I notice that bonnie++ Sequential and Random Create 
tests flucuate a lot.  These are on 16384 small files in my tests.

                ------Sequential Create------ --------Random Create--------
                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
          files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
2.4.17       16  4217  97 +++++ +++  4664  99  4354  99 +++++ +++  4090  95
             16  4423  99 +++++ +++  4266  90  4404  95 +++++ +++  4382 100
             16  4468  97 +++++ +++  4899  99  4522  99 +++++ +++  4235  95
             16  4498  96 +++++ +++  4777 100  4464  99 +++++ +++  3854  90
             16  4503  95 +++++ +++  4990  99  4632  98 +++++ +++  4058  90

2.4.17rat    16  2994  98 +++++ +++  4548  94  2952  99 +++++ +++  3967  92
             16  3055  97 +++++ +++  4705  93  4665  99 +++++ +++  4119  94
             16  4463  96 +++++ +++  4670 100  4452 100 +++++ +++  3775  94
             16  4833  99 +++++ +++  4823 100  4616  97 +++++ +++  4054  92
             16  4521  98 +++++ +++  3907  88  4329  95 +++++ +++  4111 100

The test against a 1GB (large) file vary somewhat too, but not as much as
the tests above.  The results I posted earlier were the averages from 5 runs.

I just uploaded the raw logfiles from my testing (not just 2.4.17 and radix-tree)
to http://home.earthlink.net/~rwhron/kernel  (the tar.bz2 files are the logs).

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-03 23:33 ` Momchil Velikov
  2002-02-04  3:59   ` rwhron
@ 2002-02-06  2:04   ` rwhron
  2002-02-06 11:44     ` Rik van Riel
  1 sibling, 1 reply; 87+ messages in thread
From: rwhron @ 2002-02-06  2:04 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: rwhron, linux-kernel

> Hmm, I've got different results with bonnie++, are you sure you didn't
> swap the results :)

I am curious if the small followup patch to 2.5.3-dj1 makes radix-tree
more like 2.4.17-ratpagecache.  (overall, it seems that radix-tree did
better in I/O without the small followup patch).

Benchmarks on 2.5.3-dj1, 2.5.3-dj1 with radix-tree (rat) 
and small followup patch (rat2).

dbench 64 processes
2.5.3-dj1rat         **************************  13.1  MB/sec
2.5.3-dj1            **********************  11.1  MB/sec
2.5.3-dj1rat2        **********************  11.1  MB/sec

dbench 192 processes
2.5.3-dj1rat         *************   6.8  MB/sec
2.5.3-dj1rat2        *************   6.7  MB/sec
2.5.3-dj1            *************   6.6  MB/sec


Unixbench-4.1.0
                                 2.5.3-dj1   2.5.3-dj1rat  2.5.3-dj1rat2
Execl Throughput                    316.7          298.5         324.8 lps 
Pipe Throughput                  307626.1       308399.0      284619.1 lps
Pipe-based Context Switching      95502.3        93398.1       76470.1 lps
Process Creation                   1320.7         1280.5        1398.3 lps 
System Call Overhead             245040.7       239410.7      251300.1 lps
Shell Scripts (1 concurrent)        643.2          639.0         636.1 lpm 
Shell Scripts (8 concurrent)         89.3           88.0          86.9 lpm 
Shell Scripts (16 concurrent)        44.7           44.0          44.0 lpm 
Shell Scripts (32 concurrent)        22.7           22.0          22.0 lpm 
Shell Scripts (64 concurrent)        10.9           10.7          10.7 lpm 
Shell Scripts (96 concurrent)         7.1            7.1           6.9 lpm 
Shell Scripts (128 concurrent)        5.3            5.1           5.1 lpm 
C Compiler Throughput               220.6          225.2         217.1 lpm 
Dc: sqrt(2) to 99 decimal places  15838.5        15465.6       15854.2 lpm 
Recursion Test--Tower of Hanoi    14157.1        14156.8       14156.2 lps 


LMbench-2.0p2 average of 9 samples.
Processor, Processes - times in microseconds - smaller is better
OS                    null        open  selct  sig   sig   fork  exec  sh
                      call  stat  clos  TCP    inst  hndl  proc  proc  proc
--------------------  ----  ----  ----  -----  ----  ----  ----  ----  -----
Linux 2.5.3-dj1       0.60  3.56  5.42   39.0  1.49  3.10   777  3250  12082
Linux 2.5.3-dj1rat    0.60  4.65  6.61   37.5  1.44  3.03   809  3294  12580
Linux 2.5.3-dj1rat2   0.59  4.52  7.82   59.5  1.42  3.23   788  3431  12943


Context switching - times in microseconds - smaller is better
OS                   2p/0K  2p/16K  2p/64K  8p/16K  8p/64K  16p/16K 16p/64K
                     ctxsw  ctxsw   ctxsw   ctxsw   ctxsw   ctxsw   ctxsw
-------------------  -----  ------  ------  ------  ------  ------  ------
Linux 2.5.3-dj1      1.04   18.98   190.4   55.5    207.3    59.0   224.3
Linux 2.5.3-dj1rat   0.43   19.57   183.7   56.6    204.0    58.1   225.0
Linux 2.5.3-dj1rat2  1.02   20.39   184.9   58.0    212.3    59.5   227.1


*Local* Communication latencies in microseconds - smaller is better
OS                     Pipe    AF     TCP      TCP
                              UNIX             conn
-------------------    -----  -----   -----    -----
Linux 2.5.3-dj1       13.57  27.21    76.68    303.6
Linux 2.5.3-dj1rat    10.46  20.51    72.37    292.6
Linux 2.5.3-dj1rat2   11.41  20.04    75.18    304.6


File & VM system latencies in microseconds - smaller is better
OS                     0K    File     10K    File     Mmap    Prot   Page
                     Create  Delete  Create  Delete  Latency  Fault  Fault
-------------------  ------  ------  ------  ------  -------  -----  -----
Linux 2.5.3-dj1       132.3   195.5   709.7   285.9    2821   1.134   6.1
Linux 2.5.3-dj1rat    145.5   198.2   791.6   291.7    2699   1.495   5.6
Linux 2.5.3-dj1rat2   140.1   195.0   723.2   287.4    2792   1.588   5.6


*Local* Communication bandwidths in MB/s - bigger is better
OS                   Pipe  AF   TCP File   Mmap   Bcopy  Bcopy  Mem   Mem
                          UNIX      reread reread (libc) (hand) read  write
-------------------  ---- ---- ---- ------ ------ ------ ------ ----- -----
Linux 2.5.3-dj1      65.7 44.3 42.7   60.3  237.5   59.2  60.4  237.4  85.1
Linux 2.5.3-dj1rat   64.2 45.5 38.4   59.8  237.7   59.5  60.6  237.6  85.7
Linux 2.5.3-dj1rat2  67.0 46.0 45.3   60.1  237.6   59.4  60.5  237.4  85.4


Memory latencies in nanoseconds - smaller is better
OS                    Mhz   L1 $   L2 $    Main mem
-------------------  ----  -----  ------  --------
Linux 2.5.3-dj1       501   4.2   195.7    262.3
Linux 2.5.3-dj1rat    501   4.2   191.1    261.9
Linux 2.5.3-dj1rat2   501   4.2   189.9    262.1


Threaded I/O Bench
File Size is 384MB, Read, Write, and Seeks are MB/sec
CPU Effiency (CPU Eff) = (MB/sec) / CPU% (bigger is better)

Reads		  Num  Seq     Read    CPU  Rand  Read    CPU 
 		  Thr  Rate   (CPU%)   Eff  Rate (CPU%)   Eff 
 		  ---  -------------------  -----------------
2.5.3-dj1           1  12.92  42.7%  30.26  3.21  4.7%  68.63
2.5.3-dj1rat        1  12.99  44.8%  29.00  2.80  3.9%  71.17
2.5.3-dj1rat2       1  12.75  43.2%  29.51  2.75  4.6%  59.29

2.5.3-dj1           2  11.63  32.1%  36.23  3.34  4.4%  75.57
2.5.3-dj1rat        2  11.70  33.3%  35.14  3.00  4.5%  65.93
2.5.3-dj1rat2       2  11.42  32.3%  35.36  2.92  4.9%  59.75

2.5.3-dj1           4  11.26  28.8%  39.10  3.42  4.8%  71.70
2.5.3-dj1rat        4  11.22  30.0%  37.40  3.06  4.7%  65.18
2.5.3-dj1rat2       4  11.14  29.5%  37.76  3.00  4.9%  61.69

2.5.3-dj1           8  11.25  26.8%  41.98  3.61  5.1%  71.02
2.5.3-dj1rat        8  10.96  26.9%  40.74  3.19  4.6%  69.09
2.5.3-dj1rat2       8  10.94  26.9%  40.67  3.09  4.9%  62.91

Writes		  Num  Seq    Write    CPU  Rand  Write   CPU
 		  Thr  Rate   (CPU%)   Eff  Rate  (CPU%)  Eff
 		  ---  -------------------  -----------------
2.5.3-dj1           1  11.38  55.9%  20.36  0.76  2.0%  37.21
2.5.3-dj1rat        1  10.77  54.3%  19.83  0.70  2.0%  35.30
2.5.3-dj1rat2       1   9.01  42.5%  21.21  0.57  1.5%  37.75
                                                             
2.5.3-dj1           2  11.35  56.4%  20.12  0.77  2.0%  37.32
2.5.3-dj1rat        2  10.54  53.2%  19.81  0.70  2.0%  35.69
2.5.3-dj1rat2       2   7.92  37.0%  21.40  0.52  1.4%  38.15
                                                             
2.5.3-dj1           4  11.39  57.3%  19.88  0.78  2.1%  36.65
2.5.3-dj1rat        4  10.54  53.2%  19.81  0.71  2.0%  35.91
2.5.3-dj1rat2       4   7.71  36.1%  21.37  0.51  1.4%  37.56
                                                             
2.5.3-dj1           8  11.28  56.6%  19.93  0.79  2.2%  36.22
2.5.3-dj1rat        8  10.77  54.8%  19.65  0.73  2.0%  36.24
2.5.3-dj1rat2       8   7.93  37.3%  21.27  0.52  1.4%  37.16


bonnie++-1.02a summary
1024 MB file    ---------------------Sequential Output--------------------
                -----Per Char-----  ------Block-------  -----Rewrite------
Kernel          MB/sec  %CPU   Eff  MB/sec  %CPU   Eff  MB/sec  %CPU   Eff
2.5.3-dj1         3.38  98.0  3.44   14.20  67.4 21.06    8.52  51.0 16.71
2.5.3-dj1rat      3.35  98.0  3.42   14.01  62.0 22.60    8.19  46.4 17.66
2.5.3-dj1rat2     3.38  98.0  3.44   13.06  56.2 23.24    7.66  42.6 17.98


                 -----------Sequential Input-----------  ------Random-----
                 -----Per Char-----  ------Block-------  ------Seeks------
Kernel           MB/sec  %CPU   Eff  MB/sec  %CPU   Eff   /sec  %CPU   Eff
2.5.3-dj1          3.96  97.4  4.07   15.26  57.2 26.68    129   2.0  6460
2.5.3-dj1rat       3.94  97.0  4.06   15.21  57.4 26.49    127   1.8  7051
2.5.3-dj1rat2      3.93  97.0  4.05   15.13  57.4 26.36    128   2.0  6415

bonnie++ small files (16384 files)
               --------Sequential Create------------
               ------Create-----    -----Delete-----
                /sec  %CPU   Eff    /sec  %CPU   Eff
2.5.3-dj1       4369  97.4  4485    4176  96.8  4313
2.5.3-dj1rat    4276  98.0  4364    4100  96.0  4271
2.5.3-dj1rat2   4293  97.0  4425    3880  94.4  4110

                ------------Random Create------------
                ------Create-----    -----Delete-----
                 /sec  %CPU   Eff    /sec  %CPU   Eff
 2.5.3-dj1       4334  98.2  4413    3782  96.4  3923
 2.5.3-dj1rat    4238  97.6  4341    3696  95.0  3890
 2.5.3-dj1rat2   4215  97.2  4336    3667  95.0  3860


Time to build/test/run in seconds
version        autoconf    perl    kernel  updatedb updatedb*5
2.5.3-dj1      1152        1636    1320    29       38        
2.5.3-dj1rat   1167        1684    1336    29       39        
2.5.3-dj1rat2  1162        1674    1356    29       37        

Memory Info - radix-tree saved 768k on 384MB machine.
version        available totalmem kern_code reserved
2.5.3-dj1rat     385860k  393216k      931k    6972k
2.5.3-dj1rat2    385860k  393216k      931k    6972k
2.5.3-dj1        385092k  393216k      930k    7740k

Details on system, testing method, and other results at:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-06  2:04   ` rwhron
@ 2002-02-06 11:44     ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-02-06 11:44 UTC (permalink / raw)
  To: rwhron; +Cc: Momchil Velikov, linux-kernel

On Tue, 5 Feb 2002 rwhron@earthlink.net wrote:

> I am curious if the small followup patch to 2.5.3-dj1 makes radix-tree
> more like 2.4.17-ratpagecache.  (overall, it seems that radix-tree did
> better in I/O without the small followup patch).

It would be useful if you also did dbench tests with a much
lower amount of dbench processes.

Once you get over 'dbench 16' or so the whole thing basically
becomes an excercise in how well the system can trigger task
starvation in get_request_wait.

I can recommend 'dbench 1'  'dbench 4'  'dbench 16'   ;)

> dbench 64 processes
> 2.5.3-dj1rat         **************************  13.1  MB/sec
> 2.5.3-dj1            **********************  11.1  MB/sec
> 2.5.3-dj1rat2        **********************  11.1  MB/sec
>
> dbench 192 processes
> 2.5.3-dj1rat         *************   6.8  MB/sec
> 2.5.3-dj1rat2        *************   6.7  MB/sec
> 2.5.3-dj1            *************   6.6  MB/sec



Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH] Radix-tree pagecache for 2.5
@ 2002-01-29 15:54 Christoph Hellwig
  2002-01-29 19:27 ` Linus Torvalds
  2002-01-29 23:00 ` William Lee Irwin III
  0 siblings, 2 replies; 87+ messages in thread
From: Christoph Hellwig @ 2002-01-29 15:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: velco

I've ported my hacked up version of Momchil Velikov's radix tree
radix tree pagecache to 2.5.3-pre{5,6}.

The changes over the 2.4.17 version are:

  o use mempool to avoid OOM situation involving radix nodes.
  o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
  o unify add_to_page and add_to_page_unique

It gives nice scalability improvements on big machines and drops the
memory usage on small ones (if you consider my 64MB Athlon small :)).

The patch is at

	ftp://ftp.kernel.org:/pub/linux/kernel/people/hch/patches/v2.5/2.5.3-pre5/linux-2.5.3-ratpagecache.patch.gz
	ftp://ftp.kernel.org:/pub/linux/kernel/people/hch/patches/v2.5/2.5.3-pre5/linux-2.5.3-ratpagecache.patch.bz2

or below.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

diff -uNr -Xdontdiff /datenklo/ref/linux-vger/drivers/block/rd.c linux-vger/drivers/block/rd.c
--- /datenklo/ref/linux-vger/drivers/block/rd.c	Sun Jan 13 00:05:09 2002
+++ linux-vger/drivers/block/rd.c	Tue Jan 29 03:05:40 2002
@@ -156,7 +156,6 @@
 
 	do {
 		int count;
-		struct page ** hash;
 		struct page * page;
 		char * src, * dst;
 		int unlock = 0;
@@ -166,8 +165,7 @@
 			count = size;
 		size -= count;
 
-		hash = page_hash(mapping, index);
-		page = __find_get_page(mapping, index, hash);
+		page = find_get_page(mapping, index);
 		if (!page) {
 			page = grab_cache_page(mapping, index);
 			err = -ENOMEM;
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/fs/inode.c linux-vger/fs/inode.c
--- /datenklo/ref/linux-vger/fs/inode.c	Fri Jan 25 13:12:03 2002
+++ linux-vger/fs/inode.c	Tue Jan 29 03:05:40 2002
@@ -144,6 +144,7 @@
 	INIT_LIST_HEAD(&inode->i_devices);
 	sema_init(&inode->i_sem, 1);
 	sema_init(&inode->i_zombie, 1);
+	INIT_RAT_ROOT(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.i_shared_lock);
 }
 
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/include/linux/fs.h linux-vger/include/linux/fs.h
--- /datenklo/ref/linux-vger/include/linux/fs.h	Fri Jan 25 13:15:19 2002
+++ linux-vger/include/linux/fs.h	Tue Jan 29 03:05:40 2002
@@ -21,6 +21,7 @@
 #include <linux/cache.h>
 #include <linux/stddef.h>
 #include <linux/string.h>
+#include <linux/rat.h>
 
 #include <asm/atomic.h>
 #include <asm/bitops.h>
@@ -378,6 +379,7 @@
 };
 
 struct address_space {
+	struct rat_root		page_tree;	/* radix tree of all pages */
 	struct list_head	clean_pages;	/* list of clean pages */
 	struct list_head	dirty_pages;	/* list of dirty pages */
 	struct list_head	locked_pages;	/* list of locked pages */
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/include/linux/mm.h linux-vger/include/linux/mm.h
--- /datenklo/ref/linux-vger/include/linux/mm.h	Wed Jan 16 21:49:08 2002
+++ linux-vger/include/linux/mm.h	Tue Jan 29 03:05:40 2002
@@ -149,15 +149,12 @@
 	struct list_head list;		/* ->mapping has some page lists. */
 	struct address_space *mapping;	/* The inode (or ...) we belong to. */
 	unsigned long index;		/* Our offset within mapping. */
-	struct page *next_hash;		/* Next page sharing our hash bucket in
-					   the pagecache hash table. */
 	atomic_t count;			/* Usage count, see below. */
 	unsigned long flags;		/* atomic flags, some possibly
 					   updated asynchronously */
 	struct list_head lru;		/* Pageout list, eg. active_list;
 					   protected by pagemap_lru_lock !! */
 	wait_queue_head_t wait;		/* Page locked?  Stand in line... */
-	struct page **pprev_hash;	/* Complement to *next_hash. */
 	struct buffer_head * buffers;	/* Buffer maps us to a disk block. */
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
@@ -225,9 +222,8 @@
  * using the page->list list_head. These fields are also used for
  * freelist managemet (when page->count==0).
  *
- * There is also a hash table mapping (mapping,index) to the page
- * in memory if present. The lists for this hash table use the fields
- * page->next_hash and page->pprev_hash.
+ * There is also a per-mapping radix tree mapping index to the page
+ * in memory if present. The tree is rooted at mapping->root.  
  *
  * All process pages can do I/O:
  * - inode pages may need to be read from disk,
@@ -461,6 +457,24 @@
 		return 0;
 }
 
+static inline void add_page_to_inode_queue(struct address_space *mapping, struct page * page)
+{
+	struct list_head *head = &mapping->clean_pages;
+
+	mapping->nrpages++;
+	list_add(&page->list, head);
+	page->mapping = mapping;
+}
+
+static inline void remove_page_from_inode_queue(struct page * page)
+{
+	struct address_space * mapping = page->mapping;
+
+	mapping->nrpages--;
+	list_del(&page->list);
+	page->mapping = NULL;
+}
+
 struct zone_t;
 /* filemap.c */
 extern void remove_inode_page(struct page *);
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/include/linux/pagemap.h linux-vger/include/linux/pagemap.h
--- /datenklo/ref/linux-vger/include/linux/pagemap.h	Mon Nov 12 18:19:18 2001
+++ linux-vger/include/linux/pagemap.h	Tue Jan 29 03:18:57 2002
@@ -41,53 +41,22 @@
  */
 #define page_cache_entry(x)	virt_to_page(x)
 
-extern unsigned int page_hash_bits;
-#define PAGE_HASH_BITS (page_hash_bits)
-#define PAGE_HASH_SIZE (1 << PAGE_HASH_BITS)
-
-extern atomic_t page_cache_size; /* # of pages currently in the hash table */
-extern struct page **page_hash_table;
-
-extern void page_cache_init(unsigned long);
-
-/*
- * We use a power-of-two hash table to avoid a modulus,
- * and get a reasonable hash by knowing roughly how the
- * inode pointer and indexes are distributed (ie, we
- * roughly know which bits are "significant")
- *
- * For the time being it will work for struct address_space too (most of
- * them sitting inside the inodes). We might want to change it later.
- */
-static inline unsigned long _page_hashfn(struct address_space * mapping, unsigned long index)
-{
-#define i (((unsigned long) mapping)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
-#define s(x) ((x)+((x)>>PAGE_HASH_BITS))
-	return s(i+index) & (PAGE_HASH_SIZE-1);
-#undef i
-#undef s
-}
-
-#define page_hash(mapping,index) (page_hash_table+_page_hashfn(mapping,index))
-
-extern struct page * __find_get_page(struct address_space *mapping,
-				unsigned long index, struct page **hash);
-#define find_get_page(mapping, index) \
-	__find_get_page(mapping, index, page_hash(mapping, index))
-extern struct page * __find_lock_page (struct address_space * mapping,
-				unsigned long index, struct page **hash);
+extern atomic_t page_cache_size; /* # of pages currently in the page cache */
+
+extern struct page * find_get_page(struct address_space *mapping,
+				unsigned long index);
+extern struct page * find_lock_page(struct address_space *mapping,
+				unsigned long index);
+extern struct page * find_trylock_page(struct address_space *mapping,
+				unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
 				unsigned long index, unsigned int gfp_mask);
 
+extern int add_to_page_cache(struct page * page, struct address_space *mapping,
+				unsigned long index);
+
 extern void FASTCALL(lock_page(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
-#define find_lock_page(mapping, index) \
-	__find_lock_page(mapping, index, page_hash(mapping, index))
-extern struct page *find_trylock_page(struct address_space *, unsigned long);
-
-extern void add_to_page_cache(struct page * page, struct address_space *mapping, unsigned long index);
-extern void add_to_page_cache_locked(struct page * page, struct address_space *mapping, unsigned long index);
-extern int add_to_page_cache_unique(struct page * page, struct address_space *mapping, unsigned long index, struct page **hash);
 
 extern void ___wait_on_page(struct page *);
 
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/include/linux/rat.h linux-vger/include/linux/rat.h
--- /datenklo/ref/linux-vger/include/linux/rat.h	Thu Jan  1 01:00:00 1970
+++ linux-vger/include/linux/rat.h	Tue Jan 29 03:05:40 2002
@@ -0,0 +1,26 @@
+#ifndef _LINUX_RAT_H
+#define _LINUX_RAT_H
+
+struct rat_node;
+
+#define RAT_SLOT_RESERVED	((void *)~0UL)
+
+struct rat_root {
+	unsigned int	height;
+	int		gfp_mask;
+	struct rat_node	*rnode;
+};
+
+#define INIT_RAT_ROOT(root, mask)	\
+do {					\
+	(root)->height = 0;		\
+	(root)->gfp_mask = (mask);	\
+	(root)->rnode = NULL;		\
+} while (0)
+
+extern int rat_reserve(struct rat_root *, unsigned long, void ***);
+extern int rat_insert(struct rat_root *, unsigned long, void *);
+extern void *rat_lookup(struct rat_root *, unsigned long);
+extern int rat_delete(struct rat_root *, unsigned long);
+
+#endif /* _LINUX_RAT_H */
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/include/linux/swap.h linux-vger/include/linux/swap.h
--- /datenklo/ref/linux-vger/include/linux/swap.h	Wed Jan 16 21:49:11 2002
+++ linux-vger/include/linux/swap.h	Tue Jan 29 03:05:40 2002
@@ -96,7 +96,7 @@
 struct task_struct;
 struct vm_area_struct;
 struct sysinfo;
-
+struct address_space;
 struct zone_t;
 
 /* linux/mm/swap.c */
@@ -126,6 +126,9 @@
 extern int add_to_swap_cache(struct page *, swp_entry_t);
 extern void __delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache(struct page *page);
+extern int move_to_swap_cache(struct page *page, swp_entry_t entry);
+extern int move_from_swap_cache(struct page *page, unsigned long index,
+		struct address_space *mapping);
 extern void free_page_and_swap_cache(struct page *page);
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t);
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/init/main.c linux-vger/init/main.c
--- /datenklo/ref/linux-vger/init/main.c	Fri Jan 25 13:16:04 2002
+++ linux-vger/init/main.c	Tue Jan 29 03:05:40 2002
@@ -69,6 +69,7 @@
 extern void sbus_init(void);
 extern void sysctl_init(void);
 extern void signals_init(void);
+extern void ratcache_init(void) __init;
 
 extern void free_initmem(void);
 
@@ -377,7 +378,7 @@
 	proc_caches_init();
 	vfs_caches_init(mempages);
 	buffer_init(mempages);
-	page_cache_init(mempages);
+	ratcache_init();
 #if defined(CONFIG_ARCH_S390)
 	ccwcache_init();
 #endif
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/kernel/ksyms.c linux-vger/kernel/ksyms.c
--- /datenklo/ref/linux-vger/kernel/ksyms.c	Fri Jan 25 13:16:04 2002
+++ linux-vger/kernel/ksyms.c	Tue Jan 29 03:05:40 2002
@@ -218,8 +218,6 @@
 EXPORT_SYMBOL(generic_file_mmap);
 EXPORT_SYMBOL(generic_ro_fops);
 EXPORT_SYMBOL(generic_buffer_fdatasync);
-EXPORT_SYMBOL(page_hash_bits);
-EXPORT_SYMBOL(page_hash_table);
 EXPORT_SYMBOL(file_lock_list);
 EXPORT_SYMBOL(locks_init_lock);
 EXPORT_SYMBOL(locks_copy_lock);
@@ -254,8 +252,8 @@
 EXPORT_SYMBOL(__pollwait);
 EXPORT_SYMBOL(poll_freewait);
 EXPORT_SYMBOL(ROOT_DEV);
-EXPORT_SYMBOL(__find_get_page);
-EXPORT_SYMBOL(__find_lock_page);
+EXPORT_SYMBOL(find_get_page);
+EXPORT_SYMBOL(find_lock_page);
 EXPORT_SYMBOL(grab_cache_page);
 EXPORT_SYMBOL(grab_cache_page_nowait);
 EXPORT_SYMBOL(read_cache_page);
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/lib/Makefile linux-vger/lib/Makefile
--- /datenklo/ref/linux-vger/lib/Makefile	Wed Jan 16 21:49:17 2002
+++ linux-vger/lib/Makefile	Tue Jan 29 03:05:40 2002
@@ -8,9 +8,10 @@
 
 L_TARGET := lib.a
 
-export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o crc32.o
+export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o crc32.o rat.o
 
-obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o
+obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o \
+	 bust_spinlocks.o rbtree.o rat.o
 
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/lib/rat.c linux-vger/lib/rat.c
--- /datenklo/ref/linux-vger/lib/rat.c	Thu Jan  1 01:00:00 1970
+++ linux-vger/lib/rat.c	Tue Jan 29 03:05:40 2002
@@ -0,0 +1,294 @@
+/*
+ * Copyright (C) 2001 Momchil Velikov
+ * Portions Copyright (C) 2001 Christoph Hellwig
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2, or (at
+ * your option) any later version.
+ * 
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/module.h>
+#include <linux/rat.h>
+#include <linux/slab.h>
+
+
+/*
+ * Radix tree node definition.
+ */
+#define RAT_MAP_SHIFT		7
+#define RAT_MAP_SIZE		(1UL << RAT_MAP_SHIFT)
+#define RAT_MAP_MASK		(RAT_MAP_SIZE-1)
+
+struct rat_node {
+	unsigned int	count;
+	void		*slots[RAT_MAP_SIZE];
+};
+
+struct rat_path {
+	struct rat_node *node, **slot;
+};
+
+#define RAT_INDEX_BITS	(8/* CHAR_BIT */ * sizeof(unsigned long))
+
+/*
+ * Radix tree node cache.
+ */
+#define POOL_SIZE	32
+
+static kmem_cache_t *ratnode_cachep;
+static mempool_t *ratnode_pool;
+
+#define ratnode_alloc(root) \
+	mempool_alloc(ratnode_pool, (root)->gfp_mask)
+#define ratnode_free(node) \
+	mempool_free((node), ratnode_pool);
+
+
+/*
+ *	Return the maximum key which can be store into a
+ *	radix tree with height HEIGHT.
+ */
+static inline unsigned long rat_maxindex(unsigned int height)
+{
+	unsigned int tmp = height * RAT_MAP_SHIFT;
+	unsigned long index = (~0UL >> (RAT_INDEX_BITS - tmp - 1)) >> 1;
+
+	if (tmp >= RAT_INDEX_BITS)
+		index = ~0UL;
+	return index;
+}
+
+
+/*
+ *	Extend a radix tree so it can store key @index.
+ */
+static int rat_extend(struct rat_root *root, unsigned long index)
+{
+	struct rat_node *node;
+	unsigned int height;
+
+	/* Figure out what the height should be.  */
+	height = root->height + 1;
+	while (index > rat_maxindex(height))
+		height++;
+
+	if (root->rnode) {
+		do {
+			if (!(node = ratnode_alloc(root)))
+				return -ENOMEM;
+
+			/* Increase the height.  */
+			node->slots[0] = root->rnode;
+			if (root->rnode)
+				node->count = 1;
+			root->rnode = node;
+			root->height++;
+		} while (height > root->height);
+	} else 
+		root->height = height;
+
+	return 0;
+}
+
+
+/**
+ *	rat_reserve    -    reserve space in a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@pslot:		pointer to reserved slot
+ *
+ *	Reserve a slot in a radix tree for the key @index.
+ */
+int rat_reserve(struct rat_root *root, unsigned long index, void ***pslot)
+{
+	struct rat_node *node = NULL, *tmp, **slot;
+	unsigned int height, shift;
+	int error;
+
+	/* Make sure the tree is high enough.  */
+	if (index > rat_maxindex(root->height)) {
+		error = rat_extend(root, index);
+		if (error)
+			return error;
+	}
+    
+	slot = &root->rnode;
+	height = root->height;
+	shift = (height-1) * RAT_MAP_SHIFT;
+
+	while (height > 0) {
+		if (*slot == NULL) {
+			/* Have to add a child node.  */
+			if (!(tmp = ratnode_alloc(root)))
+				return -ENOMEM;
+			*slot = tmp;
+			if (node)
+				node->count++;
+		}
+
+		/* Go a level down.  */
+		node = *slot;
+		slot = (struct rat_node **)
+			(node->slots + ((index >> shift) & RAT_MAP_MASK));
+		shift -= RAT_MAP_SHIFT;
+		height--;
+	}
+
+	if (*slot != NULL)
+		return -EEXIST;
+	if (node)
+		node->count++;
+
+	*pslot = (void **)slot;
+	**pslot = RAT_SLOT_RESERVED;
+	return 0;
+}
+
+EXPORT_SYMBOL(rat_reserve);
+
+
+/**
+ *	rat_insert    -    insert into a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@item:		item to insert
+ *
+ *	Insert an item into the radix tree at position @index.
+ */
+int rat_insert(struct rat_root *root, unsigned long index, void *item)
+{
+	void **slot;
+	int error;
+
+	error = rat_reserve(root, index, &slot);
+	if (!error)
+		*slot = item;
+	return error;
+}
+
+EXPORT_SYMBOL(rat_insert);
+
+
+/**
+ *	rat_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup them item at the position @index in the radix tree @root.
+ */
+void *rat_lookup(struct rat_root *root, unsigned long index)
+{
+	unsigned int height, shift;
+	struct rat_node **slot;
+
+	height = root->height;
+	if (index > rat_maxindex(height))
+		return NULL;
+
+	shift = (height-1) * RAT_MAP_SHIFT;
+	slot = &root->rnode;
+
+	while (height > 0) {
+		if (*slot == NULL)
+			return NULL;
+
+		slot = (struct rat_node **)
+			((*slot)->slots + ((index >> shift) & RAT_MAP_MASK));
+		shift -= RAT_MAP_SHIFT;
+		height--;
+	}
+
+	return (void *) *slot;
+}
+
+EXPORT_SYMBOL(rat_lookup);
+
+
+/**
+ *	rat_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ */
+int rat_delete(struct rat_root *root, unsigned long index)
+{
+	struct rat_path path[RAT_INDEX_BITS/RAT_MAP_SHIFT + 2], *pathp = path;
+	unsigned int height, shift;
+
+	height = root->height;
+	if (index > rat_maxindex(height))
+		return -ENOENT;
+
+	shift = (height-1) * RAT_MAP_SHIFT;
+	pathp->node = NULL;
+	pathp->slot = &root->rnode;
+
+	while (height > 0) {
+		if (*pathp->slot == NULL)
+			return -ENOENT;
+
+		pathp[1].node = *pathp[0].slot;
+		pathp[1].slot = (struct rat_node **)
+		    (pathp[1].node->slots + ((index >> shift) & RAT_MAP_MASK));
+		pathp++;
+		shift -= RAT_MAP_SHIFT;
+		height--;
+	}
+
+	if (*pathp[0].slot == NULL)
+		return -ENOENT;
+
+	*pathp[0].slot = NULL;
+	while (pathp[0].node && --pathp[0].node->count == 0) {
+		pathp--;
+		*pathp[0].slot = NULL;
+		ratnode_free(pathp[1].node);
+	}
+
+	return 0;
+}
+
+EXPORT_SYMBOL(rat_delete);
+
+
+static void ratnode_ctor(void *node, kmem_cache_t *cachep, unsigned long flags)
+{
+	memset(node, 0, sizeof(struct rat_node));
+}
+
+static void *ratnode_pool_alloc(int gfp_mask, void *data)
+{
+	return kmem_cache_alloc(ratnode_cachep, gfp_mask);
+}
+
+static void ratnode_pool_free(void *node, void *data)
+{
+	kmem_cache_free(ratnode_cachep, node);
+}
+
+void __init ratcache_init(void)
+{
+	ratnode_cachep = kmem_cache_create("ratnode", sizeof(struct rat_node),
+			0, SLAB_HWCACHE_ALIGN, ratnode_ctor, NULL);
+	if (!ratnode_cachep)
+		panic ("Failed to create ratnode cache\n");
+	ratnode_pool = mempool_create(POOL_SIZE, ratnode_pool_alloc,
+			ratnode_pool_free, NULL);
+	if (!ratnode_pool)
+		panic ("Failed to create ratnode pool\n");
+}
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/mm/filemap.c linux-vger/mm/filemap.c
--- /datenklo/ref/linux-vger/mm/filemap.c	Fri Jan 25 13:16:04 2002
+++ linux-vger/mm/filemap.c	Tue Jan 29 03:32:41 2002
@@ -44,69 +44,16 @@
  */
 
 atomic_t page_cache_size = ATOMIC_INIT(0);
-unsigned int page_hash_bits;
-struct page **page_hash_table;
 
-spinlock_t pagecache_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 /*
- * NOTE: to avoid deadlocking you must never acquire the pagemap_lru_lock 
- *	with the pagecache_lock held.
- *
- * Ordering:
- *	swap_lock ->
- *		pagemap_lru_lock ->
- *			pagecache_lock
+ * The deadlock-free ordering of lock acquisition is:
+ *	pagemap_lru_lock ==> mapping_lock
  */
 spinlock_t pagemap_lru_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 
 #define CLUSTER_PAGES		(1 << page_cluster)
 #define CLUSTER_OFFSET(x)	(((x) >> page_cluster) << page_cluster)
 
-static void FASTCALL(add_page_to_hash_queue(struct page * page, struct page **p));
-static void add_page_to_hash_queue(struct page * page, struct page **p)
-{
-	struct page *next = *p;
-
-	*p = page;
-	page->next_hash = next;
-	page->pprev_hash = p;
-	if (next)
-		next->pprev_hash = &page->next_hash;
-	if (page->buffers)
-		PAGE_BUG(page);
-	atomic_inc(&page_cache_size);
-}
-
-static inline void add_page_to_inode_queue(struct address_space *mapping, struct page * page)
-{
-	struct list_head *head = &mapping->clean_pages;
-
-	mapping->nrpages++;
-	list_add(&page->list, head);
-	page->mapping = mapping;
-}
-
-static inline void remove_page_from_inode_queue(struct page * page)
-{
-	struct address_space * mapping = page->mapping;
-
-	mapping->nrpages--;
-	list_del(&page->list);
-	page->mapping = NULL;
-}
-
-static inline void remove_page_from_hash_queue(struct page * page)
-{
-	struct page *next = page->next_hash;
-	struct page **pprev = page->pprev_hash;
-
-	if (next)
-		next->pprev_hash = pprev;
-	*pprev = next;
-	page->pprev_hash = NULL;
-	atomic_dec(&page_cache_size);
-}
-
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
@@ -115,18 +62,20 @@
 void __remove_inode_page(struct page *page)
 {
 	if (PageDirty(page)) BUG();
+	rat_delete(&page->mapping->page_tree, page->index);
 	remove_page_from_inode_queue(page);
-	remove_page_from_hash_queue(page);
+	atomic_dec(&page_cache_size);
 }
 
 void remove_inode_page(struct page *page)
 {
+	struct address_space *mapping = page->mapping;
 	if (!PageLocked(page))
 		PAGE_BUG(page);
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 	__remove_inode_page(page);
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 }
 
 static inline int sync_page(struct page *page)
@@ -147,10 +96,10 @@
 		struct address_space *mapping = page->mapping;
 
 		if (mapping) {
-			spin_lock(&pagecache_lock);
+			spin_lock(&mapping->i_shared_lock);
 			list_del(&page->list);
 			list_add(&page->list, &mapping->dirty_pages);
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 
 			if (mapping->host)
 				mark_inode_dirty_pages(mapping->host);
@@ -170,11 +119,12 @@
 {
 	struct list_head *head, *curr;
 	struct page * page;
+	struct address_space *mapping = inode->i_mapping;
 
-	head = &inode->i_mapping->clean_pages;
+	head = &mapping->clean_pages;
 
 	spin_lock(&pagemap_lru_lock);
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 	curr = head->next;
 
 	while (curr != head) {
@@ -205,7 +155,7 @@
 		continue;
 	}
 
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 	spin_unlock(&pagemap_lru_lock);
 }
 
@@ -244,8 +194,9 @@
 	page_cache_release(page);
 }
 
-static int FASTCALL(truncate_list_pages(struct list_head *, unsigned long, unsigned *));
-static int truncate_list_pages(struct list_head *head, unsigned long start, unsigned *partial)
+static int FASTCALL(truncate_list_pages(struct address_space *, struct list_head *, unsigned long, unsigned *));
+static int truncate_list_pages(struct address_space *mapping,
+	struct list_head *head, unsigned long start, unsigned *partial)
 {
 	struct list_head *curr;
 	struct page * page;
@@ -274,7 +225,7 @@
 				/* Restart on this page */
 				list_add(head, curr);
 
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			unlocked = 1;
 
  			if (!failed) {
@@ -295,7 +246,7 @@
 				schedule();
 			}
 
-			spin_lock(&pagecache_lock);
+			spin_lock(&mapping->i_shared_lock);
 			goto restart;
 		}
 		curr = curr->prev;
@@ -319,24 +270,28 @@
 	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	int unlocked;
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 	do {
-		unlocked = truncate_list_pages(&mapping->clean_pages, start, &partial);
-		unlocked |= truncate_list_pages(&mapping->dirty_pages, start, &partial);
-		unlocked |= truncate_list_pages(&mapping->locked_pages, start, &partial);
+		unlocked = truncate_list_pages(mapping, &mapping->clean_pages,
+						start, &partial);
+		unlocked |= truncate_list_pages(mapping, &mapping->dirty_pages,
+						start, &partial);
+		unlocked |= truncate_list_pages(mapping, &mapping->locked_pages,
+						start, &partial);
 	} while (unlocked);
 	/* Traversed all three lists without dropping the lock */
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 }
 
-static inline int invalidate_this_page2(struct page * page,
+static inline int invalidate_this_page2(struct address_space *mapping,
+					struct page * page,
 					struct list_head * curr,
 					struct list_head * head)
 {
 	int unlocked = 1;
 
 	/*
-	 * The page is locked and we hold the pagecache_lock as well
+	 * The page is locked and we hold the mapping lock as well
 	 * so both page_count(page) and page->buffers stays constant here.
 	 */
 	if (page_count(page) == 1 + !!page->buffers) {
@@ -345,7 +300,7 @@
 		list_add_tail(head, curr);
 
 		page_cache_get(page);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 		truncate_complete_page(page);
 	} else {
 		if (page->buffers) {
@@ -354,7 +309,7 @@
 			list_add_tail(head, curr);
 
 			page_cache_get(page);
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			block_invalidate_page(page);
 		} else
 			unlocked = 0;
@@ -366,8 +321,8 @@
 	return unlocked;
 }
 
-static int FASTCALL(invalidate_list_pages2(struct list_head *));
-static int invalidate_list_pages2(struct list_head *head)
+static int FASTCALL(invalidate_list_pages2(struct address_space *mapping, struct list_head *));
+static int invalidate_list_pages2(struct address_space *mapping, struct list_head *head)
 {
 	struct list_head *curr;
 	struct page * page;
@@ -381,7 +336,7 @@
 		if (!TryLockPage(page)) {
 			int __unlocked;
 
-			__unlocked = invalidate_this_page2(page, curr, head);
+			__unlocked = invalidate_this_page2(mapping, page, curr, head);
 			UnlockPage(page);
 			unlocked |= __unlocked;
 			if (!__unlocked) {
@@ -394,7 +349,7 @@
 			list_add(head, curr);
 
 			page_cache_get(page);
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			unlocked = 1;
 			wait_on_page(page);
 		}
@@ -405,7 +360,7 @@
 			schedule();
 		}
 
-		spin_lock(&pagecache_lock);
+		spin_lock(&mapping->i_shared_lock);
 		goto restart;
 	}
 	return unlocked;
@@ -420,32 +375,13 @@
 {
 	int unlocked;
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 	do {
-		unlocked = invalidate_list_pages2(&mapping->clean_pages);
-		unlocked |= invalidate_list_pages2(&mapping->dirty_pages);
-		unlocked |= invalidate_list_pages2(&mapping->locked_pages);
+		unlocked = invalidate_list_pages2(mapping, &mapping->clean_pages);
+		unlocked |= invalidate_list_pages2(mapping, &mapping->dirty_pages);
+		unlocked |= invalidate_list_pages2(mapping, &mapping->locked_pages);
 	} while (unlocked);
-	spin_unlock(&pagecache_lock);
-}
-
-static inline struct page * __find_page_nolock(struct address_space *mapping, unsigned long offset, struct page *page)
-{
-	goto inside;
-
-	for (;;) {
-		page = page->next_hash;
-inside:
-		if (!page)
-			goto not_found;
-		if (page->mapping != mapping)
-			continue;
-		if (page->index == offset)
-			break;
-	}
-
-not_found:
-	return page;
+	spin_unlock(&mapping->i_shared_lock);
 }
 
 /*
@@ -483,13 +419,14 @@
 	return error;
 }
 
-static int do_buffer_fdatasync(struct list_head *head, unsigned long start, unsigned long end, int (*fn)(struct page *))
+static int do_buffer_fdatasync(struct address_space *mapping,
+	struct list_head *head, unsigned long start, unsigned long end, int (*fn)(struct page *))
 {
 	struct list_head *curr;
 	struct page *page;
 	int retval = 0;
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 	curr = head->next;
 	while (curr != head) {
 		page = list_entry(curr, struct page, list);
@@ -502,7 +439,7 @@
 			continue;
 
 		page_cache_get(page);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 		lock_page(page);
 
 		/* The buffers could have been free'd while we waited for the page lock */
@@ -510,11 +447,11 @@
 			retval |= fn(page);
 
 		UnlockPage(page);
-		spin_lock(&pagecache_lock);
+		spin_lock(&mapping->i_shared_lock);
 		curr = page->list.next;
 		page_cache_release(page);
 	}
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 
 	return retval;
 }
@@ -525,17 +462,18 @@
  */
 int generic_buffer_fdatasync(struct inode *inode, unsigned long start_idx, unsigned long end_idx)
 {
+	struct address_space *mapping = inode->i_mapping;
 	int retval;
 
 	/* writeout dirty buffers on pages from both clean and dirty lists */
-	retval = do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx, end_idx, writeout_one_page);
-	retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx, end_idx, writeout_one_page);
-	retval |= do_buffer_fdatasync(&inode->i_mapping->locked_pages, start_idx, end_idx, writeout_one_page);
+	retval = do_buffer_fdatasync(mapping, &mapping->dirty_pages, start_idx, end_idx, writeout_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->clean_pages, start_idx, end_idx, writeout_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->locked_pages, start_idx, end_idx, writeout_one_page);
 
 	/* now wait for locked buffers on pages from both clean and dirty lists */
-	retval |= do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx, end_idx, waitfor_one_page);
-	retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx, end_idx, waitfor_one_page);
-	retval |= do_buffer_fdatasync(&inode->i_mapping->locked_pages, start_idx, end_idx, waitfor_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->dirty_pages, start_idx, end_idx, waitfor_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->clean_pages, start_idx, end_idx, waitfor_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->locked_pages, start_idx, end_idx, waitfor_one_page);
 
 	return retval;
 }
@@ -580,7 +518,7 @@
 {
 	int (*writepage)(struct page *) = mapping->a_ops->writepage;
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 
         while (!list_empty(&mapping->dirty_pages)) {
 		struct page *page = list_entry(mapping->dirty_pages.next, struct page, list);
@@ -592,7 +530,7 @@
 			continue;
 
 		page_cache_get(page);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 
 		lock_page(page);
 
@@ -603,9 +541,9 @@
 			UnlockPage(page);
 
 		page_cache_release(page);
-		spin_lock(&pagecache_lock);
+		spin_lock(&mapping->i_shared_lock);
 	}
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 }
 
 /**
@@ -617,7 +555,7 @@
  */
 void filemap_fdatawait(struct address_space * mapping)
 {
-	spin_lock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
 
         while (!list_empty(&mapping->locked_pages)) {
 		struct page *page = list_entry(mapping->locked_pages.next, struct page, list);
@@ -629,83 +567,57 @@
 			continue;
 
 		page_cache_get(page);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 
 		___wait_on_page(page);
 
 		page_cache_release(page);
-		spin_lock(&pagecache_lock);
+		spin_lock(&mapping->i_shared_lock);
 	}
-	spin_unlock(&pagecache_lock);
-}
-
-/*
- * Add a page to the inode page cache.
- *
- * The caller must have locked the page and 
- * set all the page flags correctly..
- */
-void add_to_page_cache_locked(struct page * page, struct address_space *mapping, unsigned long index)
-{
-	if (!PageLocked(page))
-		BUG();
-
-	page->index = index;
-	page_cache_get(page);
-	spin_lock(&pagecache_lock);
-	add_page_to_inode_queue(mapping, page);
-	add_page_to_hash_queue(page, page_hash(mapping, index));
-	spin_unlock(&pagecache_lock);
-
-	lru_cache_add(page);
+	spin_unlock(&mapping->i_shared_lock);
 }
 
 /*
  * This adds a page to the page cache, starting out as locked,
  * owned by us, but unreferenced, not uptodate and with no errors.
  */
-static inline void __add_to_page_cache(struct page * page,
-	struct address_space *mapping, unsigned long offset,
-	struct page **hash)
+static int __add_to_page_cache(struct page * page, struct address_space *mapping,
+				      unsigned long offset)
 {
 	unsigned long flags;
+	int error;
 
-	flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked);
-	page->flags = flags | (1 << PG_locked);
 	page_cache_get(page);
+	if ((error = rat_insert(&mapping->page_tree, offset, page)))
+		goto fail;
+	flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error |
+				1 << PG_dirty | 1 << PG_referenced |
+				1 << PG_arch_1 | 1 << PG_checked);
+	page->flags = flags | (1 << PG_locked);
 	page->index = offset;
 	add_page_to_inode_queue(mapping, page);
-	add_page_to_hash_queue(page, hash);
-}
 
-void add_to_page_cache(struct page * page, struct address_space * mapping, unsigned long offset)
-{
-	spin_lock(&pagecache_lock);
-	__add_to_page_cache(page, mapping, offset, page_hash(mapping, offset));
-	spin_unlock(&pagecache_lock);
-	lru_cache_add(page);
+	atomic_inc(&page_cache_size);
+	return 0;
+fail:
+	page_cache_release(page);
+	return error;
 }
 
-int add_to_page_cache_unique(struct page * page,
-	struct address_space *mapping, unsigned long offset,
-	struct page **hash)
+int add_to_page_cache(struct page *page, struct address_space *mapping,
+		      unsigned long offset)
 {
-	int err;
-	struct page *alias;
-
-	spin_lock(&pagecache_lock);
-	alias = __find_page_nolock(mapping, offset, *hash);
-
-	err = 1;
-	if (!alias) {
-		__add_to_page_cache(page,mapping,offset,hash);
-		err = 0;
-	}
+	int error;
 
-	spin_unlock(&pagecache_lock);
-	if (!err)
-		lru_cache_add(page);
-	return err;
+	spin_lock(&mapping->i_shared_lock);
+	if ((error = __add_to_page_cache(page, mapping, offset)))
+		goto fail;
+	spin_unlock(&mapping->i_shared_lock);
+	lru_cache_add(page);
+	return 0;
+fail:
+	spin_unlock(&mapping->i_shared_lock);
+	return -ENOMEM;
 }
 
 /*
@@ -716,12 +628,12 @@
 static int page_cache_read(struct file * file, unsigned long offset)
 {
 	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
-	struct page **hash = page_hash(mapping, offset);
 	struct page *page; 
+	int error;
 
-	spin_lock(&pagecache_lock);
-	page = __find_page_nolock(mapping, offset, *hash);
-	spin_unlock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
+	page = rat_lookup(&mapping->page_tree, offset);
+	spin_unlock(&mapping->i_shared_lock);
 	if (page)
 		return 0;
 
@@ -729,11 +641,18 @@
 	if (!page)
 		return -ENOMEM;
 
-	if (!add_to_page_cache_unique(page, mapping, offset, hash)) {
-		int error = mapping->a_ops->readpage(file, page);
+	while ((error = add_to_page_cache(page, mapping, offset)) == -ENOMEM) {
+		/* Yield for kswapd, and try again */
+		__set_current_state(TASK_RUNNING);
+		yield();
+	}
+
+	if (!error) {
+		error = mapping->a_ops->readpage(file, page);
 		page_cache_release(page);
 		return error;
 	}
+
 	/*
 	 * We arrive here in the unlikely event that someone 
 	 * raced with us and added our page to the cache first.
@@ -837,8 +756,7 @@
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */
-struct page * __find_get_page(struct address_space *mapping,
-			      unsigned long offset, struct page **hash)
+struct page * find_get_page(struct address_space *mapping, unsigned long offset)
 {
 	struct page *page;
 
@@ -846,11 +764,11 @@
 	 * We scan the hash list read-only. Addition to and removal from
 	 * the hash-list needs a held write-lock.
 	 */
-	spin_lock(&pagecache_lock);
-	page = __find_page_nolock(mapping, offset, *hash);
+	spin_lock(&mapping->i_shared_lock);
+	page = rat_lookup(&mapping->page_tree, offset);
 	if (page)
 		page_cache_get(page);
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 	return page;
 }
 
@@ -860,15 +778,14 @@
 struct page *find_trylock_page(struct address_space *mapping, unsigned long offset)
 {
 	struct page *page;
-	struct page **hash = page_hash(mapping, offset);
 
-	spin_lock(&pagecache_lock);
-	page = __find_page_nolock(mapping, offset, *hash);
+	spin_lock(&mapping->i_shared_lock);
+	page = rat_lookup(&mapping->page_tree, offset);
 	if (page) {
 		if (TryLockPage(page))
 			page = NULL;
 	}
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&mapping->i_shared_lock);
 	return page;
 }
 
@@ -877,9 +794,9 @@
  * will return with it held (but it may be dropped
  * during blocking operations..
  */
-static struct page * FASTCALL(__find_lock_page_helper(struct address_space *, unsigned long, struct page *));
-static struct page * __find_lock_page_helper(struct address_space *mapping,
-					unsigned long offset, struct page *hash)
+static struct page * FASTCALL(find_lock_page_helper(struct address_space *, unsigned long));
+static struct page * find_lock_page_helper(struct address_space *mapping,
+					unsigned long offset)
 {
 	struct page *page;
 
@@ -888,13 +805,13 @@
 	 * the hash-list needs a held write-lock.
 	 */
 repeat:
-	page = __find_page_nolock(mapping, offset, hash);
+	page = rat_lookup(&mapping->page_tree, offset);
 	if (page) {
 		page_cache_get(page);
 		if (TryLockPage(page)) {
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			lock_page(page);
-			spin_lock(&pagecache_lock);
+			spin_lock(&mapping->i_shared_lock);
 
 			/* Has the page been re-allocated while we slept? */
 			if (page->mapping != mapping || page->index != offset) {
@@ -911,39 +828,40 @@
  * Same as the above, but lock the page too, verifying that
  * it's still valid once we own it.
  */
-struct page * __find_lock_page (struct address_space *mapping,
-				unsigned long offset, struct page **hash)
+struct page * find_lock_page(struct address_space *mapping, unsigned long offset)
 {
 	struct page *page;
 
-	spin_lock(&pagecache_lock);
-	page = __find_lock_page_helper(mapping, offset, *hash);
-	spin_unlock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
+	page = find_lock_page_helper(mapping, offset);
+	spin_unlock(&mapping->i_shared_lock);
+
 	return page;
 }
 
 /*
  * Same as above, but create the page if required..
  */
-struct page * find_or_create_page(struct address_space *mapping, unsigned long index, unsigned int gfp_mask)
+struct page * find_or_create_page(struct address_space *mapping,
+				  unsigned long index, unsigned int gfp_mask)
 {
 	struct page *page;
-	struct page **hash = page_hash(mapping, index);
 
-	spin_lock(&pagecache_lock);
-	page = __find_lock_page_helper(mapping, index, *hash);
-	spin_unlock(&pagecache_lock);
+	spin_lock(&mapping->i_shared_lock);
+	page = find_lock_page_helper(mapping, index);
+	spin_unlock(&mapping->i_shared_lock);
 	if (!page) {
 		struct page *newpage = alloc_page(gfp_mask);
 		if (newpage) {
-			spin_lock(&pagecache_lock);
-			page = __find_lock_page_helper(mapping, index, *hash);
+			spin_lock(&mapping->i_shared_lock);
+			page = find_lock_page_helper(mapping, index);
 			if (likely(!page)) {
-				page = newpage;
-				__add_to_page_cache(page, mapping, index, hash);
-				newpage = NULL;
+				if (!__add_to_page_cache(newpage, mapping, index)) {
+					page = newpage;
+					newpage = NULL;
+				}
 			}
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			if (newpage == NULL)
 				lru_cache_add(page);
 			else 
@@ -970,10 +888,9 @@
  */
 struct page *grab_cache_page_nowait(struct address_space *mapping, unsigned long index)
 {
-	struct page *page, **hash;
+	struct page *page;
 
-	hash = page_hash(mapping, index);
-	page = __find_get_page(mapping, index, hash);
+	page = find_get_page(mapping, index);
 
 	if ( page ) {
 		if ( !TryLockPage(page) ) {
@@ -998,7 +915,7 @@
 	if ( unlikely(!page) )
 		return NULL;	/* Failed to allocate a page */
 
-	if ( unlikely(add_to_page_cache_unique(page, mapping, index, hash)) ) {
+	if (unlikely(add_to_page_cache(page, mapping, index)) ) {
 		/* Someone else grabbed the page already. */
 		page_cache_release(page);
 		return NULL;
@@ -1323,7 +1240,7 @@
 	}
 
 	for (;;) {
-		struct page *page, **hash;
+		struct page *page;
 		unsigned long end_index, nr, ret;
 
 		end_index = inode->i_size >> PAGE_CACHE_SHIFT;
@@ -1342,15 +1259,14 @@
 		/*
 		 * Try to find the data in the page cache..
 		 */
-		hash = page_hash(mapping, index);
 
-		spin_lock(&pagecache_lock);
-		page = __find_page_nolock(mapping, index, *hash);
+		spin_lock(&mapping->i_shared_lock);
+		page = rat_lookup(&mapping->page_tree, index);
 		if (!page)
 			goto no_cached_page;
 found_page:
 		page_cache_get(page);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 
 		if (!Page_Uptodate(page))
 			goto page_not_up_to_date;
@@ -1444,7 +1360,7 @@
 		 * We get here with the page cache lock held.
 		 */
 		if (!cached_page) {
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			cached_page = page_cache_alloc(mapping);
 			if (!cached_page) {
 				desc->error = -ENOMEM;
@@ -1455,8 +1371,8 @@
 			 * Somebody may have added the page while we
 			 * dropped the page cache lock. Check for that.
 			 */
-			spin_lock(&pagecache_lock);
-			page = __find_page_nolock(mapping, index, *hash);
+			spin_lock(&mapping->i_shared_lock);
+			page = rat_lookup(&mapping->page_tree, index);
 			if (page)
 				goto found_page;
 		}
@@ -1464,9 +1380,13 @@
 		/*
 		 * Ok, add the new page to the hash-queues...
 		 */
+		if (__add_to_page_cache(cached_page, mapping, index) == -ENOMEM) {
+			spin_unlock (&mapping->i_shared_lock);
+			desc->error = -ENOMEM;
+			break;
+		}
 		page = cached_page;
-		__add_to_page_cache(page, mapping, index, hash);
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&mapping->i_shared_lock);
 		lru_cache_add(page);		
 		cached_page = NULL;
 
@@ -1866,7 +1786,7 @@
 	struct file *file = area->vm_file;
 	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page, **hash;
+	struct page *page;
 	unsigned long size, pgoff, endoff;
 
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
@@ -1888,9 +1808,8 @@
 	/*
 	 * Do we have something in the page cache already?
 	 */
-	hash = page_hash(mapping, pgoff);
 retry_find:
-	page = __find_get_page(mapping, pgoff, hash);
+	page = find_get_page(mapping, pgoff);
 	if (!page)
 		goto no_cached_page;
 
@@ -2575,13 +2494,13 @@
 {
 	unsigned char present = 0;
 	struct address_space * as = vma->vm_file->f_dentry->d_inode->i_mapping;
-	struct page * page, ** hash = page_hash(as, pgoff);
+	struct page * page;
 
-	spin_lock(&pagecache_lock);
-	page = __find_page_nolock(as, pgoff, *hash);
+	spin_lock(&as->i_shared_lock);
+	page = rat_lookup(&as->page_tree, pgoff);
 	if ((page) && (Page_Uptodate(page)))
 		present = 1;
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&as->i_shared_lock);
 
 	return present;
 }
@@ -2724,20 +2643,24 @@
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page **hash = page_hash(mapping, index);
 	struct page *page, *cached_page = NULL;
 	int err;
 repeat:
-	page = __find_get_page(mapping, index, hash);
+	page = find_get_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
 			cached_page = page_cache_alloc(mapping);
 			if (!cached_page)
 				return ERR_PTR(-ENOMEM);
 		}
-		page = cached_page;
-		if (add_to_page_cache_unique(page, mapping, index, hash))
+		err = add_to_page_cache(cached_page, mapping, index);
+		if (err == -EEXIST)
 			goto repeat;
+		if (err < 0) {
+			page_cache_release(cached_page);
+			return ERR_PTR(err);
+		}
+		page = cached_page;
 		cached_page = NULL;
 		err = filler(data, page);
 		if (err < 0) {
@@ -2792,19 +2715,23 @@
 static inline struct page * __grab_cache_page(struct address_space *mapping,
 				unsigned long index, struct page **cached_page)
 {
-	struct page *page, **hash = page_hash(mapping, index);
+	int err;
+	struct page *page;
 repeat:
-	page = __find_lock_page(mapping, index, hash);
+	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!*cached_page) {
 			*cached_page = page_cache_alloc(mapping);
 			if (!*cached_page)
 				return NULL;
 		}
-		page = *cached_page;
-		if (add_to_page_cache_unique(page, mapping, index, hash))
+		err = add_to_page_cache(*cached_page, mapping, index);
+		if (err == -EEXIST)
 			goto repeat;
-		*cached_page = NULL;
+		if (err == 0) {
+			page = *cached_page;
+			*cached_page = NULL;
+		}
 	}
 	return page;
 }
@@ -3064,30 +2991,3 @@
 		status = generic_osync_inode(inode, OSYNC_METADATA);
 	goto out_status;
 }
-
-void __init page_cache_init(unsigned long mempages)
-{
-	unsigned long htable_size, order;
-
-	htable_size = mempages;
-	htable_size *= sizeof(struct page *);
-	for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
-		;
-
-	do {
-		unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
-
-		page_hash_bits = 0;
-		while((tmp >>= 1UL) != 0UL)
-			page_hash_bits++;
-
-		page_hash_table = (struct page **)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while(page_hash_table == NULL && --order > 0);
-
-	printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
-	       (1 << page_hash_bits), order, (PAGE_SIZE << order));
-	if (!page_hash_table)
-		panic("Failed to allocate page hash table\n");
-	memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));
-}
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/mm/shmem.c linux-vger/mm/shmem.c
--- /datenklo/ref/linux-vger/mm/shmem.c	Fri Jan 25 13:16:06 2002
+++ linux-vger/mm/shmem.c	Tue Jan 29 03:20:18 2002
@@ -366,7 +366,7 @@
 	swp_entry_t *ptr;
 	unsigned long idx;
 	int offset;
-	
+
 	idx = 0;
 	spin_lock (&info->lock);
 	offset = shmem_clear_swp (entry, info->i_direct, SHMEM_NR_DIRECT);
@@ -385,11 +385,8 @@
 	spin_unlock (&info->lock);
 	return 0;
 found:
-	delete_from_swap_cache(page);
-	add_to_page_cache(page, info->vfs_inode.i_mapping, offset + idx);
-	SetPageDirty(page);
-	SetPageUptodate(page);
-	info->swapped--;
+	if (!move_from_swap_cache(page, offset + idx, info->vfs_inode.i_mapping))
+		info->swapped--;
 	spin_unlock(&info->lock);
 	return 1;
 }
@@ -426,6 +423,7 @@
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
+	int error;
 
 	if (!PageLocked(page))
 		BUG();
@@ -438,7 +436,6 @@
 	info = SHMEM_I(inode);
 	if (info->locked)
 		return fail_writepage(page);
-getswap:
 	swap = get_swap_page();
 	if (!swap.val)
 		return fail_writepage(page);
@@ -451,21 +448,12 @@
 	if (entry->val)
 		BUG();
 
-	/* Remove it from the page cache */
-	remove_inode_page(page);
-	page_cache_release(page);
-
-	/* Add it to the swap cache */
-	if (add_to_swap_cache(page, swap) != 0) {
-		/*
-		 * Raced with "speculative" read_swap_cache_async.
-		 * Add page back to page cache, unref swap, try again.
-		 */
-		add_to_page_cache_locked(page, mapping, index);
+	error = move_to_swap_cache(page, swap);
+	if (error) {
 		spin_unlock(&info->lock);
 		swap_free(swap);
-		goto getswap;
-	}
+		return fail_writepage(page);
+	} 
 
 	*entry = swap;
 	info->swapped++;
@@ -520,8 +508,6 @@
 	
 	shmem_recalc_inode(inode);
 	if (entry->val) {
-		unsigned long flags;
-
 		/* Look it up and read it in.. */
 		page = find_get_page(&swapper_space, entry->val);
 		if (!page) {
@@ -546,16 +532,15 @@
 			goto repeat;
 		}
 
-		/* We have to this with page locked to prevent races */
+		/* We have to do this with page locked to prevent races */
 		if (TryLockPage(page)) 
 			goto wait_retry;
 
+		if (move_from_swap_cache(page, idx, mapping))
+			goto nomem_retry;
+
 		swap_free(*entry);
 		*entry = (swp_entry_t) {0};
-		delete_from_swap_cache(page);
-		flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_referenced) | (1 << PG_arch_1));
-		page->flags = flags | (1 << PG_dirty);
-		add_to_page_cache_locked(page, mapping, idx);
 		info->swapped--;
 		spin_unlock (&info->lock);
 	} else {
@@ -579,7 +564,11 @@
 			return ERR_PTR(-ENOMEM);
 		clear_highpage(page);
 		inode->i_blocks += BLOCKS_PER_PAGE;
-		add_to_page_cache (page, mapping, idx);
+		while (add_to_page_cache(page, mapping, idx) == -ENOMEM) {
+			/* Yield for kswapd, and try again */
+			__set_current_state(TASK_RUNNING);
+			yield();
+		}
 	}
 
 	/* We have the page */
@@ -594,6 +583,16 @@
 	wait_on_page(page);
 	page_cache_release(page);
 	goto repeat;
+
+nomem_retry:
+	spin_unlock (&info->lock);
+	UnlockPage (page);
+	page_cache_release (page);
+
+	/* Yield for kswapd, and try again */
+	__set_current_state(TASK_RUNNING);
+	yield();
+	goto repeat;
 }
 
 static int shmem_getpage(struct inode * inode, unsigned long idx, struct page **ptr)
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/mm/swap_state.c linux-vger/mm/swap_state.c
--- /datenklo/ref/linux-vger/mm/swap_state.c	Mon Nov 12 18:19:54 2001
+++ linux-vger/mm/swap_state.c	Tue Jan 29 03:30:00 2002
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
+#include <linux/rat.h>
 
 #include <asm/pgtable.h>
 
@@ -37,11 +38,12 @@
 };
 
 struct address_space swapper_space = {
-	LIST_HEAD_INIT(swapper_space.clean_pages),
-	LIST_HEAD_INIT(swapper_space.dirty_pages),
-	LIST_HEAD_INIT(swapper_space.locked_pages),
-	0,				/* nrpages	*/
-	&swap_aops,
+	page_tree:	{0, GFP_ATOMIC, NULL},
+	clean_pages:	LIST_HEAD_INIT(swapper_space.clean_pages),
+	dirty_pages:	LIST_HEAD_INIT(swapper_space.dirty_pages),
+	locked_pages:	LIST_HEAD_INIT(swapper_space.locked_pages),
+	a_ops:		&swap_aops,
+	i_shared_lock:	SPIN_LOCK_UNLOCKED,
 };
 
 #ifdef SWAP_CACHE_INFO
@@ -69,17 +71,20 @@
 
 int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
+	int error;
+
 	if (page->mapping)
 		BUG();
 	if (!swap_duplicate(entry)) {
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
-	if (add_to_page_cache_unique(page, &swapper_space, entry.val,
-			page_hash(&swapper_space, entry.val)) != 0) {
+
+	error = add_to_page_cache(page, &swapper_space, entry.val);
+	if (error) {
 		swap_free(entry);
 		INC_CACHE_INFO(exist_race);
-		return -EEXIST;
+		return error;
 	}
 	if (!PageLocked(page))
 		BUG();
@@ -121,14 +126,100 @@
 
 	entry.val = page->index;
 
-	spin_lock(&pagecache_lock);
+	spin_lock(&swapper_space.i_shared_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock(&pagecache_lock);
+	spin_unlock(&swapper_space.i_shared_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
 }
 
+int move_to_swap_cache(struct page *page, swp_entry_t entry)
+{
+	struct address_space *mapping = page->mapping;
+	void **pslot;
+	int err;
+
+	if (!mapping)
+		BUG();
+
+	if (!swap_duplicate(entry)) {
+		INC_CACHE_INFO(noent_race);
+		return -ENOENT;
+	}
+
+	spin_lock(&swapper_space.i_shared_lock);
+	spin_lock(&mapping->i_shared_lock);
+
+	err = rat_reserve(&swapper_space.page_tree, entry.val, &pslot);
+	if (!err) {
+		/* Remove it from the page cache */
+		__remove_inode_page(page);
+
+		/* Add it to the swap cache */
+		*pslot = page;
+		page->flags = ((page->flags & ~(1 << PG_uptodate | 1 << PG_error
+						| 1 << PG_dirty  | 1 << PG_referenced
+						| 1 << PG_arch_1 | 1 << PG_checked))
+			       | (1 << PG_locked));
+		page->index = entry.val;
+		add_page_to_inode_queue(&swapper_space, page);
+		atomic_inc(&page_cache_size);
+	}
+
+	spin_unlock(&mapping->i_shared_lock);
+	spin_unlock(&swapper_space.i_shared_lock);
+
+	if (!err) {
+		INC_CACHE_INFO(add_total);
+		return 0;
+	}
+
+	swap_free(entry);
+
+	if (err == -EEXIST)
+		INC_CACHE_INFO(exist_race);
+
+	return err;
+}
+
+int move_from_swap_cache(struct page *page, unsigned long index,
+		struct address_space *mapping)
+{
+	void **pslot;
+	int err;
+
+	if (!PageLocked(page))
+		BUG();
+
+	spin_lock(&swapper_space.i_shared_lock);
+	spin_lock(&mapping->i_shared_lock);
+
+	err = rat_reserve(&mapping->page_tree, index, &pslot);
+	if (!err) {
+		swp_entry_t entry;
+
+		block_flushpage(page, 0);
+		entry.val = page->index;
+		__delete_from_swap_cache(page);
+		swap_free(entry);
+
+		*pslot = page;
+		page->flags = ((page->flags & ~(1 << PG_uptodate | 1 << PG_error
+						| 1 << PG_referenced | 1 << PG_arch_1
+						| 1 << PG_checked))
+			       | (1 << PG_dirty));
+		page->index = index;
+		add_page_to_inode_queue (mapping, page);
+		atomic_inc(&page_cache_size);
+	}
+
+	spin_lock(&mapping->i_shared_lock);
+	spin_lock(&swapper_space.i_shared_lock);
+
+	return err;
+}
+
 /* 
  * Perform a free_page(), also freeing any swap cache associated with
  * this page if it is the last user of the page. Can not do a lock_page,
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/mm/swapfile.c linux-vger/mm/swapfile.c
--- /datenklo/ref/linux-vger/mm/swapfile.c	Fri Jan 25 13:16:06 2002
+++ linux-vger/mm/swapfile.c	Tue Jan 29 03:05:40 2002
@@ -239,10 +239,10 @@
 		/* Is the only swap cache user the cache itself? */
 		if (p->swap_map[SWP_OFFSET(entry)] == 1) {
 			/* Recheck the page count with the pagecache lock held.. */
-			spin_lock(&pagecache_lock);
+			spin_lock(&swapper_space.i_shared_lock);
 			if (page_count(page) - !!page->buffers == 2)
 				retval = 1;
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&swapper_space.i_shared_lock);
 		}
 		swap_info_put(p);
 	}
@@ -307,13 +307,13 @@
 	retval = 0;
 	if (p->swap_map[SWP_OFFSET(entry)] == 1) {
 		/* Recheck the page count with the pagecache lock held.. */
-		spin_lock(&pagecache_lock);
+		spin_lock(&swapper_space.i_shared_lock);
 		if (page_count(page) - !!page->buffers == 2) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock(&pagecache_lock);
+		spin_unlock(&swapper_space.i_shared_lock);
 	}
 	swap_info_put(p);
 
diff -uNr -Xdontdiff /datenklo/ref/linux-vger/mm/vmscan.c linux-vger/mm/vmscan.c
--- /datenklo/ref/linux-vger/mm/vmscan.c	Fri Jan 25 13:16:06 2002
+++ linux-vger/mm/vmscan.c	Tue Jan 29 03:12:59 2002
@@ -137,10 +137,16 @@
 		 * (adding to the page cache will clear the dirty
 		 * and uptodate bits, so we need to do it again)
 		 */
-		if (add_to_swap_cache(page, entry) == 0) {
+		switch (add_to_swap_cache(page, entry)) {
+		case 0:
 			SetPageUptodate(page);
 			set_page_dirty(page);
 			goto set_swap_pte;
+		case -ENOMEM:
+			swap_free(entry);
+			goto preserve;
+		default:
+			break;
 		}
 		/* Raced with "speculative" read_swap_cache_async */
 		swap_free(entry);
@@ -338,6 +344,7 @@
 static int shrink_cache(int nr_pages, zone_t * classzone, unsigned int gfp_mask, int priority)
 {
 	struct list_head * entry;
+	struct address_space *mapping;
 	int max_scan = nr_inactive_pages / priority;
 	int max_mapped = nr_pages << (9 - priority);
 
@@ -392,7 +399,9 @@
 			continue;
 		}
 
-		if (PageDirty(page) && is_page_cache_freeable(page) && page->mapping) {
+		mapping = page->mapping;
+
+		if (PageDirty(page) && is_page_cache_freeable(page) && mapping) {
 			/*
 			 * It is not critical here to write it only if
 			 * the page is unmapped beause any direct writer
@@ -403,7 +412,7 @@
 			 */
 			int (*writepage)(struct page *);
 
-			writepage = page->mapping->a_ops->writepage;
+			writepage = mapping->a_ops->writepage;
 			if ((gfp_mask & __GFP_FS) && writepage) {
 				ClearPageDirty(page);
 				SetPageLaunder(page);
@@ -430,7 +439,7 @@
 			page_cache_get(page);
 
 			if (try_to_release_page(page, gfp_mask)) {
-				if (!page->mapping) {
+				if (!mapping) {
 					/*
 					 * We must not allow an anon page
 					 * with no buffers to be visible on
@@ -467,13 +476,22 @@
 			}
 		}
 
-		spin_lock(&pagecache_lock);
+		/*
+		 * Page is locked, so mapping can't change under our
+		 * feet.
+		 */
+		if (!mapping) {
+			UnlockPage (page);
+			goto page_mapped;
+		}
+
+		spin_lock(&mapping->i_shared_lock);
 
 		/*
 		 * this is the non-racy check for busy page.
 		 */
-		if (!page->mapping || !is_page_cache_freeable(page)) {
-			spin_unlock(&pagecache_lock);
+		if (!is_page_cache_freeable(page)) {
+			spin_unlock(&mapping->i_shared_lock);
 			UnlockPage(page);
 page_mapped:
 			if (--max_mapped >= 0)
@@ -493,7 +511,7 @@
 		 * the page is freeable* so not in use by anybody.
 		 */
 		if (PageDirty(page)) {
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			UnlockPage(page);
 			continue;
 		}
@@ -501,12 +519,12 @@
 		/* point of no return */
 		if (likely(!PageSwapCache(page))) {
 			__remove_inode_page(page);
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 		} else {
 			swp_entry_t swap;
 			swap.val = page->index;
 			__delete_from_swap_cache(page);
-			spin_unlock(&pagecache_lock);
+			spin_unlock(&mapping->i_shared_lock);
 			swap_free(swap);
 		}
 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 15:54 Christoph Hellwig
@ 2002-01-29 19:27 ` Linus Torvalds
  2002-01-29 21:40   ` David S. Miller
  2002-01-29 23:02   ` Momchil Velikov
  2002-01-29 23:00 ` William Lee Irwin III
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-01-29 19:27 UTC (permalink / raw)
  To: linux-kernel

In article <20020129165444.A26626@caldera.de>,
Christoph Hellwig  <hch@caldera.de> wrote:
>I've ported my hacked up version of Momchil Velikov's radix tree
>radix tree pagecache to 2.5.3-pre{5,6}.
>
>The changes over the 2.4.17 version are:
>
>  o use mempool to avoid OOM situation involving radix nodes.
>  o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
>  o unify add_to_page and add_to_page_unique
>
>It gives nice scalability improvements on big machines and drops the
>memory usage on small ones (if you consider my 64MB Athlon small :)).

Looks good. 

In fact, this looks a _lot_ more palatable than the "scalable page
cache" approach with the hashed locks.

Can you post the numbers on scalability (I can see the locking
improvement, but if you have numbers I'd be even happier) and any
benchmarks you have?

The only real complaint I have is that I'd rather see "radix_root" than
"rat_root". Maybe it is just me, but the latter makes me wonder about
the sex-lives of small furry mammals. Which is not what I really want to
be thinking about.

It looks straigthforward enough, so if you feel it is stable (and
cleaned up), I'd suggest just submitting it for real.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 19:27 ` Linus Torvalds
@ 2002-01-29 21:40   ` David S. Miller
  2002-01-29 22:07     ` Linus Torvalds
  2002-01-29 23:02   ` Momchil Velikov
  1 sibling, 1 reply; 87+ messages in thread
From: David S. Miller @ 2002-01-29 21:40 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

   From: torvalds@transmeta.com (Linus Torvalds)
   Date: Tue, 29 Jan 2002 19:27:43 +0000 (UTC)

   In article <20020129165444.A26626@caldera.de>,
   Christoph Hellwig  <hch@caldera.de> wrote:
   >I've ported my hacked up version of Momchil Velikov's radix tree
   >radix tree pagecache to 2.5.3-pre{5,6}.

   Looks good. 

I like the changes too, but I'd like to see some numbers
as well.

My only concern is that it doesn't handle one particular
case better than the ugly per-hashchain lock version.  When we're
running through a file and the task doing this changes cpus.
In that case we'll get a lock collision and the per-hashchain lock
changes would at least potentially avoid that.

For web serving sizeable files this might matter, but probably
we don't really care.  Probably it doesn't matter and we are limited
to moving one lock over in such an event anyways.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 21:40   ` David S. Miller
@ 2002-01-29 22:07     ` Linus Torvalds
  2002-01-29 23:01       ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-01-29 22:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Tue, 29 Jan 2002, David S. Miller wrote:
>
> I like the changes too, but I'd like to see some numbers
> as well.

Absolutely. Even something as simplistic as "lmbench file re-read" changed
by 0.1% or something. I definitely believe in the scalability part (as
long as the different processes don't all touch the same mapping all the
time), so I'm more interested in the "what is the impact of the hash chain
lookup/walk vs the radix tree walk" kinds of numbers.

> My only concern is that it doesn't handle one particular
> case better than the ugly per-hashchain lock version.  When we're
> running through a file and the task doing this changes cpus.
> In that case we'll get a lock collision and the per-hashchain lock
> changes would at least potentially avoid that.

Well, I would put it the other way around: the advantage of the
per-mapping lock (as opposed to the per-hashchain one) is that for common
access patterns where one process walks the whole file, we get added
locality from the per-mapping approach. While the per-hashchain one tends
to take a _lot_ of different locks (module bucket behaviour, of course).

And locality is good - especially as we try to make processes have CPU
affinity anyway. So I'd expect the per-mapping lock to generally show
_nicer_ cache behaviour.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 22:07     ` Linus Torvalds
@ 2002-01-29 23:01       ` Rik van Riel
  2002-01-29 23:32         ` Alan Cox
  0 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-01-29 23:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-kernel

On Tue, 29 Jan 2002, Linus Torvalds wrote:
> On Tue, 29 Jan 2002, David S. Miller wrote:
> >
> > I like the changes too, but I'd like to see some numbers
> > as well.
>
> Absolutely. Even something as simplistic as "lmbench file re-read" changed
> by 0.1% or something. I definitely believe in the scalability part (as
> long as the different processes don't all touch the same mapping all the
> time), so I'm more interested in the "what is the impact of the hash chain
> lookup/walk vs the radix tree walk" kinds of numbers.

There's another nice advantage to the radix tree.

We can let oracle shared memory segments use 4 MB pages,
but still use the normal page cache code to look up the
pages.

With a radix tree there is no overhead in using different
page sizes since we'll just run into them in the tree.

(as opposed to the horrors of trying a hash lookup with
multiple page orders)

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:01       ` Rik van Riel
@ 2002-01-29 23:32         ` Alan Cox
  2002-01-29 23:35           ` Rik van Riel
  2002-01-31 23:44           ` Anton Blanchard
  0 siblings, 2 replies; 87+ messages in thread
From: Alan Cox @ 2002-01-29 23:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, David S. Miller, linux-kernel

> We can let oracle shared memory segments use 4 MB pages,
> but still use the normal page cache code to look up the
> pages.

That has some potential big wins beyond oracle. Some of the big number
crunching algorithms also benefit heavily from 4Mb pages even when you
try and minimise tlb misses.

Just remember to read the ppro/early pII errata when starting - there are
some page invalidation funnies. If I remember rightly we have to kill MCE
support on PPro if we do 4Mb pages that may overlap 4K ones

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:32         ` Alan Cox
@ 2002-01-29 23:35           ` Rik van Riel
  2002-01-30  3:00             ` Daniel Phillips
  2002-01-31 23:44           ` Anton Blanchard
  1 sibling, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-01-29 23:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, David S. Miller, linux-kernel

On Tue, 29 Jan 2002, Alan Cox wrote:

> > We can let oracle shared memory segments use 4 MB pages,
> > but still use the normal page cache code to look up the
> > pages.
>
> That has some potential big wins beyond oracle. Some of the big number
> crunching algorithms also benefit heavily from 4Mb pages even when you
> try and minimise tlb misses.

Note that I'm not sure whether the complexity of using
4 MB pages is worth it or not ... I just like the fact
that the radix tree page cache gives us the opportunity
to easily implement and try it.

I like radix trees for making our design more flexible
and opening doors to possible new functionality.

It could even be a CONFIG option for the embedded folks,
if we can keep the code isolated enough ;)

cheers,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:35           ` Rik van Riel
@ 2002-01-30  3:00             ` Daniel Phillips
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Phillips @ 2002-01-30  3:00 UTC (permalink / raw)
  To: Rik van Riel, Alan Cox; +Cc: Linus Torvalds, David S. Miller, linux-kernel

On January 30, 2002 12:35 am, Rik van Riel wrote:
> On Tue, 29 Jan 2002, Alan Cox wrote:
> > > We can let oracle shared memory segments use 4 MB pages,
> > > but still use the normal page cache code to look up the
> > > pages.
> >
> > That has some potential big wins beyond oracle. Some of the big number
> > crunching algorithms also benefit heavily from 4Mb pages even when you
> > try and minimise tlb misses.
> 
> Note that I'm not sure whether the complexity of using
> 4 MB pages is worth it or not ... I just like the fact
> that the radix tree page cache gives us the opportunity
> to easily implement and try it.
> 
> I like radix trees for making our design more flexible
> and opening doors to possible new functionality.
> 
> It could even be a CONFIG option for the embedded folks,
> if we can keep the code isolated enough ;)

Making it a CONFIG option would preclude leveraging the advantages of the 
radix tree that fall out from its ordered nature, so I'd vote for all or 
nothing in this case.

I also have some perhaps-twisted plans for this thing, but if this patch 
passes muster on its own merits - that is, in the context we're currently 
using the pcache hash as opposed to new capabilities that can be leveraged - 
that's the ideal situation.  Hence no, or minimal, talking up other 
advantages.

I'm inclined to think that the radix tree has locality advantages for UP as 
well as SMP, under certain types of filesystem loads, and that it is never 
worse.  Well, we shall see about that soon enogh.

-- 
Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:32         ` Alan Cox
  2002-01-29 23:35           ` Rik van Riel
@ 2002-01-31 23:44           ` Anton Blanchard
  2002-02-01  0:34             ` Alan Cox
  1 sibling, 1 reply; 87+ messages in thread
From: Anton Blanchard @ 2002-01-31 23:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: Rik van Riel, Linus Torvalds, David S. Miller, linux-kernel

> That has some potential big wins beyond oracle. Some of the big number
> crunching algorithms also benefit heavily from 4Mb pages even when you
> try and minimise tlb misses.

There are further wins on some architectures, eg POWER has hardware
prefetch streams which terminate at a page boundary. With a 4kB pagesize
the prefetch engine will have to restart every 4kB, so we would want to
use 16MB pages if possible.

How would we allocate large pages? Would there be a boot option to
reserve an area of RAM for large pages only?

Anton

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:44           ` Anton Blanchard
@ 2002-02-01  0:34             ` Alan Cox
  2002-02-01 11:04               ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Alan Cox @ 2002-02-01  0:34 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Alan Cox, Rik van Riel, Linus Torvalds, David S. Miller,
	linux-kernel

> the prefetch engine will have to restart every 4kB, so we would want to
> use 16MB pages if possible.
> 
> How would we allocate large pages? Would there be a boot option to
> reserve an area of RAM for large pages only?

If you have an rmap all you have to do is to avoid smearing kernel objects
around lots of 16Mb page sets. If need be you can then get a 16Mb page
back just by shuffling user pages.

It does make the performance analysis much more interesting though.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  0:34             ` Alan Cox
@ 2002-02-01 11:04               ` Rik van Riel
  2002-02-01 11:33                 ` Arjan van de Ven
  2002-02-02 18:57                 ` Richard Henderson
  0 siblings, 2 replies; 87+ messages in thread
From: Rik van Riel @ 2002-02-01 11:04 UTC (permalink / raw)
  To: Alan Cox; +Cc: Anton Blanchard, Linus Torvalds, David S. Miller, linux-kernel

On Fri, 1 Feb 2002, Alan Cox wrote:

> > the prefetch engine will have to restart every 4kB, so we would want to
> > use 16MB pages if possible.
> >
> > How would we allocate large pages? Would there be a boot option to
> > reserve an area of RAM for large pages only?
>
> If you have an rmap all you have to do is to avoid smearing kernel objects
> around lots of 16Mb page sets. If need be you can then get a 16Mb page
> back just by shuffling user pages.
>
> It does make the performance analysis much more interesting though.

Actually, I suspect that for most workloads the amount of
large pages vs. the amount of small pages should be fairly
static.

In that case we can just reclaim an old large page from
the inactive_clean list whenever we want to allocate a new
one.

As for not putting kernel objects everywhere, this comes
naturally with HIGHMEM ;)

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 11:04               ` Rik van Riel
@ 2002-02-01 11:33                 ` Arjan van de Ven
  2002-02-02 18:57                 ` Richard Henderson
  1 sibling, 0 replies; 87+ messages in thread
From: Arjan van de Ven @ 2002-02-01 11:33 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel

Rik van Riel wrote:
> 
> On Fri, 1 Feb 2002, Alan Cox wrote:
> 
> > > the prefetch engine will have to restart every 4kB, so we would want to
> > > use 16MB pages if possible.
> > >
> > > How would we allocate large pages? Would there be a boot option to
> > > reserve an area of RAM for large pages only?
> >
> > If you have an rmap all you have to do is to avoid smearing kernel objects
> > around lots of 16Mb page sets. If need be you can then get a 16Mb page
> > back just by shuffling user pages.
> >
> > It does make the performance analysis much more interesting though.
> 
> Actually, I suspect that for most workloads the amount of
> large pages vs. the amount of small pages should be fairly
> static.
> 
> In that case we can just reclaim an old large page from
> the inactive_clean list whenever we want to allocate a new
> one.
> 
> As for not putting kernel objects everywhere, this comes
> naturally with HIGHMEM ;)

well except when you start doing pagetables high, as Andrea is doing
(and it makes tons of sense to do that)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 11:04               ` Rik van Riel
  2002-02-01 11:33                 ` Arjan van de Ven
@ 2002-02-02 18:57                 ` Richard Henderson
  2002-02-02 21:15                   ` Rik van Riel
  1 sibling, 1 reply; 87+ messages in thread
From: Richard Henderson @ 2002-02-02 18:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Anton Blanchard, Linus Torvalds, David S. Miller,
	linux-kernel

On Fri, Feb 01, 2002 at 09:04:45AM -0200, Rik van Riel wrote:
> As for not putting kernel objects everywhere, this comes
> naturally with HIGHMEM ;)

Not for 64-bit targets.


r~

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-02 18:57                 ` Richard Henderson
@ 2002-02-02 21:15                   ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-02-02 21:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alan Cox, Anton Blanchard, Linus Torvalds, David S. Miller,
	linux-kernel

On Sat, 2 Feb 2002, Richard Henderson wrote:
> On Fri, Feb 01, 2002 at 09:04:45AM -0200, Rik van Riel wrote:
> > As for not putting kernel objects everywhere, this comes
> > naturally with HIGHMEM ;)
>
> Not for 64-bit targets.

Agreed.  We'll probably want to find something else to fix this
problem ...

(like, allocating kernel area as much contiguously as possible,
leaving space for large freeable areas elsewhere)

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 19:27 ` Linus Torvalds
  2002-01-29 21:40   ` David S. Miller
@ 2002-01-29 23:02   ` Momchil Velikov
  2002-01-29 23:33     ` Linus Torvalds
  2002-01-30  3:02     ` Daniel Phillips
  1 sibling, 2 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-01-29 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> In article <20020129165444.A26626@caldera.de>,
Linus> Christoph Hellwig  <hch@caldera.de> wrote:
>> I've ported my hacked up version of Momchil Velikov's radix tree
>> radix tree pagecache to 2.5.3-pre{5,6}.
>> 
>> The changes over the 2.4.17 version are:
>> 
>> o use mempool to avoid OOM situation involving radix nodes.
>> o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
>> o unify add_to_page and add_to_page_unique
>> 
>> It gives nice scalability improvements on big machines and drops the
>> memory usage on small ones (if you consider my 64MB Athlon small :)).

Linus> Can you post the numbers on scalability (I can see the locking
Linus> improvement, but if you have numbers I'd be even happier) and any
Linus> benchmarks you have?

Well, these are dbench numbers from December, it's
2.4.17. Unfortunately, it appears OSDL have trouble with 2.5 currently ...

FWIW, box is 8-way PIII Xeon, 700MHz, 1MB cache, 8G RAM

rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

#Clients        2.4.17     2.4.17-rat-7   2.4.17-rat-4
---------------------------------------------------------
1                81.81        82.70         79.49
2               131.77       133.15        116.32
3               179.74       188.04        184.80
4               221.60       228.70        223.97
5               249.86       252.89        258.77
6               260.56       277.70        265.20
7               285.82       287.47        281.27
8               263.61       258.81        256.29
9               271.06       268.29        261.04
10              261.23       265.82        259.34
11              256.82       260.38        258.35
12              255.55       255.68        252.78
13              252.70       254.02        249.42
14              251.41       253.93        252.21
15              255.27       257.13        262.21
16              156.81       146.69        180.77
17              113.00       103.32        101.14
18               81.06        78.98         86.77
19               76.24        40.09         39.89
20               17.51        17.64         17.53

The results are similar on 4-way OSDL boxen and on the 12- and 16-way
PPC64 runs by Anton Blanchard:

   # clients
   [1 - ncpu]         - linear increase in the throughput, small improvement over the
                        stock kernel, I guess we quickly hit other locks
   [ncpu - 2 * ncpu]  - flat
   [2 * ncpu, +infty) - drops down do zero

Linus> The only real complaint I have is that I'd rather see "radix_root" than
Linus> "rat_root". Maybe it is just me, but the latter makes me wonder about
Linus> the sex-lives of small furry mammals. Which is not what I really want to
Linus> be thinking about.

Done. rat_* -> radix_tree_*

Linus> It looks straigthforward enough, so if you feel it is stable (and
Linus> cleaned up), I'd suggest just submitting it for real.

I'll wait for a day for some more comments (e.g. Ingo) and will submit
it.

Regards,
-velco



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:02   ` Momchil Velikov
@ 2002-01-29 23:33     ` Linus Torvalds
  2002-01-29 23:45       ` Christoph Hellwig
  2002-01-30 21:25       ` Momchil Velikov
  2002-01-30  3:02     ` Daniel Phillips
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-01-29 23:33 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: linux-kernel

On 30 Jan 2002, Momchil Velikov wrote:
>
> rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

Hmm. It appears that the 128-way one is no slower, at least.

Probably not very relevant question: What are the memory usage
implications? I love having that global big page_hash_table gone, but what
are the differences in memory usage between rat-4 and rat-7? In
particular, it _looks_ like the way the radix_node is done, it will
basically always be a factor-of-two+1 words, which sounds like the worst
possible schenario from an allocator standpoint.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:33     ` Linus Torvalds
@ 2002-01-29 23:45       ` Christoph Hellwig
  2002-01-30 21:25       ` Momchil Velikov
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2002-01-29 23:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Momchil Velikov

In article <Pine.LNX.4.33.0201291515480.1747-100000@penguin.transmeta.com> you wrote:
> it _looks_ like the way the radix_node is done, it will
> basically always be a factor-of-two+1 words, which sounds like the worst
> possible schenario from an allocator standpoint.

One advantage of the slab allocator is that if works efficiently with
odd object sizes..

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:33     ` Linus Torvalds
  2002-01-29 23:45       ` Christoph Hellwig
@ 2002-01-30 21:25       ` Momchil Velikov
  2002-01-30 22:05         ` John Stoffel
  1 sibling, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-01-30 21:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> On 30 Jan 2002, Momchil Velikov wrote:
>> 
>> rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

Linus> Hmm. It appears that the 128-way one is no slower, at least.

Linus> Probably not very relevant question: What are the memory usage
Linus> implications? I love having that global big page_hash_table gone, but what
Linus> are the differences in memory usage between rat-4 and rat-7? In
Linus> particular, it _looks_ like the way the radix_node is done, it will
Linus> basically always be a factor-of-two+1 words, which sounds like the worst
Linus> possible schenario from an allocator standpoint.

Memory overhead due to allocator overhead is of no concern with the
slab allocator. What matters most is probably the overhead of the
radix tree nodes themselves, compared to the two pointers in struct
page with the hash table approach. rat-4 variant ought to have less
overhead compared to rat-7 at the expense of deeper/higher tree. I
have no figures for the actual memory usage though. For small files it
should be negligible, i.e. one radix tree node, 68 or 516 bytes for
rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes.  The
worst case would be very large file with a few cached pages with
offsets uniformly distributed across the whole file, that is having
deep tree with only one page hanging off each leaf node.

Regards,
-velco

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-30 21:25       ` Momchil Velikov
@ 2002-01-30 22:05         ` John Stoffel
  2002-01-30 22:15           ` Momchil Velikov
  2002-01-30 22:22           ` Christoph Hellwig
  0 siblings, 2 replies; 87+ messages in thread
From: John Stoffel @ 2002-01-30 22:05 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: Linus Torvalds, linux-kernel

Momchil> Memory overhead due to allocator overhead is of no concern with the
Momchil> slab allocator. What matters most is probably the overhead of the
Momchil> radix tree nodes themselves, compared to the two pointers in struct
Momchil> page with the hash table approach. rat-4 variant ought to have less
Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
Momchil> have no figures for the actual memory usage though. For small files it
Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes.  The
Momchil> worst case would be very large file with a few cached pages with
Momchil> offsets uniformly distributed across the whole file, that is having
Momchil> deep tree with only one page hanging off each leaf node.

Isn't this a good place to use AVL trees then, since they balance
automatically?  Admittedly, it may be more overhead than we want in
the case where the tree is balanced by default anyway.  

Again, benchmarks would be the good thing to see either way.

John

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-30 22:05         ` John Stoffel
@ 2002-01-30 22:15           ` Momchil Velikov
  2002-01-31  2:33             ` Andrea Arcangeli
  2002-01-31 10:41             ` Josh MacDonald
  2002-01-30 22:22           ` Christoph Hellwig
  1 sibling, 2 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-01-30 22:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: Linus Torvalds, linux-kernel

>>>>> "John" == John Stoffel <stoffel@casc.com> writes:

Momchil> Memory overhead due to allocator overhead is of no concern with the
Momchil> slab allocator. What matters most is probably the overhead of the
Momchil> radix tree nodes themselves, compared to the two pointers in struct
Momchil> page with the hash table approach. rat-4 variant ought to have less
Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
Momchil> have no figures for the actual memory usage though. For small files it
Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes.  The
Momchil> worst case would be very large file with a few cached pages with
Momchil> offsets uniformly distributed across the whole file, that is having
Momchil> deep tree with only one page hanging off each leaf node.

John> Isn't this a good place to use AVL trees then, since they balance
John> automatically?  Admittedly, it may be more overhead than we want in
John> the case where the tree is balanced by default anyway.  

The widespread opinion is that binary trees are generally way too deep
compared to radix trees, so searches have larger cache footprint.

John> Again, benchmarks would be the good thing to see either way.

I've posted some with 2.4.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-30 22:15           ` Momchil Velikov
@ 2002-01-31  2:33             ` Andrea Arcangeli
  2002-01-31 13:58               ` Rik van Riel
  2002-01-31 10:41             ` Josh MacDonald
  1 sibling, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31  2:33 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: John Stoffel, Linus Torvalds, linux-kernel

On Thu, Jan 31, 2002 at 12:15:09AM +0200, Momchil Velikov wrote:
> >>>>> "John" == John Stoffel <stoffel@casc.com> writes:
> 
> Momchil> Memory overhead due to allocator overhead is of no concern with the
> Momchil> slab allocator. What matters most is probably the overhead of the
> Momchil> radix tree nodes themselves, compared to the two pointers in struct
> Momchil> page with the hash table approach. rat-4 variant ought to have less
> Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
> Momchil> have no figures for the actual memory usage though. For small files it
> Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
> Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes.  The
> Momchil> worst case would be very large file with a few cached pages with
> Momchil> offsets uniformly distributed across the whole file, that is having
> Momchil> deep tree with only one page hanging off each leaf node.
> 
> John> Isn't this a good place to use AVL trees then, since they balance
> John> automatically?  Admittedly, it may be more overhead than we want in
> John> the case where the tree is balanced by default anyway.  

rbtree are not too overhead for the rebalance, but the problem of not
using the hashtable is that you can't just pay with ram globally. Of
course you can enlarge the array for each radix node (that will end to
be a waste with an huge number of inodes with only a page in them), but
as the height of the tree increases performance will go down anyways (it
will never be as large as the global hashtable that we can tune
optimally at boot). With the hashtable the ram we pay for is not per inode,
but it's global.

I'm not optimistic it will work (even if it can be better than an rb or
an avl during the lookups because it pays more ram per tree node [and
per-inode], but still nearly not enoguh ram per node with big files to
be fast, and a big waste of ram with lots of inodes with a only 1 page
in them)

So I wouldn't merge it, at least until some math is done for the memory
consumation with 500k inodes with only 1 page in them each, and on the
number of heights/levels that must be walked during the tree lookup,
during a access at offset 10G (or worst case in general [biggest
height]) of an inode with 10G just allocated in pagecache.

> 
> The widespread opinion is that binary trees are generally way too deep
> compared to radix trees, so searches have larger cache footprint.
> 
> John> Again, benchmarks would be the good thing to see either way.
> 
> I've posted some with 2.4.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31  2:33             ` Andrea Arcangeli
@ 2002-01-31 13:58               ` Rik van Riel
  2002-01-31 14:36                 ` Andrea Arcangeli
  0 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-01-31 13:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Momchil Velikov, John Stoffel, Linus Torvalds, linux-kernel

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:

> So I wouldn't merge it, at least until some math is done for the memory
> consumation with 500k inodes with only 1 page in them each, and on the
> number of heights/levels that must be walked during the tree lookup,
> during a access at offset 10G (or worst case in general [biggest
> height]) of an inode with 10G just allocated in pagecache.

Ummm, I don't see how this worst case is any more realistic
as the worst case for the hash table (where all pages live
in very few hash buckets and we have really deep chains).

People just don't go around caching a single page each for
all of their 10 GB files and even if they _wanted to_ they
couldn't because of the readahead code.

I suspect that for large files we'll always have around
min_readahead logically contiguous pages cached, if not more.

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 13:58               ` Rik van Riel
@ 2002-01-31 14:36                 ` Andrea Arcangeli
  2002-01-31 15:32                   ` Alan Cox
                                     ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31 14:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Momchil Velikov, John Stoffel, Linus Torvalds, linux-kernel

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> 
> > So I wouldn't merge it, at least until some math is done for the memory
> > consumation with 500k inodes with only 1 page in them each, and on the
> > number of heights/levels that must be walked during the tree lookup,
> > during a access at offset 10G (or worst case in general [biggest
> > height]) of an inode with 10G just allocated in pagecache.
> 
> Ummm, I don't see how this worst case is any more realistic
> as the worst case for the hash table (where all pages live
> in very few hash buckets and we have really deep chains).

Mathematically the hashtable complexity is O(N). But probabilistically
with the tuning we do on the hashtable size, the collisions will be
nearly zero for most buckets for of most workloads. Despite the worst
case is with all the pagecache and swapcache queued in a single linked
list :).

So in short math is wrong about O(N) being bad, hashtable is infact the
only way we can get an effective O(1) by just paying RAM. We pay with
RAM and we get performance back to us.

but with the radix tree (please correct me if I'm wrong) the height will
increase eventually, no matter what (so it won't be an effective O(1)
like the hashtable provides in real life, not the worst case, the common
case). With the hashtable the height won't increase instead.

In short to get the same performance with the radix tree, you'd need to
waste an huge amount of ram per inode, the hashtable is instead global
so we pay only once for all the pages in the system. At least this is my
understanding, I'm not a radix tree guru though, so I may be missing
something.

> 
> People just don't go around caching a single page each for
> all of their 10 GB files and even if they _wanted to_ they
> couldn't because of the readahead code.
> 
> I suspect that for large files we'll always have around
> min_readahead logically contiguous pages cached, if not more.

readahead really doesn't matter at all. consider all the data just in
cache, assume you wrote it and you will never ever need to read it once
again because you've 64G of ram and only 20G of disk.

I/O on large files can and must run as fast as I/O on small files, at
the pagecache level. If the fs doesn't support extents or it's
inefficient with large files that's an enterly different problem, and
the underlying fs doesn't matter any longer once the data is in cache.
pagecache must not slowdown on big files.

Otherwise for an unix fs developer usage (the small files ala dbench)
the rbtree was much nicer data structure than the hash in first place
(and it eats less ram than the radix tree if only one page is queued
etc...).

Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 14:36                 ` Andrea Arcangeli
@ 2002-01-31 15:32                   ` Alan Cox
  2002-01-31 16:39                   ` William Lee Irwin III
  2002-01-31 17:46                   ` Linus Torvalds
  2 siblings, 0 replies; 87+ messages in thread
From: Alan Cox @ 2002-01-31 15:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, Linus Torvalds,
	linux-kernel

> Mathematically the hashtable complexity is O(N). But probabilistically
> with the tuning we do on the hashtable size, the collisions will be
> nearly zero for most buckets for of most workloads. Despite the worst
> case is with all the pagecache and swapcache queued in a single linked
> list :).

Providing it handles the worst case. Some of the hash table inputs appear
to be user controllable so an end user can set out to get worst case 
behaviour 8(

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 14:36                 ` Andrea Arcangeli
  2002-01-31 15:32                   ` Alan Cox
@ 2002-01-31 16:39                   ` William Lee Irwin III
  2002-01-31 17:19                     ` William Lee Irwin III
  2002-01-31 17:21                     ` Andrea Arcangeli
  2002-01-31 17:46                   ` Linus Torvalds
  2 siblings, 2 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-01-31 16:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, Linus Torvalds,
	linux-kernel

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>>> So I wouldn't merge it, at least until some math is done for the memory
>>> consumation with 500k inodes with only 1 page in them each, and on the
>>> number of heights/levels that must be walked during the tree lookup,
>>> during a access at offset 10G (or worst case in general [biggest
>>> height]) of an inode with 10G just allocated in pagecache.

Did someone say math? Looks like I popped in just in time.

The radix tree forest worst case space usage for fixed-precision search
keys is where each leaf node of a radix tree is occupied by a unique page,
and furthermore, each radix tree contains a single page (otherwise the
shared root conserves a small amount of space).

key precision = D^B = wordsize (e.g. 2^32 or 2^64)
D = depth
B = branch factor

Each leaf node lies within a chain of D nodes, where all but the root
nodes are of size B words. This is (D-1)*B + 1 words per file, hence
per-page. Variable branch factors don't complicate this significantly:
1 + \sum_{0 \leq k \leq D} B_k words per page.

For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
words per file. So for 500K inodes each with one page, 42MB (Douglas
Adams fan?). Offsets of 10GB don't work here. Sounds like either an
interesting patch or a 64-bit machine if they work for you. =)

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
>> Ummm, I don't see how this worst case is any more realistic
>> as the worst case for the hash table (where all pages live
>> in very few hash buckets and we have really deep chains).

I don't believe it's particularly realistic either. And sorry about the
inaccurate estimates from before. =)

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> Mathematically the hashtable complexity is O(N). But probabilistically
> with the tuning we do on the hashtable size, the collisions will be
> nearly zero for most buckets for of most workloads. Despite the worst
> case is with all the pagecache and swapcache queued in a single linked
> list :).

To avoid its worst case (or just poor distribution across the buckets),
a good hash function is necessary. And I don't believe the measurements
are in favor of the one currently in use. Also, the pointer links for
separate chaining within the objects costs extremely-precious boot-time
allocated memory and that memory is taken away from the system for all
time, where the dynamic allocation at least allows for the possibility
of recovering memory when needed.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> So in short math is wrong about O(N) being bad, hashtable is infact the
> only way we can get an effective O(1) by just paying RAM. We pay with
> RAM and we get performance back to us.
> but with the radix tree (please correct me if I'm wrong) the height will
> increase eventually, no matter what (so it won't be an effective O(1)
> like the hashtable provides in real life, not the worst case, the common
> case). With the hashtable the height won't increase instead.

The key is of a fixed precision, hence the tree is of a fixed depth.
The radix tree is O(1).

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> In short to get the same performance with the radix tree, you'd need to
> waste an huge amount of ram per inode, the hashtable is instead global
> so we pay only once for all the pages in the system. At least this is my
> understanding, I'm not a radix tree guru though, so I may be missing
> something.

Lock and cache contention introduced by intermixing data from unrelated
objects. We've all seen radix trees before: most page tables are radix
trees.

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
>> People just don't go around caching a single page each for
>> all of their 10 GB files and even if they _wanted to_ they
>> couldn't because of the readahead code.
>> I suspect that for large files we'll always have around
>> min_readahead logically contiguous pages cached, if not more.

I suspect the worst case could only arise after evictions of the
readahead from the pagecache.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> readahead really doesn't matter at all. consider all the data just in
> cache, assume you wrote it and you will never ever need to read it once
> again because you've 64G of ram and only 20G of disk.

Good luck booting on a 64GB x86 with excess pointer links in struct page.
Boot-time allocations filling the direct-mapped portion of the kernel
virtual address space -appear- to be a severe problem there, but I've
not got the RAM to empirically verify this quite yet.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> I/O on large files can and must run as fast as I/O on small files, at
> the pagecache level. If the fs doesn't support extents or it's
> inefficient with large files that's an enterly different problem, and
> the underlying fs doesn't matter any longer once the data is in cache.
> pagecache must not slowdown on big files.

O(1) is O(1). This isn't even average-case or worst case: it's all cases.
In a radix tree using fixed-precision search keys, such as machine words,
exactly the same number of internal nodes are traversed to reach a leaf
for every search key, every time, regardless of how populated or unpopulated
the radix tree is.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> Otherwise for an unix fs developer usage (the small files ala dbench)
> the rbtree was much nicer data structure than the hash in first place
> (and it eats less ram than the radix tree if only one page is queued
> etc...).

And the pointer links in struct page? Sounds like more RAM to me...
4000 open files (much more realistic than 500K) each with one page
leads to 48000 words of radix tree overhead. 3 words per page of
pointer links and > 16000 pages of RAM and the rbtree eats more, not
less. And 16000 pages is just 64MB on i386.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 16:39                   ` William Lee Irwin III
@ 2002-01-31 17:19                     ` William Lee Irwin III
  2002-01-31 17:21                     ` Andrea Arcangeli
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-01-31 17:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Momchil Velikov, John Stoffel,
	Linus Torvalds, linux-kernel

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> The radix tree forest worst case space usage for fixed-precision search
> keys is where each leaf node of a radix tree is occupied by a unique page,
> and furthermore, each radix tree contains a single page (otherwise the
> shared root conserves a small amount of space).
> key precision = D^B = wordsize (e.g. 2^32 or 2^64)
> D = depth
> B = branch factor
> Each leaf node lies within a chain of D nodes, where all but the root
> nodes are of size B words. This is (D-1)*B + 1 words per file, hence
> per-page. Variable branch factors don't complicate this significantly:
> 1 + \sum_{0 \leq k \leq D} B_k words per page.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
> words per file. So for 500K inodes each with one page, 42MB (Douglas
> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
> interesting patch or a 64-bit machine if they work for you. =)

As just pointed out to me the minute I did a substitution I went wrong:

A branch factor of 128 leads to 
1 + (7 + 7 + 6)*128 words = 2561 words per file, which is somewhat
more severe. =(

More corrections are welcome.


On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
>> Otherwise for an unix fs developer usage (the small files ala dbench)
>> the rbtree was much nicer data structure than the hash in first place
>> (and it eats less ram than the radix tree if only one page is queued
>> etc...).

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> And the pointer links in struct page? Sounds like more RAM to me...
> 4000 open files (much more realistic than 500K) each with one page
> leads to 48000 words of radix tree overhead. 3 words per page of
> pointer links and > 16000 pages of RAM and the rbtree eats more, not
> less. And 16000 pages is just 64MB on i386.

This doesn't quite hold up after the the correction above. 4K open
files ends up having 10.5K words or 40MB of overhead.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 16:39                   ` William Lee Irwin III
  2002-01-31 17:19                     ` William Lee Irwin III
@ 2002-01-31 17:21                     ` Andrea Arcangeli
  2002-01-31 17:50                       ` William Lee Irwin III
  1 sibling, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31 17:21 UTC (permalink / raw)
  To: William Lee Irwin III, Rik van Riel, Momchil Velikov,
	John Stoffel, Linus Torvalds, linux-kernel

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >>> So I wouldn't merge it, at least until some math is done for the memory
> >>> consumation with 500k inodes with only 1 page in them each, and on the
> >>> number of heights/levels that must be walked during the tree lookup,
> >>> during a access at offset 10G (or worst case in general [biggest
> >>> height]) of an inode with 10G just allocated in pagecache.
> 
> Did someone say math? Looks like I popped in just in time.
> 
> The radix tree forest worst case space usage for fixed-precision search
> keys is where each leaf node of a radix tree is occupied by a unique page,
> and furthermore, each radix tree contains a single page (otherwise the
> shared root conserves a small amount of space).
> 
> key precision = D^B = wordsize (e.g. 2^32 or 2^64)
> D = depth
> B = branch factor
> 
> Each leaf node lies within a chain of D nodes, where all but the root
> nodes are of size B words. This is (D-1)*B + 1 words per file, hence
> per-page. Variable branch factors don't complicate this significantly:
> 1 + \sum_{0 \leq k \leq D} B_k words per page.
> 
> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
> words per file. So for 500K inodes each with one page, 42MB (Douglas
> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
> interesting patch or a 64-bit machine if they work for you. =)

What do you mean with offsets of 10GB not working? In any recent
distribution supporting LFS the file size limit is only a constraint of
the filesystem on disk format. You don't need a 64bit arch for that.
(and anyways any change to the pagecahce must work fine for 64bit archs
too)

> On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> >> Ummm, I don't see how this worst case is any more realistic
> >> as the worst case for the hash table (where all pages live
> >> in very few hash buckets and we have really deep chains).
> 
> I don't believe it's particularly realistic either. And sorry about the
> inaccurate estimates from before. =)
> 
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > Mathematically the hashtable complexity is O(N). But probabilistically
> > with the tuning we do on the hashtable size, the collisions will be
> > nearly zero for most buckets for of most workloads. Despite the worst
> > case is with all the pagecache and swapcache queued in a single linked
> > list :).
> 
> To avoid its worst case (or just poor distribution across the buckets),
> a good hash function is necessary. And I don't believe the measurements
> are in favor of the one currently in use. Also, the pointer links for

the randomization provided by the inode is quite powerful (and it makes
not possible to guess the hash bucket in use from userspace without
privilegies), and the current one make sure to optimize the cacheline
usage during consecutive reads. 

> separate chaining within the objects costs extremely-precious boot-time
> allocated memory and that memory is taken away from the system for all
> time, where the dynamic allocation at least allows for the possibility
> of recovering memory when needed.
> 
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > So in short math is wrong about O(N) being bad, hashtable is infact the
> > only way we can get an effective O(1) by just paying RAM. We pay with
> > RAM and we get performance back to us.
> > but with the radix tree (please correct me if I'm wrong) the height will
> > increase eventually, no matter what (so it won't be an effective O(1)
> > like the hashtable provides in real life, not the worst case, the common
> > case). With the hashtable the height won't increase instead.
> 
> The key is of a fixed precision, hence the tree is of a fixed depth.
> The radix tree is O(1).

what does it mean the tree is of a fixed depth? If the depth is fixed
and you claim the lookup complexity O(1) how can you support terabytes
of pagecache queued into the tree without wasting quite a lot of ram per
inode in the worst case (the worst case is with only a few pages into
each radix tree, just to make sure all the depth gets allocated)? Forget
totally about x86, some box runs linux with 256G of ram (I guess
terabyte is next).

Also the complexity arguments are all about the worst case, O(1) as said
in the earlier email may very well be much slower than O(N) when you get
to numbers. hashtable will provide a common case where there is no
collision in the bucket, and then the lookup only consiste of a pointer
check.

If your radix tree O(1) fixed depth is 10000, you will always have to
walk 10000 pointers before you can finish the lookup and it will be
definitely much slower than O(N).

So keep in mind the math complexity arguments can be very misleading in
real life.

> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > In short to get the same performance with the radix tree, you'd need to
> > waste an huge amount of ram per inode, the hashtable is instead global
> > so we pay only once for all the pages in the system. At least this is my
> > understanding, I'm not a radix tree guru though, so I may be missing
> > something.
> 
> Lock and cache contention introduced by intermixing data from unrelated
> objects. We've all seen radix trees before: most page tables are radix

of course with a per-inode data structure the locking issue while
accessing different inodes goes away but I think nominal performance of
the data structure is more important than scalability issue (also
contention would remain in workloads where all tasks access the same
inode like database). The cacheline part has to be taken into account but
the hashfn is just optimized for that one.

> trees.
> 
> On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> >> People just don't go around caching a single page each for
> >> all of their 10 GB files and even if they _wanted to_ they
> >> couldn't because of the readahead code.
> >> I suspect that for large files we'll always have around
> >> min_readahead logically contiguous pages cached, if not more.
> 
> I suspect the worst case could only arise after evictions of the
> readahead from the pagecache.
> 
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > readahead really doesn't matter at all. consider all the data just in
> > cache, assume you wrote it and you will never ever need to read it once
> > again because you've 64G of ram and only 20G of disk.
> 
> Good luck booting on a 64GB x86 with excess pointer links in struct page.

x86 doesn't matter. this is common code.

> Boot-time allocations filling the direct-mapped portion of the kernel
> virtual address space -appear- to be a severe problem there, but I've
> not got the RAM to empirically verify this quite yet.
> 
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > I/O on large files can and must run as fast as I/O on small files, at
> > the pagecache level. If the fs doesn't support extents or it's
> > inefficient with large files that's an enterly different problem, and
> > the underlying fs doesn't matter any longer once the data is in cache.
> > pagecache must not slowdown on big files.
> 
> O(1) is O(1). This isn't even average-case or worst case: it's all cases.
> In a radix tree using fixed-precision search keys, such as machine words,
> exactly the same number of internal nodes are traversed to reach a leaf
> for every search key, every time, regardless of how populated or unpopulated
> the radix tree is.

and this mean it will be slower than the hashtable that will reach the
page without walking any "depth" in the common case.

> 
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > Otherwise for an unix fs developer usage (the small files ala dbench)
> > the rbtree was much nicer data structure than the hash in first place
> > (and it eats less ram than the radix tree if only one page is queued
> > etc...).
> 
> And the pointer links in struct page? Sounds like more RAM to me...

yes, that would be a few more bytes per page.... unless you allocate the
node structure dynamically like you seems to be doing in the radix tree
patch for the very same reason of not increasing the struct page I guess.

> 4000 open files (much more realistic than 500K) each with one page

open files doesn't matter. what matters are the number of inodes with
cache in them. 500k is definitely realistic, try to run updatedb with
plenty of ram free and then check /proc/sys/fs/inode-nr.

> leads to 48000 words of radix tree overhead. 3 words per page of
> pointer links and > 16000 pages of RAM and the rbtree eats more, not
> less. And 16000 pages is just 64MB on i386.
> 
> 
> Cheers,
> Bill


Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 17:21                     ` Andrea Arcangeli
@ 2002-01-31 17:50                       ` William Lee Irwin III
  0 siblings, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-01-31 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, Linus Torvalds,
	linux-kernel

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
>> words per file. So for 500K inodes each with one page, 42MB (Douglas
>> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
>> interesting patch or a 64-bit machine if they work for you. =)

These numbers are wrong -- see the other reply.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> What do you mean with offsets of 10GB not working? In any recent
> distribution supporting LFS the file size limit is only a constraint of
> the filesystem on disk format. You don't need a 64bit arch for that.
> (and anyways any change to the pagecahce must work fine for 64bit archs
> too)

I stand corrected on that. It appears the extra bits are used for large
files. The depth of the tree as represented in the calculation may need
to go up and so the worst case space usage is even larger than the
2.2-ish 32-bit - PAGE_SHIFT calculation.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> To avoid its worst case (or just poor distribution across the buckets),
>> a good hash function is necessary. And I don't believe the measurements
>> are in favor of the one currently in use. Also, the pointer links for

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> the randomization provided by the inode is quite powerful (and it makes
> not possible to guess the hash bucket in use from userspace without
> privilegies), and the current one make sure to optimize the cacheline
> usage during consecutive reads. 

chi^2 is nowhere near passing a confidence test for uniformity on the
pagecache and on various machines extremely poor bucket distribution
has been observed (i.e. visibly poor from histograms).

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> The key is of a fixed precision, hence the tree is of a fixed depth.
>> The radix tree is O(1).

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> what does it mean the tree is of a fixed depth? If the depth is fixed
> and you claim the lookup complexity O(1) how can you support terabytes
> of pagecache queued into the tree without wasting quite a lot of ram per
> inode in the worst case (the worst case is with only a few pages into
> each radix tree, just to make sure all the depth gets allocated)? Forget
> totally about x86, some box runs linux with 256G of ram (I guess
> terabyte is next).

The number of levels in the tree is proportional to the number of bits in
the machine word, which is a constant. A double-precision machine word
is of constant size as well. Radix trees can be used on strings, which
is where they would not be of fixed depth.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> Also the complexity arguments are all about the worst case, O(1) as said
> in the earlier email may very well be much slower than O(N) when you get
> to numbers. hashtable will provide a common case where there is no
> collision in the bucket, and then the lookup only consiste of a pointer
> check.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> If your radix tree O(1) fixed depth is 10000, you will always have to
> walk 10000 pointers before you can finish the lookup and it will be
> definitely much slower than O(N).
> So keep in mind the math complexity arguments can be very misleading in
> real life.

I know how to use them. Unfortunately, I am not a lightning calculator,
as seen in the preceding post.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> of course with a per-inode data structure the locking issue while
> accessing different inodes goes away but I think nominal performance of
> the data structure is more important than scalability issue (also
> contention would remain in workloads where all tasks access the same
> inode like database). The cacheline part has to be taken into account but
> the hashfn is just optimized for that one.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> Good luck booting on a 64GB x86 with excess pointer links in struct page.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> x86 doesn't matter. this is common code.

I wish.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> O(1) is O(1). This isn't even average-case or worst case: it's all cases.
>> In a radix tree using fixed-precision search keys, such as machine words,
>> exactly the same number of internal nodes are traversed to reach a leaf
>> for every search key, every time, regardless of how populated or unpopulated
>> the radix tree is.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> and this mean it will be slower than the hashtable that will reach the
> page without walking any "depth" in the common case.

Not so; the hash table will walk some number of pointer links which can
be estimated from the load on the table, and this is likely to be
approximately the same as the radix tree from hash table statistics I've
seen.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> yes, that would be a few more bytes per page.... unless you allocate the
> node structure dynamically like you seems to be doing in the radix tree
> patch for the very same reason of not increasing the struct page I guess.

Well, maybe not given the revised numbers. =)

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> 4000 open files (much more realistic than 500K) each with one page

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> open files doesn't matter. what matters are the number of inodes with
> cache in them. 500k is definitely realistic, try to run updatedb with
> plenty of ram free and then check /proc/sys/fs/inode-nr.

I misspoke -- I used an estimate of the number of address_spaces in use
on busy systems I heard elsewhere and quadrupled it (and then said "open
files"). Since it's inodes, 4K is still a bad estimate.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 14:36                 ` Andrea Arcangeli
  2002-01-31 15:32                   ` Alan Cox
  2002-01-31 16:39                   ` William Lee Irwin III
@ 2002-01-31 17:46                   ` Linus Torvalds
  2002-01-31 18:02                     ` Andrea Arcangeli
  2 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-01-31 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, linux-kernel

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>
> but with the radix tree (please correct me if I'm wrong) the height will
> increase eventually, no matter what (so it won't be an effective O(1)
> like the hashtable provides in real life, not the worst case, the common
> case). With the hashtable the height won't increase instead.

No.

The radix tree is basically O(1), because the maximum depth of a 7-bit
radix tree is just 5. The index is only a 32-bit number.

We could, in fact, make all page caches use a fixed-depth tree, which is
clearly O(1). But the radix tree is slightly faster and tends to use less
memory under common loads, so..

Remember: you must NOT ignore the constant part of a "O(x)" equation.
Hashes tend to be effectively O(1) under most loads, but they have cache
costs, and they have scalability costs that a radix tree doesn't have.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 17:46                   ` Linus Torvalds
@ 2002-01-31 18:02                     ` Andrea Arcangeli
  2002-01-31 18:32                       ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Momchil Velikov, John Stoffel, linux-kernel

On Thu, Jan 31, 2002 at 09:46:52AM -0800, Linus Torvalds wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >
> > but with the radix tree (please correct me if I'm wrong) the height will
> > increase eventually, no matter what (so it won't be an effective O(1)
> > like the hashtable provides in real life, not the worst case, the common
> > case). With the hashtable the height won't increase instead.
> 
> No.
> 
> The radix tree is basically O(1), because the maximum depth of a 7-bit
> radix tree is just 5. The index is only a 32-bit number.

then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).

Also there must be some significant memory overhead that can be
triggered with a certain layout of pages, in some configuration it
should take much more ram than the hashtable if I understood well how it
works.

Also its O(1) may be slower than the O(N) of the hashtable in the 99% of
the cases.

> 
> We could, in fact, make all page caches use a fixed-depth tree, which is
> clearly O(1). But the radix tree is slightly faster and tends to use less
> memory under common loads, so..
> 
> Remember: you must NOT ignore the constant part of a "O(x)" equation.
> Hashes tend to be effectively O(1) under most loads, but they have cache
> costs, and they have scalability costs that a radix tree doesn't have.

the scalability cost I obviously agree :) (however on some workload with
all tasks on the same inode, the scalability cost remains the same).

Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:02                     ` Andrea Arcangeli
@ 2002-01-31 18:32                       ` Linus Torvalds
  2002-01-31 18:38                         ` Rik van Riel
  2002-01-31 19:14                         ` Andrea Arcangeli
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-01-31 18:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, linux-kernel


On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >
> > The radix tree is basically O(1), because the maximum depth of a 7-bit
> > radix tree is just 5. The index is only a 32-bit number.
>
> then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).

NO.

The radix tree is an index lookup mechanism.

The index is 32 bits.

That's true regardless of how much RAM you have.

> Also there must be some significant memory overhead that can be
> triggered with a certain layout of pages, in some configuration it
> should take much more ram than the hashtable if I understood well how it
> works.

Considering that the radix tree can _remove_ 8 bytes per "struct page", I
suspect you potentially win more memory than you lose.

		Linus


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:32                       ` Linus Torvalds
@ 2002-01-31 18:38                         ` Rik van Riel
  2002-01-31 18:49                           ` Linus Torvalds
  2002-01-31 19:14                         ` Andrea Arcangeli
  1 sibling, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-01-31 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Momchil Velikov, John Stoffel, linux-kernel

On Thu, 31 Jan 2002, Linus Torvalds wrote:

> > Also there must be some significant memory overhead that can be
> > triggered with a certain layout of pages, in some configuration it
> > should take much more ram than the hashtable if I understood well how it
> > works.
>
> Considering that the radix tree can _remove_ 8 bytes per "struct
> page", I suspect you potentially win more memory than you lose.

Actually, since the page cache hash table is also 8 bytes
per page, the radix trees effectively remove 16 bytes per
struct page.

Also, Momchil's radix trees are only as deep as needed
for each file, so most files should have very shallow
radix trees.

Combine these two facts with min_readahead and you'll
see that the memory consumption for radix trees should
be pretty decent.

It's still a question whether we'll want to use 128 as
the branch factor or another number ... but I'm sure
somebody will figure that out (and it can be changed
later, it's just one define).

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:38                         ` Rik van Riel
@ 2002-01-31 18:49                           ` Linus Torvalds
  2002-01-31 19:09                             ` Momchil Velikov
                                               ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-01-31 18:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Momchil Velikov, John Stoffel, linux-kernel


On Thu, 31 Jan 2002, Rik van Riel wrote:
>
> It's still a question whether we'll want to use 128 as
> the branch factor or another number ... but I'm sure
> somebody will figure that out (and it can be changed
> later, it's just one define).

Actually, I think the big question is whether somebody is willing to clean
up and fix the "move_from_swap_cache()" issue with block_flushpage.

		Linus


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:49                           ` Linus Torvalds
@ 2002-01-31 19:09                             ` Momchil Velikov
  2002-01-31 19:26                             ` Andrew Morton
  2002-01-31 21:12                             ` Momchil Velikov
  2 siblings, 0 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-01-31 19:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, John Stoffel, linux-kernel

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> On Thu, 31 Jan 2002, Rik van Riel wrote:
>> 
>> It's still a question whether we'll want to use 128 as
>> the branch factor or another number ... but I'm sure
>> somebody will figure that out (and it can be changed
>> later, it's just one define).

Linus> Actually, I think the big question is whether somebody is willing to clean
Linus> up and fix the "move_from_swap_cache()" issue with block_flushpage.

Actually, I would be the one, only if I knew what was the issue :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:49                           ` Linus Torvalds
  2002-01-31 19:09                             ` Momchil Velikov
@ 2002-01-31 19:26                             ` Andrew Morton
  2002-01-31 21:12                             ` Momchil Velikov
  2 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2002-01-31 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Momchil Velikov, John Stoffel,
	linux-kernel

Linus Torvalds wrote:
> 
> On Thu, 31 Jan 2002, Rik van Riel wrote:
> >
> > It's still a question whether we'll want to use 128 as
> > the branch factor or another number ... but I'm sure
> > somebody will figure that out (and it can be changed
> > later, it's just one define).
> 
> Actually, I think the big question is whether somebody is willing to clean
> up and fix the "move_from_swap_cache()" issue with block_flushpage.
> 

It appears that move_from_swap_cache() is in good company:

1: shmem_unuse_inode() calls delete_from_swap_cache under
   spinlock, but delete_from_swap_cache() calls block_flushpage(),
   which can sleep.

2: shmem_getpage_locked() calls delete_from_swap_cache() calls
   block_flushpage() under info->lock.

3: zap_pte_range holds mm->page_table_lock, and calls
   free_swap_and_cache() calls delete_from_swap_cache() calls
   block_flushpage().

block_flushpage() can only sleep in the lock_buffer() in
discard_buffer().   It so happens that all three callers
are always using block_flushpage() against a locked
swapcache page, and (correct me if I'm wrong), it's
not possible for those buffers to be locked.

So we got lucky.

A short-term fix is to put a BIG FAT COMMENT over block_flushpage.

-

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:49                           ` Linus Torvalds
  2002-01-31 19:09                             ` Momchil Velikov
  2002-01-31 19:26                             ` Andrew Morton
@ 2002-01-31 21:12                             ` Momchil Velikov
  2 siblings, 0 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-01-31 21:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, John Stoffel, linux-kernel

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> On Thu, 31 Jan 2002, Rik van Riel wrote:
>> 
>> It's still a question whether we'll want to use 128 as
>> the branch factor or another number ... but I'm sure
>> somebody will figure that out (and it can be changed
>> later, it's just one define).

Linus> Actually, I think the big question is whether somebody is willing to clean
Linus> up and fix the "move_from_swap_cache()" issue with block_flushpage.

Ah, almost forgot it. The patch removes ``next_hash'' and
``pprev_hash'' from ``struct page'', which breaks ARM and sparc64.

Regards,
-velco


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 18:32                       ` Linus Torvalds
  2002-01-31 18:38                         ` Rik van Riel
@ 2002-01-31 19:14                         ` Andrea Arcangeli
  2002-01-31 19:23                           ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31 19:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Momchil Velikov, John Stoffel, linux-kernel

On Thu, Jan 31, 2002 at 10:32:35AM -0800, Linus Torvalds wrote:
> 
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> > >
> > > The radix tree is basically O(1), because the maximum depth of a 7-bit
> > > radix tree is just 5. The index is only a 32-bit number.
> >
> > then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).
> 
> NO.
> 
> The radix tree is an index lookup mechanism.
> 
> The index is 32 bits.
> 
> That's true regardless of how much RAM you have.

then there must be some collision handling that raise the complexity to
O(N) like with the hashtable, if the depth is fixed and if 32bits of
index are enough regardless of how many entries are in the tree.

I'm confused by the comments I heard so far, but well I don't want to
bother you further until I have clear how this data structure is layed
out exactly. I mainly wanted to give a warning, to be sure some point is
evalulated properly before integration.

> Considering that the radix tree can _remove_ 8 bytes per "struct page", I
> suspect you potentially win more memory than you lose.

of course if we add kmalloc to the pagecache code we can drop such part
from the page structure with the hashtable too.

Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 19:14                         ` Andrea Arcangeli
@ 2002-01-31 19:23                           ` Linus Torvalds
  2002-01-31 21:34                             ` Ingo Molnar
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-01-31 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Momchil Velikov, John Stoffel, linux-kernel

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>
> then there must be some collision handling that raise the complexity to
> O(N) like with the hashtable, if the depth is fixed and if 32bits of
> index are enough regardless of how many entries are in the tree.

No collisions. Each mapping has its own private tree. And mappings are
virtually indexed by 32 bits. No hashes, no collisions, no nothing.

Think of the page tables. We can have 64GB of memory, and the page tables
will shrink and grow dynamically to match the needs for virtual memory.
The radix tree is no different, except it ends up being a bit more
aggressive about shrinking by virtue of not always using the maximum depth.

(A fixed depth tree is much simpler, and has equivalent memory use for
not-very-dense mappings. But file mappings are 99% dense).

> of course if we add kmalloc to the pagecache code we can drop such part
> from the page structure with the hashtable too.

But you still need the hashtable.

Right now the hashtable is _roughly_ the size of 4 bytes per physical page
in the machine - and it was done that way explicitly to avoid havin gto
walk the chains. That's a LOT of memory.

For example, on my 2GB machine, I have 2MB worth of hash-tables.

In addition, each "struct page" has 8 bytes in it, so we have a total of
12 bytes per page just for the hash chains.

And yes, you could use kmalloc to allocate the hash chain entries. But
we're _guaranteed_ that 12 bytes, and kmalloc overhead might make it
worse.

In short: the radix tree certainly isn't any worse.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 19:23                           ` Linus Torvalds
@ 2002-01-31 21:34                             ` Ingo Molnar
  2002-01-31 23:12                               ` Anton Blanchard
  0 siblings, 1 reply; 87+ messages in thread
From: Ingo Molnar @ 2002-01-31 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Rik van Riel, Momchil Velikov, John Stoffel,
	linux-kernel, Anton Blanchard

On Thu, 31 Jan 2002, Linus Torvalds wrote:

> > then there must be some collision handling that raise the complexity to
> > O(N) like with the hashtable, if the depth is fixed and if 32bits of
> > index are enough regardless of how many entries are in the tree.
>
> No collisions. Each mapping has its own private tree. And mappings are
> virtually indexed by 32 bits. No hashes, no collisions, no nothing.

Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
(tree vs. scalable-hash page buckets) for SMP scalability against the
stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
performance of radix trees vs. scalable hash to be at least equivalent. (i
think Anton has a few links to show the resulting graphs.)

In fact the radix trees showed a slight performance/scalability edge in
some parts of the performance curve. So given the fact that hashes/buckets
were *purely* designed for speed/scalability and not for RAM usage, this
proves that radix trees are superior. Plus the locking is much simpler
than for the hash buckets solution. Which make radix trees a clear winner
IMO.

	Ingo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 21:34                             ` Ingo Molnar
@ 2002-01-31 23:12                               ` Anton Blanchard
  2002-01-31 23:55                                 ` Andrea Arcangeli
                                                   ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Anton Blanchard @ 2002-01-31 23:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, Momchil Velikov,
	John Stoffel, linux-kernel

Hi Ingo,

> Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
> (tree vs. scalable-hash page buckets) for SMP scalability against the
> stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
> performance of radix trees vs. scalable hash to be at least equivalent. (i
> think Anton has a few links to show the resulting graphs.)

Here are some results on a 12 way machine. (2.4.16-splay is the radix
patch):

http://samba.org/~anton/linux/pagecache_locking/1/summary.png

As you can see both patches give pretty much equal improvements.

The other problem with the current pagecache hash is that it maxes out
at order 9 (due to the get_free_pages limitation) which starts to hurt
at 4GB RAM and above. On a 32GB machine the average hashchain depth
was very high:

http://samba.org/~anton/linux/pagecache/pagecache_before.png

There were a few solutions (from davem and ingo) to allocate a larger
hash but with the radix patch we no longer have to worry about this.

So the radix patch solves 2 problems quite nicely :)

Anton

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:12                               ` Anton Blanchard
@ 2002-01-31 23:55                                 ` Andrea Arcangeli
  2002-02-01  0:01                                   ` David S. Miller
  2002-02-01  3:56                                   ` Anton Blanchard
  2002-02-01  6:32                                 ` Momchil Velikov
  2002-02-01  9:04                                 ` Ingo Molnar
  2 siblings, 2 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-01-31 23:55 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Momchil Velikov,
	John Stoffel, linux-kernel

On Fri, Feb 01, 2002 at 10:12:42AM +1100, Anton Blanchard wrote:
>  
> Hi Ingo,
> 
> > Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
> > (tree vs. scalable-hash page buckets) for SMP scalability against the
> > stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
> > performance of radix trees vs. scalable hash to be at least equivalent. (i
> > think Anton has a few links to show the resulting graphs.)
> 
> Here are some results on a 12 way machine. (2.4.16-splay is the radix
> patch):
> 
> http://samba.org/~anton/linux/pagecache_locking/1/summary.png
> 
> As you can see both patches give pretty much equal improvements.
> 
> The other problem with the current pagecache hash is that it maxes out
> at order 9 (due to the get_free_pages limitation) which starts to hurt
> at 4GB RAM and above. On a 32GB machine the average hashchain depth
> was very high:
> 
> http://samba.org/~anton/linux/pagecache/pagecache_before.png
> 
> There were a few solutions (from davem and ingo) to allocate a larger
> hash but with the radix patch we no longer have to worry about this.
> 
> So the radix patch solves 2 problems quite nicely :)

all the hashes should be allocated with the bootmem allocator, that
doesn't have the MAX_ORDER limit. Not only the pagecache hash, that is
the only one replaced.

In short, for an optimal comparison between hash and radix tree, we'd
need to fixup the hash allocation with the bootmem allocator first.

Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:55                                 ` Andrea Arcangeli
@ 2002-02-01  0:01                                   ` David S. Miller
  2002-02-16 16:20                                     ` Andrea Arcangeli
  2002-02-01  3:56                                   ` Anton Blanchard
  1 sibling, 1 reply; 87+ messages in thread
From: David S. Miller @ 2002-02-01  0:01 UTC (permalink / raw)
  To: andrea; +Cc: anton, mingo, torvalds, riel, velco, stoffel, linux-kernel

   From: Andrea Arcangeli <andrea@suse.de>
   Date: Fri, 1 Feb 2002 00:55:43 +0100
   
   In short, for an optimal comparison between hash and radix tree, we'd
   need to fixup the hash allocation with the bootmem allocator first.

I'm totally convinced the radix stuff is much better, but since you
are not here is the "pagecache hash in bootmem" patch I did ages ago
so Anton can make you happy :-)

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c
--- vanilla/linux/arch/alpha/mm/init.c	Thu Sep 20 20:02:03 2001
+++ linux/arch/alpha/mm/init.c	Sat Nov 10 01:49:56 2001
@@ -23,6 +23,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -360,6 +361,7 @@
 mem_init(void)
 {
 	max_mapnr = num_physpages = max_low_pfn;
+	page_cache_init(count_free_bootmem());
 	totalram_pages += free_all_bootmem();
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/numa.c linux/arch/alpha/mm/numa.c
--- vanilla/linux/arch/alpha/mm/numa.c	Sun Aug 12 10:38:48 2001
+++ linux/arch/alpha/mm/numa.c	Sat Nov 10 01:52:27 2001
@@ -15,6 +15,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/hwrpb.h>
 #include <asm/pgalloc.h>
@@ -359,8 +360,13 @@
 	extern char _text, _etext, _data, _edata;
 	extern char __init_begin, __init_end;
 	extern unsigned long totalram_pages;
-	unsigned long nid, i;
+	unsigned long nid, i, num_free_bootmem_pages;
 	mem_map_t * lmem_map;
+
+	num_free_bootmem_pages = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(num_free_bootmem_pages);
 
 	high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/arm/mm/init.c linux/arch/arm/mm/init.c
--- vanilla/linux/arch/arm/mm/init.c	Thu Oct 11 09:04:57 2001
+++ linux/arch/arm/mm/init.c	Sat Nov 10 01:52:34 2001
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/bootmem.h>
 #include <linux/blk.h>
+#include <linux/pagemap.h>
 
 #include <asm/segment.h>
 #include <asm/mach-types.h>
@@ -594,6 +595,7 @@
 void __init mem_init(void)
 {
 	unsigned int codepages, datapages, initpages;
+	unsigned long num_free_bootmem_pages;
 	int i, node;
 
 	codepages = &_etext - &_text;
@@ -608,6 +610,11 @@
 	 */
 	if (meminfo.nr_banks != 1)
 		create_memmap_holes(&meminfo);
+
+	num_free_bootmem_pages = 0;
+	for (node = 0; node < numnodes; node++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node));
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all unused low memory onto the freelists */
 	for (node = 0; node < numnodes; node++) {
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/cris/mm/init.c linux/arch/cris/mm/init.c
--- vanilla/linux/arch/cris/mm/init.c	Thu Jul 26 15:10:06 2001
+++ linux/arch/cris/mm/init.c	Sat Nov 10 01:53:10 2001
@@ -95,6 +95,7 @@
 #include <linux/swap.h>
 #include <linux/smp.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -366,6 +367,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn - min_low_pfn;
  
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all memory onto the freelists */
         totalram_pages = free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/i386/mm/init.c linux/arch/i386/mm/init.c
--- vanilla/linux/arch/i386/mm/init.c	Sun Nov 18 19:59:22 2001
+++ linux/arch/i386/mm/init.c	Mon Nov 12 00:14:00 2001
@@ -466,6 +466,8 @@
 #endif
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ia64/mm/init.c linux/arch/ia64/mm/init.c
--- vanilla/linux/arch/ia64/mm/init.c	Sun Nov 18 19:59:23 2001
+++ linux/arch/ia64/mm/init.c	Sat Nov 10 01:54:20 2001
@@ -13,6 +13,7 @@
 #include <linux/reboot.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/bitops.h>
 #include <asm/dma.h>
@@ -406,6 +407,8 @@
 
 	max_mapnr = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/m68k/mm/init.c linux/arch/m68k/mm/init.c
--- vanilla/linux/arch/m68k/mm/init.c	Thu Sep 20 20:02:03 2001
+++ linux/arch/m68k/mm/init.c	Sat Nov 10 01:54:47 2001
@@ -20,6 +20,7 @@
 #ifdef CONFIG_BLK_DEV_RAM
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/setup.h>
 #include <asm/uaccess.h>
@@ -135,6 +136,8 @@
 	if (MACH_IS_ATARI)
 		atari_stram_mem_init_hook();
 #endif
+
+	page_cache_init(count_free_bootmem());
 
 	/* this will put all memory onto the freelists */
 	totalram_pages = free_all_bootmem();
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips/mm/init.c linux/arch/mips/mm/init.c
--- vanilla/linux/arch/mips/mm/init.c	Wed Jul  4 11:50:39 2001
+++ linux/arch/mips/mm/init.c	Sat Nov 10 01:55:09 2001
@@ -28,6 +28,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -203,6 +204,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/mm/init.c linux/arch/mips64/mm/init.c
--- vanilla/linux/arch/mips64/mm/init.c	Wed Jul  4 11:50:39 2001
+++ linux/arch/mips64/mm/init.c	Sat Nov 10 01:55:30 2001
@@ -25,6 +25,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -396,6 +397,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c linux/arch/mips64/sgi-ip27/ip27-memory.c
--- vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c	Sun Sep  9 10:43:02 2001
+++ linux/arch/mips64/sgi-ip27/ip27-memory.c	Sat Nov 10 02:02:33 2001
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/bootmem.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/page.h>
 #include <asm/bootinfo.h>
@@ -277,6 +278,11 @@
 	num_physpages = numpages;	/* memory already sized by szmem */
 	max_mapnr = pagenr;		/* already found during paging_init */
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	tmp = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		tmp += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(tmp);
 
 	for (nid = 0; nid < numnodes; nid++) {
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/parisc/mm/init.c linux/arch/parisc/mm/init.c
--- vanilla/linux/arch/parisc/mm/init.c	Tue Dec  5 12:29:39 2000
+++ linux/arch/parisc/mm/init.c	Sat Nov 10 01:57:11 2001
@@ -17,6 +17,7 @@
 #include <linux/pci.h>		/* for hppa_dma_ops and pcxl_dma_ops */
 #include <linux/swap.h>
 #include <linux/unistd.h>
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 
@@ -48,6 +49,8 @@
 {
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ppc/mm/init.c linux/arch/ppc/mm/init.c
--- vanilla/linux/arch/ppc/mm/init.c	Tue Oct  2 09:12:44 2001
+++ linux/arch/ppc/mm/init.c	Sat Nov 10 01:57:34 2001
@@ -34,6 +34,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>		/* for initrd_* */
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/prom.h>
@@ -462,6 +463,8 @@
 
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 	num_physpages = max_mapnr;	/* RAM is assumed contiguous */
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390/mm/init.c linux/arch/s390/mm/init.c
--- vanilla/linux/arch/s390/mm/init.c	Thu Oct 11 09:04:57 2001
+++ linux/arch/s390/mm/init.c	Sat Nov 10 01:57:56 2001
@@ -186,6 +186,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390x/mm/init.c linux/arch/s390x/mm/init.c
--- vanilla/linux/arch/s390x/mm/init.c	Sun Nov 18 19:59:23 2001
+++ linux/arch/s390x/mm/init.c	Sat Nov 10 01:58:14 2001
@@ -198,6 +198,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+        page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sh/mm/init.c linux/arch/sh/mm/init.c
--- vanilla/linux/arch/sh/mm/init.c	Mon Oct 15 13:36:48 2001
+++ linux/arch/sh/mm/init.c	Sat Nov 10 01:59:56 2001
@@ -26,6 +26,7 @@
 #endif
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/processor.h>
 #include <asm/system.h>
@@ -139,6 +140,7 @@
 void __init mem_init(void)
 {
 	extern unsigned long empty_zero_page[1024];
+	unsigned long num_free_bootmem_pages;
 	int codesize, reservedpages, datasize, initsize;
 	int tmp;
 
@@ -148,6 +150,12 @@
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 	__flush_wback_region(empty_zero_page, PAGE_SIZE);
+
+	num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0));
+#ifdef CONFIG_DISCONTIGMEM
+	num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1));
+#endif
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc/mm/init.c linux/arch/sparc/mm/init.c
--- vanilla/linux/arch/sparc/mm/init.c	Mon Oct  1 09:19:56 2001
+++ linux/arch/sparc/mm/init.c	Mon Nov 12 19:27:47 2001
@@ -25,6 +25,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -434,6 +432,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(max_low_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 #ifdef DEBUG_BOOTMEM
 	prom_printf("mem_init: Calling free_all_bootmem().\n");
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc64/mm/init.c linux/arch/sparc64/mm/init.c
--- vanilla/linux/arch/sparc64/mm/init.c	Sun Nov 18 19:59:23 2001
+++ linux/arch/sparc64/mm/init.c	Sat Nov 17 23:51:28 2001
@@ -1583,6 +1583,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	num_physpages = free_all_bootmem() - 1;
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/include/linux/bootmem.h linux/include/linux/bootmem.h
--- vanilla/linux/include/linux/bootmem.h	Mon Nov  5 12:43:18 2001
+++ linux/include/linux/bootmem.h	Mon Nov 19 10:22:17 2001
@@ -43,11 +43,13 @@
 #define alloc_bootmem_low_pages(x) \
 	__alloc_bootmem((x), PAGE_SIZE, 0)
 extern unsigned long __init free_all_bootmem (void);
+extern unsigned long __init count_free_bootmem (void);
 
 extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn);
 extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
 extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
 extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
+extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat);
 extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
 #define alloc_bootmem_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/init/main.c linux/init/main.c
--- vanilla/linux/init/main.c	Sun Nov 18 19:59:37 2001
+++ linux/init/main.c	Sat Nov 10 04:58:16 2001
@@ -597,7 +597,6 @@
 	proc_caches_init();
 	vfs_caches_init(mempages);
 	buffer_init(mempages);
-	page_cache_init(mempages);
 #if defined(CONFIG_ARCH_S390)
 	ccwcache_init();
 #endif
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/bootmem.c linux/mm/bootmem.c
--- vanilla/linux/mm/bootmem.c	Tue Sep 18 14:10:43 2001
+++ linux/mm/bootmem.c	Mon Nov 12 20:40:58 2001
@@ -272,6 +279,28 @@
 	return total;
 }
 
+static unsigned long __init count_free_bootmem_core(pg_data_t *pgdat)
+{
+	bootmem_data_t *bdata = pgdat->bdata;
+	unsigned long i, idx, total;
+
+	if (!bdata->node_bootmem_map) BUG();
+
+	total = 0;
+	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
+	for (i = 0; i < idx; i++) {
+		if (!test_bit(i, bdata->node_bootmem_map))
+			total++;
+	}
+
+	/*
+	 * Count the allocator bitmap itself.
+	 */
+	total += ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE;
+
+	return total;
+}
+
 unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn)
 {
 	return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn));
@@ -292,6 +321,11 @@
 	return(free_all_bootmem_core(pgdat));
 }
 
+unsigned long __init count_free_bootmem_node (pg_data_t *pgdat)
+{
+	return(count_free_bootmem_core(pgdat));
+}
+
 unsigned long __init init_bootmem (unsigned long start, unsigned long pages)
 {
 	max_low_pfn = pages;
@@ -312,6 +346,11 @@
 unsigned long __init free_all_bootmem (void)
 {
 	return(free_all_bootmem_core(&contig_page_data));
+}
+
+unsigned long __init count_free_bootmem (void)
+{
+	return(count_free_bootmem_core(&contig_page_data));
 }
 
 void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/filemap.c linux/mm/filemap.c
--- vanilla/linux/mm/filemap.c	Sun Nov 18 19:59:38 2001
+++ linux/mm/filemap.c	Fri Nov 16 07:31:35 2001
@@ -24,6 +24,7 @@
 #include <linux/mm.h>
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
+#include <linux/bootmem.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -2931,28 +2932,48 @@
 	goto unlock;
 }
 
+/* This is called from the arch specific mem_init routine.
+ * It is done right before free_all_bootmem (or NUMA equivalent).
+ *
+ * The mempages arg is the number of pages free_all_bootmem is
+ * going to liberate, or a close approximation.
+ *
+ * We have to use bootmem because on huge systems (ie. 16GB ram)
+ * get_free_pages cannot give us a large enough allocation.
+ */
 void __init page_cache_init(unsigned long mempages)
 {
-	unsigned long htable_size, order;
+	unsigned long htable_size, real_size;
 
 	htable_size = mempages;
 	htable_size *= sizeof(struct page *);
-	for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
+
+	for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL)
 		;
 
 	do {
-		unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
+		unsigned long tmp = (real_size / sizeof(struct page *));
+		unsigned long align;
 
 		page_hash_bits = 0;
 		while((tmp >>= 1UL) != 0UL)
 			page_hash_bits++;
+		
+		align = real_size;
+		if (align > (4UL * 1024UL * 1024UL))
+			align = (4UL * 1024UL * 1024UL);
+
+		page_hash_table = __alloc_bootmem(real_size, align,
+						  __pa(MAX_DMA_ADDRESS));
+
+		/* Perhaps the alignment was too strict. */
+		if (page_hash_table == NULL)
+			page_hash_table = alloc_bootmem(real_size);
+	} while (page_hash_table == NULL &&
+		 (real_size >>= 1UL) >= PAGE_SIZE);
 
-		page_hash_table = (struct page **)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while(page_hash_table == NULL && --order > 0);
-
-	printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
-	       (1 << page_hash_bits), order, (PAGE_SIZE << order));
+	printk("Page-cache hash table entries: %d (%ld bytes)\n",
+	       (1 << page_hash_bits), real_size);
 	if (!page_hash_table)
 		panic("Failed to allocate page hash table\n");
 	memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  0:01                                   ` David S. Miller
@ 2002-02-16 16:20                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-02-16 16:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: anton, mingo, torvalds, riel, velco, stoffel, linux-kernel

On Thu, Jan 31, 2002 at 04:01:59PM -0800, David S. Miller wrote:
> --- vanilla/linux/arch/i386/mm/init.c	Sun Nov 18 19:59:22 2001
> +++ linux/arch/i386/mm/init.c	Mon Nov 12 00:14:00 2001
> @@ -466,6 +466,8 @@
>  #endif
>  	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
>  
> +	page_cache_init(count_free_bootmem());
> +
>  	/* clear the zero-page */

I wanted to merge it, but pagecache will be in highmem as well, and the
above only takes into account the normal classzone (so it can be off by
63G).

Also I'd prefer page_cache_init() to be recalled within
free_all_bootmem/free_all_bootmem_node, so we don't need to change all
archs. to take highmem into account we can use the globals
highstart_pfn/highend_pfn within an #ifdef CONFIG_HIGHMEM.

Are you interested in fixing it? Otherwise let me know. Thanks!

Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:55                                 ` Andrea Arcangeli
  2002-02-01  0:01                                   ` David S. Miller
@ 2002-02-01  3:56                                   ` Anton Blanchard
  1 sibling, 0 replies; 87+ messages in thread
From: Anton Blanchard @ 2002-02-01  3:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Momchil Velikov,
	John Stoffel, linux-kernel

 
> all the hashes should be allocated with the bootmem allocator, that
> doesn't have the MAX_ORDER limit. Not only the pagecache hash, that is
> the only one replaced.
> 
> In short, for an optimal comparison between hash and radix tree, we'd
> need to fixup the hash allocation with the bootmem allocator first.

All my results use vmalloc to allocate the hashes so they get sized
correctly.

Don't worry, there is no increased tlb pressure on these machines due
to vmalloc, that cpu doesnt have large page support.

Anton

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:12                               ` Anton Blanchard
  2002-01-31 23:55                                 ` Andrea Arcangeli
@ 2002-02-01  6:32                                 ` Momchil Velikov
  2002-02-01 18:38                                   ` Anton Blanchard
  2002-02-01  9:04                                 ` Ingo Molnar
  2 siblings, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01  6:32 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Ingo Molnar, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

>>>>> "Anton" == Anton Blanchard <anton@samba.org> writes:

Anton> Hi Ingo,

>> Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
>> (tree vs. scalable-hash page buckets) for SMP scalability against the
>> stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
>> performance of radix trees vs. scalable hash to be at least equivalent. (i
>> think Anton has a few links to show the resulting graphs.)

Anton> Here are some results on a 12 way machine. (2.4.16-splay is the radix
Anton> patch):

Anton> http://samba.org/~anton/linux/pagecache_locking/1/summary.png

A correction, "-splay" is the very first variant I posted, which used
splay trees for the page cache.

Anton> As you can see both patches give pretty much equal improvements.

Anton> The other problem with the current pagecache hash is that it maxes out
Anton> at order 9 (due to the get_free_pages limitation) which starts to hurt
Anton> at 4GB RAM and above. On a 32GB machine the average hashchain depth
Anton> was very high:

Anton> http://samba.org/~anton/linux/pagecache/pagecache_before.png

Anton> There were a few solutions (from davem and ingo) to allocate a larger
Anton> hash but with the radix patch we no longer have to worry about this.

Anton> So the radix patch solves 2 problems quite nicely :)

Anton> Anton

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  6:32                                 ` Momchil Velikov
@ 2002-02-01 18:38                                   ` Anton Blanchard
  0 siblings, 0 replies; 87+ messages in thread
From: Anton Blanchard @ 2002-02-01 18:38 UTC (permalink / raw)
  To: Momchil Velikov
  Cc: Ingo Molnar, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

 
> A correction, "-splay" is the very first variant I posted, which used
> splay trees for the page cache.

Oops now I remember why I called it -splay :) I have a radix comparison
somewhere, I'll fish it out.

Anton

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 23:12                               ` Anton Blanchard
  2002-01-31 23:55                                 ` Andrea Arcangeli
  2002-02-01  6:32                                 ` Momchil Velikov
@ 2002-02-01  9:04                                 ` Ingo Molnar
  2002-02-01  7:59                                   ` Momchil Velikov
                                                     ` (2 more replies)
  2 siblings, 3 replies; 87+ messages in thread
From: Ingo Molnar @ 2002-02-01  9:04 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, Momchil Velikov,
	John Stoffel, linux-kernel

On Fri, 1 Feb 2002, Anton Blanchard wrote:

> There were a few solutions (from davem and ingo) to allocate a larger
> hash but with the radix patch we no longer have to worry about this.

there is one big issue we forgot to consider.

in the case of radix trees it's not only search depth that gets worse with
big files. The thing i'm worried about is the 'big pagecache lock' being
reintroduced again. If eg. a database application puts lots of data into a
single file (multiple gigabytes - why not), then the
mapping->i_shared_lock becomes a 'big pagecache lock' again, causing
serious SMP contention for even the read() case. Benchmarks show that it's
the distribution of locks that matters on big boxes.

dbench hides this issue, because it uses many temporary files, so the
locking overhead is distributed. Would you be willing to run benchmarks
that measure the scalability of reading from one bigger file, from
multiple CPUs?

with hash based locking, the locking overhead is *always* distributed.

with radix trees the locking overhead is distributed only if multiple
files are used. With one big file (or a few big files), the i_shared_lock
will always bounce between CPUs wildly in read() workloads, degrading
scalability just as much as it is degraded with the pagecache_lock now.

	Ingo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  9:04                                 ` Ingo Molnar
@ 2002-02-01  7:59                                   ` Momchil Velikov
  2002-02-01 10:29                                     ` Ingo Molnar
  2002-02-01 14:44                                   ` Andrea Arcangeli
  2002-02-01 14:59                                   ` Momchil Velikov
  2 siblings, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01  7:59 UTC (permalink / raw)
  To: mingo
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> On Fri, 1 Feb 2002, Anton Blanchard wrote:

>> There were a few solutions (from davem and ingo) to allocate a larger
>> hash but with the radix patch we no longer have to worry about this.

Ingo> there is one big issue we forgot to consider.

Ingo> in the case of radix trees it's not only search depth that gets worse with

Hmm, worse, yes, the same way as page tables get "worse" with larger
address spaces.

Ingo> big files. The thing i'm worried about is the 'big pagecache lock' being
Ingo> reintroduced again. If eg. a database application puts lots of data into a

Yes, though I'd strongly suspect big database engines can/should/do
benefit from doing their application specific caching and indexing,
outperforming whatever cache implementation the OS has.

Ingo> single file (multiple gigabytes - why not), then the
mapping-> i_shared_lock becomes a 'big pagecache lock' again, causing
Ingo> serious SMP contention for even the read() case. Benchmarks show that it's
Ingo> the distribution of locks that matters on big boxes.

So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

Regards,
-velco

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  7:59                                   ` Momchil Velikov
@ 2002-02-01 10:29                                     ` Ingo Molnar
  2002-02-01  9:01                                       ` Momchil Velikov
                                                         ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Ingo Molnar @ 2002-02-01 10:29 UTC (permalink / raw)
  To: Momchil Velikov
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel


On 1 Feb 2002, Momchil Velikov wrote:

> Hmm, worse, yes, the same way as page tables get "worse" with larger
> address spaces.

with the difference that for address spaces one of the preferred methods
of operation is read() [or sendfile(), or any other non-mmap() operation],
while for pagetables the hardware helps to get locking-free access to the
mapped contents.

> Ingo> big files. The thing i'm worried about is the 'big pagecache lock' being
> Ingo> reintroduced again. If eg. a database application puts lots of data into a
>
> Yes, though I'd strongly suspect big database engines can/should/do
> benefit from doing their application specific caching and indexing,
> outperforming whatever cache implementation the OS has.

it's not just databases. It's webservers too, serving content via
sendfile() from a single, bigger file. Think streaming media servers,
where the 'movie of the night' sits in a single big binary glob.

> Ingo> single file (multiple gigabytes - why not), then the
> mapping-> i_shared_lock becomes a 'big pagecache lock' again, causing
> Ingo> serious SMP contention for even the read() case. Benchmarks show that it's
> Ingo> the distribution of locks that matters on big boxes.
>
> So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

using read-write locks does not solve the scalability problem: the problem
is the bouncing of the spinlock cacheline from CPU to CPU.

	Ingo


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 10:29                                     ` Ingo Molnar
@ 2002-02-01  9:01                                       ` Momchil Velikov
  2002-02-01  9:10                                         ` David S. Miller
  2002-02-01  9:07                                       ` David S. Miller
  2002-02-01 17:06                                       ` Linus Torvalds
  2 siblings, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01  9:01 UTC (permalink / raw)
  To: mingo
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> On 1 Feb 2002, Momchil Velikov wrote:
>> So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

Ingo> using read-write locks does not solve the scalability problem: the problem
Ingo> is the bouncing of the spinlock cacheline from CPU to CPU.

Does cache line bounce (shared somewhere -> exclusive elsewhere) cost
more that a simple miss (present nowhere -> exclusive somewhere) ?

Regards,
-velco


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  9:01                                       ` Momchil Velikov
@ 2002-02-01  9:10                                         ` David S. Miller
  0 siblings, 0 replies; 87+ messages in thread
From: David S. Miller @ 2002-02-01  9:10 UTC (permalink / raw)
  To: velco; +Cc: mingo, anton, torvalds, andrea, riel, stoffel, linux-kernel

   From: Momchil Velikov <velco@fadata.bg>
   Date: 01 Feb 2002 11:01:50 +0200

   Does cache line bounce (shared somewhere -> exclusive elsewhere) cost
   more that a simple miss (present nowhere -> exclusive somewhere) ?

They are about equal.  For coherency purposes all cpus have to listen
to all the transactions anyways to see if they have a match in their
L2 caches (and thus must provide the data to the requestor).

Perhaps the exclusive somewhere --> exclusive somewhere else is a bit
more expensive because you eat a write port for the cache line move on
the processor providing the data.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 10:29                                     ` Ingo Molnar
  2002-02-01  9:01                                       ` Momchil Velikov
@ 2002-02-01  9:07                                       ` David S. Miller
  2002-02-01  9:13                                         ` Momchil Velikov
  2002-02-01 17:06                                       ` Linus Torvalds
  2 siblings, 1 reply; 87+ messages in thread
From: David S. Miller @ 2002-02-01  9:07 UTC (permalink / raw)
  To: mingo; +Cc: velco, anton, torvalds, andrea, riel, stoffel, linux-kernel

   From: Ingo Molnar <mingo@elte.hu>
   Date: Fri, 1 Feb 2002 11:29:53 +0100 (CET)
   
   using read-write locks does not solve the scalability problem: the problem
   is the bouncing of the spinlock cacheline from CPU to CPU.

I so much wish more people understood this :(

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  9:07                                       ` David S. Miller
@ 2002-02-01  9:13                                         ` Momchil Velikov
  0 siblings, 0 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01  9:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: mingo, anton, torvalds, andrea, riel, stoffel, linux-kernel

>>>>> "David" == David S Miller <davem@redhat.com> writes:

David>    From: Ingo Molnar <mingo@elte.hu>
David>    Date: Fri, 1 Feb 2002 11:29:53 +0100 (CET)
   
David>    using read-write locks does not solve the scalability problem: the problem
David>    is the bouncing of the spinlock cacheline from CPU to CPU.

David> I so much wish more people understood this :(

Amen. From now on I'll have it on an yellow sticker on my display ;)



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 10:29                                     ` Ingo Molnar
  2002-02-01  9:01                                       ` Momchil Velikov
  2002-02-01  9:07                                       ` David S. Miller
@ 2002-02-01 17:06                                       ` Linus Torvalds
  2002-02-01 18:29                                         ` Jeff Garzik
  2002-02-01 23:49                                         ` Ingo Molnar
  2 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-02-01 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Momchil Velikov, Anton Blanchard, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

On Fri, 1 Feb 2002, Ingo Molnar wrote:
>
> it's not just databases. It's webservers too, serving content via
> sendfile() from a single, bigger file. Think streaming media servers,
> where the 'movie of the night' sits in a single big binary glob.

Hey guys, be _realistic_.

Don't bother with "oh, this could be bad", when there are absolutely _no_
numbers showing any such badness.

Even databases often use multiple files, and quite frankly, a database
that doesn't use mmap and doesn't try very hard to not cause extra system
calls is going to be bad performance-wise _regardless_ of any page cache
locking.

Radix-trees are cleaner than the alternatives, and all the numbers anybody
has ever shown shows them to be at least equal in performance.

Stop making up things that just are NOT problems.

In web-servers, 99% of the content is small files, and if the file is
cached the expensive parts are all elsewhere. Don't make up "worst case
schenarios" that simply do no exist.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 17:06                                       ` Linus Torvalds
@ 2002-02-01 18:29                                         ` Jeff Garzik
  2002-02-01 18:44                                           ` arjan
  2002-02-05  9:19                                           ` Zdenek Kabelac
  2002-02-01 23:49                                         ` Ingo Molnar
  1 sibling, 2 replies; 87+ messages in thread
From: Jeff Garzik @ 2002-02-01 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Momchil Velikov, Anton Blanchard, Andrea Arcangeli,
	Rik van Riel, John Stoffel, linux-kernel

On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> Even databases often use multiple files, and quite frankly, a database
> that doesn't use mmap and doesn't try very hard to not cause extra system
> calls is going to be bad performance-wise _regardless_ of any page cache
> locking.

I've always thought that read(2) and write(2) would in the end wind up
faster than mmap(2)...  Tests in my rewritten cp/rm/mv type utilities
seem to bear this out.

Is mmap(2) only preferred for large files/databases?

	Jeff




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 18:29                                         ` Jeff Garzik
@ 2002-02-01 18:44                                           ` arjan
  2002-02-01 19:47                                             ` Jeff Garzik
  2002-02-05  9:19                                           ` Zdenek Kabelac
  1 sibling, 1 reply; 87+ messages in thread
From: arjan @ 2002-02-01 18:44 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

In article <20020201132953.A27508@havoc.gtf.org> you wrote:
> On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
>> Even databases often use multiple files, and quite frankly, a database
>> that doesn't use mmap and doesn't try very hard to not cause extra system
>> calls is going to be bad performance-wise _regardless_ of any page cache
>> locking.

> I've always thought that read(2) and write(2) would in the end wind up
> faster than mmap(2)...  Tests in my rewritten cp/rm/mv type utilities
> seem to bear this out.

the biggest reason for this is that we *suck* at readahead for mmap....

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 18:44                                           ` arjan
@ 2002-02-01 19:47                                             ` Jeff Garzik
  2002-02-02 15:39                                               ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff Garzik @ 2002-02-01 19:47 UTC (permalink / raw)
  To: arjan; +Cc: linux-kernel

On Fri, Feb 01, 2002 at 06:44:50PM +0000, arjan@fenrus.demon.nl wrote:
> In article <20020201132953.A27508@havoc.gtf.org> you wrote:
> > On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> >> Even databases often use multiple files, and quite frankly, a database
> >> that doesn't use mmap and doesn't try very hard to not cause extra system
> >> calls is going to be bad performance-wise _regardless_ of any page cache
> >> locking.
> 
> > I've always thought that read(2) and write(2) would in the end wind up
> > faster than mmap(2)...  Tests in my rewritten cp/rm/mv type utilities
> > seem to bear this out.
> 
> the biggest reason for this is that we *suck* at readahead for mmap....

Is there not also fault overhead and similar issues related to mmap(2)
in general, that are not present with read(2)/write(2)?

	Jeff




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 19:47                                             ` Jeff Garzik
@ 2002-02-02 15:39                                               ` Rik van Riel
  2002-02-05 14:21                                                 ` Pavel Machek
  0 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-02-02 15:39 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: arjan, linux-kernel

On Fri, 1 Feb 2002, Jeff Garzik wrote:

> > the biggest reason for this is that we *suck* at readahead for mmap....
>
> Is there not also fault overhead and similar issues related to mmap(2)
> in general, that are not present with read(2)/write(2)?

If a fault is more expensive than a system call, we're doing
something wrong in the page fault path ;)

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-02 15:39                                               ` Rik van Riel
@ 2002-02-05 14:21                                                 ` Pavel Machek
  2002-02-05 18:45                                                   ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Pavel Machek @ 2002-02-05 14:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jeff Garzik, arjan, linux-kernel

Hi!

> > > the biggest reason for this is that we *suck* at readahead for mmap....
> >
> > Is there not also fault overhead and similar issues related to mmap(2)
> > in general, that are not present with read(2)/write(2)?
> 
> If a fault is more expensive than a system call, we're doing
> something wrong in the page fault path ;)

You can read 128K at a time, but you can't fault 128K...
								Pavel

-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-05 14:21                                                 ` Pavel Machek
@ 2002-02-05 18:45                                                   ` Rik van Riel
  2002-02-05 20:30                                                     ` Eric Dumazet
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Rik van Riel @ 2002-02-05 18:45 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Jeff Garzik, arjan, linux-kernel

On Tue, 5 Feb 2002, Pavel Machek wrote:

> > > > the biggest reason for this is that we *suck* at readahead for mmap....
> > >
> > > Is there not also fault overhead and similar issues related to mmap(2)
> > > in general, that are not present with read(2)/write(2)?
> >
> > If a fault is more expensive than a system call, we're doing
> > something wrong in the page fault path ;)
>
> You can read 128K at a time, but you can't fault 128K...

Why not ?

If the pages are present (read-ahead) and the page table
is present, I see no reason why we couldn't fill in 32
page table entries at once.

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-05 18:45                                                   ` Rik van Riel
@ 2002-02-05 20:30                                                     ` Eric Dumazet
  2002-02-05 20:46                                                     ` Pavel Machek
  2002-02-06  9:07                                                     ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: Eric Dumazet @ 2002-02-05 20:30 UTC (permalink / raw)
  To: Rik van Riel, Pavel Machek; +Cc: Jeff Garzik, arjan, linux-kernel

> On Tue, 5 Feb 2002, Pavel Machek wrote:
>
> > > > > the biggest reason for this is that we *suck* at readahead for
mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to
mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
>
> Why not ?
>
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.
>
> Rik

Well, filling 32 page tables entries at once is certainly a big readahead...
for the common cases.

Maybe this high number could be a result of a madavise(..., MADV_SEQUENTIAL
or MAP_WILLNEED)
Solaris does exactly this kind of trick.

Eric



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-05 18:45                                                   ` Rik van Riel
  2002-02-05 20:30                                                     ` Eric Dumazet
@ 2002-02-05 20:46                                                     ` Pavel Machek
  2002-02-06  9:07                                                     ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: Pavel Machek @ 2002-02-05 20:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jeff Garzik, arjan, linux-kernel

Hi!

> > > > > the biggest reason for this is that we *suck* at readahead for mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
> 
> Why not ?
> 
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.

Ugh.

Okay, CPU will still have to fill its TLBs (it does not have to in
read case), but that is way easier operation. 

I did not think about this possibility, sorry.
									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-05 18:45                                                   ` Rik van Riel
  2002-02-05 20:30                                                     ` Eric Dumazet
  2002-02-05 20:46                                                     ` Pavel Machek
@ 2002-02-06  9:07                                                     ` Daniel Phillips
  2 siblings, 0 replies; 87+ messages in thread
From: Daniel Phillips @ 2002-02-06  9:07 UTC (permalink / raw)
  To: Rik van Riel, Pavel Machek; +Cc: Jeff Garzik, arjan, linux-kernel

On February 5, 2002 07:45 pm, Rik van Riel wrote:
> On Tue, 5 Feb 2002, Pavel Machek wrote:
> > > > > the biggest reason for this is that we *suck* at readahead for
> > > > > mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
> 
> Why not ?
> 
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.

Yes, essentially what you want is to schedule a generic_file_readahead, which 
we'd need to cook up a mechanism for doing.  The other part - much harder - 
is deciding when to readahead, and how much.

I'd amend your original statement to just 'we *suck* at readahead'.

-- 
Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 18:29                                         ` Jeff Garzik
  2002-02-01 18:44                                           ` arjan
@ 2002-02-05  9:19                                           ` Zdenek Kabelac
  1 sibling, 0 replies; 87+ messages in thread
From: Zdenek Kabelac @ 2002-02-05  9:19 UTC (permalink / raw)
  To: Jeff Garzik

Jeff Garzik wrote:
> 
> On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> > Even databases often use multiple files, and quite frankly, a database
> > that doesn't use mmap and doesn't try very hard to not cause extra system
> > calls is going to be bad performance-wise _regardless_ of any page cache
> > locking.
> 
> I've always thought that read(2) and write(2) would in the end wind up
> faster than mmap(2)...  Tests in my rewritten cp/rm/mv type utilities
> seem to bear this out.
> 
> Is mmap(2) only preferred for large files/databases?
 
I've tried to make faster md5summing program and programmed several
ways of accessing file - for the very large files the fastest
way seemed to be  O_DIRECT with threaded precaching.

For fast mmap access I'd to implement two parallel mmpad areas with
madvise MADV_WILLNEED  - then it was almost as fast as read


-- 
  .''`.  Which fundamental human right do you want to give up today?
 : :' :      Debian GNU/Linux maintainer - www.debian.{org,cz}
 `. `'  Zdenek Kabelac  kabi@{debian.org, users.sf.net, fi.muni.cz}
   `-              When in doubt, just blame the Euro. :)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 17:06                                       ` Linus Torvalds
  2002-02-01 18:29                                         ` Jeff Garzik
@ 2002-02-01 23:49                                         ` Ingo Molnar
  1 sibling, 0 replies; 87+ messages in thread
From: Ingo Molnar @ 2002-02-01 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Momchil Velikov, Anton Blanchard, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

On Fri, 1 Feb 2002, Linus Torvalds wrote:

> In web-servers, 99% of the content is small files, and if the file is
> cached the expensive parts are all elsewhere. Don't make up "worst
> case schenarios" that simply do no exist.

in fact the locking structure of radix trees have a locking advantage in
the 'multiple small files' case: if one CPU does a sendfile() on one file,
then the lock will be likely CPU-local for the duration of the sendfile(),
while page buckets will access a new spinlock for every page accessed.

	Ingo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  9:04                                 ` Ingo Molnar
  2002-02-01  7:59                                   ` Momchil Velikov
@ 2002-02-01 14:44                                   ` Andrea Arcangeli
  2002-02-01 14:59                                   ` Momchil Velikov
  2 siblings, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-02-01 14:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Anton Blanchard, Linus Torvalds, Rik van Riel, Momchil Velikov,
	John Stoffel, linux-kernel

On Fri, Feb 01, 2002 at 10:04:50AM +0100, Ingo Molnar wrote:
> 
> On Fri, 1 Feb 2002, Anton Blanchard wrote:
> 
> > There were a few solutions (from davem and ingo) to allocate a larger
> > hash but with the radix patch we no longer have to worry about this.
> 
> there is one big issue we forgot to consider.
> 
> in the case of radix trees it's not only search depth that gets worse with
> big files. The thing i'm worried about is the 'big pagecache lock' being
> reintroduced again. If eg. a database application puts lots of data into a
> single file (multiple gigabytes - why not), then the
> mapping->i_shared_lock becomes a 'big pagecache lock' again, causing
> serious SMP contention for even the read() case. Benchmarks show that it's
> the distribution of locks that matters on big boxes.

exactly, this is the same thing I mentioned in some past email. It's not
that having per-inode data structures solves the locking completly, DBMS
are used to store stuff in a single file. And of course with a structure
like radix tree it would be a pain to have it scale within the same
file, unlike with the hashtable where each bucket is indipendent from
the others.

> 
> dbench hides this issue, because it uses many temporary files, so the

Indeed, a lot of workloads would benefit from the separate data
structure and locking, but not all, some important one not.

> locking overhead is distributed. Would you be willing to run benchmarks
> that measure the scalability of reading from one bigger file, from
> multiple CPUs?

Agreed, also with DaveM patch applied, sizing the hash properly so it
has a mean distribution of 1 entry per bucket or so, will decrease the
window for the spinlock collisions as well btw.

> 
> with hash based locking, the locking overhead is *always* distributed.
> 
> with radix trees the locking overhead is distributed only if multiple
> files are used. With one big file (or a few big files), the i_shared_lock
> will always bounce between CPUs wildly in read() workloads, degrading
> scalability just as much as it is degraded with the pagecache_lock now.
> 
> 	Ingo


Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01  9:04                                 ` Ingo Molnar
  2002-02-01  7:59                                   ` Momchil Velikov
  2002-02-01 14:44                                   ` Andrea Arcangeli
@ 2002-02-01 14:59                                   ` Momchil Velikov
  2002-02-01 17:03                                     ` Ingo Molnar
  2 siblings, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01 14:59 UTC (permalink / raw)
  To: mingo
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> files are used. With one big file (or a few big files), the i_shared_lock
Ingo> will always bounce between CPUs wildly in read() workloads, degrading

Will there be difference between bounces of a rwlock in the radix tree
variant and the cache misses in hashed locks variant for the case of
concurrently accessed large file ?

Regards,
-velco

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 14:59                                   ` Momchil Velikov
@ 2002-02-01 17:03                                     ` Ingo Molnar
  2002-02-01 15:26                                       ` Momchil Velikov
  0 siblings, 1 reply; 87+ messages in thread
From: Ingo Molnar @ 2002-02-01 17:03 UTC (permalink / raw)
  To: Momchil Velikov
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel


On 1 Feb 2002, Momchil Velikov wrote:

> Ingo> files are used. With one big file (or a few big files), the i_shared_lock
> Ingo> will always bounce between CPUs wildly in read() workloads, degrading
>
> Will there be difference between bounces of a rwlock in the radix tree
> variant and the cache misses in hashed locks variant for the case of
> concurrently accessed large file ?

definitely, because in the case of page buckets there are many locks
hashed in a mapping-neutral way. Ie. different parts of the same file will
likely map to different spinlocks. In the radix tree case all pages in the
inode will map to the same spinlock.

	Ingo


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 17:03                                     ` Ingo Molnar
@ 2002-02-01 15:26                                       ` Momchil Velikov
  2002-02-01 23:45                                         ` Ingo Molnar
  0 siblings, 1 reply; 87+ messages in thread
From: Momchil Velikov @ 2002-02-01 15:26 UTC (permalink / raw)
  To: mingo
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> On 1 Feb 2002, Momchil Velikov wrote:

Ingo> files are used. With one big file (or a few big files), the i_shared_lock
Ingo> will always bounce between CPUs wildly in read() workloads, degrading
>> 
>> Will there be difference between bounces of a rwlock in the radix tree
>> variant and the cache misses in hashed locks variant for the case of
>> concurrently accessed large file ?

Ingo> definitely, because in the case of page buckets there are many locks
Ingo> hashed in a mapping-neutral way. Ie. different parts of the same file will
Ingo> likely map to different spinlocks.

That's why it's likely to miss on each access.  

Ingo> In the radix tree case all pages in the inode will map to the
Ingo> same spinlock.

That's why it's likely to bounce on each access.

So, is there any difference ? :)

Regards,
-velco

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-02-01 15:26                                       ` Momchil Velikov
@ 2002-02-01 23:45                                         ` Ingo Molnar
  0 siblings, 0 replies; 87+ messages in thread
From: Ingo Molnar @ 2002-02-01 23:45 UTC (permalink / raw)
  To: Momchil Velikov
  Cc: Anton Blanchard, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	John Stoffel, linux-kernel


On 1 Feb 2002, Momchil Velikov wrote:

> Ingo> definitely, because in the case of page buckets there are many locks
> Ingo> hashed in a mapping-neutral way. Ie. different parts of the same file will
> Ingo> likely map to different spinlocks.
>
> That's why it's likely to miss on each access.

yes, you are right.

> Ingo> In the radix tree case all pages in the inode will map to the
> Ingo> same spinlock.
>
> That's why it's likely to bounce on each access.
>
> So, is there any difference ? :)

no difference. I tried to create a testcase that shows the difference
(multiple processes read()ing a single big file on an 8-way box), but
performance was equivalent. So given the clear advantages of radix trees
in other areas, they win hands down. :)

	Ingo


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-30 22:15           ` Momchil Velikov
  2002-01-31  2:33             ` Andrea Arcangeli
@ 2002-01-31 10:41             ` Josh MacDonald
  2002-01-31 14:00               ` Rik van Riel
  2002-01-31 14:21               ` Momchil Velikov
  1 sibling, 2 replies; 87+ messages in thread
From: Josh MacDonald @ 2002-01-31 10:41 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: linux-kernel

Quoting Momchil Velikov (velco@fadata.bg):
> >>>>> "John" == John Stoffel <stoffel@casc.com> writes:
> 
> Momchil> Memory overhead due to allocator overhead is of no concern with the
> Momchil> slab allocator. What matters most is probably the overhead of the
> Momchil> radix tree nodes themselves, compared to the two pointers in struct
> Momchil> page with the hash table approach. rat-4 variant ought to have less
> Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
> Momchil> have no figures for the actual memory usage though. For small files it
> Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
> Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes.  The
> Momchil> worst case would be very large file with a few cached pages with
> Momchil> offsets uniformly distributed across the whole file, that is having
> Momchil> deep tree with only one page hanging off each leaf node.
> 
> John> Isn't this a good place to use AVL trees then, since they balance
> John> automatically?  Admittedly, it may be more overhead than we want in
> John> the case where the tree is balanced by default anyway.  
> 
> The widespread opinion is that binary trees are generally way too deep
> compared to radix trees, so searches have larger cache footprint.

I've posted this before -- my cache-optimized skip list solves the
problem of balanced-tree cache footprint.  It uses cacheline-sized
nodes and per-node locking to avoid false-sharing and increase 
concurrency.  The memory usage for the skip list is also less than
the red-black tree for trees larger than several hundred nodes.

I posted a graph on space consumption (using the Linux vm_area_struct to 
calculate space overhead) at:

	http://prdownloads.sourceforge.net/skiplist/slrb_space.gif

There are also results for concurrency and performance as a function 
of node size.

-josh

-- 
PRCS version control system    http://sourceforge.net/projects/prcs
Xdelta storage & transport     http://sourceforge.net/projects/xdelta
Need a concurrent skip list?   http://sourceforge.net/projects/skiplist

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 10:41             ` Josh MacDonald
@ 2002-01-31 14:00               ` Rik van Riel
  2002-01-31 14:21               ` Momchil Velikov
  1 sibling, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-01-31 14:00 UTC (permalink / raw)
  To: Josh MacDonald; +Cc: Momchil Velikov, linux-kernel

On Thu, 31 Jan 2002, Josh MacDonald wrote:

> I've posted this before -- my cache-optimized skip list solves the
> problem of balanced-tree cache footprint.  It uses cacheline-sized
> nodes and per-node locking to avoid false-sharing and increase
> concurrency.  The memory usage for the skip list is also less than
> the red-black tree for trees larger than several hundred nodes.

I'd be happy to test a kernel where the page cache uses
these skip lists for indexing.

Where can I download the patch ?  ;)

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-31 10:41             ` Josh MacDonald
  2002-01-31 14:00               ` Rik van Riel
@ 2002-01-31 14:21               ` Momchil Velikov
  1 sibling, 0 replies; 87+ messages in thread
From: Momchil Velikov @ 2002-01-31 14:21 UTC (permalink / raw)
  To: Josh MacDonald; +Cc: linux-kernel

>>>>> "Josh" == Josh MacDonald <jmacd@CS.Berkeley.EDU> writes:

Josh> Quoting Momchil Velikov (velco@fadata.bg):
>> >>>>> "John" == John Stoffel <stoffel@casc.com> writes:

Momchil>  The worst case would be very large file with a few cached
Momchil> pages with offsets uniformly distributed across the whole
Momchil> file, that is having deep tree with only one page hanging off
Momchil> each leaf node.

John> Isn't this a good place to use AVL trees then, since they balance
John> automatically?  Admittedly, it may be more overhead than we want in
John> the case where the tree is balanced by default anyway.  

>> The widespread opinion is that binary trees are generally way too deep
>> compared to radix trees, so searches have larger cache footprint.

Josh> I've posted this before -- my cache-optimized skip list solves the
Josh> problem of balanced-tree cache footprint.  It uses cacheline-sized

I don't think skip lists differ from the balanced trees w.r.t cache
line footprint.

Josh> nodes and per-node locking to avoid false-sharing and increase 

Whether there _is_ a (non-negligible) false sharing would be an open
question.

Josh> concurrency.  The memory usage for the skip list is also less than
Josh> the red-black tree for trees larger than several hundred nodes.

Yes. Skip list or (whatever) b-tree are sure to have less space
overhead in the worst case.  Therefore, I'd be curious to see
comparisons with the three pagecache implementations.  Note that in my
last patch you can do a drop-in replacement of the radix tree with a
skip list, since memory allocation issues are solved.

Regards,
-velco

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-30 22:05         ` John Stoffel
  2002-01-30 22:15           ` Momchil Velikov
@ 2002-01-30 22:22           ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2002-01-30 22:22 UTC (permalink / raw)
  To: John Stoffel; +Cc: Linus Torvalds, linux-kernel, Momchil Velikov

In article <15448.28224.481925.430169@gargle.gargle.HOWL> you wrote:
> Isn't this a good place to use AVL trees then, since they balance
> automatically?  Admittedly, it may be more overhead than we want in
> the case where the tree is balanced by default anyway.  

OpenUnix uses AVL trees for the pagecache.  The overhead in struct page
is immense..

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 23:02   ` Momchil Velikov
  2002-01-29 23:33     ` Linus Torvalds
@ 2002-01-30  3:02     ` Daniel Phillips
  1 sibling, 0 replies; 87+ messages in thread
From: Daniel Phillips @ 2002-01-30  3:02 UTC (permalink / raw)
  To: Momchil Velikov, Linus Torvalds; +Cc: linux-kernel

On January 30, 2002 12:02 am, Momchil Velikov wrote:
> Well, these are dbench numbers from December, it's
> 2.4.17. Unfortunately, it appears OSDL have trouble with 2.5 currently ...

Have you tested with anything besides dbench?

-- 
Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] Radix-tree pagecache for 2.5
  2002-01-29 15:54 Christoph Hellwig
  2002-01-29 19:27 ` Linus Torvalds
@ 2002-01-29 23:00 ` William Lee Irwin III
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-01-29 23:00 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel, linux-mm, velco

On Tue, Jan 29, 2002 at 04:54:44PM +0100, Christoph Hellwig wrote:
> I've ported my hacked up version of Momchil Velikov's radix tree
> radix tree pagecache to 2.5.3-pre{5,6}.
> The changes over the 2.4.17 version are:
>   o use mempool to avoid OOM situation involving radix nodes.
>   o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
>   o unify add_to_page and add_to_page_unique
> It gives nice scalability improvements on big machines and drops the
> memory usage on small ones (if you consider my 64MB Athlon small :)).

I love this patch. My only concern is about worst-case space consumption,
but it is beautiful regardless, and space consumption can be addressed
later if it is a problem in practice. The average case space consumption,
as you have noted, is quite good already, and it seems difficult to
trigger the worst case (I have tested it myself).


Cheers,
Bill

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2002-02-16 16:25 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-06 21:34 [PATCH] Radix-tree pagecache for 2.5 rwhron
2002-02-06 21:37 ` Rik van Riel
2002-02-06 22:06   ` rwhron
2002-02-07 11:32   ` Daniel Phillips
2002-02-07 11:32   ` Daniel Phillips
  -- strict thread matches above, loose matches on Subject: below --
2002-02-02 19:23 rwhron
2002-02-03 14:31 ` chris
2002-02-03 23:33 ` Momchil Velikov
2002-02-04  3:59   ` rwhron
2002-02-06  2:04   ` rwhron
2002-02-06 11:44     ` Rik van Riel
2002-01-29 15:54 Christoph Hellwig
2002-01-29 19:27 ` Linus Torvalds
2002-01-29 21:40   ` David S. Miller
2002-01-29 22:07     ` Linus Torvalds
2002-01-29 23:01       ` Rik van Riel
2002-01-29 23:32         ` Alan Cox
2002-01-29 23:35           ` Rik van Riel
2002-01-30  3:00             ` Daniel Phillips
2002-01-31 23:44           ` Anton Blanchard
2002-02-01  0:34             ` Alan Cox
2002-02-01 11:04               ` Rik van Riel
2002-02-01 11:33                 ` Arjan van de Ven
2002-02-02 18:57                 ` Richard Henderson
2002-02-02 21:15                   ` Rik van Riel
2002-01-29 23:02   ` Momchil Velikov
2002-01-29 23:33     ` Linus Torvalds
2002-01-29 23:45       ` Christoph Hellwig
2002-01-30 21:25       ` Momchil Velikov
2002-01-30 22:05         ` John Stoffel
2002-01-30 22:15           ` Momchil Velikov
2002-01-31  2:33             ` Andrea Arcangeli
2002-01-31 13:58               ` Rik van Riel
2002-01-31 14:36                 ` Andrea Arcangeli
2002-01-31 15:32                   ` Alan Cox
2002-01-31 16:39                   ` William Lee Irwin III
2002-01-31 17:19                     ` William Lee Irwin III
2002-01-31 17:21                     ` Andrea Arcangeli
2002-01-31 17:50                       ` William Lee Irwin III
2002-01-31 17:46                   ` Linus Torvalds
2002-01-31 18:02                     ` Andrea Arcangeli
2002-01-31 18:32                       ` Linus Torvalds
2002-01-31 18:38                         ` Rik van Riel
2002-01-31 18:49                           ` Linus Torvalds
2002-01-31 19:09                             ` Momchil Velikov
2002-01-31 19:26                             ` Andrew Morton
2002-01-31 21:12                             ` Momchil Velikov
2002-01-31 19:14                         ` Andrea Arcangeli
2002-01-31 19:23                           ` Linus Torvalds
2002-01-31 21:34                             ` Ingo Molnar
2002-01-31 23:12                               ` Anton Blanchard
2002-01-31 23:55                                 ` Andrea Arcangeli
2002-02-01  0:01                                   ` David S. Miller
2002-02-16 16:20                                     ` Andrea Arcangeli
2002-02-01  3:56                                   ` Anton Blanchard
2002-02-01  6:32                                 ` Momchil Velikov
2002-02-01 18:38                                   ` Anton Blanchard
2002-02-01  9:04                                 ` Ingo Molnar
2002-02-01  7:59                                   ` Momchil Velikov
2002-02-01 10:29                                     ` Ingo Molnar
2002-02-01  9:01                                       ` Momchil Velikov
2002-02-01  9:10                                         ` David S. Miller
2002-02-01  9:07                                       ` David S. Miller
2002-02-01  9:13                                         ` Momchil Velikov
2002-02-01 17:06                                       ` Linus Torvalds
2002-02-01 18:29                                         ` Jeff Garzik
2002-02-01 18:44                                           ` arjan
2002-02-01 19:47                                             ` Jeff Garzik
2002-02-02 15:39                                               ` Rik van Riel
2002-02-05 14:21                                                 ` Pavel Machek
2002-02-05 18:45                                                   ` Rik van Riel
2002-02-05 20:30                                                     ` Eric Dumazet
2002-02-05 20:46                                                     ` Pavel Machek
2002-02-06  9:07                                                     ` Daniel Phillips
2002-02-05  9:19                                           ` Zdenek Kabelac
2002-02-01 23:49                                         ` Ingo Molnar
2002-02-01 14:44                                   ` Andrea Arcangeli
2002-02-01 14:59                                   ` Momchil Velikov
2002-02-01 17:03                                     ` Ingo Molnar
2002-02-01 15:26                                       ` Momchil Velikov
2002-02-01 23:45                                         ` Ingo Molnar
2002-01-31 10:41             ` Josh MacDonald
2002-01-31 14:00               ` Rik van Riel
2002-01-31 14:21               ` Momchil Velikov
2002-01-30 22:22           ` Christoph Hellwig
2002-01-30  3:02     ` Daniel Phillips
2002-01-29 23:00 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox