[rfc] lockless pagecache

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [rfc] lockless pagecache
@ 2005-06-27  6:29 Nick Piggin
  2005-06-27  7:46 ` Andrew Morton
  2005-06-29 10:49 ` Hirokazu Takahashi
  0 siblings, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2005-06-27  6:29 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management

Hi,

This is going to be a fairly long and probably incoherent post. The
idea and implementation are not completely analysed for holes, and
I wouldn't be surprised if some (even fatal ones) exist.

That said, I wanted something to talk about at Ottawa and I think
this is a promising idea - it is at the stage where it would be good
to have interested parties pick it apart. BTW. this is my main reason
for the PageReserved removal patches, so if this falls apart then
some good will have come from it! :)

OK, so my aim is to remove the requirement to take mapping->tree_lock
when looking up pagecache pages (eg. for a read/write or nopage fault).
Note that this does not deal with insertion and removal of pages from
pagecache mappings - that is usually a slower path operation associated
with IO or page reclaim or truncate. However if there was interest in
making these paths more scalable, there are possibilities for that too.

What for? Well there are probably lots of reasons, but suppose you have
a big app with lots of processes all mmaping and playing around with
various parts of the same big file (say, a shared memory file), then
you might start seeing problems if you want to scale this workload up
to say 32+ CPUs.

Now the tree_lock was recently(ish) converted to an rwlock, precisely
for such a workload and that was apparently very successful. However
an rwlock is significantly heavier, and as machines get faster and
bigger, rwlocks (and any locks) will tend to use more and more of Paul
McKenney's toilet paper due to cacheline bouncing.

So in the interest of saving some trees, let's try it without any locks.

First I'll put up some numbers to get you interested - of a 64-way Altix
with 64 processes each read-faulting in their own 512MB part of a 32GB
file that is preloaded in pagecache (with the proper NUMA memory
allocation).

[best of 5 runs]

plain 2.6.12-git4:
  1 proc    0.65u   1.43s 2.09e 99%CPU
64 proc    0.75u 291.30s 4.92e 5927%CPU

64 proc prof:
3242763 total                                      0.5366
1269413 _read_unlock_irq                         19834.5781
842042 do_no_page                               355.5921
779373 cond_resched                             3479.3438
100667 ia64_pal_call_static                     524.3073
  96469 _spin_lock                               1004.8854
  92857 default_idle                             241.8151
  25572 filemap_nopage                            15.6691
  11981 ia64_load_scratch_fpregs                 187.2031
  11671 ia64_save_scratch_fpregs                 182.3594
   2566 page_fault                                 2.5867

It has slowed by a factor of 2.5x when going from serial to 64-way, and it
is due to mapping->tree_lock. Serial is even at the disadvantage of reading
from remote memory 62 times out of 64.

2.6.12-git4-lockless:
  1 proc    0.66u   1.38s 2.04e 99%CPU
64 proc    0.68u   1.42s 0.12e 1686%CPU

64 proc prof:
  81934 total                                      0.0136
  31108 ia64_pal_call_static                     162.0208
  28394 default_idle                              73.9427
   3796 ia64_save_scratch_fpregs                  59.3125
   3736 ia64_load_scratch_fpregs                  58.3750
   2208 page_fault                                 2.2258
   1380 unmap_vmas                                 0.3292
   1298 __mod_page_state                           8.1125
   1089 do_no_page                                 0.4599
    830 find_get_page                              2.5938
    781 ia64_do_page_fault                         0.2805

So we have increased performance exactly 17x when going from 1 to 64 way,
however if you look at the CPU utilisation figure and the elapsed time,
you'll see my test didn't provide enough work to keep all CPUs busy, and
for the amount of CPU time used, we appear to have perfect scalability.
In fact, it is slightly superlinear probably due to remote memory access
on the serial run.

I'll reply to this post with the series of commented patches which is
probably the best way to explain how it is done. They are against
2.6.12-git4 + some future iteration of the PageReserved patches. I
can provide the complete rollup privately on request.

Comments, flames, laughing me out of town, etc. are all very welcome.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  6:29 Nick Piggin
@ 2005-06-27  7:46 ` Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
                     ` (2 more replies)
  2005-06-29 10:49 ` Hirokazu Takahashi
  1 sibling, 3 replies; 25+ messages in thread
From: Andrew Morton @ 2005-06-27  7:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> First I'll put up some numbers to get you interested - of a 64-way Altix
>  with 64 processes each read-faulting in their own 512MB part of a 32GB
>  file that is preloaded in pagecache (with the proper NUMA memory
>  allocation).

I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
16-page faultahead.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` Andrew Morton
@ 2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
                       ` (2 more replies)
  2005-06-27 14:08   ` Martin J. Bligh
  2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 3 replies; 25+ messages in thread
From: Nick Piggin @ 2005-06-27  8:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>First I'll put up some numbers to get you interested - of a 64-way Altix
>> with 64 processes each read-faulting in their own 512MB part of a 32GB
>> file that is preloaded in pagecache (with the proper NUMA memory
>> allocation).
> 
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.
> 
> 

Definitely, for the microbenchmark I was testing with.

However I think for Oracle and others that use shared memory like
this, they are probably not doing linear access, so that would be a
net loss. I'm not completely sure (I don't have access to real loads
at the moment), but I would have thought those guys would have looked
into fault ahead if it were a possibility.

Also, the memory usage regression cases that fault ahead brings makes it
a bit contentious.

I like that the lockless patch completely removes the problem at its
source and even makes the serial path lighter. The other things is, the
speculative get_page may be useful for more code than just pagecache
lookups. But it is fairly tricky I'll give you that.

Anyway it is obviously not something that can go in tomorrow. At the
very least the PageReserved patches need to go in first, and even they
will need a lot of testing out of tree.

Perhaps it can be discussed at KS and we can think about what to do with
it after that - that kind of time frame. No rush.

Oh yeah, and obviously it would be nice if it provided real improvements
on real workloads too ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
@ 2005-06-27  8:15     ` Andrew Morton
  2005-06-27  8:28       ` Nick Piggin
  2005-06-27  8:56     ` Lincoln Dale
  2005-06-27 13:17     ` Benjamin LaHaise
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2005-06-27  8:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Also, the memory usage regression cases that fault ahead brings makes it
>  a bit contentious.

faultahead consumes no more memory: if the page is present then point a pte
at it.  It'll make reclaim work a bit harder in some situations.

>  I like that the lockless patch completely removes the problem at its
>  source and even makes the serial path lighter. The other things is, the
>  speculative get_page may be useful for more code than just pagecache
>  lookups. But it is fairly tricky I'll give you that.

Yes, it's scary-looking stuff.

>  Anyway it is obviously not something that can go in tomorrow. At the
>  very least the PageReserved patches need to go in first, and even they
>  will need a lot of testing out of tree.
> 
>  Perhaps it can be discussed at KS and we can think about what to do with
>  it after that - that kind of time frame. No rush.
> 
>  Oh yeah, and obviously it would be nice if it provided real improvements
>  on real workloads too ;)

umm, yes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:15     ` Andrew Morton
@ 2005-06-27  8:28       ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2005-06-27  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Also, the memory usage regression cases that fault ahead brings makes it
>> a bit contentious.
> 
> 
> faultahead consumes no more memory: if the page is present then point a pte
> at it.  It'll make reclaim work a bit harder in some situations.
> 

Oh OK we'll call that faultahead and Christoph's thing prefault then.

I suspect it may still be a net loss for those that are running into
tree_lock contention, but we'll see.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
@ 2005-06-27  8:56     ` Lincoln Dale
  2005-06-27  9:04       ` Nick Piggin
  2005-06-27 13:17     ` Benjamin LaHaise
  2 siblings, 1 reply; 25+ messages in thread
From: Lincoln Dale @ 2005-06-27  8:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm

Nick Piggin wrote:
[..]

> However I think for Oracle and others that use shared memory like
> this, they are probably not doing linear access, so that would be a
> net loss. I'm not completely sure (I don't have access to real loads
> at the moment), but I would have thought those guys would have looked
> into fault ahead if it were a possibility.

i thought those guys used O_DIRECT - in which case, wouldn't the page 
cache not be used?


cheers,

lincoln.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:56     ` Lincoln Dale
@ 2005-06-27  9:04       ` Nick Piggin
  2005-06-27 18:14         ` Chen, Kenneth W
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2005-06-27  9:04 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm

Lincoln Dale wrote:
> Nick Piggin wrote:
> [..]
> 
>> However I think for Oracle and others that use shared memory like
>> this, they are probably not doing linear access, so that would be a
>> net loss. I'm not completely sure (I don't have access to real loads
>> at the moment), but I would have thought those guys would have looked
>> into fault ahead if it were a possibility.
> 
> 
> i thought those guys used O_DIRECT - in which case, wouldn't the page 
> cache not be used?
> 

Well I think they do use O_DIRECT for their IO, but they need to
use the Linux pagecache for their shared memory - that shared
memory being the basis for their page cache. I think. Whatever
the setup I believe they have issues with the tree_lock, which is
why it was changed to an rwlock.

-- 
SUSE Labs, Novell Inc.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27  8:15     ` Andrew Morton
  2005-06-27  8:56     ` Lincoln Dale
@ 2005-06-27 13:17     ` Benjamin LaHaise
  2005-06-28  0:32       ` Nick Piggin
  2 siblings, 1 reply; 25+ messages in thread
From: Benjamin LaHaise @ 2005-06-27 13:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote:
> However I think for Oracle and others that use shared memory like
> this, they are probably not doing linear access, so that would be a
> net loss. I'm not completely sure (I don't have access to real loads
> at the moment), but I would have thought those guys would have looked
> into fault ahead if it were a possibility.

Shared memory overhead doesn't show up on any of the database benchmarks 
I've seen, as they tend to use huge pages that are locked in memory, and 
thus don't tend to access the page cache at all after ramp up.

		-ben
-- 
"Time is what keeps everything from happening all at once." -- John Wheeler

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
@ 2005-06-27 14:08   ` Martin J. Bligh
  2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 0 replies; 25+ messages in thread
From: Martin J. Bligh @ 2005-06-27 14:08 UTC (permalink / raw)
  To: Andrew Morton, Nick Piggin; +Cc: linux-kernel, linux-mm



--Andrew Morton <akpm@osdl.org> wrote (on Monday, June 27, 2005 00:46:24 -0700):

> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> 
>> First I'll put up some numbers to get you interested - of a 64-way Altix
>>  with 64 processes each read-faulting in their own 512MB part of a 32GB
>>  file that is preloaded in pagecache (with the proper NUMA memory
>>  allocation).
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.

Maybe true, but when we last tried that, faultahead sucked for performance
in a more general sense. All the extra setup and teardown cost for 
unnecessary PTEs kills you, even if it's only 4 pages or so.

M.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  7:46 ` Andrew Morton
  2005-06-27  8:02   ` Nick Piggin
  2005-06-27 14:08   ` Martin J. Bligh
@ 2005-06-27 17:49   ` Christoph Lameter
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2005-06-27 17:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm

On Mon, 27 Jun 2005, Andrew Morton wrote:

> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> > First I'll put up some numbers to get you interested - of a 64-way Altix
> >  with 64 processes each read-faulting in their own 512MB part of a 32GB
> >  file that is preloaded in pagecache (with the proper NUMA memory
> >  allocation).
> 
> I bet you can get a 5x to 10x reduction in ->tree_lock traffic by doing
> 16-page faultahead.

Could be working into the prefault patch.... Good idea.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27  9:04       ` Nick Piggin
@ 2005-06-27 18:14         ` Chen, Kenneth W
  2005-06-27 18:50           ` Badari Pulavarty
  0 siblings, 1 reply; 25+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 18:14 UTC (permalink / raw)
  To: 'Nick Piggin', Lincoln Dale; +Cc: Andrew Morton, linux-kernel, linux-mm

Nick Piggin wrote on Monday, June 27, 2005 2:04 AM
> >> However I think for Oracle and others that use shared memory like
> >> this, they are probably not doing linear access, so that would be a
> >> net loss. I'm not completely sure (I don't have access to real loads
> >> at the moment), but I would have thought those guys would have looked
> >> into fault ahead if it were a possibility.
> > 
> > 
> > i thought those guys used O_DIRECT - in which case, wouldn't the page 
> > cache not be used?
> > 
> 
> Well I think they do use O_DIRECT for their IO, but they need to
> use the Linux pagecache for their shared memory - that shared
> memory being the basis for their page cache. I think. Whatever
> the setup I believe they have issues with the tree_lock, which is
> why it was changed to an rwlock.

Typically shared memory is used as db buffer cache, and O_DIRECT is
performed on these buffer cache (hence O_DIRECT on the shared memory).
You must be thinking some other workload.  Nevertheless, for OLTP type
of db workload, tree_lock hasn't been a problem so far.

- Ken


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 18:14         ` Chen, Kenneth W
@ 2005-06-27 18:50           ` Badari Pulavarty
  2005-06-27 19:05             ` Chen, Kenneth W
  0 siblings, 1 reply; 25+ messages in thread
From: Badari Pulavarty @ 2005-06-27 18:50 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel,
	linux-mm

On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote:
> Nick Piggin wrote on Monday, June 27, 2005 2:04 AM
> > >> However I think for Oracle and others that use shared memory like
> > >> this, they are probably not doing linear access, so that would be a
> > >> net loss. I'm not completely sure (I don't have access to real loads
> > >> at the moment), but I would have thought those guys would have looked
> > >> into fault ahead if it were a possibility.
> > > 
> > > 
> > > i thought those guys used O_DIRECT - in which case, wouldn't the page 
> > > cache not be used?
> > > 
> > 
> > Well I think they do use O_DIRECT for their IO, but they need to
> > use the Linux pagecache for their shared memory - that shared
> > memory being the basis for their page cache. I think. Whatever
> > the setup I believe they have issues with the tree_lock, which is
> > why it was changed to an rwlock.
> 
> Typically shared memory is used as db buffer cache, and O_DIRECT is
> performed on these buffer cache (hence O_DIRECT on the shared memory).
> You must be thinking some other workload.  Nevertheless, for OLTP type
> of db workload, tree_lock hasn't been a problem so far.

What about DSS ? I need to go back and verify some of the profiles
we have.

Thanks,
Badari



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 18:50           ` Badari Pulavarty
@ 2005-06-27 19:05             ` Chen, Kenneth W
  2005-06-27 19:22               ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 19:05 UTC (permalink / raw)
  To: 'Badari Pulavarty'
  Cc: 'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel,
	linux-mm

Badari Pulavarty wrote on Monday, June 27, 2005 11:51 AM
> On Mon, 2005-06-27 at 11:14 -0700, Chen, Kenneth W wrote:
> > Typically shared memory is used as db buffer cache, and O_DIRECT is
> > performed on these buffer cache (hence O_DIRECT on the shared memory).
> > You must be thinking some other workload.  Nevertheless, for OLTP type
> > of db workload, tree_lock hasn't been a problem so far.
> 
> What about DSS ? I need to go back and verify some of the profiles
> we have.

I don't recall seeing tree_lock to be a problem for DSS workload either.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 19:05             ` Chen, Kenneth W
@ 2005-06-27 19:22               ` Christoph Lameter
  2005-06-27 19:42                 ` Chen, Kenneth W
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2005-06-27 19:22 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 27 Jun 2005, Chen, Kenneth W wrote:

> I don't recall seeing tree_lock to be a problem for DSS workload either.

I have seen the tree_lock being a problem a number of times with large 
scale NUMA type workloads.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
  2005-06-27 19:22               ` Christoph Lameter
@ 2005-06-27 19:42                 ` Chen, Kenneth W
  2005-07-05 15:11                   ` Sonny Rao
  0 siblings, 1 reply; 25+ messages in thread
From: Chen, Kenneth W @ 2005-06-27 19:42 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: 'Badari Pulavarty', 'Nick Piggin', Lincoln Dale,
	Andrew Morton, linux-kernel, linux-mm

Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM
> On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> > I don't recall seeing tree_lock to be a problem for DSS workload either.
> 
> I have seen the tree_lock being a problem a number of times with large 
> scale NUMA type workloads.

I totally agree!  My earlier posts are strictly referring to industry
standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
for everyone :-)  Obviously you just outlined a few ....

- Ken


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27 13:17     ` Benjamin LaHaise
@ 2005-06-28  0:32       ` Nick Piggin
  2005-06-28  1:26         ` William Lee Irwin III
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2005-06-28  0:32 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, linux-kernel, linux-mm

Benjamin LaHaise wrote:
> On Mon, Jun 27, 2005 at 06:02:15PM +1000, Nick Piggin wrote:
> 
>>However I think for Oracle and others that use shared memory like
>>this, they are probably not doing linear access, so that would be a
>>net loss. I'm not completely sure (I don't have access to real loads
>>at the moment), but I would have thought those guys would have looked
>>into fault ahead if it were a possibility.
> 
> 
> Shared memory overhead doesn't show up on any of the database benchmarks 
> I've seen, as they tend to use huge pages that are locked in memory, and 
> thus don't tend to access the page cache at all after ramp up.
> 

To be quite honest I don't have any real workloads here that stress
it, however I was told that it is a problem for oracle database. If
there is anyone else who has problems then I'd be interested to hear
them as well.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-28  0:32       ` Nick Piggin
@ 2005-06-28  1:26         ` William Lee Irwin III
  0 siblings, 0 replies; 25+ messages in thread
From: William Lee Irwin III @ 2005-06-28  1:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Benjamin LaHaise, Andrew Morton, linux-kernel, linux-mm

Benjamin LaHaise wrote:
>> Shared memory overhead doesn't show up on any of the database benchmarks 
>> I've seen, as they tend to use huge pages that are locked in memory, and 
>> thus don't tend to access the page cache at all after ramp up.

On Tue, Jun 28, 2005 at 10:32:51AM +1000, Nick Piggin wrote:
> To be quite honest I don't have any real workloads here that stress
> it, however I was told that it is a problem for oracle database. If
> there is anyone else who has problems then I'd be interested to hear
> them as well.

It's vlm-specific.


-- wli

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [rfc] lockless pagecache
@ 2005-06-28 11:56 David Kearster
  2005-06-28 12:20 ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: David Kearster @ 2005-06-28 11:56 UTC (permalink / raw)
  To: nickpiggin, linux-kernel

Hi nick,

The patches that u posted on lkml regarding the vfs scalibility, on
which kernel did u build them.
I tried applying them on 2.6.12-git4, 2.6.12-mm1, mm2, 2.6.12.1, but
to no avail.

Many hunks are failing in each of the patch. I guess when the first is
failing others are
bound to..

Also, i m trying to apply the patches in the following order :
[patch 1] mm: PG_free flag
[patch 2] mm: speculative get_page
[patch 3] radix tree: lookup_slot
[patch 4] radix tree: lockless readside
[patch 5] mm: lockless pagecache lookups
[patch 6] mm: spinlock tree_lock

This sequence is correct....right?

Thanks,
Dave

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-28 11:56 [rfc] lockless pagecache David Kearster
@ 2005-06-28 12:20 ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2005-06-28 12:20 UTC (permalink / raw)
  To: David Kearster; +Cc: linux-kernel

David Kearster wrote:
> Hi nick,
> 
> The patches that u posted on lkml regarding the vfs scalibility, on
> which kernel did u build them.
> I tried applying them on 2.6.12-git4, 2.6.12-mm1, mm2, 2.6.12.1, but
> to no avail.
> 

Hi David,

They are against 2.6.12-git4 plus a later revision of the PageRemoval
patchset I posted to linux-mm earlier, which is needed to make page
refcounting consistent.

I have a couple of updates and fixes for both sets of patches, so I
can send you a rollup of the current patches against a current -git
kernel privately if you would like.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27  6:29 Nick Piggin
  2005-06-27  7:46 ` Andrew Morton
@ 2005-06-29 10:49 ` Hirokazu Takahashi
  2005-06-29 11:38   ` Nick Piggin
  1 sibling, 1 reply; 25+ messages in thread
From: Hirokazu Takahashi @ 2005-06-29 10:49 UTC (permalink / raw)
  To: nickpiggin; +Cc: linux-kernel, linux-mm

Hi Nick,

Your patches improve the performance if lots of processes are
accessing the same file at the same time, right?

If so, I think we can introduce multiple radix-trees instead,
which enhance each inode to be able to have two or more radix-trees
in it to avoid the race condition traversing the trees.
Some decision mechanism is needed which radix-tree each page
should be in, how many radix-tree should be prepared.

It seems to be simple and effective.

What do you think?

> Now the tree_lock was recently(ish) converted to an rwlock, precisely
> for such a workload and that was apparently very successful. However
> an rwlock is significantly heavier, and as machines get faster and
> bigger, rwlocks (and any locks) will tend to use more and more of Paul
> McKenney's toilet paper due to cacheline bouncing.
> 
> So in the interest of saving some trees, let's try it without any locks.
> 
> First I'll put up some numbers to get you interested - of a 64-way Altix
> with 64 processes each read-faulting in their own 512MB part of a 32GB
> file that is preloaded in pagecache (with the proper NUMA memory
> allocation).

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-29 10:49 ` Hirokazu Takahashi
@ 2005-06-29 11:38   ` Nick Piggin
  2005-06-30  3:32     ` Hirokazu Takahashi
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2005-06-29 11:38 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, linux-mm

Hirokazu Takahashi wrote:
> Hi Nick,
> 

Hi,

> Your patches improve the performance if lots of processes are
> accessing the same file at the same time, right?
> 

Yes.

> If so, I think we can introduce multiple radix-trees instead,
> which enhance each inode to be able to have two or more radix-trees
> in it to avoid the race condition traversing the trees.
> Some decision mechanism is needed which radix-tree each page
> should be in, how many radix-tree should be prepared.
> 
> It seems to be simple and effective.
> 
> What do you think?
> 

Sure it is a possibility.

I don't think you could call it effective like a completely
lockless version is effective. You might take more locks during
gang lookups, you may have a lot of ugly and not-always-working
heuristics (hey, my app goes really fast if it spreads accesses
over a 1GB file, but falls on its face with a 10MB one). You
might get increased cache footprints for common operations.

I mainly did the patches for a bit of fun rather than to address
a particular problem with a real workload and as such I won't be
pushing to get them in the kernel for the time being.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-29 11:38   ` Nick Piggin
@ 2005-06-30  3:32     ` Hirokazu Takahashi
  0 siblings, 0 replies; 25+ messages in thread
From: Hirokazu Takahashi @ 2005-06-30  3:32 UTC (permalink / raw)
  To: nickpiggin; +Cc: linux-kernel, linux-mm

Hi,

> > Your patches improve the performance if lots of processes are
> > accessing the same file at the same time, right?
> > 
> 
> Yes.
> 
> > If so, I think we can introduce multiple radix-trees instead,
> > which enhance each inode to be able to have two or more radix-trees
> > in it to avoid the race condition traversing the trees.
> > Some decision mechanism is needed which radix-tree each page
> > should be in, how many radix-tree should be prepared.
> > 
> > It seems to be simple and effective.
> > 
> > What do you think?
> > 
> 
> Sure it is a possibility.
> 
> I don't think you could call it effective like a completely
> lockless version is effective. You might take more locks during
> gang lookups, you may have a lot of ugly and not-always-working
> heuristics (hey, my app goes really fast if it spreads accesses
> over a 1GB file, but falls on its face with a 10MB one). You
> might get increased cache footprints for common operations.

I guess it would be enough to split a huge file into the same
size pieces simply and put each of them in its associated radix-tree
in most cases for practical use.

And I also feel your approach is interesting.

> I mainly did the patches for a bit of fun rather than to address
> a particular problem with a real workload and as such I won't be
> pushing to get them in the kernel for the time being.

I see.

I propose another idea if you don't mind, seqlock seems to make
your code much simpler though I'm not sure whether it works well
under heavy load. It would become stable without the tricks,
which makes VM hard to be enhanced in the future.

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-06-27 19:42                 ` Chen, Kenneth W
@ 2005-07-05 15:11                   ` Sonny Rao
  2005-07-05 15:31                     ` Martin J. Bligh
  0 siblings, 1 reply; 25+ messages in thread
From: Sonny Rao @ 2005-07-05 15:11 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter', 'Badari Pulavarty',
	'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel,
	linux-mm

On Mon, Jun 27, 2005 at 12:42:44PM -0700, Chen, Kenneth W wrote:
> Christoph Lameter wrote on Monday, June 27, 2005 12:23 PM
> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
> > 
> > I have seen the tree_lock being a problem a number of times with large 
> > scale NUMA type workloads.
> 
> I totally agree!  My earlier posts are strictly referring to industry
> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
> for everyone :-)  Obviously you just outlined a few ....

I'm a bit late to the party here (was gone on vacation), but I do have
profiles from DSS workloads using page-cache rather than O_DIRECT and
I do see spin_lock_irq() in the profiles which I'm pretty certain are
locks spinning for access to the radix_tree.  I'll talk about it a bit
more up in Ottawa but here's the top 5 on my profile (sorry don't have
the number of ticks at the momement):

1. dedicated_idle (waiting for I/O)
2. __copy_tofrom_user
3. radix_tree_delete
4. _spin_lock_irq
5. __find_get_block

So, yes, if the page-cache is used in a DSS workload then one will see
the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
NUMA factor.

Sonny

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-07-05 15:11                   ` Sonny Rao
@ 2005-07-05 15:31                     ` Martin J. Bligh
  2005-07-05 15:37                       ` Sonny Rao
  0 siblings, 1 reply; 25+ messages in thread
From: Martin J. Bligh @ 2005-07-05 15:31 UTC (permalink / raw)
  To: Sonny Rao, Chen, Kenneth W
  Cc: 'Christoph Lameter', 'Badari Pulavarty',
	'Nick Piggin', Lincoln Dale, Andrew Morton, linux-kernel,
	linux-mm

>> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
>> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
>> > 
>> > I have seen the tree_lock being a problem a number of times with large 
>> > scale NUMA type workloads.
>> 
>> I totally agree!  My earlier posts are strictly referring to industry
>> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
>> for everyone :-)  Obviously you just outlined a few ....
> 
> I'm a bit late to the party here (was gone on vacation), but I do have
> profiles from DSS workloads using page-cache rather than O_DIRECT and
> I do see spin_lock_irq() in the profiles which I'm pretty certain are
> locks spinning for access to the radix_tree.  I'll talk about it a bit
> more up in Ottawa but here's the top 5 on my profile (sorry don't have
> the number of ticks at the momement):
> 
> 1. dedicated_idle (waiting for I/O)
> 2. __copy_tofrom_user
> 3. radix_tree_delete
> 4. _spin_lock_irq
> 5. __find_get_block
> 
> So, yes, if the page-cache is used in a DSS workload then one will see
> the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
> NUMA factor.

The easiest way to confirm the spin-lock thing is to recompile with 
CONFIG_SPINLINE, and take a new profile, then diff the two ...

M.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc] lockless pagecache
  2005-07-05 15:31                     ` Martin J. Bligh
@ 2005-07-05 15:37                       ` Sonny Rao
  0 siblings, 0 replies; 25+ messages in thread
From: Sonny Rao @ 2005-07-05 15:37 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	'Badari Pulavarty', 'Nick Piggin', Lincoln Dale,
	Andrew Morton, linux-kernel, linux-mm

On Tue, Jul 05, 2005 at 08:31:40AM -0700, Martin J. Bligh wrote:
> >> > On Mon, 27 Jun 2005, Chen, Kenneth W wrote:
> >> > > I don't recall seeing tree_lock to be a problem for DSS workload either.
> >> > 
> >> > I have seen the tree_lock being a problem a number of times with large 
> >> > scale NUMA type workloads.
> >> 
> >> I totally agree!  My earlier posts are strictly referring to industry
> >> standard db workloads (OLTP, DSS).  I'm not saying it's not a problem
> >> for everyone :-)  Obviously you just outlined a few ....
> > 
> > I'm a bit late to the party here (was gone on vacation), but I do have
> > profiles from DSS workloads using page-cache rather than O_DIRECT and
> > I do see spin_lock_irq() in the profiles which I'm pretty certain are
> > locks spinning for access to the radix_tree.  I'll talk about it a bit
> > more up in Ottawa but here's the top 5 on my profile (sorry don't have
> > the number of ticks at the momement):
> > 
> > 1. dedicated_idle (waiting for I/O)
> > 2. __copy_tofrom_user
> > 3. radix_tree_delete
> > 4. _spin_lock_irq
> > 5. __find_get_block
> > 
> > So, yes, if the page-cache is used in a DSS workload then one will see
> > the tree-lock.  BTW, this was on a PPC64 machine w/ a fairly small
> > NUMA factor.
> 
> The easiest way to confirm the spin-lock thing is to recompile with 
> CONFIG_SPINLINE, and take a new profile, then diff the two ...

Yep...

Unfortunately, this is broken in PPC64 since 2.6.9-rc2 or something
like that, I never had a chance to track down what the issue was
exactly.  IIRC, there was a lot of churn in the spinlocking code
around that time.

Sonny

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2005-07-05 15:56 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-28 11:56 [rfc] lockless pagecache David Kearster
2005-06-28 12:20 ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2005-06-27  6:29 Nick Piggin
2005-06-27  7:46 ` Andrew Morton
2005-06-27  8:02   ` Nick Piggin
2005-06-27  8:15     ` Andrew Morton
2005-06-27  8:28       ` Nick Piggin
2005-06-27  8:56     ` Lincoln Dale
2005-06-27  9:04       ` Nick Piggin
2005-06-27 18:14         ` Chen, Kenneth W
2005-06-27 18:50           ` Badari Pulavarty
2005-06-27 19:05             ` Chen, Kenneth W
2005-06-27 19:22               ` Christoph Lameter
2005-06-27 19:42                 ` Chen, Kenneth W
2005-07-05 15:11                   ` Sonny Rao
2005-07-05 15:31                     ` Martin J. Bligh
2005-07-05 15:37                       ` Sonny Rao
2005-06-27 13:17     ` Benjamin LaHaise
2005-06-28  0:32       ` Nick Piggin
2005-06-28  1:26         ` William Lee Irwin III
2005-06-27 14:08   ` Martin J. Bligh
2005-06-27 17:49   ` Christoph Lameter
2005-06-29 10:49 ` Hirokazu Takahashi
2005-06-29 11:38   ` Nick Piggin
2005-06-30  3:32     ` Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox