public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* vmscan.c heuristic adjustment for smaller systems
@ 2004-04-17 19:38 William Lee Irwin III
  2004-04-17 21:29 ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-04-17 19:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, elf

Marc Singer reported an issue where an embedded ARM system performed
poorly due to page replacement potentially prematurely replacing
mapped memory where there was very little mapped pagecache in use to
begin with.

The following patch attempts to address the issue by using the
_maximum_ of vm_swappiness and distress to add to the mapped ratio, so
that distress doesn't contribute to swap_tendency until it exceeds
vm_swappiness, and afterward the effect is not cumulative.

The intended effect is that swap_tendency should vary in a more jagged
way, and not be elevated by distress beyond vm_swappiness until distress
exceeds vm_swappiness. For instance, since distress is 100 >>
zone->prev_priority, no distinction is made between a vm_swappiness of
50 or a vm_swappiness of 90 given the same mapped_ratio.

Marc Singer has results where this is an improvement, and hopefully can
clarify as-needed. Help determining whether this policy change is an
improvement for a broader variety of systems would be appreciated.


-- wli


Index: singer-2.6.5-mm6/mm/vmscan.c
===================================================================
--- singer-2.6.5-mm6.orig/mm/vmscan.c	2004-04-14 23:21:19.000000000 -0700
+++ singer-2.6.5-mm6/mm/vmscan.c	2004-04-17 11:09:35.000000000 -0700
@@ -636,7 +636,7 @@
 	 *
 	 * A 100% value of vm_swappiness overrides this algorithm altogether.
 	 */
-	swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
+	swap_tendency = mapped_ratio / 2 + max(distress, vm_swappiness);
 
 	/*
 	 * Now use this metric to decide whether to start moving mapped memory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 19:38 vmscan.c heuristic adjustment for smaller systems William Lee Irwin III
@ 2004-04-17 21:29 ` Marc Singer
  2004-04-17 21:33   ` William Lee Irwin III
  2004-04-17 23:21   ` Andrew Morton
  0 siblings, 2 replies; 27+ messages in thread
From: Marc Singer @ 2004-04-17 21:29 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, akpm, elf

On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
> Marc Singer reported an issue where an embedded ARM system performed
> poorly due to page replacement potentially prematurely replacing
> mapped memory where there was very little mapped pagecache in use to
> begin with.
> 
> Marc Singer has results where this is an improvement, and hopefully can
> clarify as-needed. Help determining whether this policy change is an
> improvement for a broader variety of systems would be appreciated.

I have some numbers to clarify the 'improvement'.

Setup:
  ARM922 CPU, 200MHz, 32MiB RAM
  NFS mounted rootfs, tcp, hard, v3, 4K blocks
  Test application copies 41MiB file and prints the elapsed time

The two scenarios differ only in the setting of /proc/sys/vm/swappiness.

				 swappiness
			60 (default)		0
			------------		--------
elapsed time(s)		52.48			52.9
			53.13			52.91
			53.13			52.87
			52.53			53.03
			52.35			53.02
			
mean			52.72			52.94

I'd say that there is no statistically significant difference between
these sets of times.  However, after I've run the test program, I run
the command "ls -l /proc"

				 swappiness
			60 (default)		0
			------------		--------
elapsed time(s)		18			1
			30			1
			33			1
			
This is the problem.  Once RAM fills with IO buffers, the kernel's
tendency to evict mapped pages ruins interactive performance.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 21:29 ` Marc Singer
@ 2004-04-17 21:33   ` William Lee Irwin III
  2004-04-17 21:52     ` Marc Singer
  2004-04-17 23:21   ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-04-17 21:33 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel, akpm

On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
>> Marc Singer reported an issue where an embedded ARM system performed
>> poorly due to page replacement potentially prematurely replacing
>> mapped memory where there was very little mapped pagecache in use to
>> begin with.
>> Marc Singer has results where this is an improvement, and hopefully can
>> clarify as-needed. Help determining whether this policy change is an
>> improvement for a broader variety of systems would be appreciated.

On Sat, Apr 17, 2004 at 02:29:58PM -0700, Marc Singer wrote:
> I have some numbers to clarify the 'improvement'.
> Setup:
>   ARM922 CPU, 200MHz, 32MiB RAM
>   NFS mounted rootfs, tcp, hard, v3, 4K blocks
>   Test application copies 41MiB file and prints the elapsed time
> The two scenarios differ only in the setting of /proc/sys/vm/swappiness.

This doesn't match your first response. Anyway, this one is gets
scrapped. I guess if swappiness solves it, then so much the better.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 21:33   ` William Lee Irwin III
@ 2004-04-17 21:52     ` Marc Singer
  2004-04-18  1:06       ` William Lee Irwin III
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-17 21:52 UTC (permalink / raw)
  To: William Lee Irwin III, Marc Singer, linux-kernel, akpm

On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
> >> Marc Singer reported an issue where an embedded ARM system performed
> >> poorly due to page replacement potentially prematurely replacing
> >> mapped memory where there was very little mapped pagecache in use to
> >> begin with.
> >> Marc Singer has results where this is an improvement, and hopefully can
> >> clarify as-needed. Help determining whether this policy change is an
> >> improvement for a broader variety of systems would be appreciated.
> 
> On Sat, Apr 17, 2004 at 02:29:58PM -0700, Marc Singer wrote:
> > I have some numbers to clarify the 'improvement'.
> > Setup:
> >   ARM922 CPU, 200MHz, 32MiB RAM
> >   NFS mounted rootfs, tcp, hard, v3, 4K blocks
> >   Test application copies 41MiB file and prints the elapsed time
> > The two scenarios differ only in the setting of /proc/sys/vm/swappiness.
> 
> This doesn't match your first response. Anyway, this one is gets
> scrapped. I guess if swappiness solves it, then so much the better.

Huh?  Where do you see a discrepency?  I don't think I claimed that
the test program performance changed.  The noticeable difference is in
interactivity once the page cache fills.  IMHO, 30 seconds to do a
file listing on /proc is extreme.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 21:29 ` Marc Singer
  2004-04-17 21:33   ` William Lee Irwin III
@ 2004-04-17 23:21   ` Andrew Morton
  2004-04-17 23:30     ` Marc Singer
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-17 23:21 UTC (permalink / raw)
  To: Marc Singer; +Cc: wli, linux-kernel, elf

Marc Singer <elf@buici.com> wrote:
>
>  I'd say that there is no statistically significant difference between
>  these sets of times.  However, after I've run the test program, I run
>  the command "ls -l /proc"
> 
>  				 swappiness
>  			60 (default)		0
>  			------------		--------
>  elapsed time(s)		18			1
>  			30			1
>  			33			1

How on earth can it take half a minute to list /proc?

>  This is the problem.  Once RAM fills with IO buffers, the kernel's
>  tendency to evict mapped pages ruins interactive performance.

Is everything here on NFS, or are local filesystemms involved?  (What does
"mount" say?)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 23:21   ` Andrew Morton
@ 2004-04-17 23:30     ` Marc Singer
  2004-04-17 23:51       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-17 23:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, wli, linux-kernel

On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> Marc Singer <elf@buici.com> wrote:
> >
> >  I'd say that there is no statistically significant difference between
> >  these sets of times.  However, after I've run the test program, I run
> >  the command "ls -l /proc"
> > 
> >  				 swappiness
> >  			60 (default)		0
> >  			------------		--------
> >  elapsed time(s)		18			1
> >  			30			1
> >  			33			1
> 
> How on earth can it take half a minute to list /proc?

I've watched the vmscan code at work.  The memory pressure is so high
that it reclaims mapped pages zealously.  The program's code pages are
being evicted frequently.

I would like to show a video of the ls -l /proc command.  It's
remarkable.  The program pauses after displaying each line.

> >  This is the problem.  Once RAM fills with IO buffers, the kernel's
> >  tendency to evict mapped pages ruins interactive performance.
> 
> Is everything here on NFS, or are local filesystemms involved?  (What does
> "mount" say?)

    # mount
    rootfs on / type rootfs (rw)
    /dev/root on / type nfs (rw,v2,rsize=4096,wsize=4096,hard,udp,nolock,addr=192.168.8.1)
    proc on /proc type proc (rw)
    devpts on /dev/pts type devpts (rw)

I've been wondering if the swappiness isn't a red herring.  Is it
reasonable that the distress value (in refill_inactive_zones ()) be
50?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 23:30     ` Marc Singer
@ 2004-04-17 23:51       ` Andrew Morton
  2004-04-18  0:11         ` Trond Myklebust
                           ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andrew Morton @ 2004-04-17 23:51 UTC (permalink / raw)
  To: Marc Singer; +Cc: elf, wli, linux-kernel

Marc Singer <elf@buici.com> wrote:
>
> On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> > Marc Singer <elf@buici.com> wrote:
> > >
> > >  I'd say that there is no statistically significant difference between
> > >  these sets of times.  However, after I've run the test program, I run
> > >  the command "ls -l /proc"
> > > 
> > >  				 swappiness
> > >  			60 (default)		0
> > >  			------------		--------
> > >  elapsed time(s)		18			1
> > >  			30			1
> > >  			33			1
> > 
> > How on earth can it take half a minute to list /proc?
> 
> I've watched the vmscan code at work.  The memory pressure is so high
> that it reclaims mapped pages zealously.  The program's code pages are
> being evicted frequently.

Which tends to imply that the VM is not reclaiming any of that nfs-backed
pagecache.

> I've been wondering if the swappiness isn't a red herring.  Is it
> reasonable that the distress value (in refill_inactive_zones ()) be
> 50?

I'd assume that setting swappiness to zero simply means that you still have
all of your libc in pagecache when running ls.

What happens if you do the big file copy, then run `sync', then do the ls?

Have you experimented with the NFS mount options?  v2? UDP?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 23:51       ` Andrew Morton
@ 2004-04-18  0:11         ` Trond Myklebust
  2004-04-18  0:23         ` Marc Singer
  2004-04-18  1:59         ` William Lee Irwin III
  2 siblings, 0 replies; 27+ messages in thread
From: Trond Myklebust @ 2004-04-18  0:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, wli, linux-kernel

On Sat, 2004-04-17 at 16:51, Andrew Morton wrote:

> What happens if you do the big file copy, then run `sync', then do the ls?

You shouldn't ever need to do "sync" with NFS unless you are using
mmap(). close() will suffice to flush out all dirty pages in the case of
ordinary file writes.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 23:51       ` Andrew Morton
  2004-04-18  0:11         ` Trond Myklebust
@ 2004-04-18  0:23         ` Marc Singer
  2004-04-18  3:37           ` Nick Piggin
  2004-04-18  9:29           ` Russell King
  2004-04-18  1:59         ` William Lee Irwin III
  2 siblings, 2 replies; 27+ messages in thread
From: Marc Singer @ 2004-04-18  0:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, wli, linux-kernel

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> Marc Singer <elf@buici.com> wrote:
> >
> > On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> > > Marc Singer <elf@buici.com> wrote:
> > > >
> > > >  I'd say that there is no statistically significant difference between
> > > >  these sets of times.  However, after I've run the test program, I run
> > > >  the command "ls -l /proc"
> > > > 
> > > >  				 swappiness
> > > >  			60 (default)		0
> > > >  			------------		--------
> > > >  elapsed time(s)		18			1
> > > >  			30			1
> > > >  			33			1
> > > 
> > > How on earth can it take half a minute to list /proc?
> > 
> > I've watched the vmscan code at work.  The memory pressure is so high
> > that it reclaims mapped pages zealously.  The program's code pages are
> > being evicted frequently.
> 
> Which tends to imply that the VM is not reclaiming any of that nfs-backed
> pagecache.

I don't think that's the whole story.  They question is why.

> > I've been wondering if the swappiness isn't a red herring.  Is it
> > reasonable that the distress value (in refill_inactive_zones ()) be
> > 50?
> 
> I'd assume that setting swappiness to zero simply means that you still have
> all of your libc in pagecache when running ls.

Perhaps.  I think it is more important that it is still mapped.

> 
> What happens if you do the big file copy, then run `sync', then do the ls?

It still takes a long time.  I'm watching the network load as I
perform the ls.  There's almost 20 seconds of no screen activity while
NFS reloads the code. 

> 
> Have you experimented with the NFS mount options?  v2? UDP?

Doesn't seem to matter.  I've used v2, v3, UDP and TCP.

I have more data.

All of these tests are performed at the console, one command at a
time.  I have a telnet daemon available, so I open a second connection
to the target system.  I run a continuous loop of file copies on the
console and I execute 'ls -l /proc' in the telnet window.  It's a
little slow, but it isn't unreasonable.  Hmm.  I then run the copy
command in the telnet window followed by the 'ls -l /proc'.  It works
fine.  I logout of the console session and perform the telnet window
test again.  The 'ls -l /proc takes 30 seconds.

When there is more than one process running, everything is peachy.
When there is only one process (no context switching) I see the slow
performance.  I had a hypothesis, but my test of that hypothesis
failed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 21:52     ` Marc Singer
@ 2004-04-18  1:06       ` William Lee Irwin III
  2004-04-18  5:05         ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-04-18  1:06 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel, akpm

On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
>> This doesn't match your first response. Anyway, this one is gets
>> scrapped. I guess if swappiness solves it, then so much the better.

On Sat, Apr 17, 2004 at 02:52:57PM -0700, Marc Singer wrote:
> Huh?  Where do you see a discrepency?  I don't think I claimed that
> the test program performance changed.  The noticeable difference is in
> interactivity once the page cache fills.  IMHO, 30 seconds to do a
> file listing on /proc is extreme.

Oh, sorry, it was unclear to me that the test changed anything but
swappiness (i.e. I couldn't tell they included the patch etc.)


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-17 23:51       ` Andrew Morton
  2004-04-18  0:11         ` Trond Myklebust
  2004-04-18  0:23         ` Marc Singer
@ 2004-04-18  1:59         ` William Lee Irwin III
  2004-04-18  3:53           ` Andrew Morton
  2 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-04-18  1:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, linux-kernel

Marc Singer <elf@buici.com> wrote:
>> I've watched the vmscan code at work.  The memory pressure is so high
>> that it reclaims mapped pages zealously.  The program's code pages are
>> being evicted frequently.

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> Which tends to imply that the VM is not reclaiming any of that nfs-backed
> pagecache.

The observation that prompted the max() vs. addition was:

	On Sat, Apr 17, 2004 at 10:57:24AM -0700, Marc Singer wrote:
	> I don't think that's the whole story.  I printed distress,
	> mapped_ratio, and swappiness when vmscan starts trying to reclaim
	> mapped pages.
	> reclaim_mapped: distress 50  mapped_ratio 0  swappiness 60
	>   50 + 60 > 100
	> So, part of the problem is swappiness.  I could set that value to 25,
	> for example, to stop the machine from swapping.
	> I'd be fine stopping here, except for you comment about what
	> swappiness means.  In my case, nearly none of memory is mapped.  It is
	> zone priority which has dropped to 1 that is precipitating the
	> eviction.  Is this what you expect and want?


Marc Singer <elf@buici.com> wrote:
>> I've been wondering if the swappiness isn't a red herring.  Is it
>> reasonable that the distress value (in refill_inactive_zones ()) be
>> 50?

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> I'd assume that setting swappiness to zero simply means that you still have
> all of your libc in pagecache when running ls.
> What happens if you do the big file copy, then run `sync', then do the ls?
> Have you experimented with the NFS mount options?  v2? UDP?

I wonder if the ptep_test_and_clear_young() TLB flushing is related.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  0:23         ` Marc Singer
@ 2004-04-18  3:37           ` Nick Piggin
  2004-04-18  4:17             ` William Lee Irwin III
  2004-04-18  9:29           ` Russell King
  1 sibling, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2004-04-18  3:37 UTC (permalink / raw)
  To: Marc Singer; +Cc: Andrew Morton, wli, linux-kernel

Marc Singer wrote:
> On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> 
>>Marc Singer <elf@buici.com> wrote:
>>
>>>On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
>>>
>>>>
>>>>How on earth can it take half a minute to list /proc?
>>>
>>>I've watched the vmscan code at work.  The memory pressure is so high
>>>that it reclaims mapped pages zealously.  The program's code pages are
>>>being evicted frequently.
>>
>>Which tends to imply that the VM is not reclaiming any of that nfs-backed
>>pagecache.
> 
> 
> I don't think that's the whole story.  They question is why.
> 

swappiness is pretty arbitrary and unfortunately it means
different things to machines with different sized memory.

Also, once you *have* gone past the reclaim_mapped threshold,
mapped pages aren't really given any preference above
unmapped pages.

I have a small patchset which splits the active list roughly
into mapped and unmapped pages. It might hopefully solve your
problem. Would you give it a try? It is pretty stable here.

Nick

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  1:59         ` William Lee Irwin III
@ 2004-04-18  3:53           ` Andrew Morton
  2004-04-18  5:38             ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-18  3:53 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: elf, linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
>  On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
>  > I'd assume that setting swappiness to zero simply means that you still have
>  > all of your libc in pagecache when running ls.
>  > What happens if you do the big file copy, then run `sync', then do the ls?
>  > Have you experimented with the NFS mount options?  v2? UDP?
> 
>  I wonder if the ptep_test_and_clear_young() TLB flushing is related.

That, or page_referenced() always returns true on this ARM implementation
or some such silliness.  Everything here points at the VM being unable to
reclaim that clean pagecache.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  3:37           ` Nick Piggin
@ 2004-04-18  4:17             ` William Lee Irwin III
  2004-04-18  4:41               ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2004-04-18  4:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Marc Singer, Andrew Morton, linux-kernel

On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
> swappiness is pretty arbitrary and unfortunately it means
> different things to machines with different sized memory.
> Also, once you *have* gone past the reclaim_mapped threshold,
> mapped pages aren't really given any preference above
> unmapped pages.
> I have a small patchset which splits the active list roughly
> into mapped and unmapped pages. It might hopefully solve your
> problem. Would you give it a try? It is pretty stable here.

It would be interesting to see the results of this on Marc's system.
It's a more comprehensive solution than tweaking numbers.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  4:17             ` William Lee Irwin III
@ 2004-04-18  4:41               ` Nick Piggin
  2004-04-18  5:10                 ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2004-04-18  4:41 UTC (permalink / raw)
  To: Marc Singer; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1235 bytes --]

William Lee Irwin III wrote:
> On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
> 
>>swappiness is pretty arbitrary and unfortunately it means
>>different things to machines with different sized memory.
>>Also, once you *have* gone past the reclaim_mapped threshold,
>>mapped pages aren't really given any preference above
>>unmapped pages.
>>I have a small patchset which splits the active list roughly
>>into mapped and unmapped pages. It might hopefully solve your
>>problem. Would you give it a try? It is pretty stable here.
> 
> 
> It would be interesting to see the results of this on Marc's system.
> It's a more comprehensive solution than tweaking numbers.
> 

Well, here is the current patch against 2.6.5-mm6. -mm is
different enough from -linus now that it is not 100% trivial
to patch (mainly the rmap and hugepages work).

Marc if you could test this it would be great. I've been doing
very swap heavy tests for the last 24 hours on a SMP system
here, so it should be fairly stable.

It replaces /proc/sys/vm/swappiness with
/proc/sys/vm/mapped_page_cost, which is in units of unmapped
pages. I have found 8 to be pretty good, so that is the
default. Higher makes it less likely to evict mapped pages.

Nick

[-- Attachment #2: split-active-list.patch --]
[-- Type: text/x-patch, Size: 27964 bytes --]

 linux-2.6-npiggin/include/linux/buffer_head.h |    1 
 linux-2.6-npiggin/include/linux/mm_inline.h   |   48 +++++--
 linux-2.6-npiggin/include/linux/mmzone.h      |   28 ----
 linux-2.6-npiggin/include/linux/page-flags.h  |   54 ++++---
 linux-2.6-npiggin/include/linux/swap.h        |    4 
 linux-2.6-npiggin/kernel/sysctl.c             |    9 -
 linux-2.6-npiggin/mm/filemap.c                |    6 
 linux-2.6-npiggin/mm/hugetlb.c                |    9 -
 linux-2.6-npiggin/mm/memory.c                 |    1 
 linux-2.6-npiggin/mm/page_alloc.c             |   26 ++-
 linux-2.6-npiggin/mm/shmem.c                  |    1 
 linux-2.6-npiggin/mm/swap.c                   |   54 +------
 linux-2.6-npiggin/mm/vmscan.c                 |  176 ++++++++++----------------
 13 files changed, 193 insertions(+), 224 deletions(-)

diff -puN include/linux/buffer_head.h~rollup include/linux/buffer_head.h
--- linux-2.6/include/linux/buffer_head.h~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/include/linux/buffer_head.h	2004-04-18 14:30:21.000000000 +1000
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/linkage.h>
 #include <linux/wait.h>
+#include <linux/mm_inline.h>
 #include <asm/atomic.h>
 
 enum bh_state_bits {
diff -puN include/linux/mm_inline.h~rollup include/linux/mm_inline.h
--- linux-2.6/include/linux/mm_inline.h~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/include/linux/mm_inline.h	2004-04-18 14:30:21.000000000 +1000
@@ -1,9 +1,21 @@
+#ifndef _LINUX_MM_INLINE_H
+#define _LINUX_MM_INLINE_H
+
+#include <linux/mm.h>
+#include <linux/page-flags.h>
 
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_active_mapped_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->active_list);
-	zone->nr_active++;
+	list_add(&page->lru, &zone->active_mapped_list);
+	zone->nr_active_mapped++;
+}
+
+static inline void
+add_page_to_active_unmapped_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->active_unmapped_list);
+	zone->nr_active_unmapped++;
 }
 
 static inline void
@@ -14,10 +26,17 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+del_page_from_active_mapped_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	zone->nr_active_mapped--;
+}
+
+static inline void
+del_page_from_active_unmapped_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	zone->nr_active--;
+	zone->nr_active_unmapped--;
 }
 
 static inline void
@@ -31,10 +50,23 @@ static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	if (PageActive(page)) {
-		ClearPageActive(page);
-		zone->nr_active--;
+	if (PageActiveMapped(page)) {
+		ClearPageActiveMapped(page);
+		zone->nr_active_mapped--;
+	} else if (PageActiveUnmapped(page)) {
+		ClearPageActiveUnmapped(page);
+		zone->nr_active_unmapped--;
 	} else {
 		zone->nr_inactive--;
 	}
 }
+
+/*
+ * Mark a page as having seen activity.
+ */
+static inline void mark_page_accessed(struct page *page)
+{
+	SetPageReferenced(page);
+}
+
+#endif /* _LINUX_MM_INLINE_H */
diff -puN include/linux/mmzone.h~rollup include/linux/mmzone.h
--- linux-2.6/include/linux/mmzone.h~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/include/linux/mmzone.h	2004-04-18 14:30:21.000000000 +1000
@@ -104,11 +104,14 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
+	struct list_head	active_mapped_list;
+	struct list_head	active_unmapped_list;
 	struct list_head	inactive_list;
-	atomic_t		nr_scan_active;
+	atomic_t		nr_scan_active_mapped;
+	atomic_t		nr_scan_active_unmapped;
 	atomic_t		nr_scan_inactive;
-	unsigned long		nr_active;
+	unsigned long		nr_active_mapped;
+	unsigned long		nr_active_unmapped;
 	unsigned long		nr_inactive;
 	int			all_unreclaimable; /* All pages pinned */
 	unsigned long		pages_scanned;	   /* since last reclaim */
@@ -116,25 +119,6 @@ struct zone {
 	ZONE_PADDING(_pad2_)
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invokation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to free_pages == pages_high.
-	 *
-	 * Access to both these fields is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int temp_priority;
-	int prev_priority;
-
-	/*
 	 * free areas of different sizes
 	 */
 	struct free_area	free_area[MAX_ORDER];
diff -puN include/linux/page-flags.h~rollup include/linux/page-flags.h
--- linux-2.6/include/linux/page-flags.h~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/include/linux/page-flags.h	2004-04-18 14:30:21.000000000 +1000
@@ -58,25 +58,27 @@
 
 #define PG_dirty	 	 4
 #define PG_lru			 5
-#define PG_active		 6
-#define PG_slab			 7	/* slab debug (Suparna wants this) */
+#define PG_active_mapped	 6
+#define PG_active_unmapped	 7
 
-#define PG_highmem		 8
-#define PG_checked		 9	/* kill me in 2.5.<early>. */
-#define PG_arch_1		10
-#define PG_reserved		11
-
-#define PG_private		12	/* Has something at ->private */
-#define PG_writeback		13	/* Page is under writeback */
-#define PG_nosave		14	/* Used for system suspend/resume */
-#define PG_maplock		15	/* Lock bit for rmap to ptes */
-
-#define PG_direct		16	/* ->pte_chain points directly at pte */
-#define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
-#define PG_reclaim		18	/* To be reclaimed asap */
-#define PG_compound		19	/* Part of a compound page */
-#define PG_anon			20	/* Anonymous page: anon_vma in mapping*/
-#define PG_swapcache		21	/* Swap page: swp_entry_t in private */
+#define PG_slab			 8	/* slab debug (Suparna wants this) */
+#define PG_highmem		 9
+#define PG_checked		10	/* kill me in 2.5.<early>. */
+#define PG_arch_1		11
+
+#define PG_reserved		12
+#define PG_private		13	/* Has something at ->private */
+#define PG_writeback		14	/* Page is under writeback */
+#define PG_nosave		15	/* Used for system suspend/resume */
+
+#define PG_maplock		16	/* Lock bit for rmap to ptes */
+#define PG_direct		17	/* ->pte_chain points directly at pte */
+#define PG_mappedtodisk		18	/* Has blocks allocated on-disk */
+#define PG_reclaim		19	/* To be reclaimed asap */
+
+#define PG_compound		20	/* Part of a compound page */
+#define PG_anon			21	/* Anonymous page: anon_vma in mapping*/
+#define PG_swapcache		22	/* Swap page: swp_entry_t in private */
 
 
 /*
@@ -213,11 +215,17 @@ extern void get_full_page_state(struct p
 #define TestSetPageLRU(page)	test_and_set_bit(PG_lru, &(page)->flags)
 #define TestClearPageLRU(page)	test_and_clear_bit(PG_lru, &(page)->flags)
 
-#define PageActive(page)	test_bit(PG_active, &(page)->flags)
-#define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
-#define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
-#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
-#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
+#define PageActiveMapped(page)		test_bit(PG_active_mapped, &(page)->flags)
+#define SetPageActiveMapped(page)	set_bit(PG_active_mapped, &(page)->flags)
+#define ClearPageActiveMapped(page)	clear_bit(PG_active_mapped, &(page)->flags)
+#define TestClearPageActiveMapped(page) test_and_clear_bit(PG_active_mapped, &(page)->flags)
+#define TestSetPageActiveMapped(page) test_and_set_bit(PG_active_mapped, &(page)->flags)
+
+#define PageActiveUnmapped(page)	test_bit(PG_active_unmapped, &(page)->flags)
+#define SetPageActiveUnmapped(page)	set_bit(PG_active_unmapped, &(page)->flags)
+#define ClearPageActiveUnmapped(page)	clear_bit(PG_active_unmapped, &(page)->flags)
+#define TestClearPageActiveUnmapped(page) test_and_clear_bit(PG_active_unmapped, &(page)->flags)
+#define TestSetPageActiveUnmapped(page) test_and_set_bit(PG_active_unmapped, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
 #define SetPageSlab(page)	set_bit(PG_slab, &(page)->flags)
diff -puN include/linux/swap.h~rollup include/linux/swap.h
--- linux-2.6/include/linux/swap.h~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/include/linux/swap.h	2004-04-18 14:30:21.000000000 +1000
@@ -165,8 +165,6 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
-extern void FASTCALL(activate_page(struct page *));
-extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
@@ -174,7 +172,7 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
 extern int shrink_all_memory(int);
-extern int vm_swappiness;
+extern int vm_mapped_page_cost;
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff -puN kernel/sysctl.c~rollup kernel/sysctl.c
--- linux-2.6/kernel/sysctl.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/kernel/sysctl.c	2004-04-18 14:30:21.000000000 +1000
@@ -621,6 +621,7 @@ static ctl_table kern_table[] = {
 /* Constants for minimum and maximum testing in vm_table.
    We use these as one-element integer vectors. */
 static int zero;
+static int one = 1;
 static int one_hundred = 100;
 
 
@@ -697,13 +698,13 @@ static ctl_table vm_table[] = {
 	},
 	{
 		.ctl_name	= VM_SWAPPINESS,
-		.procname	= "swappiness",
-		.data		= &vm_swappiness,
-		.maxlen		= sizeof(vm_swappiness),
+		.procname	= "mapped_page_cost",
+		.data		= &vm_mapped_page_cost,
+		.maxlen		= sizeof(vm_mapped_page_cost),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec_minmax,
 		.strategy	= &sysctl_intvec,
-		.extra1		= &zero,
+		.extra1		= &one,
 		.extra2		= &one_hundred,
 	},
 #ifdef CONFIG_HUGETLB_PAGE
diff -puN mm/filemap.c~rollup mm/filemap.c
--- linux-2.6/mm/filemap.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-04-18 14:30:21.000000000 +1000
@@ -663,11 +663,7 @@ page_ok:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
-		/*
-		 * Mark the page accessed if we read the beginning.
-		 */
-		if (!offset)
-			mark_page_accessed(page);
+		mark_page_accessed(page);
 
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
diff -puN mm/hugetlb.c~rollup mm/hugetlb.c
--- linux-2.6/mm/hugetlb.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/hugetlb.c	2004-04-18 14:30:21.000000000 +1000
@@ -127,9 +127,12 @@ static void update_and_free_page(struct 
 	int i;
 	nr_huge_pages--;
 	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
-		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
-				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
-				1 << PG_private | 1<< PG_writeback);
+		page[i].flags &= ~(
+			1 << PG_locked		| 1 << PG_error		|
+			1 << PG_referenced	| 1 << PG_dirty		|
+			1 << PG_active_mapped	| 1 << PG_active_unmapped |
+			1 << PG_reserved	| 1 << PG_private	|
+			1 << PG_writeback);
 		set_page_count(&page[i], 0);
 	}
 	set_page_count(page, 1);
diff -puN mm/memory.c~rollup mm/memory.c
--- linux-2.6/mm/memory.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/memory.c	2004-04-18 14:30:21.000000000 +1000
@@ -38,6 +38,7 @@
 
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/hugetlb.h>
 #include <linux/mman.h>
 #include <linux/swap.h>
diff -puN mm/page_alloc.c~rollup mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/page_alloc.c	2004-04-18 14:30:21.000000000 +1000
@@ -82,7 +82,7 @@ static void bad_page(const char *functio
 	page->flags &= ~(1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_lru	|
-			1 << PG_active	|
+			1 << PG_active_mapped	|
 			1 << PG_dirty	|
 			1 << PG_maplock |
 			1 << PG_anon    |
@@ -224,7 +224,8 @@ static inline void free_pages_check(cons
 			1 << PG_lru	|
 			1 << PG_private |
 			1 << PG_locked	|
-			1 << PG_active	|
+			1 << PG_active_mapped	|
+			1 << PG_active_unmapped	|
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_maplock |
@@ -334,7 +335,8 @@ static void prep_new_page(struct page *p
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_lru	|
-			1 << PG_active	|
+			1 << PG_active_mapped	|
+			1 << PG_active_unmapped	|
 			1 << PG_dirty	|
 			1 << PG_reclaim	|
 			1 << PG_maplock |
@@ -859,7 +861,8 @@ unsigned int nr_used_zone_pages(void)
 	struct zone *zone;
 
 	for_each_zone(zone)
-		pages += zone->nr_active + zone->nr_inactive;
+		pages += zone->nr_active_mapped + zone->nr_active_unmapped
+			+ zone->nr_inactive;
 
 	return pages;
 }
@@ -996,7 +999,7 @@ void get_zone_counts(unsigned long *acti
 	*inactive = 0;
 	*free = 0;
 	for_each_zone(zone) {
-		*active += zone->nr_active;
+		*active += zone->nr_active_mapped + zone->nr_active_unmapped;
 		*inactive += zone->nr_inactive;
 		*free += zone->free_pages;
 	}
@@ -1114,7 +1117,7 @@ void show_free_areas(void)
 			K(zone->pages_min),
 			K(zone->pages_low),
 			K(zone->pages_high),
-			K(zone->nr_active),
+			K(zone->nr_active_mapped + zone->nr_active_unmapped),
 			K(zone->nr_inactive),
 			K(zone->present_pages)
 			);
@@ -1461,8 +1464,6 @@ static void __init free_area_init_core(s
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
 
-		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
-
 		/*
 		 * The per-cpu-pages pools are set to around 1000th of the
 		 * size of the zone.  But no more than 1/4 of a meg - there's
@@ -1496,11 +1497,14 @@ static void __init free_area_init_core(s
 		}
 		printk("  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
-		INIT_LIST_HEAD(&zone->active_list);
+		INIT_LIST_HEAD(&zone->active_mapped_list);
+		INIT_LIST_HEAD(&zone->active_unmapped_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
-		atomic_set(&zone->nr_scan_active, 0);
+		atomic_set(&zone->nr_scan_active_mapped, 0);
+		atomic_set(&zone->nr_scan_active_unmapped, 0);
 		atomic_set(&zone->nr_scan_inactive, 0);
-		zone->nr_active = 0;
+		zone->nr_active_mapped = 0;
+		zone->nr_active_unmapped = 0;
 		zone->nr_inactive = 0;
 		if (!size)
 			continue;
diff -puN mm/shmem.c~rollup mm/shmem.c
--- linux-2.6/mm/shmem.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/shmem.c	2004-04-18 14:30:21.000000000 +1000
@@ -25,6 +25,7 @@
 #include <linux/devfs_fs_kernel.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/mman.h>
 #include <linux/file.h>
 #include <linux/swap.h>
diff -puN mm/swap.c~rollup mm/swap.c
--- linux-2.6/mm/swap.c~rollup	2004-04-18 14:30:20.000000000 +1000
+++ linux-2.6-npiggin/mm/swap.c	2004-04-18 14:30:21.000000000 +1000
@@ -79,14 +79,18 @@ int rotate_reclaimable_page(struct page 
 		return 1;
 	if (PageDirty(page))
 		return 1;
-	if (PageActive(page))
+	if (PageActiveMapped(page))
+		return 1;
+	if (PageActiveUnmapped(page))
 		return 1;
 	if (!PageLRU(page))
 		return 1;
 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page)
+		&& !PageActiveMapped(page) && !PageActiveUnmapped(page)) {
+
 		list_del(&page->lru);
 		list_add_tail(&page->lru, &zone->inactive_list);
 		inc_page_state(pgrotated);
@@ -97,42 +101,6 @@ int rotate_reclaimable_page(struct page 
 	return 0;
 }
 
-/*
- * FIXME: speed this up?
- */
-void fastcall activate_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(zone, page);
-		SetPageActive(page);
-		add_page_to_active_list(zone, page);
-		inc_page_state(pgactivate);
-	}
-	spin_unlock_irq(&zone->lru_lock);
-}
-
-/*
- * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
- */
-void fastcall mark_page_accessed(struct page *page)
-{
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
-	}
-}
-
-EXPORT_SYMBOL(mark_page_accessed);
-
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
@@ -331,9 +299,13 @@ void __pagevec_lru_add_active(struct pag
 		}
 		if (TestSetPageLRU(page))
 			BUG();
-		if (TestSetPageActive(page))
-			BUG();
-		add_page_to_active_list(zone, page);
+		if (page_mapped(page)) {
+			SetPageActiveMapped(page);
+			add_page_to_active_mapped_list(zone, page);
+		} else {
+			SetPageActiveMapped(page);
+			add_page_to_active_unmapped_list(zone, page);
+		}
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
diff -puN mm/vmscan.c~rollup mm/vmscan.c
--- linux-2.6/mm/vmscan.c~rollup	2004-04-18 14:30:21.000000000 +1000
+++ linux-2.6-npiggin/mm/vmscan.c	2004-04-18 14:30:21.000000000 +1000
@@ -40,10 +40,9 @@
 #include <linux/swapops.h>
 
 /*
- * From 0 .. 100.  Higher means more swappy.
+ * From 1 .. 100.  Higher means less swappy.
  */
-int vm_swappiness = 60;
-static long total_memory;
+int vm_mapped_page_cost = 8;
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
@@ -264,11 +263,11 @@ shrink_list(struct list_head *page_list,
 		if (TestSetPageLocked(page))
 			goto keep;
 
-		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
-			(*nr_scanned)++;
+		/* Increase the slab pressure for mapped pages */
+		if (page_mapped(page))
+			(*nr_scanned) += vm_mapped_page_cost;
 
-		BUG_ON(PageActive(page));
+		BUG_ON(PageActiveMapped(page) || PageActiveUnmapped(page));
 
 		if (PageWriteback(page))
 			goto keep_locked;
@@ -444,7 +443,10 @@ free_it:
 		continue;
 
 activate_locked:
-		SetPageActive(page);
+		if (page_mapped(page))
+			SetPageActiveMapped(page);
+		else
+			SetPageActiveUnmapped(page);
 		pgactivate++;
 keep_locked:
 		unlock_page(page);
@@ -540,8 +542,10 @@ shrink_cache(struct zone *zone, unsigned
 			if (TestSetPageLRU(page))
 				BUG();
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
+			if (PageActiveMapped(page))
+				add_page_to_active_mapped_list(zone, page);
+			else if (PageActiveUnmapped(page))
+				add_page_to_active_unmapped_list(zone, page);
 			else
 				add_page_to_inactive_list(zone, page);
 			if (!pagevec_add(&pvec, page)) {
@@ -574,36 +578,32 @@ done:
  * The downside is that we have to touch page->count against each page.
  * But we had to alter page->flags anyway.
  */
-static void
-refill_inactive_zone(struct zone *zone, const int nr_pages_in,
-			struct page_state *ps)
+static void shrink_active_list(struct zone *zone, struct list_head *list,
+		unsigned long *list_count, const int nr_scan,
+		struct page_state *ps)
 {
-	int pgmoved;
+	int pgmoved, pgmoved_unmapped;
 	int pgdeactivate = 0;
-	int nr_pages = nr_pages_in;
+	int nr_pages = nr_scan;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
 	struct page *page;
 	struct pagevec pvec;
-	int reclaim_mapped = 0;
-	long mapped_ratio;
-	long distress;
-	long swap_tendency;
 
 	lru_add_drain();
 	pgmoved = 0;
 	spin_lock_irq(&zone->lru_lock);
-	while (nr_pages && !list_empty(&zone->active_list)) {
-		page = lru_to_page(&zone->active_list);
-		prefetchw_prev_lru_page(page, &zone->active_list, flags);
+	while (nr_pages && !list_empty(list)) {
+		page = lru_to_page(list);
+		prefetchw_prev_lru_page(page, list, flags);
 		if (!TestClearPageLRU(page))
 			BUG();
 		list_del(&page->lru);
 		if (page_count(page) == 0) {
 			/* It is currently in pagevec_release() */
 			SetPageLRU(page);
-			list_add(&page->lru, &zone->active_list);
+			list_add(&page->lru, list);
 		} else {
 			page_cache_get(page);
 			list_add(&page->lru, &l_hold);
@@ -611,61 +611,21 @@ refill_inactive_zone(struct zone *zone, 
 		}
 		nr_pages--;
 	}
-	zone->nr_active -= pgmoved;
+	*list_count -= pgmoved;
 	spin_unlock_irq(&zone->lru_lock);
 
-	/*
-	 * `distress' is a measure of how much trouble we're having reclaiming
-	 * pages.  0 -> no problems.  100 -> great trouble.
-	 */
-	distress = 100 >> zone->prev_priority;
-
-	/*
-	 * The point of this algorithm is to decide when to start reclaiming
-	 * mapped memory instead of just pagecache.  Work out how much memory
-	 * is mapped.
-	 */
-	mapped_ratio = (ps->nr_mapped * 100) / total_memory;
-
-	/*
-	 * Now decide how much we really want to unmap some pages.  The mapped
-	 * ratio is downgraded - just because there's a lot of mapped memory
-	 * doesn't necessarily mean that page reclaim isn't succeeding.
-	 *
-	 * The distress ratio is important - we don't want to start going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm altogether.
-	 */
-	swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
-
-	/*
-	 * Now use this metric to decide whether to start moving mapped memory
-	 * onto the inactive list.
-	 */
-	if (swap_tendency >= 100)
-		reclaim_mapped = 1;
-
 	while (!list_empty(&l_hold)) {
+		int referenced;
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_mapped(page)) {
-			if (!reclaim_mapped) {
-				list_add(&page->lru, &l_active);
-				continue;
-			}
-			rmap_lock(page);
-			if (page_referenced(page)) {
-				rmap_unlock(page);
-				list_add(&page->lru, &l_active);
-				continue;
-			}
-			rmap_unlock(page);
-		}
+		rmap_lock(page);
+		referenced = page_referenced(page);
+		rmap_unlock(page);
 		/*
 		 * FIXME: need to consider page_count(page) here if/when we
 		 * reap orphaned pages via the LRU (Daniel's locking stuff)
 		 */
-		if (total_swap_pages == 0 && PageAnon(page)) {
+		if (referenced || (total_swap_pages == 0 && PageAnon(page))) {
 			list_add(&page->lru, &l_active);
 			continue;
 		}
@@ -680,7 +640,8 @@ refill_inactive_zone(struct zone *zone, 
 		prefetchw_prev_lru_page(page, &l_inactive, flags);
 		if (TestSetPageLRU(page))
 			BUG();
-		if (!TestClearPageActive(page))
+		if (!TestClearPageActiveMapped(page)
+				&& !TestClearPageActiveUnmapped(page))
 			BUG();
 		list_move(&page->lru, &zone->inactive_list);
 		pgmoved++;
@@ -704,27 +665,41 @@ refill_inactive_zone(struct zone *zone, 
 	}
 
 	pgmoved = 0;
+	pgmoved_unmapped = 0;
 	while (!list_empty(&l_active)) {
 		page = lru_to_page(&l_active);
 		prefetchw_prev_lru_page(page, &l_active, flags);
 		if (TestSetPageLRU(page))
 			BUG();
-		BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->active_list);
-		pgmoved++;
+		if(!TestClearPageActiveMapped(page)
+				&& !TestClearPageActiveUnmapped(page))
+			BUG();
+		if (page_mapped(page)) {
+			SetPageActiveMapped(page);
+			list_move(&page->lru, &zone->active_mapped_list);
+			pgmoved++;
+		} else {
+			SetPageActiveUnmapped(page);
+			list_move(&page->lru, &zone->active_unmapped_list);
+			pgmoved_unmapped++;
+		}
+
 		if (!pagevec_add(&pvec, page)) {
-			zone->nr_active += pgmoved;
+			zone->nr_active_mapped += pgmoved;
 			pgmoved = 0;
+			zone->nr_active_unmapped += pgmoved_unmapped;
+			pgmoved_unmapped = 0;
 			spin_unlock_irq(&zone->lru_lock);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	zone->nr_active += pgmoved;
+	zone->nr_active_mapped += pgmoved;
+	zone->nr_active_unmapped += pgmoved_unmapped;
 	spin_unlock_irq(&zone->lru_lock);
 	pagevec_release(&pvec);
 
-	mod_page_state_zone(zone, pgrefill, nr_pages_in - nr_pages);
+	mod_page_state_zone(zone, pgrefill, nr_scan - nr_pages);
 	mod_page_state(pgdeactivate, pgdeactivate);
 }
 
@@ -737,6 +712,8 @@ shrink_zone(struct zone *zone, int max_s
 		int *total_scanned, struct page_state *ps, int do_writepage)
 {
 	unsigned long ratio;
+	unsigned long long mapped_ratio;
+	unsigned long nr_active;
 	int count;
 
 	/*
@@ -749,14 +726,27 @@ shrink_zone(struct zone *zone, int max_s
 	 * just to make sure that the kernel will slowly sift through the
 	 * active list.
 	 */
-	ratio = (unsigned long)SWAP_CLUSTER_MAX * zone->nr_active /
-				((zone->nr_inactive | 1) * 2);
+	nr_active = zone->nr_active_mapped + zone->nr_active_unmapped;
+	ratio = (unsigned long)SWAP_CLUSTER_MAX * nr_active /
+				(zone->nr_inactive * 2 + 1);
+	mapped_ratio = (unsigned long long)ratio * nr_active;
+	do_div(mapped_ratio, (zone->nr_active_unmapped * vm_mapped_page_cost) +1);
+
+	ratio = ratio - mapped_ratio;
+	atomic_add(ratio+1, &zone->nr_scan_active_unmapped);
+	count = atomic_read(&zone->nr_scan_active_unmapped);
+	if (count >= SWAP_CLUSTER_MAX) {
+		atomic_set(&zone->nr_scan_active_unmapped, 0);
+		shrink_active_list(zone, &zone->active_unmapped_list,
+					&zone->nr_active_unmapped, count, ps);
+	}
 
-	atomic_add(ratio+1, &zone->nr_scan_active);
-	count = atomic_read(&zone->nr_scan_active);
+	atomic_add(mapped_ratio+1, &zone->nr_scan_active_mapped);
+	count = atomic_read(&zone->nr_scan_active_mapped);
 	if (count >= SWAP_CLUSTER_MAX) {
-		atomic_set(&zone->nr_scan_active, 0);
-		refill_inactive_zone(zone, count, ps);
+		atomic_set(&zone->nr_scan_active_mapped, 0);
+		shrink_active_list(zone, &zone->active_mapped_list,
+					&zone->nr_active_mapped, count, ps);
 	}
 
 	atomic_add(max_scan, &zone->nr_scan_inactive);
@@ -796,9 +786,6 @@ shrink_caches(struct zone **zones, int p
 		struct zone *zone = zones[i];
 		int max_scan;
 
-		if (zone->free_pages < zone->pages_high)
-			zone->temp_priority = priority;
-
 		if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
 
@@ -833,15 +820,11 @@ int try_to_free_pages(struct zone **zone
 	int ret = 0;
 	int nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	int i;
 	unsigned long total_scanned = 0;
 	int do_writepage = 0;
 
 	inc_page_state(allocstall);
 
-	for (i = 0; zones[i] != 0; i++)
-		zones[i]->temp_priority = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int scanned = 0;
 		struct page_state ps;
@@ -880,8 +863,6 @@ int try_to_free_pages(struct zone **zone
 	if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
 		out_of_memory();
 out:
-	for (i = 0; zones[i] != 0; i++)
-		zones[i]->prev_priority = zones[i]->temp_priority;
 	return ret;
 }
 
@@ -922,12 +903,6 @@ static int balance_pgdat(pg_data_t *pgda
 
 	inc_page_state(pageoutrun);
 
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->temp_priority = DEF_PRIORITY;
-	}
-
 	for (priority = DEF_PRIORITY; priority; priority--) {
 		int all_zones_ok = 1;
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -977,7 +952,6 @@ scan:
 				if (zone->free_pages <= zone->pages_high)
 					all_zones_ok = 0;
 			}
-			zone->temp_priority = priority;
 			max_scan = zone->nr_inactive >> priority;
 			reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,
 					&scanned, ps, do_writepage);
@@ -1012,11 +986,6 @@ scan:
 			blk_congestion_wait(WRITE, HZ/10);
 	}
 out:
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = zone->temp_priority;
-	}
 	return total_reclaimed;
 }
 
@@ -1150,7 +1119,6 @@ static int __init kswapd_init(void)
 	for_each_pgdat(pgdat)
 		pgdat->kswapd
 		= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
-	total_memory = nr_free_pagecache_pages();
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }

_

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  1:06       ` William Lee Irwin III
@ 2004-04-18  5:05         ` Marc Singer
  0 siblings, 0 replies; 27+ messages in thread
From: Marc Singer @ 2004-04-18  5:05 UTC (permalink / raw)
  To: William Lee Irwin III, Marc Singer, linux-kernel, akpm

On Sat, Apr 17, 2004 at 06:06:16PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
> >> This doesn't match your first response. Anyway, this one is gets
> >> scrapped. I guess if swappiness solves it, then so much the better.
> 
> On Sat, Apr 17, 2004 at 02:52:57PM -0700, Marc Singer wrote:
> > Huh?  Where do you see a discrepency?  I don't think I claimed that
> > the test program performance changed.  The noticeable difference is in
> > interactivity once the page cache fills.  IMHO, 30 seconds to do a
> > file listing on /proc is extreme.
> 
> Oh, sorry, it was unclear to me that the test changed anything but
> swappiness (i.e. I couldn't tell they included the patch etc.)

Ah, OK.  Now I understand your confusion.  Based on the numbers, it is
clear that your last patch does exactly the same thing as setting
swappiness.  It is true that I didn't apply it.  Still, I think that
your change is worth consideration since setting swappiness to zero is
such a blunt solution.  I apologize for not making this clear before.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  4:41               ` Nick Piggin
@ 2004-04-18  5:10                 ` Marc Singer
  2004-04-18  5:19                   ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-18  5:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Marc Singer, William Lee Irwin III, Andrew Morton, linux-kernel

On Sun, Apr 18, 2004 at 02:41:12PM +1000, Nick Piggin wrote:
> William Lee Irwin III wrote:
> >On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
> >
> >>swappiness is pretty arbitrary and unfortunately it means
> >>different things to machines with different sized memory.
> >>Also, once you *have* gone past the reclaim_mapped threshold,
> >>mapped pages aren't really given any preference above
> >>unmapped pages.
> >>I have a small patchset which splits the active list roughly
> >>into mapped and unmapped pages. It might hopefully solve your
> >>problem. Would you give it a try? It is pretty stable here.
> >
> >
> >It would be interesting to see the results of this on Marc's system.
> >It's a more comprehensive solution than tweaking numbers.
> >
> 
> Well, here is the current patch against 2.6.5-mm6. -mm is
> different enough from -linus now that it is not 100% trivial
> to patch (mainly the rmap and hugepages work).

Will this work against 2.6.5 without -mm6?

As an aside, I've been using SVN to manage my kernel sources.  While
I'd be thrilled to make it work, it simply doesn't seem to have the
heavy lifting capability to handle the kernel work.  I know the
rudiments of using BK.  What I'd like is some sort of HOWTO with
example of common tasks for kernel development.  Know of any?

> Marc if you could test this it would be great. I've been doing
> very swap heavy tests for the last 24 hours on a SMP system
> here, so it should be fairly stable.

I'm game.

> It replaces /proc/sys/vm/swappiness with
> /proc/sys/vm/mapped_page_cost, which is in units of unmapped
> pages. I have found 8 to be pretty good, so that is the
> default. Higher makes it less likely to evict mapped pages.

Sounds good.

Cheers.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:10                 ` Marc Singer
@ 2004-04-18  5:19                   ` Nick Piggin
  2004-04-18  5:35                     ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2004-04-18  5:19 UTC (permalink / raw)
  To: Marc Singer; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel

Marc Singer wrote:
> On Sun, Apr 18, 2004 at 02:41:12PM +1000, Nick Piggin wrote:
> 
>>William Lee Irwin III wrote:
>>
>>>On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>swappiness is pretty arbitrary and unfortunately it means
>>>>different things to machines with different sized memory.
>>>>Also, once you *have* gone past the reclaim_mapped threshold,
>>>>mapped pages aren't really given any preference above
>>>>unmapped pages.
>>>>I have a small patchset which splits the active list roughly
>>>>into mapped and unmapped pages. It might hopefully solve your
>>>>problem. Would you give it a try? It is pretty stable here.
>>>
>>>
>>>It would be interesting to see the results of this on Marc's system.
>>>It's a more comprehensive solution than tweaking numbers.
>>>
>>
>>Well, here is the current patch against 2.6.5-mm6. -mm is
>>different enough from -linus now that it is not 100% trivial
>>to patch (mainly the rmap and hugepages work).
> 
> 
> Will this work against 2.6.5 without -mm6?
> 

Unfortunately it won't patch easily. If this is a big
problem for you I could make you up a 2.6.5 version.

> As an aside, I've been using SVN to manage my kernel sources.  While
> I'd be thrilled to make it work, it simply doesn't seem to have the
> heavy lifting capability to handle the kernel work.  I know the
> rudiments of using BK.  What I'd like is some sort of HOWTO with
> example of common tasks for kernel development.  Know of any?
>

Well I don't do a great deal of coding or merging, but I
use Andrew Morton's patch scripts which make things very
easy for me.

Regarding bitkeeper, I have never tried it but there is
some help in Documentation/BK-usage/ which might be of
use to you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:19                   ` Nick Piggin
@ 2004-04-18  5:35                     ` Marc Singer
  2004-04-18  5:41                       ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-18  5:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Marc Singer, William Lee Irwin III, Andrew Morton, linux-kernel

On Sun, Apr 18, 2004 at 03:19:59PM +1000, Nick Piggin wrote:
> >>Well, here is the current patch against 2.6.5-mm6. -mm is
> >>different enough from -linus now that it is not 100% trivial
> >>to patch (mainly the rmap and hugepages work).
> >
> >
> >Will this work against 2.6.5 without -mm6?
> >
> 
> Unfortunately it won't patch easily. If this is a big
> problem for you I could make you up a 2.6.5 version.

We'll, I'll try applying his patch and then yours.  If it doesn't work
I'll let you know.

> 
> >As an aside, I've been using SVN to manage my kernel sources.  While
> >I'd be thrilled to make it work, it simply doesn't seem to have the
> >heavy lifting capability to handle the kernel work.  I know the
> >rudiments of using BK.  What I'd like is some sort of HOWTO with
> >example of common tasks for kernel development.  Know of any?
> >
> 
> Well I don't do a great deal of coding or merging, but I
> use Andrew Morton's patch scripts which make things very
> easy for me.

Where does he keep 'em.

> Regarding bitkeeper, I have never tried it but there is
> some help in Documentation/BK-usage/ which might be of
> use to you.

I'll read it.  Thanks.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  3:53           ` Andrew Morton
@ 2004-04-18  5:38             ` Marc Singer
  2004-04-18  5:52               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-18  5:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, elf, linux-kernel

On Sat, Apr 17, 2004 at 08:53:38PM -0700, Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> >
> >  On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> >  > I'd assume that setting swappiness to zero simply means that you still have
> >  > all of your libc in pagecache when running ls.
> >  > What happens if you do the big file copy, then run `sync', then do the ls?
> >  > Have you experimented with the NFS mount options?  v2? UDP?
> > 
> >  I wonder if the ptep_test_and_clear_young() TLB flushing is related.
> 
> That, or page_referenced() always returns true on this ARM implementation
> or some such silliness.  Everything here points at the VM being unable to
> reclaim that clean pagecache.

How can I tell?  Is it something like this: because page_referenced()
always returns true (which I haven't investigated) then the page
eviction code cannot distinguish mapped from cache pages and therefore
selects valuable, mapped pages.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:35                     ` Marc Singer
@ 2004-04-18  5:41                       ` Nick Piggin
  2004-04-18 23:44                         ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2004-04-18  5:41 UTC (permalink / raw)
  To: Marc Singer; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel

Marc Singer wrote:
> On Sun, Apr 18, 2004 at 03:19:59PM +1000, Nick Piggin wrote:
> 
>>>>Well, here is the current patch against 2.6.5-mm6. -mm is
>>>>different enough from -linus now that it is not 100% trivial
>>>>to patch (mainly the rmap and hugepages work).
>>>
>>>
>>>Will this work against 2.6.5 without -mm6?
>>>
>>
>>Unfortunately it won't patch easily. If this is a big
>>problem for you I could make you up a 2.6.5 version.
> 
> 
> We'll, I'll try applying his patch and then yours.  If it doesn't work
> I'll let you know.
> 

OK thanks.

> 
>>>As an aside, I've been using SVN to manage my kernel sources.  While
>>>I'd be thrilled to make it work, it simply doesn't seem to have the
>>>heavy lifting capability to handle the kernel work.  I know the
>>>rudiments of using BK.  What I'd like is some sort of HOWTO with
>>>example of common tasks for kernel development.  Know of any?
>>>
>>
>>Well I don't do a great deal of coding or merging, but I
>>use Andrew Morton's patch scripts which make things very
>>easy for me.
> 
> 
> Where does he keep 'em.
> 

http://www.zip.com.au/~akpm/linux/patches/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:38             ` Marc Singer
@ 2004-04-18  5:52               ` Andrew Morton
  2004-04-18  6:15                 ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-18  5:52 UTC (permalink / raw)
  To: Marc Singer; +Cc: wli, elf, linux-kernel

Marc Singer <elf@buici.com> wrote:
>
> On Sat, Apr 17, 2004 at 08:53:38PM -0700, Andrew Morton wrote:
> > William Lee Irwin III <wli@holomorphy.com> wrote:
> > >
> > >  On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> > >  > I'd assume that setting swappiness to zero simply means that you still have
> > >  > all of your libc in pagecache when running ls.
> > >  > What happens if you do the big file copy, then run `sync', then do the ls?
> > >  > Have you experimented with the NFS mount options?  v2? UDP?
> > > 
> > >  I wonder if the ptep_test_and_clear_young() TLB flushing is related.
> > 
> > That, or page_referenced() always returns true on this ARM implementation
> > or some such silliness.  Everything here points at the VM being unable to
> > reclaim that clean pagecache.
> 
> How can I tell?

Well some more descriptions of what the system does after that copy-to-nfs
would help.  Does it _ever_ come good, or is a reboot needed, etc?

What does `vmstat 1' say during the copy, and during the ls?

/proc/vmstats before and after the ls.

Try doing the copy, then when it has finished do the old
memset(malloc(24M)) and monitor the `vmstat 1' output while it runs,
capture /proc/meminfo before and after.

None of the problems you report are present on x86 as far as I can tell,
so...


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:52               ` Andrew Morton
@ 2004-04-18  6:15                 ` Marc Singer
  2004-04-19  0:26                   ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Singer @ 2004-04-18  6:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, wli, linux-kernel

On Sat, Apr 17, 2004 at 10:52:43PM -0700, Andrew Morton wrote:
> Well some more descriptions of what the system does after that copy-to-nfs
> would help.  Does it _ever_ come good, or is a reboot needed, etc?

As far as I can tell, it never gets good.  I run the same ls command
over and over.  Ten times.  Always the same behavior.  Sometimes it
gets even slower.  I can set swappiness to zero and it responds
normally, immediately.

> What does `vmstat 1' say during the copy, and during the ls?

I thought I sent a message about this.  I've found that the problem
*only* occurs when there is exactly one process running.  If I open a
second console (via telnet) then the slow-down behavior disappears.
If I logout of the console session and run the tests from the telnet
session then I *do* see the problem again.

> /proc/vmstats before and after the ls.

This one I can do.

BEFORE

    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_page_table_pages 49
    nr_mapped 163
    nr_slab 255
    pgpgin 4
    pgpgout 0
    pswpin 0
    pswpout 0
    pgalloc_high 0
    pgalloc_normal 0
    pgalloc_dma 78568
    pgfree 79274
    pgactivate 8428
    pgdeactivate 8324
    pgfault 17112
    pgmajfault 1348
    pgrefill_high 0
    pgrefill_normal 0
    pgrefill_dma 543296
    pgsteal_high 0
    pgsteal_normal 0
    pgsteal_dma 60834
    pgscan_kswapd_high 0
    pgscan_kswapd_normal 0
    pgscan_kswapd_dma 189700
    pgscan_direct_high 0
    pgscan_direct_normal 0
    pgscan_direct_dma 31746
    pginodesteal 0
    slabs_scanned 1586
    kswapd_steal 30907
    kswapd_inodesteal 0
    pageoutrun 77994
    allocstall 82
    pgrotated 0

ls -l /proc runs slowly.

AFTER

    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_page_table_pages 49
    nr_mapped 164
    nr_slab 225
    pgpgin 4
    pgpgout 0
    pswpin 0
    pswpout 0
    pgalloc_high 0
    pgalloc_normal 0
    pgalloc_dma 85378
    pgfree 86106
    pgactivate 11759
    pgdeactivate 11650
    pgfault 21293
    pgmajfault 2316
    pgrefill_high 0
    pgrefill_normal 0
    pgrefill_dma 616785
    pgsteal_high 0
    pgsteal_normal 0
    pgsteal_dma 67511
    pgscan_kswapd_high 0
    pgscan_kswapd_normal 0
    pgscan_kswapd_dma 200241
    pgscan_direct_high 0
    pgscan_direct_normal 0
    pgscan_direct_dma 31746
    pginodesteal 0
    slabs_scanned 1586
    kswapd_steal 37584
    kswapd_inodesteal 0
    pageoutrun 78405
    allocstall 82
    pgrotated 0

ls -l still slow.

Anything interesting? 

> Try doing the copy, then when it has finished do the old
> memset(malloc(24M)) and monitor the `vmstat 1' output while it runs,
> capture /proc/meminfo before and after.

Here's the meminfo version of the same test above.

BEFORE

MemTotal:        30256 kB
MemFree:          3312 kB
Buffers:             0 kB
Cached:          24512 kB
SwapCached:          0 kB
Active:            732 kB
Inactive:        24084 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:        30256 kB
LowFree:          3312 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:            700 kB
Slab:             1024 kB
Committed_AS:      476 kB
PageTables:        196 kB
VmallocTotal:   434176 kB
VmallocUsed:     65924 kB
VmallocChunk:   368252 kB


ls -l runs slowly

AFTER

MemTotal:        30256 kB
MemFree:          3420 kB
Buffers:             0 kB
Cached:          24448 kB
SwapCached:          0 kB
Active:            772 kB
Inactive:        23984 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:        30256 kB
LowFree:          3420 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:            692 kB
Slab:              964 kB
Committed_AS:      476 kB
PageTables:        196 kB
VmallocTotal:   434176 kB
VmallocUsed:     65924 kB
VmallocChunk:   368252 kB

ls -l still runs slowly.

The copy w/vmstat involves two processes.  It doesn't exhibit the
problems.

> None of the problems you report are present on x86 as far as I can tell,
> so...

I don't expect you would.  Are you confident that this doesn't happen
on a 386 NFS root mounted system in single-user mode?  I don't have an
IA32 system where I can test this scenario.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  0:23         ` Marc Singer
  2004-04-18  3:37           ` Nick Piggin
@ 2004-04-18  9:29           ` Russell King
  1 sibling, 0 replies; 27+ messages in thread
From: Russell King @ 2004-04-18  9:29 UTC (permalink / raw)
  To: Marc Singer; +Cc: Andrew Morton, wli, linux-kernel

On Sat, Apr 17, 2004 at 05:23:43PM -0700, Marc Singer wrote:
> All of these tests are performed at the console, one command at a
> time.  I have a telnet daemon available, so I open a second connection
> to the target system.  I run a continuous loop of file copies on the
> console and I execute 'ls -l /proc' in the telnet window.  It's a
> little slow, but it isn't unreasonable.  Hmm.  I then run the copy
> command in the telnet window followed by the 'ls -l /proc'.  It works
> fine.  I logout of the console session and perform the telnet window
> test again.  The 'ls -l /proc takes 30 seconds.
> 
> When there is more than one process running, everything is peachy.
> When there is only one process (no context switching) I see the slow
> performance.  I had a hypothesis, but my test of that hypothesis
> failed.

Guys, this tends to indicate that we _must_ have up to date aging
information from the PTE - if not, we're liable to miss out on the
pressure from user applications.  The "lazy" method which 2.4 will
allow is not possible with 2.6.

This means we must flush the TLB when we mark the PTE old.

Might be worth reading my thread on linux-mm about this and commenting?
(hint hint)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  5:41                       ` Nick Piggin
@ 2004-04-18 23:44                         ` Marc Singer
  0 siblings, 0 replies; 27+ messages in thread
From: Marc Singer @ 2004-04-18 23:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Marc Singer, William Lee Irwin III, Andrew Morton, linux-kernel

On Sun, Apr 18, 2004 at 03:41:24PM +1000, Nick Piggin wrote:
> >We'll, I'll try applying his patch and then yours.  If it doesn't work
> >I'll let you know.
> >
> 
> OK thanks.

There appear to be a lot of conflicts between my development tree and
the -mm6 patch.  Even your patch doesn't apply cleanly, though I think
it is only because a piece has already been applied.

I'm starting with 2.6.5, applying Russell King's 2.6.5 patch from the
8th, applying -mm6 patch, and then yours.  It looks like a good bit of
Russell's patch has been included in the -mm6.  But not enough of mm6
is present in my tree for your patch to work.

I'm working on the scripts and BK docs.  At this point, I may have to
wait for 2.6.6 before we can make another test.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-18  6:15                 ` Marc Singer
@ 2004-04-19  0:26                   ` Rik van Riel
  2004-04-19  0:39                     ` Marc Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2004-04-19  0:26 UTC (permalink / raw)
  To: Marc Singer; +Cc: Andrew Morton, wli, linux-kernel

On Sat, 17 Apr 2004, Marc Singer wrote:

> I thought I sent a message about this.  I've found that the problem
> *only* occurs when there is exactly one process running.

BINGO!  ;)

Looks like this could be the referenced bits not being
flushed from the MMU and not found by the VM...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vmscan.c heuristic adjustment for smaller systems
  2004-04-19  0:26                   ` Rik van Riel
@ 2004-04-19  0:39                     ` Marc Singer
  0 siblings, 0 replies; 27+ messages in thread
From: Marc Singer @ 2004-04-19  0:39 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Marc Singer, Andrew Morton, wli, linux-kernel

On Sun, Apr 18, 2004 at 08:26:13PM -0400, Rik van Riel wrote:
> On Sat, 17 Apr 2004, Marc Singer wrote:
> 
> > I thought I sent a message about this.  I've found that the problem
> > *only* occurs when there is exactly one process running.
> 
> BINGO!  ;)
> 
> Looks like this could be the referenced bits not being
> flushed from the MMU and not found by the VM...

Can you be a little more verbose for me?  The ARM MMU doesn't keep
track of page references, AFAICT.  How does a context switch change
this?  

I have looked into the case where the TLB for an old page isn't being
flushed (by design), but I've been unable to fix the problem by
forcing a TLB flush whenever a PTE is zeroed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-04-19  0:39 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-17 19:38 vmscan.c heuristic adjustment for smaller systems William Lee Irwin III
2004-04-17 21:29 ` Marc Singer
2004-04-17 21:33   ` William Lee Irwin III
2004-04-17 21:52     ` Marc Singer
2004-04-18  1:06       ` William Lee Irwin III
2004-04-18  5:05         ` Marc Singer
2004-04-17 23:21   ` Andrew Morton
2004-04-17 23:30     ` Marc Singer
2004-04-17 23:51       ` Andrew Morton
2004-04-18  0:11         ` Trond Myklebust
2004-04-18  0:23         ` Marc Singer
2004-04-18  3:37           ` Nick Piggin
2004-04-18  4:17             ` William Lee Irwin III
2004-04-18  4:41               ` Nick Piggin
2004-04-18  5:10                 ` Marc Singer
2004-04-18  5:19                   ` Nick Piggin
2004-04-18  5:35                     ` Marc Singer
2004-04-18  5:41                       ` Nick Piggin
2004-04-18 23:44                         ` Marc Singer
2004-04-18  9:29           ` Russell King
2004-04-18  1:59         ` William Lee Irwin III
2004-04-18  3:53           ` Andrew Morton
2004-04-18  5:38             ` Marc Singer
2004-04-18  5:52               ` Andrew Morton
2004-04-18  6:15                 ` Marc Singer
2004-04-19  0:26                   ` Rik van Riel
2004-04-19  0:39                     ` Marc Singer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox