[RFC][PATCH 0/2] Swap token re-tuned

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2] Swap token re-tuned
@ 2006-09-29 18:41 Ashwin Chaugule
  2006-10-01 22:56 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Ashwin Chaugule @ 2006-09-29 18:41 UTC (permalink / raw)
  To: linux-kernel

Hi, 
Here's a brief up on the next two mails. 

PATCH 1: 

In the current implementation of swap token tuning, grab swap token is
made from : 
1) after page_cache_read (filemap.c) and 
2) after the readahead logic in do_swap_page (memory.c) 

IMO, the contention for the swap token should happen _before_ the
aforementioned calls, because in the event of low system memory, calls
to freeup space will be made later from page_cache_read and
read_swap_cache_async , so we want to avoid "false LRU" pages by
grabbing the token before the VM starts searching for replacement
candidates. 

PATCH 2: 

Instead of using TIMEOUT as a parameter to transfer the token, I think a
better solution is to hand it over to a process that proves its
eligibilty. 

What my scheme does, is to find out how frequently a process is calling
these functions. The processes that call these more frequently get a
higher priority. 
The idea is to guarantee that a high priority process gets the token.
The priority of a process is determined by the number of consecutive
calls to swap-in and no-page. I mean "consecutive" not from the
scheduler point of view, but from the process point of view. In other
words, if the task called these functions every time it was scheduled,
it means it is not getting any further with its execution. 

This way, its a matter of simple comparison of task priorities, to
decide whether to transfer the token or not. 

I did some testing with the two patches combined and the results are as
follows: 

Current Upstream implementation: 
=============================== 

root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 
seed = 1420300 
seed = 1420300 
seed = 1420300 

real    3m40.124s 
user    0m12.060s 
sys     0m0.940s 

-------------reboot----------------- 

With my implementation : 
======================== 

root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 
seed = 1420300 
seed = 1420300 
seed = 1420300 

real    2m58.708s 
user    0m11.880s 
sys     0m1.070s 

My test machine: 

1.69Ghz CPU 
64M RAM 
7200rpm hdd 
2MB L2 cache 
vanilla kernel 2.6.18 
Ubuntu dapper with gnome. 

Any comments, suggestions, ideas ? 

Cheers, 
Ashwin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-09-29 18:41 [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
@ 2006-10-01 22:56 ` Andrew Morton
  2006-10-02  7:35   ` Peter Zijlstra
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andrew Morton @ 2006-10-01 22:56 UTC (permalink / raw)
  To: ashwin.chaugule; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra

On Sat, 30 Sep 2006 00:11:51 +0530
Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote:

> 
> Hi, 
> Here's a brief up on the next two mails. 

When preparing patches, please give each one's email a different and
meaningful Subject:, and try to put the description of the patch within the
email which  contains that patch, thanks.

> PATCH 1: 
> 
> In the current implementation of swap token tuning, grab swap token is
> made from : 
> 1) after page_cache_read (filemap.c) and 
> 2) after the readahead logic in do_swap_page (memory.c) 
> 
> IMO, the contention for the swap token should happen _before_ the
> aforementioned calls, because in the event of low system memory, calls
> to freeup space will be made later from page_cache_read and
> read_swap_cache_async , so we want to avoid "false LRU" pages by
> grabbing the token before the VM starts searching for replacement
> candidates. 

Seems sane.

> PATCH 2: 
> 
> Instead of using TIMEOUT as a parameter to transfer the token, I think a
> better solution is to hand it over to a process that proves its
> eligibilty. 
> 
> What my scheme does, is to find out how frequently a process is calling
> these functions. The processes that call these more frequently get a
> higher priority. 
> The idea is to guarantee that a high priority process gets the token.
> The priority of a process is determined by the number of consecutive
> calls to swap-in and no-page. I mean "consecutive" not from the
> scheduler point of view, but from the process point of view. In other
> words, if the task called these functions every time it was scheduled,
> it means it is not getting any further with its execution. 
> 
> This way, its a matter of simple comparison of task priorities, to
> decide whether to transfer the token or not. 

Does this introduce the possibility of starvation?  Where the
fast-allocating process hogs the system and everything else makes no
progress?


> I did some testing with the two patches combined and the results are as
> follows: 
> 
> Current Upstream implementation: 
> =============================== 
> 
> root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 
> seed = 1420300 
> seed = 1420300 
> seed = 1420300 
> 
> real    3m40.124s 
> user    0m12.060s 
> sys     0m0.940s 
> 
> 
> -------------reboot----------------- 
> 
> With my implementation : 
> ======================== 
> 
> root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 
> seed = 1420300 
> seed = 1420300 
> seed = 1420300 
> 
> real    2m58.708s 
> user    0m11.880s 
> sys     0m1.070s 
> 

qsbench gives quite unstable results in my experience.  How stable is the
above result (say, average across ten runs?)

It's quite easy to make changes in this area which speed qsbench up with
one set of arguments, and which slow it down with a different set.  Did you
try mixing the tests up a bit?

Also, qsbench isn't really a very good test for swap-intensive workloads -
it's re-referencing and locality patterns seem fairly artificial.

Another workload which it would be useful to benchmark is a kernel compile
- say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need
tuning).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-01 22:56 ` Andrew Morton
@ 2006-10-02  7:35   ` Peter Zijlstra
  2006-10-02  7:59     ` Andrew Morton
  2006-10-02 11:00     ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
  2006-10-02  8:20   ` Ashwin Chaugule
  2006-10-02 10:00   ` Ashwin Chaugule
  2 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2006-10-02  7:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel

On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote:
> On Sat, 30 Sep 2006 00:11:51 +0530
> Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote:

> > PATCH 2: 
> > 
> > Instead of using TIMEOUT as a parameter to transfer the token, I think a
> > better solution is to hand it over to a process that proves its
> > eligibilty. 
> > 
> > What my scheme does, is to find out how frequently a process is calling
> > these functions. The processes that call these more frequently get a
> > higher priority. 
> > The idea is to guarantee that a high priority process gets the token.
> > The priority of a process is determined by the number of consecutive
> > calls to swap-in and no-page. I mean "consecutive" not from the
> > scheduler point of view, but from the process point of view. In other
> > words, if the task called these functions every time it was scheduled,
> > it means it is not getting any further with its execution. 
> > 
> > This way, its a matter of simple comparison of task priorities, to
> > decide whether to transfer the token or not. 
> 
> Does this introduce the possibility of starvation?  Where the
> fast-allocating process hogs the system and everything else makes no
> progress?

I tinkered with this a bit yesterday, and didn't get good results for:
mem=64M ; make -j5

-vanilla: 2h32:55
-swap-token: 2h41:48

various other attempts at tweaking the code only made it worse. (will
have to rerun these test, but a ~3h test is well, a 3h test ;-)

Being frustrated with these results - I mean the idea made sense, so
what is going on - I came up with this answer:

Tasks owning the swap token will retain their pages and will hence swap
less, other (contending) tasks will get less pages and will fault more
frequent. This prio mechanism will favour exactly those tasks not
holding the token. Which makes for token bouncing.

The current mechanism seemingly assigns the token randomly (whomever
asks while not held gets it - and the hold time is fixed), however this
change in paging behaviour (holder less, contenders more) shifts the
odds in favour of one of the contenders. Also the fixed holding time
will make sure the token doesn't get released too soon and can make some
progress.

So while I agree it would be nice to get rid of all magic variables
(holding time in the current impl) this proposed solution hasn't
convinced me (for one it introduces another).

(for the interrested, the various attempts I tried are available here:
  http://programming.kicks-ass.net/kernel-patches/swap_token/ )

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-02  7:35   ` Peter Zijlstra
@ 2006-10-02  7:59     ` Andrew Morton
  2006-10-02  8:14       ` Peter Zijlstra
                         ` (3 more replies)
  2006-10-02 11:00     ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
  1 sibling, 4 replies; 12+ messages in thread
From: Andrew Morton @ 2006-10-02  7:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel

On Mon, 02 Oct 2006 09:35:52 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote:
> > On Sat, 30 Sep 2006 00:11:51 +0530
> > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote:
> 
> > > PATCH 2: 
> > > 
> > > Instead of using TIMEOUT as a parameter to transfer the token, I think a
> > > better solution is to hand it over to a process that proves its
> > > eligibilty. 
> > > 
> > > What my scheme does, is to find out how frequently a process is calling
> > > these functions. The processes that call these more frequently get a
> > > higher priority. 
> > > The idea is to guarantee that a high priority process gets the token.
> > > The priority of a process is determined by the number of consecutive
> > > calls to swap-in and no-page. I mean "consecutive" not from the
> > > scheduler point of view, but from the process point of view. In other
> > > words, if the task called these functions every time it was scheduled,
> > > it means it is not getting any further with its execution. 
> > > 
> > > This way, its a matter of simple comparison of task priorities, to
> > > decide whether to transfer the token or not. 
> > 
> > Does this introduce the possibility of starvation?  Where the
> > fast-allocating process hogs the system and everything else makes no
> > progress?
> 
> I tinkered with this a bit yesterday, and didn't get good results for:
> mem=64M ; make -j5
> 
> -vanilla: 2h32:55
> -swap-token: 2h41:48
> 
> various other attempts at tweaking the code only made it worse. (will
> have to rerun these test, but a ~3h test is well, a 3h test ;-)

I don't think that's a region of operation where we care a great deal. 
What was the average CPU utlisation?  Only a few percent.

It's just thrashing too much to bother optimising for.  Obviously we want
it to terminate in a sane period of time and we'd _like_ to improve it. 
But I think we'd accept a 10% slowdown in this region of operation if it
gave us a 10% speedup in the 25%-utilisation region.

IOW: does the patch help mem=96M;make -j5??

> Being frustrated with these results - I mean the idea made sense, so
> what is going on - I came up with this answer:
> 
> Tasks owning the swap token will retain their pages and will hence swap
> less, other (contending) tasks will get less pages and will fault more
> frequent. This prio mechanism will favour exactly those tasks not
> holding the token. Which makes for token bouncing.

OK.

(We need to do something with
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/broken-out/mm-thrash-detect-process-thrashing-against-itself.patch,
btw.  Has been in -mm since March and I'm still waiting for some benchmarks
which would justify its inclusion..)

> The current mechanism seemingly assigns the token randomly (whomever
> asks while not held gets it - and the hold time is fixed), however this
> change in paging behaviour (holder less, contenders more) shifts the
> odds in favour of one of the contenders. Also the fixed holding time
> will make sure the token doesn't get released too soon and can make some
> progress.
> 
> So while I agree it would be nice to get rid of all magic variables
> (holding time in the current impl) this proposed solution hasn't
> convinced me (for one it introduces another).
> 
> (for the interrested, the various attempts I tried are available here:
>   http://programming.kicks-ass.net/kernel-patches/swap_token/ )

OK, thanks or looking into it.  I do think this is rich ground for
optimisation.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-02  7:59     ` Andrew Morton
@ 2006-10-02  8:14       ` Peter Zijlstra
  2006-10-03  7:32       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2006-10-02  8:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel

On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote:
> On Mon, 02 Oct 2006 09:35:52 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote:
> > > On Sat, 30 Sep 2006 00:11:51 +0530
> > > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote:
> > 
> > > > PATCH 2: 
> > > > 
> > > > Instead of using TIMEOUT as a parameter to transfer the token, I think a
> > > > better solution is to hand it over to a process that proves its
> > > > eligibilty. 
> > > > 
> > > > What my scheme does, is to find out how frequently a process is calling
> > > > these functions. The processes that call these more frequently get a
> > > > higher priority. 
> > > > The idea is to guarantee that a high priority process gets the token.
> > > > The priority of a process is determined by the number of consecutive
> > > > calls to swap-in and no-page. I mean "consecutive" not from the
> > > > scheduler point of view, but from the process point of view. In other
> > > > words, if the task called these functions every time it was scheduled,
> > > > it means it is not getting any further with its execution. 
> > > > 
> > > > This way, its a matter of simple comparison of task priorities, to
> > > > decide whether to transfer the token or not. 
> > > 
> > > Does this introduce the possibility of starvation?  Where the
> > > fast-allocating process hogs the system and everything else makes no
> > > progress?
> > 
> > I tinkered with this a bit yesterday, and didn't get good results for:
> > mem=64M ; make -j5
> > 
> > -vanilla: 2h32:55

        Command being timed: "make -j5"
        User time (seconds): 2726.81
        System time (seconds): 2266.85
        Percent of CPU this job got: 54%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 269956
        Minor (reclaiming a frame) page faults: 8699298
        Voluntary context switches: 414020
        Involuntary context switches: 242365
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

> > -swap-token: 2h41:48

        Command being timed: "make -j5"
        User time (seconds): 2720.54
        System time (seconds): 2428.60
        Percent of CPU this job got: 53%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:41:48
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 281943
        Minor (reclaiming a frame) page faults: 8692417
        Voluntary context switches: 421770
        Involuntary context switches: 241323
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

> > various other attempts at tweaking the code only made it worse. (will
> > have to rerun these test, but a ~3h test is well, a 3h test ;-)
> 
> I don't think that's a region of operation where we care a great deal. 
> What was the average CPU utlisation?  Only a few percent.

~50%, its a slow box this, a p3-550.

> It's just thrashing too much to bother optimising for.  Obviously we want
> it to terminate in a sane period of time and we'd _like_ to improve it. 
> But I think we'd accept a 10% slowdown in this region of operation if it
> gave us a 10% speedup in the 25%-utilisation region.
> 
> IOW: does the patch help mem=96M;make -j5??

Will kick off some test later today.

> > Being frustrated with these results - I mean the idea made sense, so
> > what is going on - I came up with this answer:
> > 
> > Tasks owning the swap token will retain their pages and will hence swap
> > less, other (contending) tasks will get less pages and will fault more
> > frequent. This prio mechanism will favour exactly those tasks not
> > holding the token. Which makes for token bouncing.
> 
> OK.
> 
> (We need to do something with
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/broken-out/mm-thrash-detect-process-thrashing-against-itself.patch,
> btw.  Has been in -mm since March and I'm still waiting for some benchmarks
> which would justify its inclusion..)

Hmm, benchmarks, I need VM benchmarks for my page replacment work
too ;-)

Perhaps I can create a multi-threaded progamm that knows a few patterns.

> > The current mechanism seemingly assigns the token randomly (whomever
> > asks while not held gets it - and the hold time is fixed), however this
> > change in paging behaviour (holder less, contenders more) shifts the
> > odds in favour of one of the contenders. Also the fixed holding time
> > will make sure the token doesn't get released too soon and can make some
> > progress.
> > 
> > So while I agree it would be nice to get rid of all magic variables
> > (holding time in the current impl) this proposed solution hasn't
> > convinced me (for one it introduces another).
> > 
> > (for the interrested, the various attempts I tried are available here:
> >   http://programming.kicks-ass.net/kernel-patches/swap_token/ )
> 
> OK, thanks or looking into it.  I do think this is rich ground for
> optimisation.

Given the amazing reduction in speed I accomplished yesterday (worst was
3h09:02), I'd say we're not doing bad, but yeah, I too think there is
room for improvement.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-02  7:59     ` Andrew Morton
  2006-10-02  8:14       ` Peter Zijlstra
@ 2006-10-03  7:32       ` Peter Zijlstra
  2006-10-08 20:23       ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule
  2006-10-08 20:28       ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule
  3 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2006-10-03  7:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel

On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote:

> IOW: does the patch help mem=96M;make -j5??

Its hardly swapping; I'll go back to mem=64M; make -j5
that got some decent swapping and still ~50% cpu.

-vanilla:

        Command being timed: "make -j5"
        User time (seconds): 2557.12
        System time (seconds): 1239.14
        Percent of CPU this job got: 87%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12:36
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 50920
        Minor (reclaiming a frame) page faults: 8988166
        Voluntary context switches: 129759
        Involuntary context switches: 146431
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

-swap-token:

        Command being timed: "make -j5"
        User time (seconds): 2557.20
        System time (seconds): 1122.35
        Percent of CPU this job got: 86%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10:54
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 56116
        Minor (reclaiming a frame) page faults: 8985073
        Voluntary context switches: 135533
        Involuntary context switches: 145494
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 1/2] grab swap token reordered
  2006-10-02  7:59     ` Andrew Morton
  2006-10-02  8:14       ` Peter Zijlstra
  2006-10-03  7:32       ` Peter Zijlstra
@ 2006-10-08 20:23       ` Ashwin Chaugule
  2006-10-08 20:28       ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule
  3 siblings, 0 replies; 12+ messages in thread
From: Ashwin Chaugule @ 2006-10-08 20:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zijlstra, linux-kernel, Rik van Riel


This patch makes sure the contention for the token happens _before_ any
read-in and kicks the swap-token algo only when the VM is under
pressure.



Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>

--
diff --git a/mm/filemap.c b/mm/filemap.c
index afcdc72..c17b2ab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1479,7 +1479,6 @@ no_cached_page:
 	 * effect.
 	 */
 	error = page_cache_read(file, pgoff);
-	grab_swap_token();
 
 	/*
 	 * The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index 92a3ebd..4a877e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1974,6 +1974,7 @@ static int do_swap_page(struct mm_struct
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		grab_swap_token(); /* Contend for token _before_ read-in */
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
@@ -1991,7 +1992,6 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
-		grab_swap_token();
 	}
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
--


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC][PATCH 2/2] new scheme to preempt swap token
  2006-10-02  7:59     ` Andrew Morton
                         ` (2 preceding siblings ...)
  2006-10-08 20:23       ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule
@ 2006-10-08 20:28       ` Ashwin Chaugule
  3 siblings, 0 replies; 12+ messages in thread
From: Ashwin Chaugule @ 2006-10-08 20:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zijlstra, linux-kernel, Rik van Riel

On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote:
> On Mon, 02 Oct 2006 09:35:52 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> It's just thrashing too much to bother optimising for.  Obviously we want
> it to terminate in a sane period of time and we'd _like_ to improve it. 
> But I think we'd accept a 10% slowdown in this region of operation if it
> gave us a 10% speedup in the 25%-utilisation region.
> 
> IOW: does the patch help mem=96M;make -j5??
> 
> > 
> > Tasks owning the swap token will retain their pages and will hence swap
> > less, other (contending) tasks will get less pages and will fault more
> > frequent. This prio mechanism will favour exactly those tasks not
> > holding the token. Which makes for token bouncing.


This algo should take care of it. 
Each task has a priority which is incremented if it contended
for the token in an interval less than its previous attempt.
If the token is acquired, that task's priority is boosted to prevent
the token from bouncing around too often and to let the task make 
some progress in its execution.

Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>

--
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34ed0d9..c4bb78b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -342,9 +342,16 @@ struct mm_struct {
 	/* Architecture-specific MM context */
 	mm_context_t context;
 
-	/* Token based thrashing protection. */
-	unsigned long swap_token_time;
-	char recent_pagein;
+	/* Swap token stuff */
+	/*
+	 * Last value of global fault stamp as seen by this process. 
+	 * In other words, this value gives an indication of how long
+	 * it has been since this task got the token.
+	 * Look at mm/thrash.c
+	 */
+	unsigned int faultstamp;
+	unsigned int token_priority;
+	unsigned int last_interval;
 
 	/* coredumping support */
 	int core_waiters;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e7c36ba..89f8a39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ extern spinlock_t swap_lock;
 
 /* linux/mm/thrash.c */
 extern struct mm_struct * swap_token_mm;
-extern unsigned long swap_token_default_timeout;
 extern void grab_swap_token(void);
 extern void __put_swap_token(struct mm_struct *);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index f9b014e..c4b19b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -470,6 +470,10 @@ static struct mm_struct *dup_mm(struct t
 
 	memcpy(mm, oldmm, sizeof(*mm));
 
+	/* Initializing for Swap token stuff */
+	mm->token_priority = 0;
+	mm->last_interval = 0;
+
 	if (!mm_init(mm))
 		goto fail_nomem;
 
@@ -532,7 +536,11 @@ static int copy_mm(unsigned long clone_f
 	if (!mm)
 		goto fail_nomem;
 
-good_mm:
+good_mm:	
+	/* Initializing for Swap token stuff */
+	mm->token_priority = 0;
+	mm->last_interval = 0;
+	
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fd43c3e..ef52798 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -910,17 +910,6 @@ static ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
-#ifdef CONFIG_SWAP
-	{
-		.ctl_name	= VM_SWAP_TOKEN_TIMEOUT,
-		.procname	= "swap_token_timeout",
-		.data		= &swap_token_default_timeout,
-		.maxlen		= sizeof(swap_token_default_timeout),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_jiffies,
-		.strategy	= &sysctl_jiffies,
-	},
-#endif
 #ifdef CONFIG_NUMA
 	{
 		.ctl_name	= VM_ZONE_RECLAIM_MODE,
diff --git a/mm/thrash.c b/mm/thrash.c
index f4c560b..c0d9cee 100644
--- a/mm/thrash.c
+++ b/mm/thrash.c
@@ -7,90 +7,66 @@
  *
  * Simple token based thrashing protection, using the algorithm
  * described in:  http://www.cs.wm.edu/~sjiang/token.pdf
+ *
+ * Sep 2006, Ashwin Chaugule <ashwin.chaugule@celunite.com>
+ * Improved algorithm to pass token:
+ * Each task has a priority which is incremented if it contended
+ * for the token in an interval less than its previous attempt.
+ * If the token is acquired, that task's priority is boosted to prevent
+ * the token from bouncing around too often and to let the task make 
+ * some progress in its execution.
  */
+
 #include <linux/jiffies.h>
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/swap.h>
 
 static DEFINE_SPINLOCK(swap_token_lock);
-static unsigned long swap_token_timeout;
-static unsigned long swap_token_check;
-struct mm_struct * swap_token_mm = &init_mm;
-
-#define SWAP_TOKEN_CHECK_INTERVAL (HZ * 2)
-#define SWAP_TOKEN_TIMEOUT	(300 * HZ)
-/*
- * Currently disabled; Needs further code to work at HZ * 300.
- */
-unsigned long swap_token_default_timeout = SWAP_TOKEN_TIMEOUT;
-
-/*
- * Take the token away if the process had no page faults
- * in the last interval, or if it has held the token for
- * too long.
- */
-#define SWAP_TOKEN_ENOUGH_RSS 1
-#define SWAP_TOKEN_TIMED_OUT 2
-static int should_release_swap_token(struct mm_struct *mm)
-{
-	int ret = 0;
-	if (!mm->recent_pagein)
-		ret = SWAP_TOKEN_ENOUGH_RSS;
-	else if (time_after(jiffies, swap_token_timeout))
-		ret = SWAP_TOKEN_TIMED_OUT;
-	mm->recent_pagein = 0;
-	return ret;
-}
+struct mm_struct * swap_token_mm = NULL;
+unsigned int global_faults = 0;
 
-/*
- * Try to grab the swapout protection token.  We only try to
- * grab it once every TOKEN_CHECK_INTERVAL, both to prevent
- * SMP lock contention and to check that the process that held
- * the token before is no longer thrashing.
- */
 void grab_swap_token(void)
 {
-	struct mm_struct *mm;
-	int reason;
-
-	/* We have the token. Let others know we still need it. */
-	if (has_swap_token(current->mm)) {
-		current->mm->recent_pagein = 1;
-		if (unlikely(!swap_token_default_timeout))
-			disable_swap_token();
+	int current_interval = 0;
+	
+	global_faults++; 
+
+	current_interval = global_faults - current->mm->faultstamp;
+	
+	if (!spin_trylock(&swap_token_lock))
 		return;
-	}
 
-	if (time_after(jiffies, swap_token_check)) {
+	/* First come first served */
+	if (swap_token_mm == NULL) {
+		current->mm->token_priority = current->mm->token_priority + 2;
+		swap_token_mm = current->mm;
+		goto out;
+	}
 
-		if (!swap_token_default_timeout) {
-			swap_token_check = jiffies + SWAP_TOKEN_CHECK_INTERVAL;
-			return;
+	if (current->mm != swap_token_mm) {
+		if (current_interval < current->mm->last_interval) 
+			current->mm->token_priority++;
+		else {
+			current->mm->token_priority--;
+			if (unlikely(current->mm->token_priority < 0))
+				current->mm->token_priority = 0;
 		}
-
-		/* ... or if we recently held the token. */
-		if (time_before(jiffies, current->mm->swap_token_time))
-			return;
-
-		if (!spin_trylock(&swap_token_lock))
-			return;
-
-		swap_token_check = jiffies + SWAP_TOKEN_CHECK_INTERVAL;
-
-		mm = swap_token_mm;
-		if ((reason = should_release_swap_token(mm))) {
-			unsigned long eligible = jiffies;
-			if (reason == SWAP_TOKEN_TIMED_OUT) {
-				eligible += swap_token_default_timeout;
-			}
-			mm->swap_token_time = eligible;
-			swap_token_timeout = jiffies + swap_token_default_timeout;
+		/* Check if we deserve the token */
+		if (current->mm->token_priority > swap_token_mm->token_priority) {
+			current->mm->token_priority = current->mm->token_priority + 2;
 			swap_token_mm = current->mm;
 		}
-		spin_unlock(&swap_token_lock);
 	}
-	return;
+	else
+		/* Token holder came in again! */
+		current->mm->token_priority = current->mm->token_priority + 2;
+
+out:
+	current->mm->faultstamp = global_faults;
+	current->mm->last_interval = current_interval;
+	spin_unlock(&swap_token_lock);
+return;
 }
 
 /* Called on process exit. */
@@ -98,9 +74,7 @@ void __put_swap_token(struct mm_struct *
 {
 	spin_lock(&swap_token_lock);
 	if (likely(mm == swap_token_mm)) {
-		mm->swap_token_time = jiffies + SWAP_TOKEN_CHECK_INTERVAL;
-		swap_token_mm = &init_mm;
-		swap_token_check = jiffies;
+		swap_token_mm = NULL;
 	}
 	spin_unlock(&swap_token_lock);
 }

--


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-02  7:35   ` Peter Zijlstra
  2006-10-02  7:59     ` Andrew Morton
@ 2006-10-02 11:00     ` Ashwin Chaugule
  2006-10-02 11:08       ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Ashwin Chaugule @ 2006-10-02 11:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel, Rik van Riel

On Mon, 2006-10-02 at 09:35 +0200, Peter Zijlstra wrote:

> Being frustrated with these results - I mean the idea made sense, so
> what is going on - I came up with this answer:
> 
> Tasks owning the swap token will retain their pages and will hence swap
> less, other (contending) tasks will get less pages and will fault more
> frequent. This prio mechanism will favour exactly those tasks not
> holding the token. Which makes for token bouncing.
> 
Right. But, with the token bouncing around, effectively the RSS of the
processes at that time will keep increasing, and they should be able to
spend more time on execution than i/o. And meanwhile the priorities of
the tasks that were contending for the token, but didnt get it, will
increment. So since the fairness is preserved, all the tasks should get
their fair share for execution and it should result in a speedup as
compared to the current upstream implementation. I took a time
instrumentation of the vanilla 2.6.18 kernel build with -j 4 and I've
posted up the results in the previous mail. I'm testing on an ibm t42
1.69Ghz 64M system.

> So while I agree it would be nice to get rid of all magic variables
> (holding time in the current impl) this proposed solution hasn't
> convinced me (for one it introduces another).
> 
> (for the interrested, the various attempts I tried are available here:
>   http://programming.kicks-ass.net/kernel-patches/swap_token/ )

Cool!

Had you applied these patches when you posted your test results ?


Thanks
Ashwin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-02 11:00     ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
@ 2006-10-02 11:08       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2006-10-02 11:08 UTC (permalink / raw)
  To: ashwin.chaugule; +Cc: Andrew Morton, linux-kernel, Rik van Riel

On Mon, 2006-10-02 at 16:30 +0530, Ashwin Chaugule wrote:
> On Mon, 2006-10-02 at 09:35 +0200, Peter Zijlstra wrote:

> > So while I agree it would be nice to get rid of all magic variables
> > (holding time in the current impl) this proposed solution hasn't
> > convinced me (for one it introduces another).
> > 
> > (for the interrested, the various attempts I tried are available here:
> >   http://programming.kicks-ass.net/kernel-patches/swap_token/ )
> 
> Cool!
> 
> Had you applied these patches when you posted your test results ?

Only my test box ever ran them.

They are replacements for your 2nd patch, timings I got from them were
worse than with yours though, needs more attention.

A variation on 3 I have in mind is to reset the prio of the loosing mm
to 0 - this should avoid it regaining the token quickly.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-01 22:56 ` Andrew Morton
  2006-10-02  7:35   ` Peter Zijlstra
@ 2006-10-02  8:20   ` Ashwin Chaugule
  2006-10-02 10:00   ` Ashwin Chaugule
  2 siblings, 0 replies; 12+ messages in thread
From: Ashwin Chaugule @ 2006-10-02  8:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra

On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote:
> On Sat, 30 Sep 2006 00:11:51 +0530
> Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote:
> 
> > 
> > Hi, 
> > Here's a brief up on the next two mails. 
> 
> When preparing patches, please give each one's email a different and
> meaningful Subject:, and try to put the description of the patch within the
> email which  contains that patch, thanks.

Yep, will remember that.


> > PATCH 2: 
> > 
> > Instead of using TIMEOUT as a parameter to transfer the token, I think a
> > better solution is to hand it over to a process that proves its
> > eligibilty. 
> > 
> > What my scheme does, is to find out how frequently a process is calling
> > these functions. The processes that call these more frequently get a
> > higher priority. 
> > The idea is to guarantee that a high priority process gets the token.
> > The priority of a process is determined by the number of consecutive
> > calls to swap-in and no-page. I mean "consecutive" not from the
> > scheduler point of view, but from the process point of view. In other
> > words, if the task called these functions every time it was scheduled,
> > it means it is not getting any further with its execution. 
> > 
> > This way, its a matter of simple comparison of task priorities, to
> > decide whether to transfer the token or not. 
> 
> Does this introduce the possibility of starvation?  Where the
> fast-allocating process hogs the system and everything else makes no
> progress?
> 
> 

A fast allocating process will start to increase its RSS and the
assumption is that such a process will finish its execution faster and
relinquish the token. Meanwhile, when such a process is allocating, the
other processes pages will be marked as "true LRU" pages and in the
event that they get swaped out, when those processes get scheduled,
their priorities will also be increased. So effectively, chances of
starvation are quite minimal. The key is to grant the token to the most
deserving process, so in other words, when a task tries to hog up the
system by allocations and swap-in's , some other process is getting
hampered and when the affected process gets scheduled, the algorithm
will make sure it gets the immunity from generating false LRU pages.
Also, when the fast allocating process stops its continuous allocation ,
or continues its allocation sporadically ie. ((global_faults -
current->mm->faultstamp) > FAULTSTAMP_DIFF) , its priority keeps getting
decremented too. 


> qsbench gives quite unstable results in my experience.  How stable is the
> above result (say, average across ten runs?)
> 

True. I did run the qsbench test several times, and the results were
always better off by atleast 10 seconds with my changes.

> It's quite easy to make changes in this area which speed qsbench up with
> one set of arguments, and which slow it down with a different set.  Did you
> try mixing the tests up a bit?

I ran another vmstress app, which spawns several other threads each
dedicated to either only malloc, or io only etc

Results:

Upstream:

time stress --cpu 2 --io 14 --vm 5 --vm-bytes 50M --timeout 10s --hdd 2
stress: info: [4331] dispatching hogs: 2 cpu, 14 io, 5 vm, 2 hdd
stress: info: [4331] successful run completed in 19s

real    0m19.358s 
user    0m9.850s
sys     0m0.210s


My changes:

time stress --cpu 2 --io 14 --vm 5 --vm-bytes 50M --timeout 10s --hdd 2
stress: info: [4498] dispatching hogs: 2 cpu, 14 io, 5 vm, 2 hdd
stress: info: [4498] successful run completed in 16s

real    0m16.813s
user    0m9.850s
sys     0m0.100s

Havent tested this enough to average out, but it did show improvements
everytime I ran it.

> 
> Also, qsbench isn't really a very good test for swap-intensive workloads -
> it's re-referencing and locality patterns seem fairly artificial.

True. In theory, my algo should give better results. The earlier TIMEOUT
was unfair to processes. In the pre-thrashing stages, it was detrimental
to processes badly in need of the token. Thus their execution didnt get
any futher, which is addressed here. I was hoping that people would have
some other intsrumentation tools for VM, I tried vmregress, but it didnt
build against 2.6.18, needs some mm api fixes.
> 
> Another workload which it would be useful to benchmark is a kernel compile
> - say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need
> tuning).
> 
Will test this and post it up.

Thanks !
Ashwin

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/2] Swap token re-tuned
  2006-10-01 22:56 ` Andrew Morton
  2006-10-02  7:35   ` Peter Zijlstra
  2006-10-02  8:20   ` Ashwin Chaugule
@ 2006-10-02 10:00   ` Ashwin Chaugule
  2 siblings, 0 replies; 12+ messages in thread
From: Ashwin Chaugule @ 2006-10-02 10:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra

On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote:

> Another workload which it would be useful to benchmark is a kernel compile
> - say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need
> tuning).
> 

This is what I got :

mem=64M


Upstream:
2.6.18
make -j 4 vmlinux


real    31m26.021s
user    4m32.140s
sys     0m23.340s

------------------

My patch:

real    27m42.984s
user    4m33.800s
sys     0m22.080s


Ashwin


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-10-08 20:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-29 18:41 [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
2006-10-01 22:56 ` Andrew Morton
2006-10-02  7:35   ` Peter Zijlstra
2006-10-02  7:59     ` Andrew Morton
2006-10-02  8:14       ` Peter Zijlstra
2006-10-03  7:32       ` Peter Zijlstra
2006-10-08 20:23       ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule
2006-10-08 20:28       ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule
2006-10-02 11:00     ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule
2006-10-02 11:08       ` Peter Zijlstra
2006-10-02  8:20   ` Ashwin Chaugule
2006-10-02 10:00   ` Ashwin Chaugule

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox