* [RFC][PATCH 0/2] Swap token re-tuned @ 2006-09-29 18:41 Ashwin Chaugule 2006-10-01 22:56 ` Andrew Morton 0 siblings, 1 reply; 12+ messages in thread From: Ashwin Chaugule @ 2006-09-29 18:41 UTC (permalink / raw) To: linux-kernel Hi, Here's a brief up on the next two mails. PATCH 1: In the current implementation of swap token tuning, grab swap token is made from : 1) after page_cache_read (filemap.c) and 2) after the readahead logic in do_swap_page (memory.c) IMO, the contention for the swap token should happen _before_ the aforementioned calls, because in the event of low system memory, calls to freeup space will be made later from page_cache_read and read_swap_cache_async , so we want to avoid "false LRU" pages by grabbing the token before the VM starts searching for replacement candidates. PATCH 2: Instead of using TIMEOUT as a parameter to transfer the token, I think a better solution is to hand it over to a process that proves its eligibilty. What my scheme does, is to find out how frequently a process is calling these functions. The processes that call these more frequently get a higher priority. The idea is to guarantee that a high priority process gets the token. The priority of a process is determined by the number of consecutive calls to swap-in and no-page. I mean "consecutive" not from the scheduler point of view, but from the process point of view. In other words, if the task called these functions every time it was scheduled, it means it is not getting any further with its execution. This way, its a matter of simple comparison of task priorities, to decide whether to transfer the token or not. I did some testing with the two patches combined and the results are as follows: Current Upstream implementation: =============================== root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 seed = 1420300 seed = 1420300 seed = 1420300 real 3m40.124s user 0m12.060s sys 0m0.940s -------------reboot----------------- With my implementation : ======================== root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 seed = 1420300 seed = 1420300 seed = 1420300 real 2m58.708s user 0m11.880s sys 0m1.070s My test machine: 1.69Ghz CPU 64M RAM 7200rpm hdd 2MB L2 cache vanilla kernel 2.6.18 Ubuntu dapper with gnome. Any comments, suggestions, ideas ? Cheers, Ashwin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-09-29 18:41 [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule @ 2006-10-01 22:56 ` Andrew Morton 2006-10-02 7:35 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Andrew Morton @ 2006-10-01 22:56 UTC (permalink / raw) To: ashwin.chaugule; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra On Sat, 30 Sep 2006 00:11:51 +0530 Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote: > > Hi, > Here's a brief up on the next two mails. When preparing patches, please give each one's email a different and meaningful Subject:, and try to put the description of the patch within the email which contains that patch, thanks. > PATCH 1: > > In the current implementation of swap token tuning, grab swap token is > made from : > 1) after page_cache_read (filemap.c) and > 2) after the readahead logic in do_swap_page (memory.c) > > IMO, the contention for the swap token should happen _before_ the > aforementioned calls, because in the event of low system memory, calls > to freeup space will be made later from page_cache_read and > read_swap_cache_async , so we want to avoid "false LRU" pages by > grabbing the token before the VM starts searching for replacement > candidates. Seems sane. > PATCH 2: > > Instead of using TIMEOUT as a parameter to transfer the token, I think a > better solution is to hand it over to a process that proves its > eligibilty. > > What my scheme does, is to find out how frequently a process is calling > these functions. The processes that call these more frequently get a > higher priority. > The idea is to guarantee that a high priority process gets the token. > The priority of a process is determined by the number of consecutive > calls to swap-in and no-page. I mean "consecutive" not from the > scheduler point of view, but from the process point of view. In other > words, if the task called these functions every time it was scheduled, > it means it is not getting any further with its execution. > > This way, its a matter of simple comparison of task priorities, to > decide whether to transfer the token or not. Does this introduce the possibility of starvation? Where the fast-allocating process hogs the system and everything else makes no progress? > I did some testing with the two patches combined and the results are as > follows: > > Current Upstream implementation: > =============================== > > root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 > seed = 1420300 > seed = 1420300 > seed = 1420300 > > real 3m40.124s > user 0m12.060s > sys 0m0.940s > > > -------------reboot----------------- > > With my implementation : > ======================== > > root@ashbert:~/crap# time ./qsbench -n 9000000 -p 3 -s 1420300 > seed = 1420300 > seed = 1420300 > seed = 1420300 > > real 2m58.708s > user 0m11.880s > sys 0m1.070s > qsbench gives quite unstable results in my experience. How stable is the above result (say, average across ten runs?) It's quite easy to make changes in this area which speed qsbench up with one set of arguments, and which slow it down with a different set. Did you try mixing the tests up a bit? Also, qsbench isn't really a very good test for swap-intensive workloads - it's re-referencing and locality patterns seem fairly artificial. Another workload which it would be useful to benchmark is a kernel compile - say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need tuning). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-01 22:56 ` Andrew Morton @ 2006-10-02 7:35 ` Peter Zijlstra 2006-10-02 7:59 ` Andrew Morton 2006-10-02 11:00 ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule 2006-10-02 8:20 ` Ashwin Chaugule 2006-10-02 10:00 ` Ashwin Chaugule 2 siblings, 2 replies; 12+ messages in thread From: Peter Zijlstra @ 2006-10-02 7:35 UTC (permalink / raw) To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote: > On Sat, 30 Sep 2006 00:11:51 +0530 > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote: > > PATCH 2: > > > > Instead of using TIMEOUT as a parameter to transfer the token, I think a > > better solution is to hand it over to a process that proves its > > eligibilty. > > > > What my scheme does, is to find out how frequently a process is calling > > these functions. The processes that call these more frequently get a > > higher priority. > > The idea is to guarantee that a high priority process gets the token. > > The priority of a process is determined by the number of consecutive > > calls to swap-in and no-page. I mean "consecutive" not from the > > scheduler point of view, but from the process point of view. In other > > words, if the task called these functions every time it was scheduled, > > it means it is not getting any further with its execution. > > > > This way, its a matter of simple comparison of task priorities, to > > decide whether to transfer the token or not. > > Does this introduce the possibility of starvation? Where the > fast-allocating process hogs the system and everything else makes no > progress? I tinkered with this a bit yesterday, and didn't get good results for: mem=64M ; make -j5 -vanilla: 2h32:55 -swap-token: 2h41:48 various other attempts at tweaking the code only made it worse. (will have to rerun these test, but a ~3h test is well, a 3h test ;-) Being frustrated with these results - I mean the idea made sense, so what is going on - I came up with this answer: Tasks owning the swap token will retain their pages and will hence swap less, other (contending) tasks will get less pages and will fault more frequent. This prio mechanism will favour exactly those tasks not holding the token. Which makes for token bouncing. The current mechanism seemingly assigns the token randomly (whomever asks while not held gets it - and the hold time is fixed), however this change in paging behaviour (holder less, contenders more) shifts the odds in favour of one of the contenders. Also the fixed holding time will make sure the token doesn't get released too soon and can make some progress. So while I agree it would be nice to get rid of all magic variables (holding time in the current impl) this proposed solution hasn't convinced me (for one it introduces another). (for the interrested, the various attempts I tried are available here: http://programming.kicks-ass.net/kernel-patches/swap_token/ ) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-02 7:35 ` Peter Zijlstra @ 2006-10-02 7:59 ` Andrew Morton 2006-10-02 8:14 ` Peter Zijlstra ` (3 more replies) 2006-10-02 11:00 ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule 1 sibling, 4 replies; 12+ messages in thread From: Andrew Morton @ 2006-10-02 7:59 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel On Mon, 02 Oct 2006 09:35:52 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote: > > On Sat, 30 Sep 2006 00:11:51 +0530 > > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote: > > > > PATCH 2: > > > > > > Instead of using TIMEOUT as a parameter to transfer the token, I think a > > > better solution is to hand it over to a process that proves its > > > eligibilty. > > > > > > What my scheme does, is to find out how frequently a process is calling > > > these functions. The processes that call these more frequently get a > > > higher priority. > > > The idea is to guarantee that a high priority process gets the token. > > > The priority of a process is determined by the number of consecutive > > > calls to swap-in and no-page. I mean "consecutive" not from the > > > scheduler point of view, but from the process point of view. In other > > > words, if the task called these functions every time it was scheduled, > > > it means it is not getting any further with its execution. > > > > > > This way, its a matter of simple comparison of task priorities, to > > > decide whether to transfer the token or not. > > > > Does this introduce the possibility of starvation? Where the > > fast-allocating process hogs the system and everything else makes no > > progress? > > I tinkered with this a bit yesterday, and didn't get good results for: > mem=64M ; make -j5 > > -vanilla: 2h32:55 > -swap-token: 2h41:48 > > various other attempts at tweaking the code only made it worse. (will > have to rerun these test, but a ~3h test is well, a 3h test ;-) I don't think that's a region of operation where we care a great deal. What was the average CPU utlisation? Only a few percent. It's just thrashing too much to bother optimising for. Obviously we want it to terminate in a sane period of time and we'd _like_ to improve it. But I think we'd accept a 10% slowdown in this region of operation if it gave us a 10% speedup in the 25%-utilisation region. IOW: does the patch help mem=96M;make -j5?? > Being frustrated with these results - I mean the idea made sense, so > what is going on - I came up with this answer: > > Tasks owning the swap token will retain their pages and will hence swap > less, other (contending) tasks will get less pages and will fault more > frequent. This prio mechanism will favour exactly those tasks not > holding the token. Which makes for token bouncing. OK. (We need to do something with ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/broken-out/mm-thrash-detect-process-thrashing-against-itself.patch, btw. Has been in -mm since March and I'm still waiting for some benchmarks which would justify its inclusion..) > The current mechanism seemingly assigns the token randomly (whomever > asks while not held gets it - and the hold time is fixed), however this > change in paging behaviour (holder less, contenders more) shifts the > odds in favour of one of the contenders. Also the fixed holding time > will make sure the token doesn't get released too soon and can make some > progress. > > So while I agree it would be nice to get rid of all magic variables > (holding time in the current impl) this proposed solution hasn't > convinced me (for one it introduces another). > > (for the interrested, the various attempts I tried are available here: > http://programming.kicks-ass.net/kernel-patches/swap_token/ ) OK, thanks or looking into it. I do think this is rich ground for optimisation. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-02 7:59 ` Andrew Morton @ 2006-10-02 8:14 ` Peter Zijlstra 2006-10-03 7:32 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2006-10-02 8:14 UTC (permalink / raw) To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote: > On Mon, 02 Oct 2006 09:35:52 +0200 > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote: > > > On Sat, 30 Sep 2006 00:11:51 +0530 > > > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote: > > > > > > PATCH 2: > > > > > > > > Instead of using TIMEOUT as a parameter to transfer the token, I think a > > > > better solution is to hand it over to a process that proves its > > > > eligibilty. > > > > > > > > What my scheme does, is to find out how frequently a process is calling > > > > these functions. The processes that call these more frequently get a > > > > higher priority. > > > > The idea is to guarantee that a high priority process gets the token. > > > > The priority of a process is determined by the number of consecutive > > > > calls to swap-in and no-page. I mean "consecutive" not from the > > > > scheduler point of view, but from the process point of view. In other > > > > words, if the task called these functions every time it was scheduled, > > > > it means it is not getting any further with its execution. > > > > > > > > This way, its a matter of simple comparison of task priorities, to > > > > decide whether to transfer the token or not. > > > > > > Does this introduce the possibility of starvation? Where the > > > fast-allocating process hogs the system and everything else makes no > > > progress? > > > > I tinkered with this a bit yesterday, and didn't get good results for: > > mem=64M ; make -j5 > > > > -vanilla: 2h32:55 Command being timed: "make -j5" User time (seconds): 2726.81 System time (seconds): 2266.85 Percent of CPU this job got: 54% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:55 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 269956 Minor (reclaiming a frame) page faults: 8699298 Voluntary context switches: 414020 Involuntary context switches: 242365 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 > > -swap-token: 2h41:48 Command being timed: "make -j5" User time (seconds): 2720.54 System time (seconds): 2428.60 Percent of CPU this job got: 53% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:41:48 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 281943 Minor (reclaiming a frame) page faults: 8692417 Voluntary context switches: 421770 Involuntary context switches: 241323 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 > > various other attempts at tweaking the code only made it worse. (will > > have to rerun these test, but a ~3h test is well, a 3h test ;-) > > I don't think that's a region of operation where we care a great deal. > What was the average CPU utlisation? Only a few percent. ~50%, its a slow box this, a p3-550. > It's just thrashing too much to bother optimising for. Obviously we want > it to terminate in a sane period of time and we'd _like_ to improve it. > But I think we'd accept a 10% slowdown in this region of operation if it > gave us a 10% speedup in the 25%-utilisation region. > > IOW: does the patch help mem=96M;make -j5?? Will kick off some test later today. > > Being frustrated with these results - I mean the idea made sense, so > > what is going on - I came up with this answer: > > > > Tasks owning the swap token will retain their pages and will hence swap > > less, other (contending) tasks will get less pages and will fault more > > frequent. This prio mechanism will favour exactly those tasks not > > holding the token. Which makes for token bouncing. > > OK. > > (We need to do something with > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/broken-out/mm-thrash-detect-process-thrashing-against-itself.patch, > btw. Has been in -mm since March and I'm still waiting for some benchmarks > which would justify its inclusion..) Hmm, benchmarks, I need VM benchmarks for my page replacment work too ;-) Perhaps I can create a multi-threaded progamm that knows a few patterns. > > The current mechanism seemingly assigns the token randomly (whomever > > asks while not held gets it - and the hold time is fixed), however this > > change in paging behaviour (holder less, contenders more) shifts the > > odds in favour of one of the contenders. Also the fixed holding time > > will make sure the token doesn't get released too soon and can make some > > progress. > > > > So while I agree it would be nice to get rid of all magic variables > > (holding time in the current impl) this proposed solution hasn't > > convinced me (for one it introduces another). > > > > (for the interrested, the various attempts I tried are available here: > > http://programming.kicks-ass.net/kernel-patches/swap_token/ ) > > OK, thanks or looking into it. I do think this is rich ground for > optimisation. Given the amazing reduction in speed I accomplished yesterday (worst was 3h09:02), I'd say we're not doing bad, but yeah, I too think there is room for improvement. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-02 7:59 ` Andrew Morton 2006-10-02 8:14 ` Peter Zijlstra @ 2006-10-03 7:32 ` Peter Zijlstra 2006-10-08 20:23 ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule 2006-10-08 20:28 ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule 3 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2006-10-03 7:32 UTC (permalink / raw) To: Andrew Morton; +Cc: ashwin.chaugule, linux-kernel, Rik van Riel On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote: > IOW: does the patch help mem=96M;make -j5?? Its hardly swapping; I'll go back to mem=64M; make -j5 that got some decent swapping and still ~50% cpu. -vanilla: Command being timed: "make -j5" User time (seconds): 2557.12 System time (seconds): 1239.14 Percent of CPU this job got: 87% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12:36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 50920 Minor (reclaiming a frame) page faults: 8988166 Voluntary context switches: 129759 Involuntary context switches: 146431 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 -swap-token: Command being timed: "make -j5" User time (seconds): 2557.20 System time (seconds): 1122.35 Percent of CPU this job got: 86% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10:54 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 56116 Minor (reclaiming a frame) page faults: 8985073 Voluntary context switches: 135533 Involuntary context switches: 145494 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC][PATCH 1/2] grab swap token reordered 2006-10-02 7:59 ` Andrew Morton 2006-10-02 8:14 ` Peter Zijlstra 2006-10-03 7:32 ` Peter Zijlstra @ 2006-10-08 20:23 ` Ashwin Chaugule 2006-10-08 20:28 ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule 3 siblings, 0 replies; 12+ messages in thread From: Ashwin Chaugule @ 2006-10-08 20:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zijlstra, linux-kernel, Rik van Riel This patch makes sure the contention for the token happens _before_ any read-in and kicks the swap-token algo only when the VM is under pressure. Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com> -- diff --git a/mm/filemap.c b/mm/filemap.c index afcdc72..c17b2ab 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1479,7 +1479,6 @@ no_cached_page: * effect. */ error = page_cache_read(file, pgoff); - grab_swap_token(); /* * The page we want has now been added to the page cache. diff --git a/mm/memory.c b/mm/memory.c index 92a3ebd..4a877e9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1974,6 +1974,7 @@ static int do_swap_page(struct mm_struct delayacct_set_flag(DELAYACCT_PF_SWAPIN); page = lookup_swap_cache(entry); if (!page) { + grab_swap_token(); /* Contend for token _before_ read-in */ swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { @@ -1991,7 +1992,6 @@ static int do_swap_page(struct mm_struct /* Had to read the page from swap area: Major fault */ ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); - grab_swap_token(); } delayacct_clear_flag(DELAYACCT_PF_SWAPIN); -- ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC][PATCH 2/2] new scheme to preempt swap token 2006-10-02 7:59 ` Andrew Morton ` (2 preceding siblings ...) 2006-10-08 20:23 ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule @ 2006-10-08 20:28 ` Ashwin Chaugule 3 siblings, 0 replies; 12+ messages in thread From: Ashwin Chaugule @ 2006-10-08 20:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zijlstra, linux-kernel, Rik van Riel On Mon, 2006-10-02 at 00:59 -0700, Andrew Morton wrote: > On Mon, 02 Oct 2006 09:35:52 +0200 > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > It's just thrashing too much to bother optimising for. Obviously we want > it to terminate in a sane period of time and we'd _like_ to improve it. > But I think we'd accept a 10% slowdown in this region of operation if it > gave us a 10% speedup in the 25%-utilisation region. > > IOW: does the patch help mem=96M;make -j5?? > > > > > Tasks owning the swap token will retain their pages and will hence swap > > less, other (contending) tasks will get less pages and will fault more > > frequent. This prio mechanism will favour exactly those tasks not > > holding the token. Which makes for token bouncing. This algo should take care of it. Each task has a priority which is incremented if it contended for the token in an interval less than its previous attempt. If the token is acquired, that task's priority is boosted to prevent the token from bouncing around too often and to let the task make some progress in its execution. Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com> -- diff --git a/include/linux/sched.h b/include/linux/sched.h index 34ed0d9..c4bb78b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -342,9 +342,16 @@ struct mm_struct { /* Architecture-specific MM context */ mm_context_t context; - /* Token based thrashing protection. */ - unsigned long swap_token_time; - char recent_pagein; + /* Swap token stuff */ + /* + * Last value of global fault stamp as seen by this process. + * In other words, this value gives an indication of how long + * it has been since this task got the token. + * Look at mm/thrash.c + */ + unsigned int faultstamp; + unsigned int token_priority; + unsigned int last_interval; /* coredumping support */ int core_waiters; diff --git a/include/linux/swap.h b/include/linux/swap.h index e7c36ba..89f8a39 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -259,7 +259,6 @@ extern spinlock_t swap_lock; /* linux/mm/thrash.c */ extern struct mm_struct * swap_token_mm; -extern unsigned long swap_token_default_timeout; extern void grab_swap_token(void); extern void __put_swap_token(struct mm_struct *); diff --git a/kernel/fork.c b/kernel/fork.c index f9b014e..c4b19b3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -470,6 +470,10 @@ static struct mm_struct *dup_mm(struct t memcpy(mm, oldmm, sizeof(*mm)); + /* Initializing for Swap token stuff */ + mm->token_priority = 0; + mm->last_interval = 0; + if (!mm_init(mm)) goto fail_nomem; @@ -532,7 +536,11 @@ static int copy_mm(unsigned long clone_f if (!mm) goto fail_nomem; -good_mm: +good_mm: + /* Initializing for Swap token stuff */ + mm->token_priority = 0; + mm->last_interval = 0; + tsk->mm = mm; tsk->active_mm = mm; return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index fd43c3e..ef52798 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -910,17 +910,6 @@ static ctl_table vm_table[] = { .extra1 = &zero, }, #endif -#ifdef CONFIG_SWAP - { - .ctl_name = VM_SWAP_TOKEN_TIMEOUT, - .procname = "swap_token_timeout", - .data = &swap_token_default_timeout, - .maxlen = sizeof(swap_token_default_timeout), - .mode = 0644, - .proc_handler = &proc_dointvec_jiffies, - .strategy = &sysctl_jiffies, - }, -#endif #ifdef CONFIG_NUMA { .ctl_name = VM_ZONE_RECLAIM_MODE, diff --git a/mm/thrash.c b/mm/thrash.c index f4c560b..c0d9cee 100644 --- a/mm/thrash.c +++ b/mm/thrash.c @@ -7,90 +7,66 @@ * * Simple token based thrashing protection, using the algorithm * described in: http://www.cs.wm.edu/~sjiang/token.pdf + * + * Sep 2006, Ashwin Chaugule <ashwin.chaugule@celunite.com> + * Improved algorithm to pass token: + * Each task has a priority which is incremented if it contended + * for the token in an interval less than its previous attempt. + * If the token is acquired, that task's priority is boosted to prevent + * the token from bouncing around too often and to let the task make + * some progress in its execution. */ + #include <linux/jiffies.h> #include <linux/mm.h> #include <linux/sched.h> #include <linux/swap.h> static DEFINE_SPINLOCK(swap_token_lock); -static unsigned long swap_token_timeout; -static unsigned long swap_token_check; -struct mm_struct * swap_token_mm = &init_mm; - -#define SWAP_TOKEN_CHECK_INTERVAL (HZ * 2) -#define SWAP_TOKEN_TIMEOUT (300 * HZ) -/* - * Currently disabled; Needs further code to work at HZ * 300. - */ -unsigned long swap_token_default_timeout = SWAP_TOKEN_TIMEOUT; - -/* - * Take the token away if the process had no page faults - * in the last interval, or if it has held the token for - * too long. - */ -#define SWAP_TOKEN_ENOUGH_RSS 1 -#define SWAP_TOKEN_TIMED_OUT 2 -static int should_release_swap_token(struct mm_struct *mm) -{ - int ret = 0; - if (!mm->recent_pagein) - ret = SWAP_TOKEN_ENOUGH_RSS; - else if (time_after(jiffies, swap_token_timeout)) - ret = SWAP_TOKEN_TIMED_OUT; - mm->recent_pagein = 0; - return ret; -} +struct mm_struct * swap_token_mm = NULL; +unsigned int global_faults = 0; -/* - * Try to grab the swapout protection token. We only try to - * grab it once every TOKEN_CHECK_INTERVAL, both to prevent - * SMP lock contention and to check that the process that held - * the token before is no longer thrashing. - */ void grab_swap_token(void) { - struct mm_struct *mm; - int reason; - - /* We have the token. Let others know we still need it. */ - if (has_swap_token(current->mm)) { - current->mm->recent_pagein = 1; - if (unlikely(!swap_token_default_timeout)) - disable_swap_token(); + int current_interval = 0; + + global_faults++; + + current_interval = global_faults - current->mm->faultstamp; + + if (!spin_trylock(&swap_token_lock)) return; - } - if (time_after(jiffies, swap_token_check)) { + /* First come first served */ + if (swap_token_mm == NULL) { + current->mm->token_priority = current->mm->token_priority + 2; + swap_token_mm = current->mm; + goto out; + } - if (!swap_token_default_timeout) { - swap_token_check = jiffies + SWAP_TOKEN_CHECK_INTERVAL; - return; + if (current->mm != swap_token_mm) { + if (current_interval < current->mm->last_interval) + current->mm->token_priority++; + else { + current->mm->token_priority--; + if (unlikely(current->mm->token_priority < 0)) + current->mm->token_priority = 0; } - - /* ... or if we recently held the token. */ - if (time_before(jiffies, current->mm->swap_token_time)) - return; - - if (!spin_trylock(&swap_token_lock)) - return; - - swap_token_check = jiffies + SWAP_TOKEN_CHECK_INTERVAL; - - mm = swap_token_mm; - if ((reason = should_release_swap_token(mm))) { - unsigned long eligible = jiffies; - if (reason == SWAP_TOKEN_TIMED_OUT) { - eligible += swap_token_default_timeout; - } - mm->swap_token_time = eligible; - swap_token_timeout = jiffies + swap_token_default_timeout; + /* Check if we deserve the token */ + if (current->mm->token_priority > swap_token_mm->token_priority) { + current->mm->token_priority = current->mm->token_priority + 2; swap_token_mm = current->mm; } - spin_unlock(&swap_token_lock); } - return; + else + /* Token holder came in again! */ + current->mm->token_priority = current->mm->token_priority + 2; + +out: + current->mm->faultstamp = global_faults; + current->mm->last_interval = current_interval; + spin_unlock(&swap_token_lock); +return; } /* Called on process exit. */ @@ -98,9 +74,7 @@ void __put_swap_token(struct mm_struct * { spin_lock(&swap_token_lock); if (likely(mm == swap_token_mm)) { - mm->swap_token_time = jiffies + SWAP_TOKEN_CHECK_INTERVAL; - swap_token_mm = &init_mm; - swap_token_check = jiffies; + swap_token_mm = NULL; } spin_unlock(&swap_token_lock); } -- ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-02 7:35 ` Peter Zijlstra 2006-10-02 7:59 ` Andrew Morton @ 2006-10-02 11:00 ` Ashwin Chaugule 2006-10-02 11:08 ` Peter Zijlstra 1 sibling, 1 reply; 12+ messages in thread From: Ashwin Chaugule @ 2006-10-02 11:00 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel, Rik van Riel On Mon, 2006-10-02 at 09:35 +0200, Peter Zijlstra wrote: > Being frustrated with these results - I mean the idea made sense, so > what is going on - I came up with this answer: > > Tasks owning the swap token will retain their pages and will hence swap > less, other (contending) tasks will get less pages and will fault more > frequent. This prio mechanism will favour exactly those tasks not > holding the token. Which makes for token bouncing. > Right. But, with the token bouncing around, effectively the RSS of the processes at that time will keep increasing, and they should be able to spend more time on execution than i/o. And meanwhile the priorities of the tasks that were contending for the token, but didnt get it, will increment. So since the fairness is preserved, all the tasks should get their fair share for execution and it should result in a speedup as compared to the current upstream implementation. I took a time instrumentation of the vanilla 2.6.18 kernel build with -j 4 and I've posted up the results in the previous mail. I'm testing on an ibm t42 1.69Ghz 64M system. > So while I agree it would be nice to get rid of all magic variables > (holding time in the current impl) this proposed solution hasn't > convinced me (for one it introduces another). > > (for the interrested, the various attempts I tried are available here: > http://programming.kicks-ass.net/kernel-patches/swap_token/ ) Cool! Had you applied these patches when you posted your test results ? Thanks Ashwin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-02 11:00 ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule @ 2006-10-02 11:08 ` Peter Zijlstra 0 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2006-10-02 11:08 UTC (permalink / raw) To: ashwin.chaugule; +Cc: Andrew Morton, linux-kernel, Rik van Riel On Mon, 2006-10-02 at 16:30 +0530, Ashwin Chaugule wrote: > On Mon, 2006-10-02 at 09:35 +0200, Peter Zijlstra wrote: > > So while I agree it would be nice to get rid of all magic variables > > (holding time in the current impl) this proposed solution hasn't > > convinced me (for one it introduces another). > > > > (for the interrested, the various attempts I tried are available here: > > http://programming.kicks-ass.net/kernel-patches/swap_token/ ) > > Cool! > > Had you applied these patches when you posted your test results ? Only my test box ever ran them. They are replacements for your 2nd patch, timings I got from them were worse than with yours though, needs more attention. A variation on 3 I have in mind is to reset the prio of the loosing mm to 0 - this should avoid it regaining the token quickly. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-01 22:56 ` Andrew Morton 2006-10-02 7:35 ` Peter Zijlstra @ 2006-10-02 8:20 ` Ashwin Chaugule 2006-10-02 10:00 ` Ashwin Chaugule 2 siblings, 0 replies; 12+ messages in thread From: Ashwin Chaugule @ 2006-10-02 8:20 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote: > On Sat, 30 Sep 2006 00:11:51 +0530 > Ashwin Chaugule <ashwin.chaugule@celunite.com> wrote: > > > > > Hi, > > Here's a brief up on the next two mails. > > When preparing patches, please give each one's email a different and > meaningful Subject:, and try to put the description of the patch within the > email which contains that patch, thanks. Yep, will remember that. > > PATCH 2: > > > > Instead of using TIMEOUT as a parameter to transfer the token, I think a > > better solution is to hand it over to a process that proves its > > eligibilty. > > > > What my scheme does, is to find out how frequently a process is calling > > these functions. The processes that call these more frequently get a > > higher priority. > > The idea is to guarantee that a high priority process gets the token. > > The priority of a process is determined by the number of consecutive > > calls to swap-in and no-page. I mean "consecutive" not from the > > scheduler point of view, but from the process point of view. In other > > words, if the task called these functions every time it was scheduled, > > it means it is not getting any further with its execution. > > > > This way, its a matter of simple comparison of task priorities, to > > decide whether to transfer the token or not. > > Does this introduce the possibility of starvation? Where the > fast-allocating process hogs the system and everything else makes no > progress? > > A fast allocating process will start to increase its RSS and the assumption is that such a process will finish its execution faster and relinquish the token. Meanwhile, when such a process is allocating, the other processes pages will be marked as "true LRU" pages and in the event that they get swaped out, when those processes get scheduled, their priorities will also be increased. So effectively, chances of starvation are quite minimal. The key is to grant the token to the most deserving process, so in other words, when a task tries to hog up the system by allocations and swap-in's , some other process is getting hampered and when the affected process gets scheduled, the algorithm will make sure it gets the immunity from generating false LRU pages. Also, when the fast allocating process stops its continuous allocation , or continues its allocation sporadically ie. ((global_faults - current->mm->faultstamp) > FAULTSTAMP_DIFF) , its priority keeps getting decremented too. > qsbench gives quite unstable results in my experience. How stable is the > above result (say, average across ten runs?) > True. I did run the qsbench test several times, and the results were always better off by atleast 10 seconds with my changes. > It's quite easy to make changes in this area which speed qsbench up with > one set of arguments, and which slow it down with a different set. Did you > try mixing the tests up a bit? I ran another vmstress app, which spawns several other threads each dedicated to either only malloc, or io only etc Results: Upstream: time stress --cpu 2 --io 14 --vm 5 --vm-bytes 50M --timeout 10s --hdd 2 stress: info: [4331] dispatching hogs: 2 cpu, 14 io, 5 vm, 2 hdd stress: info: [4331] successful run completed in 19s real 0m19.358s user 0m9.850s sys 0m0.210s My changes: time stress --cpu 2 --io 14 --vm 5 --vm-bytes 50M --timeout 10s --hdd 2 stress: info: [4498] dispatching hogs: 2 cpu, 14 io, 5 vm, 2 hdd stress: info: [4498] successful run completed in 16s real 0m16.813s user 0m9.850s sys 0m0.100s Havent tested this enough to average out, but it did show improvements everytime I ran it. > > Also, qsbench isn't really a very good test for swap-intensive workloads - > it's re-referencing and locality patterns seem fairly artificial. True. In theory, my algo should give better results. The earlier TIMEOUT was unfair to processes. In the pre-thrashing stages, it was detrimental to processes badly in need of the token. Thus their execution didnt get any futher, which is addressed here. I was hoping that people would have some other intsrumentation tools for VM, I tried vmregress, but it didnt build against 2.6.18, needs some mm api fixes. > > Another workload which it would be useful to benchmark is a kernel compile > - say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need > tuning). > Will test this and post it up. Thanks ! Ashwin > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC][PATCH 0/2] Swap token re-tuned 2006-10-01 22:56 ` Andrew Morton 2006-10-02 7:35 ` Peter Zijlstra 2006-10-02 8:20 ` Ashwin Chaugule @ 2006-10-02 10:00 ` Ashwin Chaugule 2 siblings, 0 replies; 12+ messages in thread From: Ashwin Chaugule @ 2006-10-02 10:00 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Rik van Riel, Peter Zijlstra On Sun, 2006-10-01 at 15:56 -0700, Andrew Morton wrote: > Another workload which it would be useful to benchmark is a kernel compile > - say, boot with `mem=16M' and time `make -j4 vmlinux' (numbers may need > tuning). > This is what I got : mem=64M Upstream: 2.6.18 make -j 4 vmlinux real 31m26.021s user 4m32.140s sys 0m23.340s ------------------ My patch: real 27m42.984s user 4m33.800s sys 0m22.080s Ashwin ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-10-08 20:28 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-29 18:41 [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule 2006-10-01 22:56 ` Andrew Morton 2006-10-02 7:35 ` Peter Zijlstra 2006-10-02 7:59 ` Andrew Morton 2006-10-02 8:14 ` Peter Zijlstra 2006-10-03 7:32 ` Peter Zijlstra 2006-10-08 20:23 ` [RFC][PATCH 1/2] grab swap token reordered Ashwin Chaugule 2006-10-08 20:28 ` [RFC][PATCH 2/2] new scheme to preempt swap token Ashwin Chaugule 2006-10-02 11:00 ` [RFC][PATCH 0/2] Swap token re-tuned Ashwin Chaugule 2006-10-02 11:08 ` Peter Zijlstra 2006-10-02 8:20 ` Ashwin Chaugule 2006-10-02 10:00 ` Ashwin Chaugule
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox