* [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
@ 2005-12-07 22:04 Neil Horman
2005-12-10 0:29 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: Neil Horman @ 2005-12-07 22:04 UTC (permalink / raw)
To: linux-kernel; +Cc: mingo, akpm
[-- Attachment #1: Type: text/plain, Size: 3346 bytes --]
Hey all-
I was recently shown this issue, wherein, if the kernel was kept full of
pagecache via applications that were constantly writing large amounts of data to
disk, the box could find itself in a position where the vm, in __alloc_pages
would invoke the oom killer repetatively within try_to_free_pages, until such
time as the box had no candidate processes left to kill, at which point it would
panic. While this seems like the right thing to do in general, it occured to me
that if we could simply force some additional evictions from pagecache before we
tried to reclaim memory in try_to_free_pages, we stood a good chance of avoiding
the need to invoke the oom killer at all (assuming that the pages freed from
pagecache were physically contiguous). The following patch preforms this
operation and in my testing, and that of the origional reporter, results in
avoidance of the oom killer being invoked for the workloads which would
previously oom kill the box to the point that it would panic.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@redhat.com>
include/linux/writeback.h | 1 +
mm/page-writeback.c | 17 +++++++++++++++++
mm/page_alloc.c | 10 ++++++++++
3 files changed, 28 insertions(+)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -86,6 +86,7 @@ static inline void wait_on_inode(struct
* mm/page-writeback.c
*/
int wakeup_pdflush(long nr_pages);
+void clean_pagecache(long nr_pages);
void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(void);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -350,6 +350,23 @@ static void background_writeout(unsigned
}
/*
+ * Writeback nr_pages from pagecache to disk synchronously
+ * blocks until the writeback is complete
+ */
+void clean_pagecache(long nr_pages)
+{
+ struct writeback_control wbc = {
+ .bdi = NULL,
+ .sync_mode = WB_SYNC_ALL,
+ .older_than_this = NULL,
+ .nr_to_write = nr_pages,
+ .nonblocking = 0,
+ };
+
+ writeback_inodes(&wbc);
+}
+
+/*
* Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
* the whole world. Returns 0 if a pdflush thread was dispatched. Returns
* -1 if all pdflush threads were busy.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -949,6 +949,16 @@ rebalance:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
+ /*
+ * We're pinched for memory, so before we try to reclaim some
+ * pages synchronously, lets try to force some more pages out
+ * of pagecache, to raise our chances of this succeding.
+ * specifically, lets write out the number of pages that this
+ * allocation is requesting, in the hopes that they will be
+ * contiguous
+ */
+ clean_pagecache(1<<order);
+
did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);
p->reclaim_state = NULL;
--
/***************************************************
*Neil Horman
*Software Engineer
*gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
***************************************************/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-07 22:04 [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory Neil Horman @ 2005-12-10 0:29 ` Andrew Morton 2005-12-10 18:25 ` Neil Horman 2005-12-12 18:22 ` Neil Horman 0 siblings, 2 replies; 7+ messages in thread From: Andrew Morton @ 2005-12-10 0:29 UTC (permalink / raw) To: Neil Horman; +Cc: linux-kernel, mingo Neil Horman <nhorman@tuxdriver.com> wrote: > > Hey all- > I was recently shown this issue, wherein, if the kernel was kept full of > pagecache via applications that were constantly writing large amounts of data to > disk, the box could find itself in a position where the vm, in __alloc_pages > would invoke the oom killer repetatively within try_to_free_pages, until such > time as the box had no candidate processes left to kill, at which point it would > panic. That's pretty bad. Are you able to provide a description which would permit others to reproduce this? > /* > + * Writeback nr_pages from pagecache to disk synchronously > + * blocks until the writeback is complete > + */ > +void clean_pagecache(long nr_pages) > +{ > + struct writeback_control wbc = { > + .bdi = NULL, > + .sync_mode = WB_SYNC_ALL, > + .older_than_this = NULL, > + .nr_to_write = nr_pages, > + .nonblocking = 0, > + }; > + > + writeback_inodes(&wbc); > +} Interesting. > +/* > * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back > * the whole world. Returns 0 if a pdflush thread was dispatched. Returns > * -1 if all pdflush threads were busy. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -949,6 +949,16 @@ rebalance: > reclaim_state.reclaimed_slab = 0; > p->reclaim_state = &reclaim_state; > > + /* > + * We're pinched for memory, so before we try to reclaim some > + * pages synchronously, lets try to force some more pages out > + * of pagecache, to raise our chances of this succeding. > + * specifically, lets write out the number of pages that this > + * allocation is requesting, in the hopes that they will be > + * contiguous > + */ > + clean_pagecache(1<<order); > + > did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask); I suspect that we shuld be passing more than (1<<order) into clean_pagecache() - if we're going to do this sort of writeback then we might as well do a decent amount. Maybe something like (number of pages on the eligible LRUs * proportion of dirty memory) or something. But then, page reclaim does writeback off the LRU, so none of this should be needed... Need to work out why it broke. And we should not be calling into filesystem writeback unless the caller specified __GFP_FS. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-10 0:29 ` Andrew Morton @ 2005-12-10 18:25 ` Neil Horman 2005-12-12 18:22 ` Neil Horman 1 sibling, 0 replies; 7+ messages in thread From: Neil Horman @ 2005-12-10 18:25 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mingo On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote: > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > Hey all- > > I was recently shown this issue, wherein, if the kernel was kept full of > > pagecache via applications that were constantly writing large amounts of data to > > disk, the box could find itself in a position where the vm, in __alloc_pages > > would invoke the oom killer repetatively within try_to_free_pages, until such > > time as the box had no candidate processes left to kill, at which point it would > > panic. > > That's pretty bad. Are you able to provide a description which would permit > others to reproduce this? > I can provide you what was provided to me (It'll have to wait 'till, monday, as thats where my notes are). The origional reproducer requires multiple nodes in a cluster with more than 4GB of ram to write 16GB of data to a common NFS share, but I think it can be reproduced with a single system with sufficient ram (specifically more than 4GB IIRC) writing to an NFS share. > > /* > > + * Writeback nr_pages from pagecache to disk synchronously > > + * blocks until the writeback is complete > > + */ > > +void clean_pagecache(long nr_pages) > > +{ > > + struct writeback_control wbc = { > > + .bdi = NULL, > > + .sync_mode = WB_SYNC_ALL, > > + .older_than_this = NULL, > > + .nr_to_write = nr_pages, > > + .nonblocking = 0, > > + }; > > + > > + writeback_inodes(&wbc); > > +} > > Interesting. > > > +/* > > * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back > > * the whole world. Returns 0 if a pdflush thread was dispatched. Returns > > * -1 if all pdflush threads were busy. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -949,6 +949,16 @@ rebalance: > > reclaim_state.reclaimed_slab = 0; > > p->reclaim_state = &reclaim_state; > > > > + /* > > + * We're pinched for memory, so before we try to reclaim some > > + * pages synchronously, lets try to force some more pages out > > + * of pagecache, to raise our chances of this succeding. > > + * specifically, lets write out the number of pages that this > > + * allocation is requesting, in the hopes that they will be > > + * contiguous > > + */ > > + clean_pagecache(1<<order); > > + > > did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask); > > I suspect that we shuld be passing more than (1<<order) into > clean_pagecache() - if we're going to do this sort of writeback then we > might as well do a decent amount. Maybe something like (number of pages on > the eligible LRUs * proportion of dirty memory) or something. But then, > page reclaim does writeback off the LRU, so none of this should be > needed... Need to work out why it broke. > Understood, but I think if userspace is filling pagecache at a sufficient rate, then a non-I/O bound process preforming a memory allocation in kernel space will be able to trigger the oom killer before the set of active pdflush tasks have flushed enough pagechace to free up sufficient lowmem to satisfy the request. By adding the above writeback, we can block the allocation until at least some amount of lowmem is freed. I understand what your saying though, about flushing a decent amount, if were going to flush synchronously at all. I can re-work the patch to flush more pagecache when we trigger. The only reason I used 1<<order was because I didn't want to be too agressive and stall the system while we flushed out more pagecache than we needed to. Of course, I could be off base on all of this. As I mentioned to Ingo, I'm really trying to get more involved in vm work, so I just getting used to some of the code here. But I can say that this patch fixes the problem I describe above, and given my limited understanding, it makes sense to me. > And we should not be calling into filesystem writeback unless the caller > specified __GFP_FS. I'll take your word for this here, but I'm not sure why that needs to be the case. My intent here was to free pagecache, whenever a lowmem allocation fails. I understand that the pagecache itself may well be in highmem, but a certain amount of lowmem is used to track and manage that pagecache allocation, and by flushing pagecache we free that lowmem up, hopefully in a sufficient amount to allow the allocation at hand to procede. I'll post the full reproducer monday morning/afternoon. Thanks & Regards Neil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-10 0:29 ` Andrew Morton 2005-12-10 18:25 ` Neil Horman @ 2005-12-12 18:22 ` Neil Horman 2005-12-12 20:16 ` Andrew Morton 1 sibling, 1 reply; 7+ messages in thread From: Neil Horman @ 2005-12-12 18:22 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mingo On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote: > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > Hey all- > > I was recently shown this issue, wherein, if the kernel was kept full of > > pagecache via applications that were constantly writing large amounts of data to > > disk, the box could find itself in a position where the vm, in __alloc_pages > > would invoke the oom killer repetatively within try_to_free_pages, until such > > time as the box had no candidate processes left to kill, at which point it would > > panic. > > That's pretty bad. Are you able to provide a description which would permit > others to reproduce this? As promised, heres the reproducer that was given to me, and used to reproduce this problem: 1) setup an nfs serve with a thread count of 2. Of course, 1 thread might make the problem more easy to reproduce. I haven't tried it yet. 2) Setup 4 nodes to hammer the nfs mounted directory. The 4 nodes should hammer out 4 gigs. 2 gigs didn't seem to be enough. I used a locally developed tool called ior to reproduce this problem. The tool can be found here: http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/ I suppose anything that can write to NFS fast should be fine. But that's what I did. If you do this, any node writing to the server that has more than 4GB of RAM should start oom killing to the point where it runs out of candidate processes and panics Thanks & Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-12 18:22 ` Neil Horman @ 2005-12-12 20:16 ` Andrew Morton 2005-12-12 21:40 ` Neil Horman 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2005-12-12 20:16 UTC (permalink / raw) To: Neil Horman; +Cc: linux-kernel, mingo Neil Horman <nhorman@tuxdriver.com> wrote: > > On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote: > > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > Hey all- > > > I was recently shown this issue, wherein, if the kernel was kept full of > > > pagecache via applications that were constantly writing large amounts of data to > > > disk, the box could find itself in a position where the vm, in __alloc_pages > > > would invoke the oom killer repetatively within try_to_free_pages, until such > > > time as the box had no candidate processes left to kill, at which point it would > > > panic. > > > > That's pretty bad. Are you able to provide a description which would permit > > others to reproduce this? > > As promised, heres the reproducer that was given to me, and used to reproduce > this problem: > > 1) setup an nfs serve with a thread count of 2. Of course, 1 thread might make > the problem more easy to reproduce. I haven't tried it yet. > > 2) Setup 4 nodes to hammer the nfs mounted directory. The 4 nodes should hammer > out 4 gigs. 2 gigs didn't seem to be enough. > > I used a locally developed tool called ior to reproduce this problem. The tool > can be found here: > > http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/ > > I suppose anything that can write to NFS fast should be fine. But that's what I > did. > > > If you do this, any node writing to the server that has more than 4GB of RAM > should start oom killing to the point where it runs out of candidate processes > and panics We merged an NFS fix last week which will help throttling under heavy writeout conditions.. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-12 20:16 ` Andrew Morton @ 2005-12-12 21:40 ` Neil Horman 2005-12-14 19:43 ` Neil Horman 0 siblings, 1 reply; 7+ messages in thread From: Neil Horman @ 2005-12-12 21:40 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mingo On Mon, Dec 12, 2005 at 12:16:39PM -0800, Andrew Morton wrote: > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote: > > > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > Hey all- > > > > I was recently shown this issue, wherein, if the kernel was kept full of > > > > pagecache via applications that were constantly writing large amounts of data to > > > > disk, the box could find itself in a position where the vm, in __alloc_pages > > > > would invoke the oom killer repetatively within try_to_free_pages, until such > > > > time as the box had no candidate processes left to kill, at which point it would > > > > panic. > > > > > > That's pretty bad. Are you able to provide a description which would permit > > > others to reproduce this? > > > > As promised, heres the reproducer that was given to me, and used to reproduce > > this problem: > > > > 1) setup an nfs serve with a thread count of 2. Of course, 1 thread might make > > the problem more easy to reproduce. I haven't tried it yet. > > > > 2) Setup 4 nodes to hammer the nfs mounted directory. The 4 nodes should hammer > > out 4 gigs. 2 gigs didn't seem to be enough. > > > > I used a locally developed tool called ior to reproduce this problem. The tool > > can be found here: > > > > http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/ > > > > I suppose anything that can write to NFS fast should be fine. But that's what I > > did. > > > > > > If you do this, any node writing to the server that has more than 4GB of RAM > > should start oom killing to the point where it runs out of candidate processes > > and panics > > We merged an NFS fix last week which will help throttling under heavy > writeout conditions.. This one? http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb713d6d38f7be4f4e7d790cddb1b076e7da6699 I guess I must have just missed it during my testing. I'll give it a spin and let you know if it fixes my test case. Thanks & Regards Neil > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- /*************************************************** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory 2005-12-12 21:40 ` Neil Horman @ 2005-12-14 19:43 ` Neil Horman 0 siblings, 0 replies; 7+ messages in thread From: Neil Horman @ 2005-12-14 19:43 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mingo On Mon, Dec 12, 2005 at 04:40:43PM -0500, Neil Horman wrote: > On Mon, Dec 12, 2005 at 12:16:39PM -0800, Andrew Morton wrote: > > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote: > > > > Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > Hey all- > > > > > I was recently shown this issue, wherein, if the kernel was kept full of > > > > > pagecache via applications that were constantly writing large amounts of data to > > > > > disk, the box could find itself in a position where the vm, in __alloc_pages > > > > > would invoke the oom killer repetatively within try_to_free_pages, until such > > > > > time as the box had no candidate processes left to kill, at which point it would > > > > > panic. > > > > > > > > That's pretty bad. Are you able to provide a description which would permit > > > > others to reproduce this? > > > > > > As promised, heres the reproducer that was given to me, and used to reproduce > > > this problem: > > > > > > 1) setup an nfs serve with a thread count of 2. Of course, 1 thread might make > > > the problem more easy to reproduce. I haven't tried it yet. > > > > > > 2) Setup 4 nodes to hammer the nfs mounted directory. The 4 nodes should hammer > > > out 4 gigs. 2 gigs didn't seem to be enough. > > > > > > I used a locally developed tool called ior to reproduce this problem. The tool > > > can be found here: > > > > > > http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/ > > > > > > I suppose anything that can write to NFS fast should be fine. But that's what I > > > did. > > > > > > > > > If you do this, any node writing to the server that has more than 4GB of RAM > > > should start oom killing to the point where it runs out of candidate processes > > > and panics > > > > We merged an NFS fix last week which will help throttling under heavy > > writeout conditions.. > This one? > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb713d6d38f7be4f4e7d790cddb1b076e7da6699 > I guess I must have just missed it during my testing. I'll give it a spin and > let you know if it fixes my test case. > > Thanks & Regards > Neil > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > -- > /*************************************************** > *Neil Horman > *Software Engineer > *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu > ***************************************************/ > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ Just finished testing with the latest kernel, and the problem appears to be gone. I withdraw my patch. Apologies for the noise. Thanks & Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-12-14 19:43 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-12-07 22:04 [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory Neil Horman 2005-12-10 0:29 ` Andrew Morton 2005-12-10 18:25 ` Neil Horman 2005-12-12 18:22 ` Neil Horman 2005-12-12 20:16 ` Andrew Morton 2005-12-12 21:40 ` Neil Horman 2005-12-14 19:43 ` Neil Horman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox