[PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
@ 2005-12-07 22:04 Neil Horman
  2005-12-10  0:29 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Horman @ 2005-12-07 22:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, akpm

[-- Attachment #1: Type: text/plain, Size: 3346 bytes --]

Hey all-
     I was recently shown this issue, wherein, if the kernel was kept full of
pagecache via applications that were constantly writing large amounts of data to
disk, the box could find itself in a position where the vm, in __alloc_pages
would invoke the oom killer repetatively within try_to_free_pages, until such
time as the box had no candidate processes left to kill, at which point it would
panic.  While this seems like the right thing to do in general, it occured to me
that if we could simply force some additional evictions from pagecache before we
tried to reclaim memory in try_to_free_pages, we stood a good chance of avoiding
the need to invoke the oom killer at all (assuming that the pages freed from
pagecache were physically contiguous).  The following patch preforms this
operation and in my testing, and that of the origional reporter, results in
avoidance of the oom killer being invoked for the workloads which would
previously oom kill the box to the point that it would panic.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman@redhat.com>


 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |   17 +++++++++++++++++
 mm/page_alloc.c           |   10 ++++++++++
 3 files changed, 28 insertions(+)


diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -86,6 +86,7 @@ static inline void wait_on_inode(struct 
  * mm/page-writeback.c
  */
 int wakeup_pdflush(long nr_pages);
+void clean_pagecache(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -350,6 +350,23 @@ static void background_writeout(unsigned
 }
 
 /*
+ * Writeback nr_pages from pagecache to disk synchronously
+ * blocks until the writeback is complete
+ */
+void clean_pagecache(long nr_pages)
+{
+	struct writeback_control wbc = {
+		.bdi            = NULL,
+		.sync_mode      = WB_SYNC_ALL,
+		.older_than_this = NULL,
+		.nr_to_write    = nr_pages,
+		.nonblocking    = 0,
+	};
+
+	writeback_inodes(&wbc);
+}
+
+/*
  * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush threads were busy.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -949,6 +949,16 @@ rebalance:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
+	/*
+	 * We're pinched for memory, so before we try to reclaim some 
+	 * pages synchronously, lets try to force some more pages out
+	 * of pagecache, to raise our chances of this succeding.
+	 * specifically, lets write out the number of pages that this
+	 * allocation is requesting, in the hopes that they will be
+	 * contiguous
+	 */
+	clean_pagecache(1<<order);
+
 	did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);
 
 	p->reclaim_state = NULL;
-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
 ***************************************************/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-07 22:04 [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory Neil Horman
@ 2005-12-10  0:29 ` Andrew Morton
  2005-12-10 18:25   ` Neil Horman
  2005-12-12 18:22   ` Neil Horman
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2005-12-10  0:29 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-kernel, mingo

Neil Horman <nhorman@tuxdriver.com> wrote:
>
> Hey all-
>      I was recently shown this issue, wherein, if the kernel was kept full of
> pagecache via applications that were constantly writing large amounts of data to
> disk, the box could find itself in a position where the vm, in __alloc_pages
> would invoke the oom killer repetatively within try_to_free_pages, until such
> time as the box had no candidate processes left to kill, at which point it would
> panic.

That's pretty bad.  Are you able to provide a description which would permit
others to reproduce this?

>  /*
> + * Writeback nr_pages from pagecache to disk synchronously
> + * blocks until the writeback is complete
> + */
> +void clean_pagecache(long nr_pages)
> +{
> +	struct writeback_control wbc = {
> +		.bdi            = NULL,
> +		.sync_mode      = WB_SYNC_ALL,
> +		.older_than_this = NULL,
> +		.nr_to_write    = nr_pages,
> +		.nonblocking    = 0,
> +	};
> +
> +	writeback_inodes(&wbc);
> +}

Interesting.

> +/*
>   * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
>   * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
>   * -1 if all pdflush threads were busy.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -949,6 +949,16 @@ rebalance:
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> +	/*
> +	 * We're pinched for memory, so before we try to reclaim some 
> +	 * pages synchronously, lets try to force some more pages out
> +	 * of pagecache, to raise our chances of this succeding.
> +	 * specifically, lets write out the number of pages that this
> +	 * allocation is requesting, in the hopes that they will be
> +	 * contiguous
> +	 */
> +	clean_pagecache(1<<order);
> +
>  	did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);

I suspect that we shuld be passing more than (1<<order) into
clean_pagecache() - if we're going to do this sort of writeback then we
might as well do a decent amount.  Maybe something like (number of pages on
the eligible LRUs * proportion of dirty memory) or something.  But then,
page reclaim does writeback off the LRU, so none of this should be
needed...   Need to work out why it broke.

And we should not be calling into filesystem writeback unless the caller
specified __GFP_FS.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-10  0:29 ` Andrew Morton
@ 2005-12-10 18:25   ` Neil Horman
  2005-12-12 18:22   ` Neil Horman
  1 sibling, 0 replies; 7+ messages in thread
From: Neil Horman @ 2005-12-10 18:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mingo

On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > Hey all-
> >      I was recently shown this issue, wherein, if the kernel was kept full of
> > pagecache via applications that were constantly writing large amounts of data to
> > disk, the box could find itself in a position where the vm, in __alloc_pages
> > would invoke the oom killer repetatively within try_to_free_pages, until such
> > time as the box had no candidate processes left to kill, at which point it would
> > panic.
> 
> That's pretty bad.  Are you able to provide a description which would permit
> others to reproduce this?
> 
I can provide you what was provided to me (It'll have to wait 'till, monday, as
thats where my notes are).  The origional reproducer requires multiple nodes in
a cluster with more than 4GB of ram to write 16GB of data to a common NFS share,
but I think it can be reproduced with a single system with sufficient ram
(specifically more than 4GB IIRC) writing to an NFS share.

> >  /*
> > + * Writeback nr_pages from pagecache to disk synchronously
> > + * blocks until the writeback is complete
> > + */
> > +void clean_pagecache(long nr_pages)
> > +{
> > +	struct writeback_control wbc = {
> > +		.bdi            = NULL,
> > +		.sync_mode      = WB_SYNC_ALL,
> > +		.older_than_this = NULL,
> > +		.nr_to_write    = nr_pages,
> > +		.nonblocking    = 0,
> > +	};
> > +
> > +	writeback_inodes(&wbc);
> > +}
> 
> Interesting.
> 
> > +/*
> >   * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
> >   * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
> >   * -1 if all pdflush threads were busy.
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -949,6 +949,16 @@ rebalance:
> >  	reclaim_state.reclaimed_slab = 0;
> >  	p->reclaim_state = &reclaim_state;
> >  
> > +	/*
> > +	 * We're pinched for memory, so before we try to reclaim some 
> > +	 * pages synchronously, lets try to force some more pages out
> > +	 * of pagecache, to raise our chances of this succeding.
> > +	 * specifically, lets write out the number of pages that this
> > +	 * allocation is requesting, in the hopes that they will be
> > +	 * contiguous
> > +	 */
> > +	clean_pagecache(1<<order);
> > +
> >  	did_some_progress = try_to_free_pages(zonelist->zones, gfp_mask);
> 
> I suspect that we shuld be passing more than (1<<order) into
> clean_pagecache() - if we're going to do this sort of writeback then we
> might as well do a decent amount.  Maybe something like (number of pages on
> the eligible LRUs * proportion of dirty memory) or something.  But then,
> page reclaim does writeback off the LRU, so none of this should be
> needed...   Need to work out why it broke.
> 
Understood, but I think if userspace is filling pagecache at a sufficient rate, then
a non-I/O bound process preforming a memory allocation in kernel space will
be able to trigger the oom killer before the set of active pdflush tasks have
flushed enough pagechace to free up sufficient lowmem to satisfy the request.
By adding the above writeback, we can block the allocation until at least some
amount of lowmem is freed.  I understand what your saying though, about flushing
a decent amount, if were going to flush synchronously at all.  I can re-work the
patch to flush more pagecache when we trigger.  The only reason I used 1<<order
was because I didn't want to be too agressive and stall the system while we
flushed out more pagecache than we needed to.

Of course, I could be off base on all of this.  As I mentioned to Ingo, I'm
really trying to get more involved in vm work, so I just getting used to some of
the code here.  But I can say that this patch fixes the problem I describe
above, and given my limited understanding, it makes sense to me.

> And we should not be calling into filesystem writeback unless the caller
> specified __GFP_FS.
I'll take your word for this here, but I'm not sure why that needs to be the
case.  My intent here was to free pagecache, whenever a lowmem allocation fails.
I understand that the pagecache itself may well be in highmem, but a certain
amount of lowmem is used to track and manage that pagecache allocation, and by
flushing pagecache we free that lowmem up, hopefully in a sufficient amount to
allow the allocation at hand to procede.

I'll post the full reproducer monday morning/afternoon.

Thanks & Regards
Neil
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-10  0:29 ` Andrew Morton
  2005-12-10 18:25   ` Neil Horman
@ 2005-12-12 18:22   ` Neil Horman
  2005-12-12 20:16     ` Andrew Morton
  1 sibling, 1 reply; 7+ messages in thread
From: Neil Horman @ 2005-12-12 18:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mingo

On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > Hey all-
> >      I was recently shown this issue, wherein, if the kernel was kept full of
> > pagecache via applications that were constantly writing large amounts of data to
> > disk, the box could find itself in a position where the vm, in __alloc_pages
> > would invoke the oom killer repetatively within try_to_free_pages, until such
> > time as the box had no candidate processes left to kill, at which point it would
> > panic.
> 
> That's pretty bad.  Are you able to provide a description which would permit
> others to reproduce this?

As promised, heres the reproducer that was given to me, and used to reproduce
this problem:

1) setup an nfs serve with a thread count of 2.  Of course, 1 thread might make
the problem more easy to reproduce.  I haven't tried it yet.

2) Setup 4 nodes to hammer the nfs mounted directory.  The 4 nodes should hammer
out 4 gigs.  2 gigs didn't seem to be enough.

I used a locally developed tool called ior to reproduce this problem.  The tool
can be found here:

http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/

I suppose anything that can write to NFS fast should be fine.  But that's what I
did.


If you do this, any node writing to the server that has more than 4GB of RAM
should start oom killing to the point where it runs out of candidate processes
and panics

Thanks & Regards
Neil

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-12 18:22   ` Neil Horman
@ 2005-12-12 20:16     ` Andrew Morton
  2005-12-12 21:40       ` Neil Horman
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2005-12-12 20:16 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-kernel, mingo

Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote:
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > Hey all-
> > >      I was recently shown this issue, wherein, if the kernel was kept full of
> > > pagecache via applications that were constantly writing large amounts of data to
> > > disk, the box could find itself in a position where the vm, in __alloc_pages
> > > would invoke the oom killer repetatively within try_to_free_pages, until such
> > > time as the box had no candidate processes left to kill, at which point it would
> > > panic.
> > 
> > That's pretty bad.  Are you able to provide a description which would permit
> > others to reproduce this?
> 
> As promised, heres the reproducer that was given to me, and used to reproduce
> this problem:
> 
> 1) setup an nfs serve with a thread count of 2.  Of course, 1 thread might make
> the problem more easy to reproduce.  I haven't tried it yet.
> 
> 2) Setup 4 nodes to hammer the nfs mounted directory.  The 4 nodes should hammer
> out 4 gigs.  2 gigs didn't seem to be enough.
> 
> I used a locally developed tool called ior to reproduce this problem.  The tool
> can be found here:
> 
> http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/
> 
> I suppose anything that can write to NFS fast should be fine.  But that's what I
> did.
> 
> 
> If you do this, any node writing to the server that has more than 4GB of RAM
> should start oom killing to the point where it runs out of candidate processes
> and panics

We merged an NFS fix last week which will help throttling under heavy
writeout conditions..

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-12 20:16     ` Andrew Morton
@ 2005-12-12 21:40       ` Neil Horman
  2005-12-14 19:43         ` Neil Horman
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Horman @ 2005-12-12 21:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mingo

On Mon, Dec 12, 2005 at 12:16:39PM -0800, Andrew Morton wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote:
> > > Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > Hey all-
> > > >      I was recently shown this issue, wherein, if the kernel was kept full of
> > > > pagecache via applications that were constantly writing large amounts of data to
> > > > disk, the box could find itself in a position where the vm, in __alloc_pages
> > > > would invoke the oom killer repetatively within try_to_free_pages, until such
> > > > time as the box had no candidate processes left to kill, at which point it would
> > > > panic.
> > > 
> > > That's pretty bad.  Are you able to provide a description which would permit
> > > others to reproduce this?
> > 
> > As promised, heres the reproducer that was given to me, and used to reproduce
> > this problem:
> > 
> > 1) setup an nfs serve with a thread count of 2.  Of course, 1 thread might make
> > the problem more easy to reproduce.  I haven't tried it yet.
> > 
> > 2) Setup 4 nodes to hammer the nfs mounted directory.  The 4 nodes should hammer
> > out 4 gigs.  2 gigs didn't seem to be enough.
> > 
> > I used a locally developed tool called ior to reproduce this problem.  The tool
> > can be found here:
> > 
> > http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/
> > 
> > I suppose anything that can write to NFS fast should be fine.  But that's what I
> > did.
> > 
> > 
> > If you do this, any node writing to the server that has more than 4GB of RAM
> > should start oom killing to the point where it runs out of candidate processes
> > and panics
> 
> We merged an NFS fix last week which will help throttling under heavy
> writeout conditions..
This one?
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb713d6d38f7be4f4e7d790cddb1b076e7da6699
I guess I must have just missed it during my testing. I'll give it a spin and
let you know if it fixes my test case.

Thanks & Regards
Neil

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory
  2005-12-12 21:40       ` Neil Horman
@ 2005-12-14 19:43         ` Neil Horman
  0 siblings, 0 replies; 7+ messages in thread
From: Neil Horman @ 2005-12-14 19:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mingo

On Mon, Dec 12, 2005 at 04:40:43PM -0500, Neil Horman wrote:
> On Mon, Dec 12, 2005 at 12:16:39PM -0800, Andrew Morton wrote:
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Fri, Dec 09, 2005 at 04:29:01PM -0800, Andrew Morton wrote:
> > > > Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > >
> > > > > Hey all-
> > > > >      I was recently shown this issue, wherein, if the kernel was kept full of
> > > > > pagecache via applications that were constantly writing large amounts of data to
> > > > > disk, the box could find itself in a position where the vm, in __alloc_pages
> > > > > would invoke the oom killer repetatively within try_to_free_pages, until such
> > > > > time as the box had no candidate processes left to kill, at which point it would
> > > > > panic.
> > > > 
> > > > That's pretty bad.  Are you able to provide a description which would permit
> > > > others to reproduce this?
> > > 
> > > As promised, heres the reproducer that was given to me, and used to reproduce
> > > this problem:
> > > 
> > > 1) setup an nfs serve with a thread count of 2.  Of course, 1 thread might make
> > > the problem more easy to reproduce.  I haven't tried it yet.
> > > 
> > > 2) Setup 4 nodes to hammer the nfs mounted directory.  The 4 nodes should hammer
> > > out 4 gigs.  2 gigs didn't seem to be enough.
> > > 
> > > I used a locally developed tool called ior to reproduce this problem.  The tool
> > > can be found here:
> > > 
> > > http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/ior/
> > > 
> > > I suppose anything that can write to NFS fast should be fine.  But that's what I
> > > did.
> > > 
> > > 
> > > If you do this, any node writing to the server that has more than 4GB of RAM
> > > should start oom killing to the point where it runs out of candidate processes
> > > and panics
> > 
> > We merged an NFS fix last week which will help throttling under heavy
> > writeout conditions..
> This one?
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bb713d6d38f7be4f4e7d790cddb1b076e7da6699
> I guess I must have just missed it during my testing. I'll give it a spin and
> let you know if it fixes my test case.
> 
> Thanks & Regards
> Neil
> 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> /***************************************************
>  *Neil Horman
>  *Software Engineer
>  *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
>  ***************************************************/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Just finished testing with the latest kernel, and the problem appears to be
gone.  I withdraw my patch.  Apologies for the noise.

Thanks & Regards
Neil

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-12-14 19:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-07 22:04 [PATCH] vm: enhance __alloc_pages to prioritize pagecache eviction when pressed for memory Neil Horman
2005-12-10  0:29 ` Andrew Morton
2005-12-10 18:25   ` Neil Horman
2005-12-12 18:22   ` Neil Horman
2005-12-12 20:16     ` Andrew Morton
2005-12-12 21:40       ` Neil Horman
2005-12-14 19:43         ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox