public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Re: How to handle TIF_MEMDIE stalls?
       [not found]           ` <20150217125315.GA14287@phnom.home.cmpxchg.org>
@ 2015-02-17 22:54             ` Dave Chinner
  2015-02-17 23:32               ` Dave Chinner
                                 ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

[ cc xfs list - experienced kernel devs should not have to be
reminded to do this ]

On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > >  			goto out;
> > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > -		if (!(gfp_mask & __GFP_FS))
> > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > +			/*
> > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > +			 * and the OOM killer can't be invoked, but
> > > > +			 * keep looping as per should_alloc_retry().
> > > > +			 */
> > > > +			*did_some_progress = 1;
> > > >  			goto out;
> > > > +		}
> > > 
> > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > 
> > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > a regression and below one is the fix.
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> >                 if (high_zoneidx < ZONE_NORMAL)
> >                         goto out;
> > -               /* The OOM killer does not compensate for light reclaim */
> > -               if (!(gfp_mask & __GFP_FS))
> > -                       goto out;
> >                 /*
> >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Again, we don't want to OOM kill on behalf of allocations that can't
> initiate IO, or even actively prevent others from doing it.  Not per
> default anyway, because most callers can deal with the failure without
> having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> It's the exceptions that should be annotated instead:
> 
> void *
> kmem_alloc(size_t size, xfs_km_flags_t flags)
> {
> 	int	retries = 0;
> 	gfp_t	lflags = kmem_flags_convert(flags);
> 	void	*ptr;
> 
> 	do {
> 		ptr = kmalloc(size, lflags);
> 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> 			return ptr;
> 		if (!(++retries % 100))
> 			xfs_err(NULL,
> 		"possible memory allocation deadlock in %s (mode:0x%x)",
> 					__func__, lflags);
> 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (1);
> }
> 
> This should use __GFP_NOFAIL, which is not only designed to annotate
> broken code like this, but also recognizes that endless looping on a
> GFP_NOFS allocation needs the OOM killer after all to make progress.
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a7a3a63bb360..17ced1805d3a 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
>  void *
>  kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
> -	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -		"possible memory allocation deadlock in %s (mode:0x%x)",
> -					__func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }

Hmmm - the only reason there is a focus on this loop is that it
emits warnings about allocations failing. It's obvious that the
problem being dealt with here is a fundamental design issue w.r.t.
to locking and the OOM killer, but the proposed special casing
hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
in XFS started emitting warnings about allocations failing more
often.

So the answer is to remove the warning?  That's like killing the
canary to stop the methane leak in the coal mine. No canary? No
problems!

Right now, the oom killer is a liability. Over the past 6 months
I've slowly had to exclude filesystem regression tests from running
on small memory machines because the OOM killer is now so unreliable
that it kills the test harness regularly rather than the process
generating memory pressure. That's a big red flag to me that all
this hacking around the edges is not solving the underlying problem,
but instead is breaking things that did once work.

And, well, then there's this (gfp.h):

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.

So, is this another policy relevation from the mm developers about
the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
Or just another symptom of frantic thrashing because nobody actually
understands the problem or those that do are unwilling to throw out
the broken crap and redesign it?

If you are changing allocator behaviour and constraints, then you
better damn well think through that changes fully, then document
those changes, change all the relevant code to use the new API (not
just those that throw warnings in your face) and make sure
*everyone* knows about it. e.g. a LWN article explaining the changes
and how memory allocation is going to work into the future would be
a good start.

Otherwise, this just looks like another knee-jerk band aid for an
architectural problem that needs more than special case hacks to
solve.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54             ` How to handle TIF_MEMDIE stalls? Dave Chinner
@ 2015-02-17 23:32               ` Dave Chinner
  2015-02-18  8:25               ` Michal Hocko
  2015-02-19 10:24               ` Johannes Weiner
  2 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, akpm, torvalds

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
>
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

I'll also point out that there are two other identical allocation
loops in XFS, one of which is only 30 lines below this one. That's
further indication that this is a "silence the warning" patch rather
than something that actually fixes a problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54             ` How to handle TIF_MEMDIE stalls? Dave Chinner
  2015-02-17 23:32               ` Dave Chinner
@ 2015-02-18  8:25               ` Michal Hocko
  2015-02-18 10:48                 ` Dave Chinner
  2015-02-19 10:24               ` Johannes Weiner
  2 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-18  8:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
[...]
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing.

Such a warning should be part of the allocator and the whole point why
I like the patch is that we should really warn at a single place. I
was thinking about a simple warning (e.g. like the above) and having
something more sophisticated when lockdep is enabled.

> It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

Not at all. I cannot speak for Johannes but I am pretty sure his
motivation wasn't to simply silence the warning. The thing is that no
kernel code paths except for the page allocator shouldn't emulate
behavior for which we have a gfp flag.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

It would be great to get bug reports.

> That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.

I am heavily trying to discourage people from adding random hacks to
the already complicated and subtle OOM code.

> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?

It is deprecated and shouldn't be used. But that doesn't mean that users
should workaround this by developing their own alternative. I agree the
wording could be more clear and mention that if the allocation failure
is absolutely unacceptable then the flags can be used rather than
creating the loop around. What do you think about the following?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b840e3b2770d..ee6440ccb75d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -57,8 +57,12 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.  This modifier is deprecated and no new
- * users should be added.
+ * cannot handle allocation failures.  This modifier is deprecated for allocation
+ * with order > 1. Besides that this modifier is very dangerous when allocation
+ * happens under a lock because it creates a lock dependency invisible for the
+ * OOM killer so it can livelock. If the allocation failure is _absolutely_
+ * unacceptable then the flags has to be used rather than looping around
+ * allocator.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?
> 
> If you are changing allocator behaviour and constraints, then you
> better damn well think through that changes fully, then document
> those changes, change all the relevant code to use the new API (not
> just those that throw warnings in your face) and make sure
> *everyone* knows about it. e.g. a LWN article explaining the changes
> and how memory allocation is going to work into the future would be
> a good start.

Well, I think the first step is to change the users of the allocator
to not lie about gfp flags. So if the code is infinitely trying then
it really should use GFP_NOFAIL flag.  In the meantime page allocator
should develop a proper diagnostic to help identify all the potential
dependencies. Next we should start thinking whether all the existing
GFP_NOFAIL paths are really necessary or the code can be
refactored/reimplemented to accept allocation failures.

> Otherwise, this just looks like another knee-jerk band aid for an
> architectural problem that needs more than special case hacks to
> solve.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18  8:25               ` Michal Hocko
@ 2015-02-18 10:48                 ` Dave Chinner
  2015-02-18 12:16                   ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> [...]
> > > void *
> > > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > > {
> > > 	int	retries = 0;
> > > 	gfp_t	lflags = kmem_flags_convert(flags);
> > > 	void	*ptr;
> > > 
> > > 	do {
> > > 		ptr = kmalloc(size, lflags);
> > > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > 			return ptr;
> > > 		if (!(++retries % 100))
> > > 			xfs_err(NULL,
> > > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > 					__func__, lflags);
> > > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > 	} while (1);
> > > }
> > > 
> > > This should use __GFP_NOFAIL, which is not only designed to annotate
> > > broken code like this, but also recognizes that endless looping on a
> > > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index a7a3a63bb360..17ced1805d3a 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > >  void *
> > >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > > -	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing.
> 
> Such a warning should be part of the allocator and the whole point why
> I like the patch is that we should really warn at a single place. I
> was thinking about a simple warning (e.g. like the above) and having
> something more sophisticated when lockdep is enabled.
> 
> > It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> Not at all. I cannot speak for Johannes but I am pretty sure his
> motivation wasn't to simply silence the warning. The thing is that no
> kernel code paths except for the page allocator shouldn't emulate
> behavior for which we have a gfp flag.
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> It would be great to get bug reports.

I thought we were talking about a manifestation of the problems I've
been seeing....

> > That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> 
> I am heavily trying to discourage people from adding random hacks to
> the already complicated and subtle OOM code.
> 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> 
> It is deprecated and shouldn't be used. But that doesn't mean that users
> should workaround this by developing their own alternative.

I'm kinda sick of hearing that, as if saying it enough times will
make reality change. We have a *hard requirement* for memory
allocation to make forwards progress, otherwise we *fail
catastrophically*.

History lesson - June 2004:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9

So, we're hardly working around the deprecation of GFP_NOFAIL when
the code existed 5 years before GFP_NOFAIL was deprecated. Indeed,
GFP_NOFAIL was shiny and new back then, having been introduced by
Andrew Morton back in 2003.

> I agree the
> wording could be more clear and mention that if the allocation failure
> is absolutely unacceptable then the flags can be used rather than
> creating the loop around. What do you think about the following?
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b840e3b2770d..ee6440ccb75d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -57,8 +57,12 @@ struct vm_area_struct;
>   * _might_ fail.  This depends upon the particular VM implementation.
>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.  This modifier is deprecated and no new
> - * users should be added.
> + * cannot handle allocation failures.  This modifier is deprecated for allocation
> + * with order > 1. Besides that this modifier is very dangerous when allocation
> + * happens under a lock because it creates a lock dependency invisible for the
> + * OOM killer so it can livelock. If the allocation failure is _absolutely_
> + * unacceptable then the flags has to be used rather than looping around
> + * allocator.

Doesn't change anything from an XFS point of view. We do order >1
allocations through kmem_alloc() wrapper, and so we are still doing
something that is "not supported" even if we use GFP_NOFAIL rather
than our own loop.

Also, this reads as an excuse for the OOM killer being broken and
not fixing it.  Keep in mind that we tell the memory alloc/reclaim
subsystem that *we hold locks* when we call into it. That's what
GFP_NOFS originally meant, and it's what it still means today in an
XFS context.

If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
that the invoking context holds, then that is a OOM killer bug, not
a bug in the subsystem calling kmalloc(GFP_NOFS).

>   *
>   * __GFP_NORETRY: The VM implementation must not retry indefinitely.
>   *
> 
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> > 
> > If you are changing allocator behaviour and constraints, then you
> > better damn well think through that changes fully, then document
> > those changes, change all the relevant code to use the new API (not
> > just those that throw warnings in your face) and make sure
> > *everyone* knows about it. e.g. a LWN article explaining the changes
> > and how memory allocation is going to work into the future would be
> > a good start.
> 
> Well, I think the first step is to change the users of the allocator
> to not lie about gfp flags. So if the code is infinitely trying then
> it really should use GFP_NOFAIL flag.

That's a complete non-issue when it comes to deciding whether it is
safe to invoke the OOM killer or not!

> In the meantime page allocator
> should develop a proper diagnostic to help identify all the potential
> dependencies. Next we should start thinking whether all the existing
> GFP_NOFAIL paths are really necessary or the code can be
> refactored/reimplemented to accept allocation failures.

Last time the "just make filesystems handle memory allocation
failures" I pointed out what that meant for XFS: dirty transaction
rollback is required. That's freakin' complex, will double the
memory footprint of transactions, roughly double the CPU cost, and
greatly increase the complexity of the transaction subsystem. It's a
*major* rework of a significant amount of the XFS codebase and will
take at least a couple of years design, test and stabilise before
it could be rolled out to production.

I'm not about to spend a couple of years rewriting XFS just so the
VM can get rid of a GFP_NOFAIL user. Especially as the we already
tell the Hammer of Last Resort the context in which it can work.

Move the OOM killer to kswapd - get it out of the direct reclaim
path altogether. If the system is that backed up on locks that it
cannot free any memory and has no reserves to satisfy the allocation
that kicked the OOM killer, then the OOM killer was not invoked soon
enough.

Hell, if you want a better way to proceed, then how about you allow
us to tell the MM subsystem how much memory reserve a specific set
of operations is going to require to complete? That's something that
we can do rough calculations for, and it integrates straight into
the existing transaction reservation system we already use for log
space and disk space, and we can tell the mm subsystem when the
reserve is no longer needed (i.e. last thing in transaction commit).

That way we don't start a transaction until the mm subsystem has
reserved enough pages for us to work with, and the reserve only
needs to be used when normal allocation has already failed. i.e
rather than looping we get a page allocated from the reserve pool.

The reservations wouldn't be perfect, but the majority of the time
we'd be able to make progress and not need the OOM killer. And best
of all, there's no responsibilty on the MM subsystem for preventing
OOM - getting the reservations right is the responsibiity of the
subsystem using them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 10:48                 ` Dave Chinner
@ 2015-02-18 12:16                   ` Michal Hocko
  2015-02-18 21:31                     ` Dave Chinner
  2015-02-19 11:01                     ` Johannes Weiner
  0 siblings, 2 replies; 83+ messages in thread
From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
[...]
> Also, this reads as an excuse for the OOM killer being broken and
> not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> subsystem that *we hold locks* when we call into it. That's what
> GFP_NOFS originally meant, and it's what it still means today in an
> XFS context.

Sure, and OOM killer will not be invoked in NOFS context. See
__alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
is the OOM killer broken.

The crucial problem we are dealing with is not GFP_NOFAIL triggering the
OOM killer but a lock dependency introduced by the following sequence:

	taskA			taskB			taskC
lock(A)							alloc()
alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
# looping for ever if we				    select_bad_process
# cannot make any progress				      victim = taskB

There is no way OOM killer can tell taskB is blocked and that there is
dependency between A and B (without lockdep). That is why I call NOFAIL
under a lock as dangerous and a bug.

> If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> that the invoking context holds, then that is a OOM killer bug, not
> a bug in the subsystem calling kmalloc(GFP_NOFS).

I guess we are talking about different things here or what am I missing?
 
[...]
> > In the meantime page allocator
> > should develop a proper diagnostic to help identify all the potential
> > dependencies. Next we should start thinking whether all the existing
> > GFP_NOFAIL paths are really necessary or the code can be
> > refactored/reimplemented to accept allocation failures.
> 
> Last time the "just make filesystems handle memory allocation
> failures" I pointed out what that meant for XFS: dirty transaction
> rollback is required. That's freakin' complex, will double the
> memory footprint of transactions, roughly double the CPU cost, and
> greatly increase the complexity of the transaction subsystem. It's a
> *major* rework of a significant amount of the XFS codebase and will
> take at least a couple of years design, test and stabilise before
> it could be rolled out to production.
> 
> I'm not about to spend a couple of years rewriting XFS just so the
> VM can get rid of a GFP_NOFAIL user. Especially as the we already
> tell the Hammer of Last Resort the context in which it can work.
> 
> Move the OOM killer to kswapd - get it out of the direct reclaim
> path altogether.

This doesn't change anything as explained in other email. The triggering
path doesn't wait for the victim to die.

> If the system is that backed up on locks that it
> cannot free any memory and has no reserves to satisfy the allocation
> that kicked the OOM killer, then the OOM killer was not invoked soon
> enough.
> 
> Hell, if you want a better way to proceed, then how about you allow
> us to tell the MM subsystem how much memory reserve a specific set
> of operations is going to require to complete? That's something that
> we can do rough calculations for, and it integrates straight into
> the existing transaction reservation system we already use for log
> space and disk space, and we can tell the mm subsystem when the
> reserve is no longer needed (i.e. last thing in transaction commit).
> 
> That way we don't start a transaction until the mm subsystem has
> reserved enough pages for us to work with, and the reserve only
> needs to be used when normal allocation has already failed. i.e
> rather than looping we get a page allocated from the reserve pool.

I am not sure I understand the above but isn't the mempools a tool for
this purpose?
 
> The reservations wouldn't be perfect, but the majority of the time
> we'd be able to make progress and not need the OOM killer. And best
> of all, there's no responsibilty on the MM subsystem for preventing
> OOM - getting the reservations right is the responsibiity of the
> subsystem using them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 12:16                   ` Michal Hocko
@ 2015-02-18 21:31                     ` Dave Chinner
  2015-02-19  9:40                       ` Michal Hocko
  2015-02-19 11:01                     ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.

I suspect that the page cache missing the correct GFP_NOFS was one
of the sources of the problems I've been seeing.

However, the oom killer exceptions are not checked if __GFP_NOFAIL
is present and so if we start using __GFP_NOFAIL then it will be
called in GFP_NOFS contexts...

> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB
> 
> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

Sure. However, eventually the OOM killer with select task A to be
killed because nothing else is working. That, at least, marks
taskA with TIF_MEMDIE and gives us a potential way to break the
deadlock.

But the bigger problem is this:

	taskA			taskB
lock(A)
alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
  out_of_memory
    select_bad_process
      victim = taskB

Because there is no way to *ever* resolve that dependency because
taskA never leaves the allocator. Even if the oom killer selects
taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
because GFP_NOFAIL is set and continues to loop.

This is why GFP_NOFAIL is not a solution to the "never fail"
alloation problem. The caller doing the "no fail" allocation _must
be able to set failure policy_. i.e. the choice of aborting and
shutting down because progress cannot be made, or continuing and
hoping for forwards progress is owned by the allocating context, no
the allocator.  The memory allocation subsystem cannot make that
choice for us as it has no concept of the failure characteristics of
the allocating context.

The situations in which this actually matters are extremely *rare* -
we've had these allocaiton loops in XFS for > 13 years, and we might
get a one or two reports a year of these "possible allocation
deadlock" messages occurring. Changing *everything* for such a rare,
unusual event is not an efficient use of time or resources.

> > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > that the invoking context holds, then that is a OOM killer bug, not
> > a bug in the subsystem calling kmalloc(GFP_NOFS).
> 
> I guess we are talking about different things here or what am I missing?

>From my perspective, you are tightly focussed on one aspect of the
problem and hence are not seeing the bigger picture: this is a
corner case of behaviour in a "last hope", brute force memory
reclaim technique that no production machine relies on for correct
or performant operation.

> [...]
> > > In the meantime page allocator
> > > should develop a proper diagnostic to help identify all the potential
> > > dependencies. Next we should start thinking whether all the existing
> > > GFP_NOFAIL paths are really necessary or the code can be
> > > refactored/reimplemented to accept allocation failures.
> > 
> > Last time the "just make filesystems handle memory allocation
> > failures" I pointed out what that meant for XFS: dirty transaction
> > rollback is required. That's freakin' complex, will double the
> > memory footprint of transactions, roughly double the CPU cost, and
> > greatly increase the complexity of the transaction subsystem. It's a
> > *major* rework of a significant amount of the XFS codebase and will
> > take at least a couple of years design, test and stabilise before
> > it could be rolled out to production.
> > 
> > I'm not about to spend a couple of years rewriting XFS just so the
> > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > tell the Hammer of Last Resort the context in which it can work.
> > 
> > Move the OOM killer to kswapd - get it out of the direct reclaim
> > path altogether.
> 
> This doesn't change anything as explained in other email. The triggering
> path doesn't wait for the victim to die.

But it does - we wouldn't be talking about deadlocks if there were
no blocking dependencies. In this case, allocation keeps retrying
until the memory freed by the killed tasks enables it to make
forward progress. That's a side effect of the last relevation that
was made in this thread that low order allocations never fail...

> > If the system is that backed up on locks that it
> > cannot free any memory and has no reserves to satisfy the allocation
> > that kicked the OOM killer, then the OOM killer was not invoked soon
> > enough.
> > 
> > Hell, if you want a better way to proceed, then how about you allow
> > us to tell the MM subsystem how much memory reserve a specific set
> > of operations is going to require to complete? That's something that
> > we can do rough calculations for, and it integrates straight into
> > the existing transaction reservation system we already use for log
> > space and disk space, and we can tell the mm subsystem when the
> > reserve is no longer needed (i.e. last thing in transaction commit).
> > 
> > That way we don't start a transaction until the mm subsystem has
> > reserved enough pages for us to work with, and the reserve only
> > needs to be used when normal allocation has already failed. i.e
> > rather than looping we get a page allocated from the reserve pool.
> 
> I am not sure I understand the above but isn't the mempools a tool for
> this purpose?

I knew this question would be the next one - I even deleted a one
line comment from my last email that said "And no, mempools are not
a solution" because that needs a more thorough explanation than a
dismissive one-liner.

As you know, mempools require a forward progress guarantee on a
single type of object and the objects must be slab based.

In transaction context we allocate from inode slabs, xfs_buf slabs,
log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
but then we also have direct page allocations for buffers, vm_map_ram()
for mapping multi-page buffers, uncounted heap allocations, etc.
We cannot make all of these mempools, nor can me meet the forwards
progress requirements of a mempool because other allocations can
block and prevent progress.

Further, the object have lifetimes that don't correspond to the
transaction life cycles, and hence even if we complete the
transaction there is no guarantee that the objects allocated within
a transaction are going to be returned to the mempool at it's
completion.

IOWs, we have need for forward allocation progress guarantees on
(potentially) several megabytes of allocations from slab caches, the
heap and the page allocator, with all allocations all in
unpredictable order, with objects of different life times and life
cycles, and at which may, at any time, get stuck behind
objects locked in other transactions and hence can randomly block
until some other thread makes forward progress and completes a
transaction and unlocks the object.

The reservation would only need to cover the memory we need to
allocate and hold in the transaction (i.e. dirtied objects). There
is potentially unbound amounts of memory required through demand
paging of buffers to find the metadata we need to modify, but demand
paged metadata that is read and then released is recoverable. i.e
the shrinkers will free it as other memory demand requires, so it's
not included in reservation pools because it doesn't deplete the
amount of free memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 21:31                     ` Dave Chinner
@ 2015-02-19  9:40                       ` Michal Hocko
  2015-02-19 22:03                         ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-19  9:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> > [...]
> > > Also, this reads as an excuse for the OOM killer being broken and
> > > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > > subsystem that *we hold locks* when we call into it. That's what
> > > GFP_NOFS originally meant, and it's what it still means today in an
> > > XFS context.
> > 
> > Sure, and OOM killer will not be invoked in NOFS context. See
> > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> > is the OOM killer broken.
> 
> I suspect that the page cache missing the correct GFP_NOFS was one
> of the sources of the problems I've been seeing.
> 
> However, the oom killer exceptions are not checked if __GFP_NOFAIL

Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc:
embed OOM killing naturally into allocation slowpath) and IMO a
desirable one. Requiring infinite retrying with a seriously restricted
reclaim context calls for troubles (e.g. livelock without no way out
because regular reclaim cannot make any progress and OOM killer as the
last resort will not happen).

> is present and so if we start using __GFP_NOFAIL then it will be
> called in GFP_NOFS contexts...
> 
> > The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> > OOM killer but a lock dependency introduced by the following sequence:
> > 
> > 	taskA			taskB			taskC
> > lock(A)							alloc()
> > alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> > # looping for ever if we				    select_bad_process
> > # cannot make any progress				      victim = taskB
> > 
> > There is no way OOM killer can tell taskB is blocked and that there is
> > dependency between A and B (without lockdep). That is why I call NOFAIL
> > under a lock as dangerous and a bug.
> 
> Sure. However, eventually the OOM killer with select task A to be
> killed because nothing else is working.

That would require OOM killer to be able to select another victim while
the current one is still alive. There were time based heuristics
suggested to do this but I do not think they are the right way to handle
the problem and should be considered only if all other options fail.

One potential way would be giving access to give GFP_NOFAIL context
access to memory reserves when the allocation domain
(global/memcg/cpuset) is OOM. Andrea was suggesting something like that
IIRC.

> That, at least, marks
> taskA with TIF_MEMDIE and gives us a potential way to break the
> deadlock.
> 
> But the bigger problem is this:
> 
> 	taskA			taskB
> lock(A)
> alloc(GFP_NOFS|GFP_NOFAIL)		lock(A)
>   out_of_memory
>     select_bad_process
>       victim = taskB
> 
> Because there is no way to *ever* resolve that dependency because
> taskA never leaves the allocator. Even if the oom killer selects
> taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE
> because GFP_NOFAIL is set and continues to loop.

TIF_MEMDIE will at least give the task access to memory reserves. Anyway
this is essentially the same category of livelock as above.

> This is why GFP_NOFAIL is not a solution to the "never fail"
> alloation problem. The caller doing the "no fail" allocation _must
> be able to set failure policy_. i.e. the choice of aborting and
> shutting down because progress cannot be made, or continuing and
> hoping for forwards progress is owned by the allocating context, no
> the allocator.

I completely agree that the failure policy is the caller responsibility
and I would have no objections to something like:

	do {
		ptr = kmalloc(size, GFP_NOFS);
		if (ptr)
			return ptr;
		if (fatal_signal_pending(current))
			break;
		if (looping_too_long())
			break;
	} while (1);

	fallback_solution();

But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
allocation with a warning and congestion_wait. There is no failure
policy defined there. The warning should be part of the allocator and
the NOFAIL policy should be explicit. So why exactly do you oppose to
changing kmem_alloc (and others which are doing essentially the same)?

> The memory allocation subsystem cannot make that
> choice for us as it has no concept of the failure characteristics of
> the allocating context.

Of course. I wasn't arguing we should change allocation loops which have
a fallback policy as well. That is an entirely different thing. My point
was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the
allocator can prevent from livelocks if possible.

> The situations in which this actually matters are extremely *rare* -
> we've had these allocaiton loops in XFS for > 13 years, and we might
> get a one or two reports a year of these "possible allocation
> deadlock" messages occurring. Changing *everything* for such a rare,
> unusual event is not an efficient use of time or resources.
> 
> > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks
> > > that the invoking context holds, then that is a OOM killer bug, not
> > > a bug in the subsystem calling kmalloc(GFP_NOFS).
> > 
> > I guess we are talking about different things here or what am I missing?
> 
> From my perspective, you are tightly focussed on one aspect of the
> problem and hence are not seeing the bigger picture: this is a
> corner case of behaviour in a "last hope", brute force memory
> reclaim technique that no production machine relies on for correct
> or performant operation.

Of course this is a corner case. And I am trying to prevent heuristics
which would optimize for such a corner case (there were multiple of
them suggested in this thread).

The reason I care about GFP_NOFAIL is that there are apparently code
paths which do not tell allocator they are basically GFP_NOFAIL without
any fallback. This leads to two main problems 1) we do not have a good
overview how many code paths have such a strong requirements and so
cannot estimate e.g. how big memory reserves should be and 2) allocator
cannot help those paths (e.g. by giving them access to reserves to break
out of the livelock).

> > [...]
> > > > In the meantime page allocator
> > > > should develop a proper diagnostic to help identify all the potential
> > > > dependencies. Next we should start thinking whether all the existing
> > > > GFP_NOFAIL paths are really necessary or the code can be
> > > > refactored/reimplemented to accept allocation failures.
> > > 
> > > Last time the "just make filesystems handle memory allocation
> > > failures" I pointed out what that meant for XFS: dirty transaction
> > > rollback is required. That's freakin' complex, will double the
> > > memory footprint of transactions, roughly double the CPU cost, and
> > > greatly increase the complexity of the transaction subsystem. It's a
> > > *major* rework of a significant amount of the XFS codebase and will
> > > take at least a couple of years design, test and stabilise before
> > > it could be rolled out to production.
> > > 
> > > I'm not about to spend a couple of years rewriting XFS just so the
> > > VM can get rid of a GFP_NOFAIL user. Especially as the we already
> > > tell the Hammer of Last Resort the context in which it can work.
> > > 
> > > Move the OOM killer to kswapd - get it out of the direct reclaim
> > > path altogether.
> > 
> > This doesn't change anything as explained in other email. The triggering
> > path doesn't wait for the victim to die.
> 
> But it does - we wouldn't be talking about deadlocks if there were
> no blocking dependencies. In this case, allocation keeps retrying
> until the memory freed by the killed tasks enables it to make
> forward progress. That's a side effect of the last relevation that
> was made in this thread that low order allocations never fail...

Sure, low order allocations being almost GFP_NOFAIL makes things much
worse of course. And this should be changed. We just have to think about
the way how to do it without breaking the universe. I hope we can
discuss this at LSF.

But even then I do not see how triggering the OOM killer from kswapd
would help here. Victims would be looping in the allocator whether the
actual killing happens from their or any other context.

> > > If the system is that backed up on locks that it
> > > cannot free any memory and has no reserves to satisfy the allocation
> > > that kicked the OOM killer, then the OOM killer was not invoked soon
> > > enough.
> > > 
> > > Hell, if you want a better way to proceed, then how about you allow
> > > us to tell the MM subsystem how much memory reserve a specific set
> > > of operations is going to require to complete? That's something that
> > > we can do rough calculations for, and it integrates straight into
> > > the existing transaction reservation system we already use for log
> > > space and disk space, and we can tell the mm subsystem when the
> > > reserve is no longer needed (i.e. last thing in transaction commit).
> > > 
> > > That way we don't start a transaction until the mm subsystem has
> > > reserved enough pages for us to work with, and the reserve only
> > > needs to be used when normal allocation has already failed. i.e
> > > rather than looping we get a page allocated from the reserve pool.
> > 
> > I am not sure I understand the above but isn't the mempools a tool for
> > this purpose?
> 
> I knew this question would be the next one - I even deleted a one
> line comment from my last email that said "And no, mempools are not
> a solution" because that needs a more thorough explanation than a
> dismissive one-liner.
> 
> As you know, mempools require a forward progress guarantee on a
> single type of object and the objects must be slab based.
> 
> In transaction context we allocate from inode slabs, xfs_buf slabs,
> log item slabs (6 different ones, IIRC), btree cursor slabs, etc,
> but then we also have direct page allocations for buffers, vm_map_ram()
> for mapping multi-page buffers, uncounted heap allocations, etc.
> We cannot make all of these mempools, nor can me meet the forwards
> progress requirements of a mempool because other allocations can
> block and prevent progress.
> 
> Further, the object have lifetimes that don't correspond to the
> transaction life cycles, and hence even if we complete the
> transaction there is no guarantee that the objects allocated within
> a transaction are going to be returned to the mempool at it's
> completion.
> 
> IOWs, we have need for forward allocation progress guarantees on
> (potentially) several megabytes of allocations from slab caches, the
> heap and the page allocator, with all allocations all in
> unpredictable order, with objects of different life times and life
> cycles, and at which may, at any time, get stuck behind
> objects locked in other transactions and hence can randomly block
> until some other thread makes forward progress and completes a
> transaction and unlocks the object.

Thanks for the clarification, I have to think about it some more,
though. My thinking was that mempools could be used for an emergency
pool with a pre-allocated memory which would be used in the non failing
contexts.

> The reservation would only need to cover the memory we need to
> allocate and hold in the transaction (i.e. dirtied objects). There
> is potentially unbound amounts of memory required through demand
> paging of buffers to find the metadata we need to modify, but demand
> paged metadata that is read and then released is recoverable. i.e
> the shrinkers will free it as other memory demand requires, so it's
> not included in reservation pools because it doesn't deplete the
> amount of free memory.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-17 22:54             ` How to handle TIF_MEMDIE stalls? Dave Chinner
  2015-02-17 23:32               ` Dave Chinner
  2015-02-18  8:25               ` Michal Hocko
@ 2015-02-19 10:24               ` Johannes Weiner
  2015-02-19 22:52                 ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> [ cc xfs list - experienced kernel devs should not have to be
> reminded to do this ]
> 
> On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > > > >  		if (high_zoneidx < ZONE_NORMAL)
> > > > >  			goto out;
> > > > >  		/* The OOM killer does not compensate for light reclaim */
> > > > > -		if (!(gfp_mask & __GFP_FS))
> > > > > +		if (!(gfp_mask & __GFP_FS)) {
> > > > > +			/*
> > > > > +			 * XXX: Page reclaim didn't yield anything,
> > > > > +			 * and the OOM killer can't be invoked, but
> > > > > +			 * keep looping as per should_alloc_retry().
> > > > > +			 */
> > > > > +			*did_some_progress = 1;
> > > > >  			goto out;
> > > > > +		}
> > > > 
> > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
> > > 
> > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings
> > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm:
> > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced
> > > a regression and below one is the fix.
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >                 /* The OOM killer does not needlessly kill tasks for lowmem */
> > >                 if (high_zoneidx < ZONE_NORMAL)
> > >                         goto out;
> > > -               /* The OOM killer does not compensate for light reclaim */
> > > -               if (!(gfp_mask & __GFP_FS))
> > > -                       goto out;
> > >                 /*
> > >                  * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >                  * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Again, we don't want to OOM kill on behalf of allocations that can't
> > initiate IO, or even actively prevent others from doing it.  Not per
> > default anyway, because most callers can deal with the failure without
> > having to resort to killing tasks, and NOFS reclaim *can* easily fail.
> > It's the exceptions that should be annotated instead:
> > 
> > void *
> > kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > 	int	retries = 0;
> > 	gfp_t	lflags = kmem_flags_convert(flags);
> > 	void	*ptr;
> > 
> > 	do {
> > 		ptr = kmalloc(size, lflags);
> > 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > 			return ptr;
> > 		if (!(++retries % 100))
> > 			xfs_err(NULL,
> > 		"possible memory allocation deadlock in %s (mode:0x%x)",
> > 					__func__, lflags);
> > 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (1);
> > }
> > 
> > This should use __GFP_NOFAIL, which is not only designed to annotate
> > broken code like this, but also recognizes that endless looping on a
> > GFP_NOFS allocation needs the OOM killer after all to make progress.
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a7a3a63bb360..17ced1805d3a 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> >  void *
> >  kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> > -	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > -					__func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Hmmm - the only reason there is a focus on this loop is that it
> emits warnings about allocations failing. It's obvious that the
> problem being dealt with here is a fundamental design issue w.r.t.
> to locking and the OOM killer, but the proposed special casing
> hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> in XFS started emitting warnings about allocations failing more
> often.
> 
> So the answer is to remove the warning?  That's like killing the
> canary to stop the methane leak in the coal mine. No canary? No
> problems!

That's not what happened.  The patch that affected behavior here
transformed code that an incoherent collection of conditions to
something that has an actual model.  That model is that we don't loop
in the allocator if there are no means to making forward progress.  In
this case, it was GFP_NOFS triggering an early exit from the allocator
because it's not allowed to invoke the OOM killer per default, and
there is little point in looping for times to better on their own.

So these deadlock warnings happen, ironically, by the page allocator
now bailing out of a locked-up state in which it's not making forward
progress.  They don't strike me as a very useful canary in this case.

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure. That's a big red flag to me that all
> this hacking around the edges is not solving the underlying problem,
> but instead is breaking things that did once work.
> 
> And, well, then there's this (gfp.h):
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
> 
> So, is this another policy relevation from the mm developers about
> the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> Or just another symptom of frantic thrashing because nobody actually
> understands the problem or those that do are unwilling to throw out
> the broken crap and redesign it?

Well, understand our dilemma here.  __GFP_NOFAIL is a liability
because it can trap tasks with unknown state and locks in a
potentially never ending loop, and we don't want people to start using
it as a convenient solution to get out of having a fallback strategy.

However, if your entire architecture around a particular allocation is
that failure is not an option at this point, and you can't reasonably
preallocate - although that would always be preferrable - then please
do not open code an endless loop around the call to the allocator but
use __GFP_NOFAIL instead so that these callsites are annotated and can
be reviewed.  By giving the allocator this information, it can then
also adjust its behavior, like it is the case right here: we don't
usually want to OOM kill for regular GFP_NOFS allocations because
their reclaim powers are weak and we don't want to kill tasks
prematurely.  But if your NOFS allocation can not fail under any
circumstances, then the OOM killer should very much be employed to
make any kind of forward progress at all for this allocation.  It's
just that the allocator needs to be made aware of this requirement.

So yes, we are wary of __GFP_NOFAIL allocations, but this is an
instance where it's the right way to communicate with the allocator,
it was introduced to replace such open-coded endless loops and have
the liability of making progress with the allocator, not the caller.

And please understand that this callsite blowing up is a chance to
better the code and behavior here.  Where previously it would just
endlessly loop in the allocator without any means to make progress,
converting it to a __GFP_NOFAIL allocation tells the allocator that
it's fine to use the OOM killer in such an instance, improving the
chances that this caller will actually make headway under heavy load.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-18 12:16                   ` Michal Hocko
  2015-02-18 21:31                     ` Dave Chinner
@ 2015-02-19 11:01                     ` Johannes Weiner
  2015-02-19 12:29                       ` Michal Hocko
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.
> 
> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB

You don't even need taskC here.  taskA could invoke the OOM killer
with lock(A) held, and taskB getting selected as the victim while
trying to acquire lock(A).  It'll get the signal and TIF_MEMDIE and
then wait for lock(A) while taskA is waiting for it to exit.

But it doesn't matter who is doing the OOM killing - if the allocating
task with the lock/state is waiting for the OOM victim to free memory,
and the victim is waiting for same the lock/state, we have a deadlock.

> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

You keep ignoring that it's also one of the main usecases of this
flag.  The caller has state that it can't unwind and thus needs the
allocation to succeed.  Chances are somebody else can get blocked up
on that same state.  And when that somebody else is the first choice
of the OOM killer, we're screwed.

This is exactly why I'm proposing that the OOM killer should not wait
indefinitely for its first choice to exit, but ultimately move on and
try other tasks.  There is no other way to resolve this deadlock.

Preferrably, we'd get rid of all nofail allocations and replace them
with preallocated reserves.  But this is not going to happen anytime
soon, so what other option do we have than resolving this on the OOM
killer side?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 11:01                     ` Johannes Weiner
@ 2015-02-19 12:29                       ` Michal Hocko
  2015-02-19 12:58                         ` Michal Hocko
                                           ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
[...]
> Preferrably, we'd get rid of all nofail allocations and replace them
> with preallocated reserves.  But this is not going to happen anytime
> soon, so what other option do we have than resolving this on the OOM
> killer side?

As I've mentioned in other email, we might give GFP_NOFAIL allocator
access to memory reserves (by giving it __GFP_HIGH). This is still not a
100% solution because reserves could get depleted but this risk is there
even with multiple oom victims. I would still argue that this would be a
better approach because selecting more victims might hit pathological
case more easily (other victims might be blocked on the very same lock
e.g.).

Something like the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..4b5cf28a13f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2628,6 +2629,15 @@ retry:
 		wake_all_kswapds(order, ac);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		gfp_mask |= __GFP_HIGH;
+
+	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                       ` Michal Hocko
@ 2015-02-19 12:58                         ` Michal Hocko
  2015-02-19 15:29                           ` Tetsuo Handa
  2015-02-19 13:29                         ` Tetsuo Handa
  2015-02-19 21:43                         ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Thu 19-02-15 13:29:14, Michal Hocko wrote:
[...]
> Something like the following.
__GFP_HIGH doesn't seem to be sufficient so we would need something
slightly else but the idea is still the same:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d52ab18fe0d..2d224bbdf8e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int oom = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2635,6 +2636,15 @@ retry:
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 	/*
+	 * __GFP_NOFAIL allocations cannot fail but yet the current context
+	 * might be blocking resources needed by the OOM victim to terminate.
+	 * Allow the caller to dive into memory reserves to succeed the
+	 * allocation and break out from a potential deadlock.
+	 */
+	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+
+	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
@@ -2759,6 +2769,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			oom++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                       ` Michal Hocko
  2015-02-19 12:58                         ` Michal Hocko
@ 2015-02-19 13:29                         ` Tetsuo Handa
  2015-02-20  9:10                           ` Michal Hocko
  2015-02-19 21:43                         ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH). This is still not a
> 100% solution because reserves could get depleted but this risk is there
> even with multiple oom victims. I would still argue that this would be a
> better approach because selecting more victims might hit pathological
> case more easily (other victims might be blocked on the very same lock
> e.g.).
> 
Does "multiple OOM victims" mean "select next if first does not die"?
Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
does not deplete memory reserves. ;-)

If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
those who do not want to fail (e.g. journal transaction) will start passing
__GFP_NOFAIL?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:58                         ` Michal Hocko
@ 2015-02-19 15:29                           ` Tetsuo Handa
  2015-02-19 21:53                             ` Tetsuo Handa
  2015-02-20  9:13                             ` Michal Hocko
  0 siblings, 2 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> [...]
> > Something like the following.
> __GFP_HIGH doesn't seem to be sufficient so we would need something
> slightly else but the idea is still the same:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab18fe0d..2d224bbdf8e8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	bool deferred_compaction = false;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
> +	int oom = 0;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -2635,6 +2636,15 @@ retry:
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
>  	/*
> +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> +	 * might be blocking resources needed by the OOM victim to terminate.
> +	 * Allow the caller to dive into memory reserves to succeed the
> +	 * allocation and break out from a potential deadlock.
> +	 */

We don't know how many callers will pass __GFP_NOFAIL. But if 1000
threads are doing the same operation which requires __GFP_NOFAIL
allocation with a lock held, wouldn't memory reserves deplete?

This heuristic can't continue if memory reserves depleted or
continuous pages of requested order cannot be found.

> +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +
> +	/*
>  	 * Find the true preferred zone if the allocation is unconstrained by
>  	 * cpusets.
>  	 */
> @@ -2759,6 +2769,8 @@ retry:
>  				goto got_pg;
>  			if (!did_some_progress)
>  				goto nopage;
> +
> +			oom++;
>  		}
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> -- 
> Michal Hocko
> SUSE Labs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 12:29                       ` Michal Hocko
  2015-02-19 12:58                         ` Michal Hocko
  2015-02-19 13:29                         ` Tetsuo Handa
@ 2015-02-19 21:43                         ` Dave Chinner
  2015-02-20 12:48                           ` Michal Hocko
  2 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> [...]
> > Preferrably, we'd get rid of all nofail allocations and replace them
> > with preallocated reserves.  But this is not going to happen anytime
> > soon, so what other option do we have than resolving this on the OOM
> > killer side?
> 
> As I've mentioned in other email, we might give GFP_NOFAIL allocator
> access to memory reserves (by giving it __GFP_HIGH).

Won't work when you have thousands of concurrent transactions
running in XFS and they are all doing GFP_NOFAIL allocations. That's
why I suggested the per-transaction reserve pool - we can use that
to throttle the number of concurent contexts demanding memory for
forwards progress, just the same was we throttle the number of
concurrent processes based on maximum log space requirements of the
transactions and the amount of unreserved log space available.

No log space, transaction reservations waits on an ordered queue for
space to become available. No memory available, transaction
reservation waits on an ordered queue for memory to become
available.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 15:29                           ` Tetsuo Handa
@ 2015-02-19 21:53                             ` Tetsuo Handa
  2015-02-20  9:13                             ` Michal Hocko
  1 sibling, 0 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel,
	akpm, fernando_b1, torvalds

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?
> 
> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.
> 

Even if the system seems to be stalled, deadlocks may not have occurred.
If the cause is (e.g.) virtio disk being stuck for unknown reason than
a deadlock, nobody should start consuming the memory reserves after
waiting for a while.

The memory reserves are something like a balloon. To guarantee forward
progress, the balloon must not become empty. Therefore, I think that
throttling heuristics for memory requester side (deflator of the balloon,
or SIGKILL receiver called processes) should be avoided and
throttling heuristics for memory releaser side (inflator of the balloon,
or SIGKILL sender called the OOM killer) should be used.
If heuristic is used on the deflator side, the memory allocator may
deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on
the inflator side, the OOM killer can act as a watchdog when nobody
volunteered memory within reasonable period.

> > +	if (oom > 10 && (gfp_mask & __GFP_NOFAIL))
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +
> > +	/*
> >  	 * Find the true preferred zone if the allocation is unconstrained by
> >  	 * cpusets.
> >  	 */
> > @@ -2759,6 +2769,8 @@ retry:
> >  				goto got_pg;
> >  			if (!did_some_progress)
> >  				goto nopage;
> > +
> > +			oom++;
> >  		}
> >  		/* Wait for some write requests to complete then retry */
> >  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19  9:40                       ` Michal Hocko
@ 2015-02-19 22:03                         ` Dave Chinner
  2015-02-20  9:27                           ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote:
> On Thu 19-02-15 08:31:18, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > This is why GFP_NOFAIL is not a solution to the "never fail"
> > alloation problem. The caller doing the "no fail" allocation _must
> > be able to set failure policy_. i.e. the choice of aborting and
> > shutting down because progress cannot be made, or continuing and
> > hoping for forwards progress is owned by the allocating context, no
> > the allocator.
> 
> I completely agree that the failure policy is the caller responsibility
> and I would have no objections to something like:
> 
> 	do {
> 		ptr = kmalloc(size, GFP_NOFS);
> 		if (ptr)
> 			return ptr;
> 		if (fatal_signal_pending(current))
> 			break;
> 		if (looping_too_long())
> 			break;
> 	} while (1);
> 
> 	fallback_solution();
> 
> But this is not the case in kmem_alloc which is essentially GFP_NOFAIL
> allocation with a warning and congestion_wait. There is no failure
> policy defined there. The warning should be part of the allocator and
> the NOFAIL policy should be explicit. So why exactly do you oppose to
> changing kmem_alloc (and others which are doing essentially the same)?

I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because
doing so is *broken*, *and* it removes the policy decision from the
calling context where it belongs.

We are in the process of discussing - at an XFS level - how to
handle errors in a configurable manner. See, for example, this
discussion:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

Where we are trying to decide how to expose failure policy to admins
to make decisions about error handling behaviour:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

There is little doubt in my mind that this stretches to ENOMEM
handling; it is another case where we consider ENOMEM to be a
transient error and hence retry forever until it succeeds. But some
people are going to want to configure that behaviour, and the API
above allows peopel to configure exactly how many repeated memory
allocations we'd fail before considering the situation hopeless,
failing, and risking a filesystem shutdown....

Converting the code to use GFP_NOFAIL takes us in exactly the
opposite direction to our current line of development w.r.t. to
filesystem error handling.

> The reason I care about GFP_NOFAIL is that there are apparently code
> paths which do not tell allocator they are basically GFP_NOFAIL without
> any fallback. This leads to two main problems 1) we do not have a good
> overview how many code paths have such a strong requirements and so
> cannot estimate e.g. how big memory reserves should be and

Right, when GFP_NOFAIL got deprecated we lost the ability to document
such behaviour and find it easily. People just put retry loops in
instead of using GFP_NOFAIL. Good luck finding them all :/

> 2) allocator
> cannot help those paths (e.g. by giving them access to reserves to break
> out of the livelock).

Allocator should not help. Global reserves are unreliable - make the
allocation context reserve the amount it needs before it enters the
context where it can't back out....

> > IOWs, we have need for forward allocation progress guarantees on
> > (potentially) several megabytes of allocations from slab caches, the
> > heap and the page allocator, with all allocations all in
> > unpredictable order, with objects of different life times and life
> > cycles, and at which may, at any time, get stuck behind
> > objects locked in other transactions and hence can randomly block
> > until some other thread makes forward progress and completes a
> > transaction and unlocks the object.
> 
> Thanks for the clarification, I have to think about it some more,
> though. My thinking was that mempools could be used for an emergency
> pool with a pre-allocated memory which would be used in the non failing
> contexts.

The other problem with mempools is that they aren't exclusive to the
context that needs the reservation. i.e. we can preallocate to the
mempool, but then when the preallocating context goes to allocate,
that preallocation may have already been drained by other contexts.

The memory reservation needs to be follow to the transaction - we
can pass them between tasks, and they need to persist across
sleeping locks, IO, etc, and mempools simply too constrainted to be
usable in this environment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 10:24               ` Johannes Weiner
@ 2015-02-19 22:52                 ` Dave Chinner
  2015-02-20 10:36                   ` Tetsuo Handa
  2015-02-21 23:52                   ` Johannes Weiner
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote:
> On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote:
> > [ cc xfs list - experienced kernel devs should not have to be
> > reminded to do this ]
> > 
> > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote:
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -		"possible memory allocation deadlock in %s (mode:0x%x)",
> > > -					__func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & (KM_MAYFAIL | KM_NOSLEEP)))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Hmmm - the only reason there is a focus on this loop is that it
> > emits warnings about allocations failing. It's obvious that the
> > problem being dealt with here is a fundamental design issue w.r.t.
> > to locking and the OOM killer, but the proposed special casing
> > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code
> > in XFS started emitting warnings about allocations failing more
> > often.
> > 
> > So the answer is to remove the warning?  That's like killing the
> > canary to stop the methane leak in the coal mine. No canary? No
> > problems!
> 
> That's not what happened.  The patch that affected behavior here
> transformed code that an incoherent collection of conditions to
> something that has an actual model.

Which is entirely undocumented. If you have a model, the first thing
to do is document it and communicate that model to everyone who
needs to know about that new model. I have no idea what that model
is. Keeping it in your head and changing code that other people
maintain without giving them any means of understanding WTF you are
doing is a really bad engineering practice.


And yes, I have had a bit to say about this in public recently.
Go watch my recent LCA talk, for example....

And, FWIW, email discussions on a list is no substitute for a
properly documented design that people can take their time to
understand and digest.

> That model is that we don't loop
> in the allocator if there are no means to making forward progress.  In
> this case, it was GFP_NOFS triggering an early exit from the allocator
> because it's not allowed to invoke the OOM killer per default, and
> there is little point in looping for times to better on their own.

So you keep saying....

> So these deadlock warnings happen, ironically, by the page allocator
> now bailing out of a locked-up state in which it's not making forward
> progress.  They don't strike me as a very useful canary in this case.

... yet we *rarely* see the canary warnings we emit when we do too
many allocation retries, the code has been that way for 13-odd
years.  Hence, despite your protestations that your way is *better*,
we have code that is tried, tested and proven in rugged production
environments. That's far more convincing evidence that the *code
should not change* than your assertions that it is broken and needs
to be fixed.

> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure. That's a big red flag to me that all
> > this hacking around the edges is not solving the underlying problem,
> > but instead is breaking things that did once work.
> > 
> > And, well, then there's this (gfp.h):
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> > 
> > So, is this another policy relevation from the mm developers about
> > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated?
> > Or just another symptom of frantic thrashing because nobody actually
> > understands the problem or those that do are unwilling to throw out
> > the broken crap and redesign it?
> 
> Well, understand our dilemma here.  __GFP_NOFAIL is a liability
> because it can trap tasks with unknown state and locks in a
> potentially never ending loop, and we don't want people to start using
> it as a convenient solution to get out of having a fallback strategy.
> 
> However, if your entire architecture around a particular allocation is
> that failure is not an option at this point, and you can't reasonably
> preallocate - although that would always be preferrable - then please
> do not open code an endless loop around the call to the allocator but
> use __GFP_NOFAIL instead so that these callsites are annotated and can
> be reviewed. 

I will actively work around aanything that causes filesystem memory
pressure to increase the chance of oom killer invocations. The OOM
killer is not a solution - it is, by definition, a loose cannon and
so we should be reducing dependencies on it.

I really don't care about the OOM Killer corner cases - it's
completely the wrong way line of development to be spending time on
and you aren't going to convince me otherwise. The OOM killer a
crutch used to justify having a memory allocation subsystem that
can't provide forward progress guarantee mechanisms to callers that
need it.

I've proposed a method of providing this forward progress guarantee
for subsystems of arbitrary complexity, and this removes the
dependency on the OOM killer for fowards allocation progress in such
contexts (e.g. filesystems). We should be discussing how to
implement that, not what bandaids we need to apply to the OOM
killer. I want to fix the underlying problems, not push them under
the OOM-killer bus...

> And please understand that this callsite blowing up is a chance to
> better the code and behavior here.  Where previously it would just
> endlessly loop in the allocator without any means to make progress,

Again, this statement ignores the fact we have *no credible
evidence* that this is actually a problem in production
environments.

And, besides, even if you do force through changing the XFS code to
GFP_NOFAIL, it'll get changed back to a retry loop in the near
future when we add admin configurable error handling behaviour to
XFS, as I pointed Michal to....
(http://oss.sgi.com/archives/xfs/2015-02/msg00346.html)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 13:29                         ` Tetsuo Handa
@ 2015-02-20  9:10                           ` Michal Hocko
  2015-02-20 12:20                             ` Tetsuo Handa
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-20  9:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > 100% solution because reserves could get depleted but this risk is there
> > even with multiple oom victims. I would still argue that this would be a
> > better approach because selecting more victims might hit pathological
> > case more easily (other victims might be blocked on the very same lock
> > e.g.).
> > 
> Does "multiple OOM victims" mean "select next if first does not die"?
> Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> does not deplete memory reserves. ;-)

It doesn't because
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;

you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
and break out from the allocator. Exiting task might need a memory to do
so and you make all those allocations fail basically. How do you know
this is not going to blow up?

> If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> those who do not want to fail (e.g. journal transaction) will start passing
> __GFP_NOFAIL?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 15:29                           ` Tetsuo Handa
  2015-02-19 21:53                             ` Tetsuo Handa
@ 2015-02-20  9:13                             ` Michal Hocko
  2015-02-20 13:37                               ` Stefan Ring
  1 sibling, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-20  9:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Fri 20-02-15 00:29:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 13:29:14, Michal Hocko wrote:
> > [...]
> > > Something like the following.
> > __GFP_HIGH doesn't seem to be sufficient so we would need something
> > slightly else but the idea is still the same:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8d52ab18fe0d..2d224bbdf8e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	bool deferred_compaction = false;
> >  	int contended_compaction = COMPACT_CONTENDED_NONE;
> > +	int oom = 0;
> >  
> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -2635,6 +2636,15 @@ retry:
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> >  	/*
> > +	 * __GFP_NOFAIL allocations cannot fail but yet the current context
> > +	 * might be blocking resources needed by the OOM victim to terminate.
> > +	 * Allow the caller to dive into memory reserves to succeed the
> > +	 * allocation and break out from a potential deadlock.
> > +	 */
> 
> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
> threads are doing the same operation which requires __GFP_NOFAIL
> allocation with a lock held, wouldn't memory reserves deplete?

We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
same time. This would be even more broken. If a load is known to use
such allocations excessively then the administrator can enlarge the
memory reserves.

> This heuristic can't continue if memory reserves depleted or
> continuous pages of requested order cannot be found.

Once memory reserves are depleted we are screwed anyway and we might
panic.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:03                         ` Dave Chinner
@ 2015-02-20  9:27                           ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2015-02-20  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri 20-02-15 09:03:55, Dave Chinner wrote:
[...]
> Converting the code to use GFP_NOFAIL takes us in exactly the
> opposite direction to our current line of development w.r.t. to
> filesystem error handling.

Fair enough. If there are plans to have a failure policy rather than
GFP_NOFAIL like behavior then I have, of course, no objections. Quite
opposite. This is exactly what I would like to see. GFP_NOFAIL should be
rarely used, really.

The whole point of this discussion, and I am sorry if I didn't make it
clear, is that _if_ there is really a GFP_NOFAIL requirement hidden
from the allocator then it should be changed to use GFP_NOFAIL so that
allocator knows about this requirement.

> > The reason I care about GFP_NOFAIL is that there are apparently code
> > paths which do not tell allocator they are basically GFP_NOFAIL without
> > any fallback. This leads to two main problems 1) we do not have a good
> > overview how many code paths have such a strong requirements and so
> > cannot estimate e.g. how big memory reserves should be and
> 
> Right, when GFP_NOFAIL got deprecated we lost the ability to document
> such behaviour and find it easily. People just put retry loops in
> instead of using GFP_NOFAIL. Good luck finding them all :/

That will be PITA, all right, but I guess the deprecation was a mistake
and we should stop this tendency.

> > 2) allocator
> > cannot help those paths (e.g. by giving them access to reserves to break
> > out of the livelock).
> 
> Allocator should not help. Global reserves are unreliable - make the
> allocation context reserve the amount it needs before it enters the
> context where it can't back out....

Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL
then it is too late and the allocator only has memory reclaim and
potentially reserves.

[...]
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:52                 ` Dave Chinner
@ 2015-02-20 10:36                   ` Tetsuo Handa
  2015-02-20 23:15                     ` Dave Chinner
  2015-02-21 23:52                   ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw)
  To: david, hannes
  Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm,
	torvalds

Dave Chinner wrote:
> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

I really care about the OOM Killer corner cases, for I'm

  (1) seeing trouble cases which occurred in enterprise systems
      under OOM conditions

  (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
      an unprivileged user with a login shell can trivially trigger
      since Linux 2.0) to OOM "Genocide" attacks in order to allow
      OOM-unkillable daemons to restart OOM-killed processes

  (3) waiting for a bandaid for (2) in order to propose changes for
      mitigating OOM "Genocide" attacks (as bad guys will find how to
      trigger OOM "Deadlock or Genocide" attacks from changes for
      mitigating OOM "Genocide" attacks)

I started posting to linux-mm ML in order to make forward progress
about (1) and (2). I don't want the memory allocation subsystem to
lock up an entire system by indefinitely disabling memory releasing
mechanism provided by the OOM killer.

> I've proposed a method of providing this forward progress guarantee
> for subsystems of arbitrary complexity, and this removes the
> dependency on the OOM killer for fowards allocation progress in such
> contexts (e.g. filesystems). We should be discussing how to
> implement that, not what bandaids we need to apply to the OOM
> killer. I want to fix the underlying problems, not push them under
> the OOM-killer bus...

I'm fine with that direction for new kernels provided that a simple
bandaid which can be backported to distributor kernels for making
OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
discussing what bandaids we need to apply to the OOM killer.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20  9:10                           ` Michal Hocko
@ 2015-02-20 12:20                             ` Tetsuo Handa
  2015-02-20 12:38                               ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw)
  To: mhocko
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

Michal Hocko wrote:
> On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > 100% solution because reserves could get depleted but this risk is there
> > > even with multiple oom victims. I would still argue that this would be a
> > > better approach because selecting more victims might hit pathological
> > > case more easily (other victims might be blocked on the very same lock
> > > e.g.).
> > > 
> > Does "multiple OOM victims" mean "select next if first does not die"?
> > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > does not deplete memory reserves. ;-)
> 
> It doesn't because
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> 
> you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> and break out from the allocator. Exiting task might need a memory to do
> so and you make all those allocations fail basically. How do you know
> this is not going to blow up?
> 

Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?

We cannot determine correct task to kill + allow access to memory reserves
based on lock dependency. Therefore, this patch evenly allow no tasks to
access to memory reserves.

Exiting task might need some memory to exit, and not allowing access to
memory reserves can retard exit of that task. But that task will eventually
get memory released by other tasks killed by timeout-based kill-more
mechanism. If no more killable tasks or expired panic-timeout, it is
the same result with depletion of memory reserves.

I think that this situation (automatically making forward progress as if
the administrator is periodically doing SysRq-f until the OOM condition
is solved, or is doing SysRq-c if no more killable tasks or stalled too
long) is better than current situation (not making forward progress since
the exiting task cannot exit due to lock dependency, caused by failing to
determine correct task to kill + allow access to memory reserves).

> > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO,
> > those who do not want to fail (e.g. journal transaction) will start passing
> > __GFP_NOFAIL?
> > 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 12:20                             ` Tetsuo Handa
@ 2015-02-20 12:38                               ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes,
	linux-fsdevel, akpm, fernando_b1, torvalds

On Fri 20-02-15 21:20:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > > [...]
> > > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > > with preallocated reserves.  But this is not going to happen anytime
> > > > > soon, so what other option do we have than resolving this on the OOM
> > > > > killer side?
> > > > 
> > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a
> > > > 100% solution because reserves could get depleted but this risk is there
> > > > even with multiple oom victims. I would still argue that this would be a
> > > > better approach because selecting more victims might hit pathological
> > > > case more easily (other victims might be blocked on the very same lock
> > > > e.g.).
> > > > 
> > > Does "multiple OOM victims" mean "select next if first does not die"?
> > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2
> > > does not deplete memory reserves. ;-)
> > 
> > It doesn't because
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > -		else if (!in_interrupt() &&
> > -				((current->flags & PF_MEMALLOC) ||
> > -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> > +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > 
> > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion
> > and break out from the allocator. Exiting task might need a memory to do
> > so and you make all those allocations fail basically. How do you know
> > this is not going to blow up?
> > 
> 
> Well, treat exiting tasks to imply __GFP_NOFAIL for clean up?
> 
> We cannot determine correct task to kill + allow access to memory reserves
> based on lock dependency. Therefore, this patch evenly allow no tasks to
> access to memory reserves.
> 
> Exiting task might need some memory to exit, and not allowing access to
> memory reserves can retard exit of that task. But that task will eventually
> get memory released by other tasks killed by timeout-based kill-more
> mechanism. If no more killable tasks or expired panic-timeout, it is
> the same result with depletion of memory reserves.
> 
> I think that this situation (automatically making forward progress as if
> the administrator is periodically doing SysRq-f until the OOM condition
> is solved, or is doing SysRq-c if no more killable tasks or stalled too
> long) is better than current situation (not making forward progress since
> the exiting task cannot exit due to lock dependency, caused by failing to
> determine correct task to kill + allow access to memory reserves).

If you really believe this is an improvement then send a proper patch
with justification. But I am _really_ skeptical about such a change to
be honest.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 21:43                         ` Dave Chinner
@ 2015-02-20 12:48                           ` Michal Hocko
  2015-02-20 23:09                             ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > [...]
> > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > with preallocated reserves.  But this is not going to happen anytime
> > > soon, so what other option do we have than resolving this on the OOM
> > > killer side?
> > 
> > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > access to memory reserves (by giving it __GFP_HIGH).
> 
> Won't work when you have thousands of concurrent transactions
> running in XFS and they are all doing GFP_NOFAIL allocations.

Is there any bound on how many transactions can run at the same time?

> That's why I suggested the per-transaction reserve pool - we can use
> that

I am still not sure what you mean by reserve pool (API wise). How
does it differ from pre-allocating memory before the "may not fail
context"? Could you elaborate on it, please?

> to throttle the number of concurent contexts demanding memory for
> forwards progress, just the same was we throttle the number of
> concurrent processes based on maximum log space requirements of the
> transactions and the amount of unreserved log space available.
> 
> No log space, transaction reservations waits on an ordered queue for
> space to become available. No memory available, transaction
> reservation waits on an ordered queue for memory to become
> available.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20  9:13                             ` Michal Hocko
@ 2015-02-20 13:37                               ` Stefan Ring
  0 siblings, 0 replies; 83+ messages in thread
From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, linux-mm, mgorman,
	hannes, linux-fsdevel, rientjes, akpm, fernando_b1, torvalds

>> We don't know how many callers will pass __GFP_NOFAIL. But if 1000
>> threads are doing the same operation which requires __GFP_NOFAIL
>> allocation with a lock held, wouldn't memory reserves deplete?
>
> We shouldn't have an unbounded number of GFP_NOFAIL allocations at the
> same time. This would be even more broken. If a load is known to use
> such allocations excessively then the administrator can enlarge the
> memory reserves.
>
>> This heuristic can't continue if memory reserves depleted or
>> continuous pages of requested order cannot be found.
>
> Once memory reserves are depleted we are screwed anyway and we might
> panic.

This discussion reminds me of a situation I've seen somewhat
regularly, which I have described here:
http://oss.sgi.com/pipermail/xfs/2014-April/035793.html

I've actually seen it more often on another box with OpenVZ and
VirtualBox installed, where it would almost always happen during
startup of a VirtualBox guest machine. This other machine is also
running XFS. I blamed it on OpenVZ or VirtualBox originally, but
having seen the same thing happen on the other machine with neither of
them, the next candidate for taking blame is XFS.

Is this behavior something that can be attributed to these memory
allocation retry loops?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 12:48                           ` Michal Hocko
@ 2015-02-20 23:09                             ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote:
> On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH).
> > 
> > Won't work when you have thousands of concurrent transactions
> > running in XFS and they are all doing GFP_NOFAIL allocations.
> 
> Is there any bound on how many transactions can run at the same time?

Yes. As many reservations that can fit in the available log space.

The log can be sized up to 2GB, and for filesystems larger than 4TB
will default to 2GB. Log space reservations depend on the operation
being done - an inode timestamp update requires about 5kB of
reservation, and rename requires about 200kB. Hence we can easily
have thousands of active transactions, even in the worst case
log space reversation cases.

You're saying it would be insane to have hundreds or thousands of
threads doing GFP_NOFAIL allocations concurrently. Reality check:
XFS has been operating successfully under such workload conditions
in production systems for many years.

> > That's why I suggested the per-transaction reserve pool - we can use
> > that
> 
> I am still not sure what you mean by reserve pool (API wise). How
> does it differ from pre-allocating memory before the "may not fail
> context"? Could you elaborate on it, please?

It is preallocating memory: into a reserve pool associated with the
transaction, done as part of the transaction reservation mechanism
we already have in XFS. The allocator then uses that reserve pool
to allocate from if an allocation would otherwise fail.

There is no way we can preallocate specific objects before the
transaction - that's just insane, especially handling the unbound
demand paged object requirement. Hence the need for a "preallocated
reserve pool" that the allocator can dip into that covers the memory
we need to *allocate and can't reclaim* during the course of the
transaction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 10:36                   ` Tetsuo Handa
@ 2015-02-20 23:15                     ` Dave Chinner
  2015-02-21  3:20                       ` Theodore Ts'o
  2015-02-21 11:12                       ` Tetsuo Handa
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> I really care about the OOM Killer corner cases, for I'm
> 
>   (1) seeing trouble cases which occurred in enterprise systems
>       under OOM conditions

You reach OOM, then your SLAs are dead and buried. Reboot the
box - its a much more reliable way of returning to a working system
than playing Russian Roulette with the OOM killer.

>   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
>       an unprivileged user with a login shell can trivially trigger
>       since Linux 2.0) to OOM "Genocide" attacks in order to allow
>       OOM-unkillable daemons to restart OOM-killed processes
> 
>   (3) waiting for a bandaid for (2) in order to propose changes for
>       mitigating OOM "Genocide" attacks (as bad guys will find how to
>       trigger OOM "Deadlock or Genocide" attacks from changes for
>       mitigating OOM "Genocide" attacks)

Which is yet another indication that the OOM killer is the wrong
solution to the "lack of forward progress" problem. Any one can
generate enough memory pressure to trigger the OOM killer; we can't
prevent that from occurring when the OOM killer can be invoked by
user processes.

> I started posting to linux-mm ML in order to make forward progress
> about (1) and (2). I don't want the memory allocation subsystem to
> lock up an entire system by indefinitely disabling memory releasing
> mechanism provided by the OOM killer.
> 
> > I've proposed a method of providing this forward progress guarantee
> > for subsystems of arbitrary complexity, and this removes the
> > dependency on the OOM killer for fowards allocation progress in such
> > contexts (e.g. filesystems). We should be discussing how to
> > implement that, not what bandaids we need to apply to the OOM
> > killer. I want to fix the underlying problems, not push them under
> > the OOM-killer bus...
> 
> I'm fine with that direction for new kernels provided that a simple
> bandaid which can be backported to distributor kernels for making
> OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> discussing what bandaids we need to apply to the OOM killer.

The band-aids being proposed are worse than the problem they are
intended to cover up. In which case, the band-aids should not be
applied.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 23:15                     ` Dave Chinner
@ 2015-02-21  3:20                       ` Theodore Ts'o
  2015-02-21  9:19                         ` Andrew Morton
                                           ` (2 more replies)
  2015-02-21 11:12                       ` Tetsuo Handa
  1 sibling, 3 replies; 83+ messages in thread
From: Theodore Ts'o @ 2015-02-21  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm,
	mgorman, rientjes, akpm, linux-ext4, torvalds

+akpm

So I'm arriving late to this discussion since I've been in conference
mode for the past week, and I'm only now catching up on this thread.

I'll note that this whole question of whether or not file systems
should use GFP_NOFAIL is one where the mm developers are not of one
mind.

In fact, search for the subject line "fs/reiserfs/journal.c: Remove
obsolete __GFP_NOFAIL" where we recapitulated many of these arguments,
Andrew Morton said that it was better to use GFP_NOFAIL over the
alternatives of (a) panic'ing the kernel because the file system has
no way to move forward other than leaving the file system corrupted,
or (b) looping in the file system to retry the memory allocation to
avoid the unfortunate effects of (a).

So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
ext4/jbd2.

It sounds like 9879de7373fc is causing massive file system
errors, and it seems **really** unfortunate it was added so late in
the day (between -rc6 and rc7).

So at this point, it seems we have two choices.  We can either revert
9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
memory allocations and submit them as stable bug fixes.

Linux MM developers, this is your call.  I will liberally be adding
GFP_NOFAIL to ext4 if you won't revert the commit, because that's the
only way I can fix things with minimal risk of adding additional,
potentially more serious regressions.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                       ` Theodore Ts'o
@ 2015-02-21  9:19                         ` Andrew Morton
  2015-02-21 13:48                           ` Tetsuo Handa
                                             ` (2 more replies)
  2015-02-21 12:00                         ` Tetsuo Handa
  2015-02-23 10:26                         ` Michal Hocko
  2 siblings, 3 replies; 83+ messages in thread
From: Andrew Morton @ 2015-02-21  9:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, rientjes, linux-ext4, torvalds

On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:

> +akpm

I was hoping not to have to read this thread ;)

afaict there are two (main) issues:

a) whether to oom-kill when __GFP_FS is not set.  The kernel hasn't
   been doing this for ages and nothing has changed recently.

b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS
   is not set and we can't oom-kill anything (which goes without
   saying, because __GFP_FS isn't set!).

   And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally
   into allocation slowpath") somewhat inadvertently changed this policy
   - the allocation attempt will now promptly return ENOMEM if
   !__GFP_NOFAIL and !__GFP_FS.

Correct enough?

Question a) seems a bit of red herring and we can park it for now.


What I'm not really understanding is why the pre-3.19 implementation
actually worked.  We've exhausted the free pages, we're not succeeding
at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
*does* work - we eventually find that memory and everything proceeds.

How come?  Where did that memory come from?


Short term, we need to fix 3.19.x and 3.20 and that appears to be by
applying Johannes's akpm-doesnt-know-why-it-works patch:

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.

Have people adequately confirmed that this gets us out of trouble?


And yes, I agree that sites such as xfs's kmem_alloc() should be
passing __GFP_NOFAIL to tell the page allocator what's going on.  I
don't think it matters a lot whether kmem_alloc() retains its retry
loop.  If __GFP_NOFAIL is working correctly then it will never loop
anyway...


Also, this:

On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

David, I did not know this!  If you've been telling us about this then
perhaps it wasn't loud enough.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-20 23:15                     ` Dave Chinner
  2015-02-21  3:20                       ` Theodore Ts'o
@ 2015-02-21 11:12                       ` Tetsuo Handa
  2015-02-21 21:48                         ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw)
  To: david
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

My main issue is

  c) whether to oom-kill more processes when the OOM victim cannot be
     terminated presumably due to the OOM killer deadlock.

Dave Chinner wrote:
> On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > Dave Chinner wrote:
> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > I really care about the OOM Killer corner cases, for I'm
> > 
> >   (1) seeing trouble cases which occurred in enterprise systems
> >       under OOM conditions
> 
> You reach OOM, then your SLAs are dead and buried. Reboot the
> box - its a much more reliable way of returning to a working system
> than playing Russian Roulette with the OOM killer.

What Service Level Agreements? Such troubles are occurring on RHEL systems
where users are not sitting in front of the console. Unless somebody is
sitting in front of the console in order to do SysRq-b when troubles
occur, the down time of system will become significantly longer.

What mechanisms are available for minimizing the down time of system
when troubles under OOM condition occur? Software/hardware watchdog?
Indeed they may help, but they may be triggered prematurely when the
system has not entered into the OOM condition. Only the OOM killer knows.

> 
> >   (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which
> >       an unprivileged user with a login shell can trivially trigger
> >       since Linux 2.0) to OOM "Genocide" attacks in order to allow
> >       OOM-unkillable daemons to restart OOM-killed processes
> > 
> >   (3) waiting for a bandaid for (2) in order to propose changes for
> >       mitigating OOM "Genocide" attacks (as bad guys will find how to
> >       trigger OOM "Deadlock or Genocide" attacks from changes for
> >       mitigating OOM "Genocide" attacks)
> 
> Which is yet another indication that the OOM killer is the wrong
> solution to the "lack of forward progress" problem. Any one can
> generate enough memory pressure to trigger the OOM killer; we can't
> prevent that from occurring when the OOM killer can be invoked by
> user processes.
> 

We have memory cgroups to reduce the possibility of triggering the OOM
killer, though there will be several bugs remaining in RHEL kernels
which make administrators hesitate to use memory cgroups.

> > I started posting to linux-mm ML in order to make forward progress
> > about (1) and (2). I don't want the memory allocation subsystem to
> > lock up an entire system by indefinitely disabling memory releasing
> > mechanism provided by the OOM killer.
> > 
> > > I've proposed a method of providing this forward progress guarantee
> > > for subsystems of arbitrary complexity, and this removes the
> > > dependency on the OOM killer for fowards allocation progress in such
> > > contexts (e.g. filesystems). We should be discussing how to
> > > implement that, not what bandaids we need to apply to the OOM
> > > killer. I want to fix the underlying problems, not push them under
> > > the OOM-killer bus...
> > 
> > I'm fine with that direction for new kernels provided that a simple
> > bandaid which can be backported to distributor kernels for making
> > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm
> > discussing what bandaids we need to apply to the OOM killer.
> 
> The band-aids being proposed are worse than the problem they are
> intended to cover up. In which case, the band-aids should not be
> applied.
> 

The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not
help if the OOM killer failed to determine correct task to kill + allow
access to memory reserves. The OOM killer is waiting forever under
the OOM deadlock condition than triggering kernel panic.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html
says that "Usually, oom_killer can kill rogue processes and the system
will survive." but says nothing about what to do when we hit the OOM
killer deadlock condition.

My band-aids allows the OOM killer to trigger kernel panic (followed
by optionally kdump and automatic reboot) for people who want to reboot
the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to
kill rogue processes, and allows people who want the system to survive
when the OOM killer failed to determine correct task to kill + allow
access to memory reserves.

Not only we cannot expect that the OOM killer messages being saved to
/var/log/messages under the OOM killer deadlock condition, but also
we do not emit the OOM killer messages if we hit

    void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
                          unsigned int points, unsigned long totalpages,
                          struct mem_cgroup *memcg, nodemask_t *nodemask,
                          const char *message)
    {
            struct task_struct *victim = p;
            struct task_struct *child;
            struct task_struct *t;
            struct mm_struct *mm;
            unsigned int victim_points = 0;
            static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                                                  DEFAULT_RATELIMIT_BURST);
    
            /*
             * If the task is already exiting, don't alarm the sysadmin or kill
             * its children or threads, just set TIF_MEMDIE so it can die quickly
             */
            if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/
                    set_tsk_thread_flag(p, TIF_MEMDIE);
                    put_task_struct(p);
                    return;
            }
    
            if (__ratelimit(&oom_rs))
                    dump_header(p, gfp_mask, order, memcg, nodemask);
    
            task_lock(p);
            pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
                    message, task_pid_nr(p), p->comm, points);
            task_unlock(p);

followed by entering into the OOM killer deadlock condition. This is
annoying for me because neither serial console nor netconsole helps
finding out that the system entered into the OOM condition.

If you want to stop people from playing Russian Roulette with the OOM
killer, please remove the OOM killer code entirely from RHEL kernels so that
people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
setting. Can you do it?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                       ` Theodore Ts'o
  2015-02-21  9:19                         ` Andrew Morton
@ 2015-02-21 12:00                         ` Tetsuo Handa
  2015-02-23 10:26                         ` Michal Hocko
  2 siblings, 0 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw)
  To: tytso
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, linux-ext4, torvalds

Theodore Ts'o wrote:
> So at this point, it seems we have two choices.  We can either revert
> 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's
> memory allocations and submit them as stable bug fixes.

Can you absorb this side effect by simply adding GFP_NOFAIL to only
ext4's memory allocations? Don't you also depend on lower layers which
use GFP_NOIO?

BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS
allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS
allocation in jbd. Failure check being there for GFP_NOFAIL seems
redundant.

---------- linux-3.19/fs/jbd2/transaction.c ----------
257 static int start_this_handle(journal_t *journal, handle_t *handle,
258                              gfp_t gfp_mask)
259 {
260         transaction_t   *transaction, *new_transaction = NULL;
261         int             blocks = handle->h_buffer_credits;
262         int             rsv_blocks = 0;
263         unsigned long ts = jiffies;
264 
265         /*
266          * 1/2 of transaction can be reserved so we can practically handle
267          * only 1/2 of maximum transaction size per operation
268          */
269         if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
270                 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
271                        current->comm, blocks,
272                        journal->j_max_transaction_buffers / 2);
273                 return -ENOSPC;
274         }
275 
276         if (handle->h_rsv_handle)
277                 rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
278 
279 alloc_transaction:
280         if (!journal->j_running_transaction) {
281                 new_transaction = kmem_cache_zalloc(transaction_cache,
282                                                     gfp_mask);
283                 if (!new_transaction) {
284                         /*
285                          * If __GFP_FS is not present, then we may be
286                          * being called from inside the fs writeback
287                          * layer, so we MUST NOT fail.  Since
288                          * __GFP_NOFAIL is going away, we will arrange
289                          * to retry the allocation ourselves.
290                          */
291                         if ((gfp_mask & __GFP_FS) == 0) {
292                                 congestion_wait(BLK_RW_ASYNC, HZ/50);
293                                 goto alloc_transaction;
294                         }
295                         return -ENOMEM;
296                 }
297         }
298 
299         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd2/transaction.c ----------

---------- linux-3.19/fs/jbd/transaction.c ----------
 84 static int start_this_handle(journal_t *journal, handle_t *handle)
 85 {
 86         transaction_t *transaction;
 87         int needed;
 88         int nblocks = handle->h_buffer_credits;
 89         transaction_t *new_transaction = NULL;
 90         int ret = 0;
 91 
 92         if (nblocks > journal->j_max_transaction_buffers) {
 93                 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
 94                        current->comm, nblocks,
 95                        journal->j_max_transaction_buffers);
 96                 ret = -ENOSPC;
 97                 goto out;
 98         }
 99 
100 alloc_transaction:
101         if (!journal->j_running_transaction) {
102                 new_transaction = kzalloc(sizeof(*new_transaction),
103                                                 GFP_NOFS|__GFP_NOFAIL);
104                 if (!new_transaction) {
105                         ret = -ENOMEM;
106                         goto out;
107                 }
108         }
109 
110         jbd_debug(3, "New handle %p going live.\n", handle);
---------- linux-3.19/fs/jbd/transaction.c ----------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                         ` Andrew Morton
@ 2015-02-21 13:48                           ` Tetsuo Handa
  2015-02-21 21:38                           ` Dave Chinner
  2015-02-22  0:20                           ` Johannes Weiner
  2 siblings, 0 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw)
  To: akpm
  Cc: tytso, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner,
	rientjes, linux-ext4, torvalds

Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

Sorry for getting so complicated.

> What I'm not really understanding is why the pre-3.19 implementation
> actually worked.  We've exhausted the free pages, we're not succeeding
> at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
> *does* work - we eventually find that memory and everything proceeds.
> 
> How come?  Where did that memory come from?
> 

Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever
(without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and
TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying.
This implies silent hang-up forever if nobody volunteers memory.

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations
not to retry unless __GFP_NOFAIL is specified. Therefore, either applying
Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL
will restore the pre-3.19 behavior (with possibility of silent hang-up).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                         ` Andrew Morton
  2015-02-21 13:48                           ` Tetsuo Handa
@ 2015-02-21 21:38                           ` Dave Chinner
  2015-02-22  0:20                           ` Johannes Weiner
  2 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, hannes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

ditto....

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

I'm not about to change behaviour "just because". Any sort of change
like this requires a *lot* of low memory regression testing because
we'd be replacing long standing known behaviour with behaviour that
changes without warning. e.g the ext4 low memory failures starting because of
changes made in 3.19-rc6 due to changes in oom-killer behaviour.
Those changes *did not affect XFS* and that's the way I'd like
things to remain.

Put simply: right now I don't trust the mm subsystem to get low memory
behaviour right, and this thread has done nothing to convince me
that it's going to improve any time soon.

> Also, this:
> 
> On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> David, I did not know this!  If you've been telling us about this then
> perhaps it wasn't loud enough.

IME, such bug reports get ignored.

Instead, over the past few months I have been pointing out bugs and
problems in the oom-killer in threads like this because it seems to
be the only way to get any attention to the issues I'm seeing. Bug
reports simply get ignored.  From this process, I've managed to
learn that low order memory allocation now never fails (contrary to
documentation and long standing behavioural expectations) and
pointed out bugs that cause the oom killer to get invoked when the
filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm:
get rid of radix tree gfp mask for pagecache_get_page").

And yes, I've definitely mentioned in these discussions that, for
example, xfstests::generic/224 is triggering the oom killer far more
often than it used to on my 1GB RAM vm. The only fix that has been
made recently that's made any difference is 45f87de, so it's a slow
process of raising awareness and trying to ensure things don't get
worse before they get better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21 11:12                       ` Tetsuo Handa
@ 2015-02-21 21:48                         ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote:
> My main issue is
> 
>   c) whether to oom-kill more processes when the OOM victim cannot be
>      terminated presumably due to the OOM killer deadlock.
> 
> Dave Chinner wrote:
> > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote:
> > > Dave Chinner wrote:
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > I really care about the OOM Killer corner cases, for I'm
> > > 
> > >   (1) seeing trouble cases which occurred in enterprise systems
> > >       under OOM conditions
> > 
> > You reach OOM, then your SLAs are dead and buried. Reboot the
> > box - its a much more reliable way of returning to a working system
> > than playing Russian Roulette with the OOM killer.
> 
> What Service Level Agreements? Such troubles are occurring on RHEL systems
> where users are not sitting in front of the console. Unless somebody is
> sitting in front of the console in order to do SysRq-b when troubles
> occur, the down time of system will become significantly longer.
>
> What mechanisms are available for minimizing the down time of system
> when troubles under OOM condition occur? Software/hardware watchdog?
> Indeed they may help, but they may be triggered prematurely when the
> system has not entered into the OOM condition. Only the OOM killer knows.

# echo 1 > /proc/sys/vm/panic_on_oom

....

> We have memory cgroups to reduce the possibility of triggering the OOM
> killer, though there will be several bugs remaining in RHEL kernels
> which make administrators hesitate to use memory cgroups.

Fix upstream first, then worry about vendor kernels.

....

> Not only we cannot expect that the OOM killer messages being saved to
> /var/log/messages under the OOM killer deadlock condition, but also

CONFIG_PSTORE=y and configure appropriately from there.

> we do not emit the OOM killer messages if we hit

So add a warning.

> If you want to stop people from playing Russian Roulette with the OOM
> killer, please remove the OOM killer code entirely from RHEL kernels so that
> people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1
> setting. Can you do it?

No. You need to go through vendor channels to get a vendor kernel
config change made.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-19 22:52                 ` Dave Chinner
  2015-02-20 10:36                   ` Tetsuo Handa
@ 2015-02-21 23:52                   ` Johannes Weiner
  2015-02-23  0:45                     ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> I will actively work around aanything that causes filesystem memory
> pressure to increase the chance of oom killer invocations. The OOM
> killer is not a solution - it is, by definition, a loose cannon and
> so we should be reducing dependencies on it.

Once we have a better-working alternative, sure.

> I really don't care about the OOM Killer corner cases - it's
> completely the wrong way line of development to be spending time on
> and you aren't going to convince me otherwise. The OOM killer a
> crutch used to justify having a memory allocation subsystem that
> can't provide forward progress guarantee mechanisms to callers that
> need it.

We can provide this.  Are all these callers able to preallocate?

---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e72a917..af81b8a67651 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order);
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
+void register_private_page(struct page *page, unsigned int order);
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr);
+void free_private_pages(void);
+
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..1fe390779f23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1545,6 +1545,8 @@ struct task_struct {
 #endif
 
 /* VM state */
+	struct list_head private_pages;
+
 	struct reclaim_state *reclaim_state;
 
 	struct backing_dev_info *backing_dev_info;
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..b6349b0e5da2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
 #endif
 
+	INIT_LIST_HEAD(&p->private_pages);
+
 	p->default_timer_slack_ns = current->timer_slack_ns;
 
 	task_io_accounting_init(&p->ioac);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a47f0b229a1a..546db4e0da75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
-	__SetPageBuddy(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
-	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 }
 
@@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page,
 			list_del(&buddy->lru);
 			zone->free_area[order].nr_free--;
 			rmv_page_order(buddy);
+			__ClearPageBuddy(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);
@@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	__SetPageBuddy(page);
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		list_add(&page[size].lru, &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+		__SetPageBuddy(page);
 	}
 }
 
@@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
 		set_freepage_migratetype(page, migratetype);
@@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
+			__ClearPageBuddy(page);
 
 			expand(zone, page, order, current_order, area,
 					buddy_type);
@@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
+	__ClearPageBuddy(page);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -2504,6 +2508,40 @@ retry:
 	return page;
 }
 
+/* Try to allocate from the caller's private memory reserves */
+static inline struct page *
+__alloc_pages_private(gfp_t gfp_mask, unsigned int order,
+		      const struct alloc_context *ac)
+{
+	unsigned int uninitialized_var(alloc_order);
+	struct page *page = NULL;
+	struct page *p;
+
+	/* Dopy, but this is a slowpath right before OOM */
+	list_for_each_entry(p, &current->private_pages, lru) {
+		int o = page_order(p);
+
+		if (o >= order && (!page || o < alloc_order)) {
+			page = p;
+			alloc_order = o;
+		}
+	}
+	if (!page)
+		return NULL;
+
+	list_del(&page->lru);
+	rmv_page_order(page);
+
+	/* Give back the remainder */
+	while (alloc_order > order) {
+		alloc_order--;
+		set_page_order(&page[1 << alloc_order], alloc_order);
+		list_add(&page[1 << alloc_order].lru, &current->private_pages);
+	}
+
+	return page;
+}
+
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -2753,9 +2791,13 @@ retry:
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
-		 * start OOM killing tasks.
+		 * dip into private reserves, or start OOM killing.
 		 */
 		if (!did_some_progress) {
+			page = __alloc_pages_private(gfp_mask, order, ac);
+			if (page)
+				goto got_pg;
+
 			page = __alloc_pages_may_oom(gfp_mask, order, ac,
 							&did_some_progress);
 			if (page)
@@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size)
 EXPORT_SYMBOL(free_pages_exact);
 
 /**
+ * alloc_private_pages - allocate private memory reserve pages
+ * @gfp_mask: gfp flags for the allocations
+ * @order: order of pages to allocate
+ * @nr: number of pages to allocate
+ *
+ * This allocates @nr pages of order @order as an emergency reserve of
+ * the calling task, to be used by the page allocator if an allocation
+ * would otherwise fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr)
+{
+	struct page *page, *page2;
+	LIST_HEAD(pages);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		page = alloc_pages(gfp_mask, order);
+		if (!page)
+			goto error;
+		set_page_order(page, order);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, &current->private_pages);
+	return 0;
+
+error:
+	list_for_each_entry_safe(page, page2, &pages, lru) {
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+	return -ENOMEM;
+}
+
+/**
+ * register_private_page - register a private memory reserve page
+ * @page: pre-allocated page
+ * @order: @page's order
+ *
+ * This registers @page as an emergency reserve of the calling task,
+ * to be used by the page allocator if an allocation would otherwise
+ * fail.
+ *
+ * The caller is responsible for calling free_private_pages() once the
+ * reserves are no longer required.
+ */
+void register_private_page(struct page *page, unsigned int order)
+{
+	set_page_order(page, order);
+	list_add(&page->lru, &current->private_pages);
+}
+
+/**
+ * free_private_pages - free all private memory reserve pages
+ *
+ * Frees all (remaining) pages of the calling task's memory reserves
+ * established by alloc_private_pages() and register_private_page().
+ */
+void free_private_pages(void)
+{
+	struct page *page, *page2;
+
+	list_for_each_entry_safe(page, page2, &current->private_pages, lru) {
+		int order = page_order(page);
+
+		list_del(&page->lru);
+		rmv_page_order(page);
+		__free_pages(page, order);
+	}
+}
+
+/**
  * nr_free_zone_pages - count number of pages beyond high watermark
  * @offset: The zone index of the highest zone
  *
@@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 		list_del(&page->lru);
 		rmv_page_order(page);
+		__ClearPageBuddy(page);
 		zone->free_area[order].nr_free--;
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  9:19                         ` Andrew Morton
  2015-02-21 13:48                           ` Tetsuo Handa
  2015-02-21 21:38                           ` Dave Chinner
@ 2015-02-22  0:20                           ` Johannes Weiner
  2015-02-23 10:48                             ` Michal Hocko
  2015-02-23 21:33                             ` David Rientjes
  2 siblings, 2 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-02-22  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, linux-ext4, torvalds

On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> applying Johannes's akpm-doesnt-know-why-it-works patch:
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Have people adequately confirmed that this gets us out of trouble?

I'd be interested in this too.  Who is seeing these failures?

Andrew, can you please use the following changelog for this patch?

---
From: Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
into allocation slowpath"), which should have been a simple cleanup
patch, accidentally changed the behavior to aborting the allocation at
that point.  This creates problems with filesystem callers (?) that
currently rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21 23:52                   ` Johannes Weiner
@ 2015-02-23  0:45                     ` Dave Chinner
  2015-02-23  1:29                       ` Andrew Morton
                                         ` (3 more replies)
  0 siblings, 4 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-23  0:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > I will actively work around aanything that causes filesystem memory
> > pressure to increase the chance of oom killer invocations. The OOM
> > killer is not a solution - it is, by definition, a loose cannon and
> > so we should be reducing dependencies on it.
> 
> Once we have a better-working alternative, sure.

Great, but first a simple request: please stop writing code and
instead start architecting a solution to the problem. i.e. we need a
design and have that documented before code gets written. If you
watched my recent LCA talk, then you'll understand what I mean
when I say: stop programming and start engineering.

> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> 
> We can provide this.  Are all these callers able to preallocate?

Anything that allocates in transaction context (and therefor is
GFP_NOFS by definition) can preallocate at transaction reservation
time. However, preallocation is dumb, complex, CPU and memory
intensive and will have a *massive* impact on performance.
Allocating 10-100 pages to a reserve which we will almost *never
use* and then free them again *on every single transaction* is a lot
of unnecessary additional fast path overhead.  Hence a "preallocate
for every context" reserve pool is not a viable solution.

And, really, "reservation" != "preallocation".

Maybe it's my filesystem background, but those to things are vastly
different things.

Reservations are simply an *accounting* of the maximum amount of a
reserve required by an operation to guarantee forwards progress. In
filesystems, we do this for log space (transactions) and some do it
for filesystem space (e.g. delayed allocation needs correct ENOSPC
detection so we don't overcommit disk space).  The VM already has
such concepts (e.g. watermarks and things like min_free_kbytes) that
it uses to ensure that there are sufficient reserves for certain
types of allocations to succeed.

A reserve memory pool is no different - every time a memory reserve
occurs, a watermark is lifted to accommodate it, and the transaction
is not allowed to proceed until the amount of free memory exceeds
that watermark. The memory allocation subsystem then only allows
*allocations* marked correctly to allocate pages from that the
reserve that watermark protects. e.g. only allocations using
__GFP_RESERVE are allowed to dip into the reserve pool.

By using watermarks, freeing of memory will automatically top
up the reserve pool which means that we guarantee that reclaimable
memory allocated for demand paging during transacitons doesn't
deplete the reserve pool permanently.  As a result, when there is
plenty of free and/or reclaimable memory, the reserve pool
watermarks will have almost zero impact on performance and
behaviour.

Further, because it's just accounting and behavioural thresholds,
this allows the mm subsystem to control how the reserve pool is
accounted internally. e.g. clean, reclaimable pages in the page
cache could serve as reserve pool pages as they can be immediately
reclaimed for allocation. This could be acheived by setting reclaim
targets first to the reserve pool watermark, then the second target
is enough pages to satisfy the current allocation.

And, FWIW, there's nothing stopping this mechanism from have order
based reserve thresholds. e.g. IB could really do with a 64k reserve
pool threshold and hence help solve the long standing problems they
have with filling the receive ring in GFP_ATOMIC context...

Sure, that's looking further down the track, but my point still
remains: we need a viable long term solution to this problem. Maybe
reservations are not the solution, but I don't see anyone else who
is thinking of how to address this architectural problem at a system
level right now.  We need to design and document the model first,
then review it, then we can start working at the code level to
implement the solution we've designed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                     ` Dave Chinner
@ 2015-02-23  1:29                       ` Andrew Morton
  2015-02-23  7:32                         ` Dave Chinner
  2015-02-28 16:29                       ` Johannes Weiner
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2015-02-23  1:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

Yup.

> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.

Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
reserve.  So to reserve N pages we increase the page allocator dynamic
reserve by N, do some reclaim if necessary then deposit N tokens into
the caller's task_struct (it'll be a set of zone/nr-pages tuples I
suppose).

When allocating pages the caller should drain its reserves in
preference to dipping into the regular freelist.  This guy has already
done his reclaim and shouldn't be penalised a second time.  I guess
Johannes's preallocation code should switch to doing this for the same
reason, plus the fact that snipping a page off
task_struct.prealloc_pages is super-fast and needs to be done sometime
anyway so why not do it by default.

Both reservation and preallocation are vulnerable to deadlocks - 10,000
tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
and we ran out of memory.  Whoops.  We can undeadlock by returning
ENOMEM but I suspect there will still be problematic situations where
massive numbers of pages are temporarily AWOL.  Perhaps some form of
queuing and throttling will be needed, to limit the peak number of
reserved pages.  Per zone, I guess.

And it'll be a huge pain handling order>0 pages.  I'd be inclined to
make it order-0 only, and tell the lamer callers that
vmap-is-thattaway.  Alas, one lame caller is slub.


But the biggest issue is how the heck does a caller work out how many
pages to reserve/prealloc?  Even a single sb_bread() - it's sitting on
loop on a sparse NTFS file on loop on a five-deep DM stack on a
six-deep MD stack on loop on NFS on an eleventy-deep networking stack. 
And then there will be an unknown number of slab allocations of unknown
size with unknown slabs-per-page rules - how many pages needed for
them?  And to make it much worse, how many pages of which orders? 
Bless its heart, slub will go and use a 1-order page for allocations
which should have been in 0-order pages..


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  1:29                       ` Andrew Morton
@ 2015-02-23  7:32                         ` Dave Chinner
  2015-02-27 18:24                           ` Vlastimil Babka
                                             ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Dave Chinner @ 2015-02-23  7:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > > > I really don't care about the OOM Killer corner cases - it's
> > > > completely the wrong way line of development to be spending time on
> > > > and you aren't going to convince me otherwise. The OOM killer a
> > > > crutch used to justify having a memory allocation subsystem that
> > > > can't provide forward progress guarantee mechanisms to callers that
> > > > need it.
> > > 
> > > We can provide this.  Are all these callers able to preallocate?
> > 
> > Anything that allocates in transaction context (and therefor is
> > GFP_NOFS by definition) can preallocate at transaction reservation
> > time. However, preallocation is dumb, complex, CPU and memory
> > intensive and will have a *massive* impact on performance.
> > Allocating 10-100 pages to a reserve which we will almost *never
> > use* and then free them again *on every single transaction* is a lot
> > of unnecessary additional fast path overhead.  Hence a "preallocate
> > for every context" reserve pool is not a viable solution.
> 
> Yup.
> 
> > Reservations are simply an *accounting* of the maximum amount of a
> > reserve required by an operation to guarantee forwards progress. In
> > filesystems, we do this for log space (transactions) and some do it
> > for filesystem space (e.g. delayed allocation needs correct ENOSPC
> > detection so we don't overcommit disk space).  The VM already has
> > such concepts (e.g. watermarks and things like min_free_kbytes) that
> > it uses to ensure that there are sufficient reserves for certain
> > types of allocations to succeed.
> 
> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> reserve.  So to reserve N pages we increase the page allocator dynamic
> reserve by N, do some reclaim if necessary then deposit N tokens into
> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> suppose).
> 
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.

> Both reservation and preallocation are vulnerable to deadlocks - 10,000
> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
> and we ran out of memory.  Whoops.

Yes, that's the big problem with preallocation, as well as your
proposed "depelete the reserved memory first" approach. They
*require* up front "preallocation" of free memory, either directly
by the application, or internally by the mm subsystem.

Hence my comments about appropriate classification of "reserved
memory". Reserved memory does not necessarily need to be on the free
list. It could be "immediately reclaimable" memory, so that
reserving memory doesn't need to immediately reclaim memory, but can
it can be pulled from the reclaimable memory reserves when
memory pressure occurs. If there is no memory pressure, we do
nothing beause we have no need to do anything....

> We can undeadlock by returning ENOMEM but I suspect there will
> still be problematic situations where massive numbers of pages are
> temporarily AWOL.  Perhaps some form of queuing and throttling
> will be needed,

Yes, think that is necessary, but I don't see it as necessary in the
MM subsystem. XFS already has a ticket-based queue mechanisms for
throttling concurrent access to ensure we don't overcommit log space
and I'd want to tie the two together...

> to limit the peak number of reserved pages.  Per
> zone, I guess.

Internal implementation issue that I don't really care about.
When it comes to guaranteeing memory allocation, global context
is all I care about. Locality of allocation simple doesn't matter;
we want that page we reserved, no matter wher eit is located.

> And it'll be a huge pain handling order>0 pages.  I'd be inclined
> to make it order-0 only, and tell the lamer callers that
> vmap-is-thattaway.  Alas, one lame caller is slub.

Sure, but vmap requires GFP_KERNEL memory allocation and we're
talking about allocation in transactions, which are GFP_NOFS.

I've lost count of the number of times we've asked for that problem
to be fixed. Refusing to fix it has simply lead to the growing use
of ugly hacks around that problem (i.e. memalloc_noio_save() and
friends).

> But the biggest issue is how the heck does a caller work out how
> many pages to reserve/prealloc?  Even a single sb_bread() - it's
> sitting on loop on a sparse NTFS file on loop on a five-deep DM
> stack on a six-deep MD stack on loop on NFS on an eleventy-deep
> networking stack. 

Each subsystem needs to take care of itself first, then we can worry
about esoteric stacking requirements.

Besides, stacking requirements through the IO layer is still pretty
trivial - we only need to guarantee single IO progress from the
highest layer as it can be recycled again and again for every IO
that needs to be done.

And, because mempools already give that guarantee to most block
devices and drivers, we won't need to reserve memory for most block
devices to make forwards progress. It's only crazy "recurse through
filesystem" configurations where this will be an issue.

> And then there will be an unknown number of
> slab allocations of unknown size with unknown slabs-per-page rules
> - how many pages needed for them?

However many pages needed to allocate the number of objects we'll
consume from the slab.

> And to make it much worse, how
> many pages of which orders?  Bless its heart, slub will go and use
> a 1-order page for allocations which should have been in 0-order
> pages..

The majority of allocations will be order-0, though if we know that
they are going to be significant numbers of high order allocations,
then it should be simple enough to tell the mm subsystem "need a
reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
memory compaction just do it's stuff. But, IMO, we should cross that
bridge when somebody actually needs reservations to be that
specific....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-21  3:20                       ` Theodore Ts'o
  2015-02-21  9:19                         ` Andrew Morton
  2015-02-21 12:00                         ` Tetsuo Handa
@ 2015-02-23 10:26                         ` Michal Hocko
  2 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, hannes, linux-mm, mgorman,
	rientjes, akpm, linux-ext4, torvalds

On Fri 20-02-15 22:20:00, Theodore Ts'o wrote:
[...]
> So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to
> ext4/jbd2.

I am currently going through opencoded GFP_NOFAIL allocations and have
this in my local branch currently. I assume you did the same so I will
drop mine if you have pushed yours already.
---
>From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 20 Feb 2015 11:32:58 +0100
Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to
 GFP_NOFAIL

This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.

It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/jbd2/journal.c     | 11 +----------
 fs/jbd2/transaction.c | 20 +++++++-------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1df94fabe4eb..878ed3e761f0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-retry_alloc:
-	new_bh = alloc_buffer_head(GFP_NOFS);
-	if (!new_bh) {
-		/*
-		 * Failure is not an option, but __GFP_NOFAIL is going
-		 * away; so we retry ourselves here.
-		 */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_alloc;
-	}
+	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5f09370c90a8..dac4523fa142 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
+		/*
+		 * If __GFP_FS is not present, then we may be being called from
+		 * inside the fs writeback layer, so we MUST NOT fail.
+		 */
+		if ((gfp_mask & __GFP_FS) == 0)
+			gfp_mask |= __GFP_NOFAIL;
 		new_transaction = kmem_cache_zalloc(transaction_cache,
 						    gfp_mask);
-		if (!new_transaction) {
-			/*
-			 * If __GFP_FS is not present, then we may be
-			 * being called from inside the fs writeback
-			 * layer, so we MUST NOT fail.  Since
-			 * __GFP_NOFAIL is going away, we will arrange
-			 * to retry the allocation ourselves.
-			 */
-			if ((gfp_mask & __GFP_FS) == 0) {
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
-				goto alloc_transaction;
-			}
+		if (!new_transaction)
 			return -ENOMEM;
-		}
 	}
 
 	jbd_debug(3, "New handle %p going live.\n", handle);
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-22  0:20                           ` Johannes Weiner
@ 2015-02-23 10:48                             ` Michal Hocko
  2015-02-23 11:23                               ` Tetsuo Handa
  2015-02-23 21:33                             ` David Rientjes
  1 sibling, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, linux-mm,
	mgorman, dchinner, Andrew Morton, linux-ext4, torvalds

On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> >  		/*
> >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Have people adequately confirmed that this gets us out of trouble?
> 
> I'd be interested in this too.  Who is seeing these failures?
> 
> Andrew, can you please use the following changelog for this patch?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.

OK, if this a _short term_ change. I really think that all the requests
except for __GFP_NOFAIL should be able to fail. I would argue that it
should be the caller who should be fixed but it is true that the patch
was introduced too late (rc7) and so it caught other subsystems
unprepared so backporting to stable makes sense to me. But can we please
move on and stop pretending that allocations do not fail for the
upcoming release?

> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23 10:48                             ` Michal Hocko
@ 2015-02-23 11:23                               ` Tetsuo Handa
  0 siblings, 0 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: tytso, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm,
	linux-ext4, torvalds

Michal Hocko wrote:
> On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (high_zoneidx < ZONE_NORMAL)
> > >  			goto out;
> > >  		/* The OOM killer does not compensate for light reclaim */
> > > -		if (!(gfp_mask & __GFP_FS))
> > > +		if (!(gfp_mask & __GFP_FS)) {
> > > +			/*
> > > +			 * XXX: Page reclaim didn't yield anything,
> > > +			 * and the OOM killer can't be invoked, but
> > > +			 * keep looping as per should_alloc_retry().
> > > +			 */
> > > +			*did_some_progress = 1;
> > >  			goto out;
> > > +		}
> > >  		/*
> > >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > > 
> > > Have people adequately confirmed that this gets us out of trouble?
> > 
> > I'd be interested in this too.  Who is seeing these failures?

So far ext4 and xfs. I don't have environment to test other filesystems.

> > 
> > Andrew, can you please use the following changelog for this patch?
> > 
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> > 
> > Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> > killer once reclaim had failed, but nevertheless kept looping in the
> > allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> > into allocation slowpath"), which should have been a simple cleanup
> > patch, accidentally changed the behavior to aborting the allocation at
> > that point.  This creates problems with filesystem callers (?) that
> > currently rely on the allocator waiting for other tasks to intervene.
> > 
> > Revert the behavior as it shouldn't have been changed as part of a
> > cleanup patch.
> 
> OK, if this a _short term_ change. I really think that all the requests
> except for __GFP_NOFAIL should be able to fail. I would argue that it
> should be the caller who should be fixed but it is true that the patch
> was introduced too late (rc7) and so it caught other subsystems
> unprepared so backporting to stable makes sense to me. But can we please
> move on and stop pretending that allocations do not fail for the
> upcoming release?
> 
> > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 

Without this patch, I think the system becomes unusable under OOM.
However, with this patch, I know the system may become unusable under
OOM. Please do write patches for handling below condition.

  Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Johannes's patch will get us out of filesystem error troubles, at
the cost of getting us into stall troubles (as with until 3.19-rc6).

I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2
with debug printk patch shown below.

---------- debug printk patch ----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d503e9c..5144506 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+atomic_t oom_killer_skipped_count = ATOMIC_INIT(0);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 				 nodemask, "Out of memory");
 		killed = 1;
 	}
+	else
+		atomic_inc(&oom_killer_skipped_count);
 out:
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e20f9c..eaea16b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+extern atomic_t oom_killer_skipped_count;
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long first_retried_time = 0;
+	unsigned long next_warn_time = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2821,6 +2832,19 @@ retry:
 			if (!did_some_progress)
 				goto nopage;
 		}
+		if (!first_retried_time) {
+			first_retried_time = jiffies;
+			if (!first_retried_time)
+				first_retried_time = 1;
+			next_warn_time = first_retried_time + 5 * HZ;
+		} else if (time_after(jiffies, next_warn_time)) {
+			printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : "
+			       "OOM-killer skipped %u\n", current->pid,
+			       current->comm, gfp_mask,
+			       (jiffies - first_retried_time) / HZ,
+			       atomic_read(&oom_killer_skipped_count));
+			next_warn_time = jiffies + 5 * HZ;
+		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
---------- debug printk patch ----------

GFP_NOFS allocations stalled for 10 minutes waiting for somebody else
to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting
for the OOM killer to kill somebody. The OOM killer stalled for 10
minutes waiting for GFP_NOFS allocations to complete.

I guess the system made forward progress because the number of remaining
a.out processes decreased over time.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz )
---------- ext4 / Linux 3.19 + patch ----------
[ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child
[ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB
[ 1335.191920] Kill process 14177 (a.out) sharing same memory
[ 1335.193465] Kill process 14178 (a.out) sharing same memory
[ 1335.195013] Kill process 14179 (a.out) sharing same memory
[ 1335.196580] Kill process 14180 (a.out) sharing same memory
[ 1335.198128] Kill process 14181 (a.out) sharing same memory
[ 1335.199674] Kill process 14182 (a.out) sharing same memory
[ 1335.201217] Kill process 14183 (a.out) sharing same memory
[ 1335.202768] Kill process 14184 (a.out) sharing same memory
[ 1335.204316] Kill process 14185 (a.out) sharing same memory
[ 1335.205871] Kill process 14186 (a.out) sharing same memory
[ 1335.207420] Kill process 14187 (a.out) sharing same memory
[ 1335.208974] Kill process 14188 (a.out) sharing same memory
[ 1335.210515] Kill process 14189 (a.out) sharing same memory
[ 1335.212063] Kill process 14190 (a.out) sharing same memory
[ 1335.213611] Kill process 14191 (a.out) sharing same memory
[ 1335.215165] Kill process 14192 (a.out) sharing same memory
[ 1335.216715] Kill process 14193 (a.out) sharing same memory
[ 1335.218286] Kill process 14194 (a.out) sharing same memory
[ 1335.219836] Kill process 14195 (a.out) sharing same memory
[ 1335.221378] Kill process 14196 (a.out) sharing same memory
[ 1335.222918] Kill process 14197 (a.out) sharing same memory
[ 1335.224461] Kill process 14198 (a.out) sharing same memory
[ 1335.225999] Kill process 14199 (a.out) sharing same memory
[ 1335.227545] Kill process 14200 (a.out) sharing same memory
[ 1335.229095] Kill process 14201 (a.out) sharing same memory
[ 1335.230643] Kill process 14202 (a.out) sharing same memory
[ 1335.232184] Kill process 14203 (a.out) sharing same memory
[ 1335.233738] Kill process 14204 (a.out) sharing same memory
[ 1335.235293] Kill process 14205 (a.out) sharing same memory
[ 1335.236834] Kill process 14206 (a.out) sharing same memory
[ 1335.238387] Kill process 14207 (a.out) sharing same memory
[ 1335.239930] Kill process 14208 (a.out) sharing same memory
[ 1335.241471] Kill process 14209 (a.out) sharing same memory
[ 1335.243011] Kill process 14210 (a.out) sharing same memory
[ 1335.244554] Kill process 14211 (a.out) sharing same memory
[ 1335.246101] Kill process 14212 (a.out) sharing same memory
[ 1335.247645] Kill process 14213 (a.out) sharing same memory
[ 1335.249182] Kill process 14214 (a.out) sharing same memory
[ 1335.250718] Kill process 14215 (a.out) sharing same memory
[ 1335.252305] Kill process 14216 (a.out) sharing same memory
[ 1335.253899] Kill process 14217 (a.out) sharing same memory
[ 1335.255443] Kill process 14218 (a.out) sharing same memory
[ 1335.256993] Kill process 14219 (a.out) sharing same memory
[ 1335.258531] Kill process 14220 (a.out) sharing same memory
[ 1335.260066] Kill process 14221 (a.out) sharing same memory
[ 1335.261616] Kill process 14222 (a.out) sharing same memory
[ 1335.263143] Kill process 14223 (a.out) sharing same memory
[ 1335.264647] Kill process 14224 (a.out) sharing same memory
[ 1335.266121] Kill process 14225 (a.out) sharing same memory
[ 1335.267598] Kill process 14226 (a.out) sharing same memory
[ 1335.269077] Kill process 14227 (a.out) sharing same memory
[ 1335.270560] Kill process 14228 (a.out) sharing same memory
[ 1335.272038] Kill process 14229 (a.out) sharing same memory
[ 1335.273508] Kill process 14230 (a.out) sharing same memory
[ 1335.274999] Kill process 14231 (a.out) sharing same memory
[ 1335.276469] Kill process 14232 (a.out) sharing same memory
[ 1335.277947] Kill process 14233 (a.out) sharing same memory
[ 1335.279428] Kill process 14234 (a.out) sharing same memory
[ 1335.280894] Kill process 14235 (a.out) sharing same memory
[ 1335.282361] Kill process 14236 (a.out) sharing same memory
[ 1335.283832] Kill process 14237 (a.out) sharing same memory
[ 1335.285304] Kill process 14238 (a.out) sharing same memory
[ 1335.286768] Kill process 14239 (a.out) sharing same memory
[ 1335.288242] Kill process 14240 (a.out) sharing same memory
[ 1335.289714] Kill process 14241 (a.out) sharing same memory
[ 1335.291196] Kill process 14242 (a.out) sharing same memory
[ 1335.292731] Kill process 14243 (a.out) sharing same memory
[ 1335.294258] Kill process 14244 (a.out) sharing same memory
[ 1335.295734] Kill process 14245 (a.out) sharing same memory
[ 1335.297215] Kill process 14246 (a.out) sharing same memory
[ 1335.298710] Kill process 14247 (a.out) sharing same memory
[ 1335.300188] Kill process 14248 (a.out) sharing same memory
[ 1335.301672] Kill process 14249 (a.out) sharing same memory
[ 1335.303157] Kill process 14250 (a.out) sharing same memory
[ 1335.304655] Kill process 14251 (a.out) sharing same memory
[ 1335.306141] Kill process 14252 (a.out) sharing same memory
[ 1335.307621] Kill process 14253 (a.out) sharing same memory
[ 1335.309107] Kill process 14254 (a.out) sharing same memory
[ 1335.310573] Kill process 14255 (a.out) sharing same memory
[ 1335.312052] Kill process 14256 (a.out) sharing same memory
[ 1335.313528] Kill process 14257 (a.out) sharing same memory
[ 1335.315039] Kill process 14258 (a.out) sharing same memory
[ 1335.316522] Kill process 14259 (a.out) sharing same memory
[ 1335.317992] Kill process 14260 (a.out) sharing same memory
[ 1335.319462] Kill process 14261 (a.out) sharing same memory
[ 1335.320965] Kill process 14262 (a.out) sharing same memory
[ 1335.322459] Kill process 14263 (a.out) sharing same memory
[ 1335.323958] Kill process 14264 (a.out) sharing same memory
[ 1335.325472] Kill process 14265 (a.out) sharing same memory
[ 1335.326966] Kill process 14266 (a.out) sharing same memory
[ 1335.328454] Kill process 14267 (a.out) sharing same memory
[ 1335.329945] Kill process 14268 (a.out) sharing same memory
[ 1335.331444] Kill process 14269 (a.out) sharing same memory
[ 1335.332944] Kill process 14270 (a.out) sharing same memory
[ 1335.334435] Kill process 14271 (a.out) sharing same memory
[ 1335.335930] Kill process 14272 (a.out) sharing same memory
[ 1335.337437] Kill process 14273 (a.out) sharing same memory
[ 1335.338927] Kill process 14274 (a.out) sharing same memory
[ 1335.340400] Kill process 14275 (a.out) sharing same memory
[ 1335.341890] Kill process 14276 (a.out) sharing same memory
[ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181
[ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438
[ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447
[ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275
[ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276
[ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277
[ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339
[ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339
[ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340
[ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341
[ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368
[ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369
[ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402
[ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403
[ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413
[ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414
[ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415
[ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416
[ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417
[ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
[ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418
(...snipped...)
[ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348
[ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108
[ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727
[ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003
[ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208
[ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299
[ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418
[ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502
[ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656
[ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279
[ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720
[ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957
[ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209
[ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356
[ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450
[ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919
[ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033
[ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107
[ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303
[ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381
[ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567
[ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388
[ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566
[ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701
[ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041
[ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365
[ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288
[ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385
[ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935
[ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669
[ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795
[ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412
[ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892
[ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656
[ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784
[ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955
[ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520
[ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206
[ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265
[ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551
[ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856
[ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303
[ 1953.201269] SysRq : Resetting
---------- ext4 / Linux 3.19 + patch ----------

I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
with debug printk patch shown above. According to console logs,
oom_kill_process() is trivially called via pagefault_out_of_memory()
for the former kernel. Due to giving up !GFP_FS allocations immediately?

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
---------- xfs / Linux 3.19 ----------
[  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
[  793.283102] su cpuset=/ mems_allowed=0
[  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
[  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
[  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
[  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
[  793.283164] Call Trace:
[  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
[  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
[  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
[  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
[  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
[  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
[  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
[  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
[  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
[  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
[  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
[  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
---------- xfs / Linux 3.19 ----------

On the other hand, stall is observed for the latter kernel.
I guess that this time the system failed to make forward progress, for
oom_killer_skipped_count is increasing over time but the number of
remaining a.out processes remained unchanged.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz )
---------- xfs / Linux 3.19 + patch ----------
[ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568
[ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662
[ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667
[ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667
[ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667
[ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668
[ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669
[ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669
[ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669
[ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670
[ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671
[ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671
[ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671
[ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672
[ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673
[ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748
[ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749
[ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749
[ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749
[ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751
[ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751
[ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751
[ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751
[ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751
[ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751
[ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752
[ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752
[ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752
[ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753
[ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715
[ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894
[ 2064.988155] SysRq : Resetting
---------- xfs / Linux 3.19 + patch ----------

Oh, current code is too hintless to determine whether forward progress is
made, for no kernel messages are printed when the OOM victim failed to die
immediately. I wish we had debug printk patch shown above and/or
like http://marc.info/?l=linux-mm&m=141671829611143&w=2 .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-22  0:20                           ` Johannes Weiner
  2015-02-23 10:48                             ` Michal Hocko
@ 2015-02-23 21:33                             ` David Rientjes
  1 sibling, 0 replies; 83+ messages in thread
From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, Andrew Morton, linux-ext4, torvalds

On Sat, 21 Feb 2015, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.
> 
> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cc: stable@vger.kernel.org [3.19]
Acked-by: David Rientjes <rientjes@google.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                         ` Dave Chinner
@ 2015-02-27 18:24                           ` Vlastimil Babka
  2015-02-28  0:03                             ` Dave Chinner
  2015-03-02  9:39                           ` Vlastimil Babka
  2015-03-02 20:22                           ` Johannes Weiner
  2 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> > And then there will be an unknown number of
>> > slab allocations of unknown size with unknown slabs-per-page rules
>> > - how many pages needed for them?
> However many pages needed to allocate the number of objects we'll
> consume from the slab.

I think the best way is if slab could also learn to provide reserves for
individual objects. Either just mark internally how many of them are reserved,
if sufficient number is free, or translate this to the page allocator reserves,
as slab knows which order it uses for the given objects.

>> > And to make it much worse, how
>> > many pages of which orders?  Bless its heart, slub will go and use
>> > a 1-order page for allocations which should have been in 0-order
>> > pages..
> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Note that watermark checking for higher-order allocations is somewhat fuzzy
compared to order-0 checks, but I guess some kind of reservations could work
there too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-27 18:24                           ` Vlastimil Babka
@ 2015-02-28  0:03                             ` Dave Chinner
  2015-02-28 15:17                               ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-02-28  0:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >> > And then there will be an unknown number of
> >> > slab allocations of unknown size with unknown slabs-per-page rules
> >> > - how many pages needed for them?
> > However many pages needed to allocate the number of objects we'll
> > consume from the slab.
> 
> I think the best way is if slab could also learn to provide reserves for
> individual objects. Either just mark internally how many of them are reserved,
> if sufficient number is free, or translate this to the page allocator reserves,
> as slab knows which order it uses for the given objects.

Which is effectively what a slab based mempool is. Mempools don't
guarantee a reserve is available once it's been resized, however,
and we'd have to have mempools configured for every type of
allocation we are going to do. So from that perspective it's not
really a solution.

Further, the kmalloc heap is backed by slab caches. We do *lots* of
variable sized kmalloc allocations in transactions the size of which
aren't known until allocation time.  In that case, we have to assume
it's going to be a page per object, because the allocations could
actually be that size.

AFAICT, the worst case is a slab-backing page allocation for
every slab object that is allocated, so we may as well cater for
that case from the start...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28  0:03                             ` Dave Chinner
@ 2015-02-28 15:17                               ` Theodore Ts'o
  0 siblings, 0 replies; 83+ messages in thread
From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds,
	Vlastimil Babka

On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote:
> > I think the best way is if slab could also learn to provide reserves for
> > individual objects. Either just mark internally how many of them are reserved,
> > if sufficient number is free, or translate this to the page allocator reserves,
> > as slab knows which order it uses for the given objects.
> 
> Which is effectively what a slab based mempool is. Mempools don't
> guarantee a reserve is available once it's been resized, however,
> and we'd have to have mempools configured for every type of
> allocation we are going to do. So from that perspective it's not
> really a solution.

The bigger problem is it means that the upper layer which is making
the reservation before it starts taking lock won't necessarily know
exactly which slab objects it and all of the lower layers might need.

So it's much more flexible, and requires less accuracy, if we can just
request that (a) the mm subsystems reserves at least N pages, and (b)
tell it that at this point in time, it's safe for the requesting
subsystem to block until N pages is available.

Can this be guaranteed to be accurate?  No, of course not.  And in
some cases, it may be possible since it might depend on whether the
iSCSI device needs to reconnect to the target, or some sort of
exception handling, before it can complete its I/O request.

But it's better than what we have now, which is that once we've taken
certain locks, and/or started a complex transaction, we can't really
back out, so we end up looping either using GFP_NOFAIL, or around the
memory allocation request if there are still mm developers who are
delusional enough to believe, ala like King Canute, to say, "You must
always be able to handle memory allocation at any point in the kernel
and GFP_NOFAIL is an indicatoin of a subsystem bug!"

I can imagine using some adjustment factors, where a particular
voratious device might require hint to the file system to boost its
memory allocation estimate by 30%, or 50%.  So yes, it's a very,
*very* rough estimate.  And if we guess wrong, we might end up having
to loop ala GFP_NOFAIL anyway.  But it's better than not having such
an estimate.

I also grant that this doesn't work very well for emergency writeback,
or background writeback, where we can't and shouldn't block waiting
for enough memory to become free, since page cleaning is one of the
ways that we might be able to make memory available.  But if that's
the only problem we have, we're in good shape, since that can be
solved by either (a) doing a better job throttling memory allocations
or memory reservation requests in the first place, and/or (b) starting
the background writeback much more aggressively and earlier.

    	       		      	   		- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                     ` Dave Chinner
  2015-02-23  1:29                       ` Andrew Morton
@ 2015-02-28 16:29                       ` Johannes Weiner
  2015-02-28 16:41                         ` Theodore Ts'o
  2015-02-28 18:36                       ` Vlastimil Babka
  2015-03-02 15:18                       ` Michal Hocko
  3 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > > I will actively work around aanything that causes filesystem memory
> > > pressure to increase the chance of oom killer invocations. The OOM
> > > killer is not a solution - it is, by definition, a loose cannon and
> > > so we should be reducing dependencies on it.
> > 
> > Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

This code was for the sake of argument, see below.

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

You are missing the point of my question.  Whether we allocate right
away or make sure the memory is allocatable later on is a matter of
cost, but the logical outcome is the same.  That is not my concern
right now.

An OOM killer allows transactional allocation sites to get away
without planning ahead.  You are arguing that the OOM killer is a
cop-out on the MM site but I see it as the opposite: it puts a lot of
complexity in the allocator so that callsites can maneuver themselves
into situations where they absolutely need to get memory - or corrupt
user data - without actually making sure their needs will be covered.

If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are
putting the full responsibility on the user.  Are you sure this is
going to reduce our kernel-wide error rate?

> And, really, "reservation" != "preallocation".

That's an implementation detail.  Yes, the example implementation was
dumb and heavy-handed, but a reservation system that works based on
watermarks, and considers clean cache readily allocatable, is not much
more complex than that.

I'm trying to figure out if the current nofail allocators can get
their memory needs figured out beforehand.  And reliably so - what
good are estimates that are right 90% of the time, when failing the
allocation means corrupting user data?  What is the contingency plan?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 16:29                       ` Johannes Weiner
@ 2015-02-28 16:41                         ` Theodore Ts'o
  2015-02-28 22:15                           ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> 
> I'm trying to figure out if the current nofail allocators can get
> their memory needs figured out beforehand.  And reliably so - what
> good are estimates that are right 90% of the time, when failing the
> allocation means corrupting user data?  What is the contingency plan?

In the ideal world, we can figure out the exact memory needs
beforehand.  But we live in an imperfect world, and given that block
devices *also* need memory, the answer is "of course not".  We can't
be perfect.  But we can least give some kind of hint, and we can offer
to wait before we get into a situation where we need to loop in
GFP_NOWAIT --- which is the contingency/fallback plan.

I'm sure that's not very satisfying, but it's better than what we have
now.

					- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                     ` Dave Chinner
  2015-02-23  1:29                       ` Andrew Morton
  2015-02-28 16:29                       ` Johannes Weiner
@ 2015-02-28 18:36                       ` Vlastimil Babka
  2015-03-02 15:18                       ` Michal Hocko
  3 siblings, 0 replies; 83+ messages in thread
From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On 23.2.2015 1:45, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
>> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
>>> I will actively work around aanything that causes filesystem memory
>>> pressure to increase the chance of oom killer invocations. The OOM
>>> killer is not a solution - it is, by definition, a loose cannon and
>>> so we should be reducing dependencies on it.
>>
>> Once we have a better-working alternative, sure.
> 
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

About that... I guess good engineering also means looking at past solutions to
the same problem. I expect there would be a lot of academic work on this, which
might tell us what's (not) possible. And maybe even actual implementations with
real-life experience to learn from?

>>> I really don't care about the OOM Killer corner cases - it's
>>> completely the wrong way line of development to be spending time on
>>> and you aren't going to convince me otherwise. The OOM killer a
>>> crutch used to justify having a memory allocation subsystem that
>>> can't provide forward progress guarantee mechanisms to callers that
>>> need it.
>>
>> We can provide this.  Are all these callers able to preallocate?
> 
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

But won't even the reservation have potentially large impact on performance, if
as you later suggest (IIUC), we don't actually dip into our reserves until
regular reclaim starts failing? Doesn't that mean potentially lot of wasted
memory? Right, it doesn't have to be if we allow clean reclaimable pages to be
part of reserve, but still...

> And, really, "reservation" != "preallocation".
> 
> Maybe it's my filesystem background, but those to things are vastly
> different things.
> 
> Reservations are simply an *accounting* of the maximum amount of a
> reserve required by an operation to guarantee forwards progress. In
> filesystems, we do this for log space (transactions) and some do it
> for filesystem space (e.g. delayed allocation needs correct ENOSPC
> detection so we don't overcommit disk space).  The VM already has
> such concepts (e.g. watermarks and things like min_free_kbytes) that
> it uses to ensure that there are sufficient reserves for certain
> types of allocations to succeed.
> 
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.
> 
> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation. This could be acheived by setting reclaim
> targets first to the reserve pool watermark, then the second target
> is enough pages to satisfy the current allocation.

Hmm but what if the clean pages need us to take some locks to unmap and some
proces holding them is blocked... Also we would need to potentally block a
process that wants to dirty a page, is that being done now?

> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...

I don't know the details here, but if the allocation is done for incoming
packets i.e. something you can't predict then how would you set the reserve for
that? If they could predict, they would be able to preallocate necessary buffers
already.

> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.  We need to design and document the model first,
> then review it, then we can start working at the code level to
> implement the solution we've designed.

Right. A conference to discuss this on could come handy :)

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 16:41                         ` Theodore Ts'o
@ 2015-02-28 22:15                           ` Johannes Weiner
  2015-03-01 11:17                             ` Tetsuo Handa
                                               ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > 
> > I'm trying to figure out if the current nofail allocators can get
> > their memory needs figured out beforehand.  And reliably so - what
> > good are estimates that are right 90% of the time, when failing the
> > allocation means corrupting user data?  What is the contingency plan?
> 
> In the ideal world, we can figure out the exact memory needs
> beforehand.  But we live in an imperfect world, and given that block
> devices *also* need memory, the answer is "of course not".  We can't
> be perfect.  But we can least give some kind of hint, and we can offer
> to wait before we get into a situation where we need to loop in
> GFP_NOWAIT --- which is the contingency/fallback plan.

Overestimating should be fine, the result would a bit of false memory
pressure.  But underestimating and looping can't be an option or the
original lockups will still be there.  We need to guarantee forward
progress or the problem is somewhat mitigated at best - only now with
quite a bit more complexity in the allocator and the filesystems.

The block code would have to be looked at separately, but doesn't it
already use mempools etc. to guarantee progress?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                           ` Johannes Weiner
@ 2015-03-01 11:17                             ` Tetsuo Handa
  2015-03-06 11:53                               ` Tetsuo Handa
  2015-03-01 13:43                             ` Theodore Ts'o
  2015-03-01 21:48                             ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw)
  To: hannes, tytso
  Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm,
	fernando_b1, torvalds

Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.
> 
> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?
> 

If underestimating is tolerable, can we simply set different watermark
levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
For example,

   GFP_KERNEL (or above) can fail if memory usage exceeds 95%
   GFP_NOFS can fail if memory usage exceeds 97%
   GFP_NOIO can fail if memory usage exceeds 98%
   GFP_ATOMIC can fail if memory usage exceeds 99%

I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
strange. Use of same watermark is preventing kernel worker threads from
processing workqueue. While it is legal to do blocking operation from
workqueue, being blocked forever is an exclusive occupation for workqueue;
other jobs in the workqueue get stuck.

[  907.302050] kworker/1:0     R  running task        0 10832      2 0x00000080
[  907.303961] Workqueue: events_freezable_power_ disk_events_workfn
[  907.305706]  ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190
[  907.307761]  0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190
[  907.309894]  0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408
[  907.311949] Call Trace:
[  907.312989]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  907.314578]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  907.316182]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  907.317889]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  907.319535]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  907.321259]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  907.322945]  [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380
[  907.324606]  [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160
[  907.326196]  [<ffffffff8125c119>] bio_copy_kern+0x49/0x100
[  907.327788]  [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100
[  907.329549]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  907.331184]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  907.332877]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  907.334452]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  907.336156]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  907.337893]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  907.339539]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  907.341289]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  907.343115]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  907.344771]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  907.346421]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  907.348057]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  907.349650]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  907.351295]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  907.352765]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  907.354520]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  907.356097]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0

If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away.
If we can reserve some amount of memory for block / filesystem layer than
allow non critical allocation, above trace will likely go away. 

Or, instead maybe we can change GFP_NOIO to do

  (1) try allocation using GFP_ATOMIC|GFP_NOWARN
  (2) try allocating from freelist for GFP_NOIO
  (3) fail the allocation with warning message

steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                           ` Johannes Weiner
  2015-03-01 11:17                             ` Tetsuo Handa
@ 2015-03-01 13:43                             ` Theodore Ts'o
  2015-03-01 16:15                               ` Johannes Weiner
  2015-03-01 20:17                               ` Johannes Weiner
  2015-03-01 21:48                             ` Dave Chinner
  2 siblings, 2 replies; 83+ messages in thread
From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

We've lived with looping as it is and in practice it's actually worked
well.  I can only speak for ext4, but I do a lot of testing under very
high memory pressure situations, and it is used in *production* under
very high stress situations --- and the only time we'e run into
trouble is when the looping behaviour somehow got accidentally
*removed*.

There have been MM experts who have been worrying about this situation
for a very long time, but honestly, it seems to be much more of a
theoretical than actual concern.  So if you don't want to get
hints/estimates about how much memory the file system is about to use,
when the file system is willing to wait or even potentially return
ENOMEM (although I suspect starting to return ENOMEM where most user
space application don't expect it will cause more problems), I'm
personally happy to just use GFP_NOFAIL everywhere --- or to hard code
my own infinite loops if the MM developers want to take GFP_NOFAIL
away.  Because in my experience, looping simply hasn't been as awful
as some folks on this thread have made it out to be.

So if you don't like the complexity because the perfect is the enemy
of the good, we can just drop this and the file systems can simply
continue to loop around their memory allocation calls...  or if that
fails we can start adding subsystem specific mempools, which would be
even more wasteful of memory and probably at least as complicated.

							- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 13:43                             ` Theodore Ts'o
@ 2015-03-01 16:15                               ` Johannes Weiner
  2015-03-01 19:36                                 ` Theodore Ts'o
  2015-03-01 20:17                               ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.
> 
> There have been MM experts who have been worrying about this situation
> for a very long time, but honestly, it seems to be much more of a
> theoretical than actual concern.

Well, looping is a valid thing to do in most situations because on a
loaded system there is a decent chance that an unrelated thread will
volunteer some unreclaimable memory, or exit altogether.  Right now,
we rely on this happening, and it works most of the time.  Maybe all
the time, depending on how your machine is used.  But when it does't,
machines do lock up in practice.

We had these lockups in cgroups with just a handful of threads, which
all got stuck in the allocator and there was nobody left to volunteer
unreclaimable memory.  When this was being addressed, we knew that the
same can theoretically happen on the system-level but weren't aware of
any reports.  Well now, here we are.

It's been argued in this thread that systems shouldn't be pushed to
such extremes in real life and that we simply expect failure at some
point.  If that's the consensus, then yes, we can stop this and tell
users that they should scale back.  But I'm not convinced just yet
that this is the best we can do.

> So if you don't want to get hints/estimates about how much memory
> the file system is about to use, when the file system is willing to
> wait or even potentially return ENOMEM (although I suspect starting
> to return ENOMEM where most user space application don't expect it
> will cause more problems), I'm personally happy to just use
> GFP_NOFAIL everywhere --- or to hard code my own infinite loops if
> the MM developers want to take GFP_NOFAIL away.  Because in my
> experience, looping simply hasn't been as awful as some folks on
> this thread have made it out to be.

As I've said before, I'd be happy to get estimates from the filesystem
so that we can adjust our reserves, instead of simply running against
the wall at some point and hoping that the OOM killer heuristics will
save the day.

Until then, I'd much prefer __GFP_NOFAIL over open-coded loops.  If
the OOM killer is too aggressive, we can tone it down, but as it
stands that mechanism is the last attempt at forward progress if
looping doesn't work out.  In addition, when we finally transition to
private memory reserves, we can easily find the callsites that need to
be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES.

> So if you don't like the complexity because the perfect is the enemy
> of the good, we can just drop this and the file systems can simply
> continue to loop around their memory allocation calls...  or if that
> fails we can start adding subsystem specific mempools, which would be
> even more wasteful of memory and probably at least as complicated.

It really depends on what the goal here is.  You don't have to be
perfectly accurate, but if you can give us a worst-case estimate we
can actually guarantee forward progress and eliminate these lockups
entirely, like in the block layer.  Sure, there will be bugs and the
estimates won't be right from the start, but we can converge towards
the right answer.  If the allocations which are allowed to dip into
the reserves - the current nofail sites? - can be annotated with a gfp
flag, we can easily verify the estimates by serving those sites
exclusively from the private reserve pool and emit warnings when that
runs dry.  We wouldn't even have to stress the system for that.

But there are legitimate concerns that this might never work.  For
example, the requirements could be so unpredictable, or assessing them
with reasonable accuracy could be so expensive, that the margin of
error would make the worst case estimate too big to be useful.  Big
enough that the reserves would harm well-behaved systems.  And if
useful worst-case estimates are unattainable, I don't think we need to
bother with reserves.  We can just stick with looping and OOM killing,
that works most of the time, too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 16:15                               ` Johannes Weiner
@ 2015-03-01 19:36                                 ` Theodore Ts'o
  2015-03-01 20:44                                   ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> 
> We had these lockups in cgroups with just a handful of threads, which
> all got stuck in the allocator and there was nobody left to volunteer
> unreclaimable memory.  When this was being addressed, we knew that the
> same can theoretically happen on the system-level but weren't aware of
> any reports.  Well now, here we are.

I think the "few threads in a small" cgroup problem is a little
difference, because in those cases very often the global system has
enough memory, and there is always the possibility that we might relax
the memory cgroup guarantees a little in order to allow forward
progress.

In fact, arguably this *is* the right thing to do, because we have
situations where (a) the VFS takes the directory mutex, (b) the
directory blocks have been pushed out of memory, and so (c) a system
call running in container with a small amount of memory and/or a small
amount of disk bandwidth allowed via its prop I/O settings ends up
taking a very long time for the directory blocks to be read into
memory.  If a high priority process, like say a cluster management
daemon, also tries to to read the same directory, it can end up
stalled for long enough for the software watchdog to take out the
entire machine from the cluster.

The hard problem here is that the lock is taken by the VFS, *before*
it calls into the file system specific layer, and so the VFS has no
idea (a) how much memory or disk bandwidth it needs, and (b) whether
it needs any memory or disk bandwidth in the first place in order to
service a directory lookup operation (most of the time, it doesn't).
So there may be situations where in the restricted cgroup, it would
useful for the file system to be able to say, "you know, we're holding
onto a lock and the fact that the disk controller is going to force
this low priority cgroup to wait over a minute for the I/O to even be
queued out to the disk, maybe we should make an exception and bust the
disk controller cgroup cap".

(There is a related problem where a cgroup with a low disk bandwidth
quota is slowing down writeback, and we are desperately short on
global memory, and where relaxing the disk bandwidth limit via some
kind of priority inheritance scheme would prevent "innocent" high,
proprity cgroups from having some of their processes get OOM-killed.
I suppose one could claim that the high priority cgroups tend to
belong to the sysadmin, who set the stupid disk bandwidth caps in the
first place, so there is a certain justice in having the high priority
processes getting OOM killed, but still, it would be nice if we could
do the right thing automatically.)


But in any case, some of these workarounds, where we relax a
particuarly tightly constrained cgroup limit, are obviously not going
to help when the entire system is low on memory.

> It really depends on what the goal here is.  You don't have to be
> perfectly accurate, but if you can give us a worst-case estimate we
> can actually guarantee forward progress and eliminate these lockups
> entirely, like in the block layer.  Sure, there will be bugs and the
> estimates won't be right from the start, but we can converge towards
> the right answer.  If the allocations which are allowed to dip into
> the reserves - the current nofail sites? - can be annotated with a gfp
> flag, we can easily verify the estimates by serving those sites
> exclusively from the private reserve pool and emit warnings when that
> runs dry.  We wouldn't even have to stress the system for that.
> 
> But there are legitimate concerns that this might never work.  For
> example, the requirements could be so unpredictable, or assessing them
> with reasonable accuracy could be so expensive, that the margin of
> error would make the worst case estimate too big to be useful.  Big
> enough that the reserves would harm well-behaved systems.  And if
> useful worst-case estimates are unattainable, I don't think we need to
> bother with reserves.  We can just stick with looping and OOM killing,
> that works most of the time, too.

I'm not sure that you want to reserve for the worst-case.  What might
work is if subsystems (probably primarily file systems) give you
estimates for the usual case and the worst case, and you reserve for
the something in between these two bounds.  In practice there will be
a huge number of file systems operations taking place in your typical
super-busy system, and if you reserve for the worst case, it probably
will be too much.  We need to make sure there is enough memory
available for some forward progress, and if we need to stall a few
operations with some sleeping loops, so be it.  So I don't think the
"heads up" mounts don't have to be strict reservations in the sense
that the memory will be available instantly without any sleeping or
looping.

I would also suggest that "reservations" be tied to a task struct and
not to some magic __GFP_* flag, since it's not just allocations done
by the file system, but also by the block device drivers, and if
certain write operations fail, the results will be catastrophic -- and
the block device can't tell whether a particular I/O operatoion must
succeed or we declare the file system as needing manual recovery and
potentially reboot the entire system, and an I/O operation where a
fail could be handled by reflecting ENOMEM back up to userspace.  The
difference is a property of the call stack, so the simplest way of
handing this is to store the reservation in the task struct, and let
the reservation get automatically returned to the system when a
particular process makes a transition from kernel space to user space.

The bottom line is that I agree that looping and OOM-killing works
most of the time, and so I'm happy with something that makes life a
little bit better and a little bit more predictable for the VM, if
that makes the system behave a bit more smoothly under high memory
pressures.  But at the same time, we don't want to make things too
complicated; whether that means that we don't try to achieve
perfection, or simply not worry about the global memory pressure
situation, and instead try to think about other solutions to handle
the "small number of threads in a container, and try to OOM kill a bit
less frequently, and instead force it to loop/sleep for a bit, and
then let a random foreground kernel thread in the container allow to
"borrow" a small amount of memory to hopefully let it make forward
progress, especially if it is holding locks, or is in the process of
exiting, etc.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 13:43                             ` Theodore Ts'o
  2015-03-01 16:15                               ` Johannes Weiner
@ 2015-03-01 20:17                               ` Johannes Weiner
  1 sibling, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> We've lived with looping as it is and in practice it's actually worked
> well.  I can only speak for ext4, but I do a lot of testing under very
> high memory pressure situations, and it is used in *production* under
> very high stress situations --- and the only time we'e run into
> trouble is when the looping behaviour somehow got accidentally
> *removed*.

Memory is a finite resource and there are (unlimited) consumers that
do not allow their share to be reclaimed/recycled.  Mainly this is the
kernel itself, but it also includes anon memory once swap space runs
out, as well as mlocked and dirty memory.  It's not a question of
whether there exists a true point of OOM (where not enough memory is
recyclable to satisfy new allocations).  That point inevitably exists.
It's a policy question of how to inform userspace once it is reached.

We agree that we can't unconditionally fail allocations, because we
might be in the middle of a transaction, where an allocation failure
can potentially corrupt userdata.  However, endlessly looping for
progress that can not happen at this point has the exact same effect:
the transaction won't finish.  Only the machine locks up in addition.
It's great that your setups don't ever truly go out of memory, but
that doesn't mean it can't happen in practice.

One answer to users at this point could certainly be to stay away from
the true point of OOM, and if you don't then that's your problem.  But
the issue I take with this answer is that, for the sake of memory
utilization, users kind of do want to get fairly close to this point,
and at the same time it's hard to reliably predict the memory
consumption of a workload in advance.  It can depend on the timing
between threads, it can depend on user/network-supplied input, and it
can simply be a bug in the application.  And if that OOM situation is
accidentally entered, I'd prefer we had a better answer than locking
up the machine and blame the user.

So one attempt to make progress in this situation is to kill userspace
applications that are pinning unreclaimable memory.  This is what we
are doing now, but there are several problems with it.  For one, we
are doing a terrible job and might still get stuck sometimes, which
deteriorates the situation back to failing the allocation and
corrupting the filesystem.  Secondly, killing tasks is disruptive, and
because it's driven by heuristics we're never going to kill the
"right" one in all situations.

Reserves would allow us to look ahead and avoid starting transactions
that can not be finished given the available resources.  So we are at
least avoiding filesystem corruption.  The tasks could probably be put
to sleep for some time in the hope that ongoing transactions complete
and release memory, but there might not be any, and eventually the OOM
situation has to be communicated to userspace.  Arguably, an -ENOMEM
from a syscall at this point might be easier to handle than a SIGKILL
from the OOM killer in an unrelated task.

So if we could pull off reserves, they look like the most attractive
solution to me.  If not, the OOM killer needs to be fixed to always
make forward progress instead.  I proposed a patch for that already.
But infinite loops that force the user to reboot the machine at the
point of OOM seem like a terrible policy.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 19:36                                 ` Theodore Ts'o
@ 2015-03-01 20:44                                   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, torvalds

On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote:
> On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote:
> > 
> > We had these lockups in cgroups with just a handful of threads, which
> > all got stuck in the allocator and there was nobody left to volunteer
> > unreclaimable memory.  When this was being addressed, we knew that the
> > same can theoretically happen on the system-level but weren't aware of
> > any reports.  Well now, here we are.
> 
> I think the "few threads in a small" cgroup problem is a little
> difference, because in those cases very often the global system has
> enough memory, and there is always the possibility that we might relax
> the memory cgroup guarantees a little in order to allow forward
> progress.

That's exactly how we fixed it.  __GFP_NOFAIL are allowed to simply
bypass the cgroup memory limits when reclaim within the group fails to
make room for the allocation.  I'm just mentioning that because the
global case doesn't have the same out, but is susceptible to the same
deadlock situation when there are no other threads volunteering pages.

If your machines are loaded with hundreds or thousands of threads, the
chances that a thread stuck in the allocator will be bailed out by the
other threads in the system is likely (or that you run into CPU limits
first), but if you have only a handful of memory-intensive tasks, this
might not be the case.  The cgroup problem was closer to that second
scenario, where few threads split all available memory between them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-28 22:15                           ` Johannes Weiner
  2015-03-01 11:17                             ` Tetsuo Handa
  2015-03-01 13:43                             ` Theodore Ts'o
@ 2015-03-01 21:48                             ` Dave Chinner
  2015-03-02  0:17                               ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, akpm, torvalds

On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > 
> > > I'm trying to figure out if the current nofail allocators can get
> > > their memory needs figured out beforehand.  And reliably so - what
> > > good are estimates that are right 90% of the time, when failing the
> > > allocation means corrupting user data?  What is the contingency plan?
> > 
> > In the ideal world, we can figure out the exact memory needs
> > beforehand.  But we live in an imperfect world, and given that block
> > devices *also* need memory, the answer is "of course not".  We can't
> > be perfect.  But we can least give some kind of hint, and we can offer
> > to wait before we get into a situation where we need to loop in
> > GFP_NOWAIT --- which is the contingency/fallback plan.
> 
> Overestimating should be fine, the result would a bit of false memory
> pressure.  But underestimating and looping can't be an option or the
> original lockups will still be there.  We need to guarantee forward
> progress or the problem is somewhat mitigated at best - only now with
> quite a bit more complexity in the allocator and the filesystems.

The additional complexity in XFS is actually quite minor, and
initial "rough worst case" memory usage estimates are not that hard
to measure....

> The block code would have to be looked at separately, but doesn't it
> already use mempools etc. to guarantee progress?

Yes, it does. I'm not concerned about the block layer.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 21:48                             ` Dave Chinner
@ 2015-03-02  0:17                               ` Dave Chinner
  2015-03-02 12:46                                 ` Brian Foster
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-02  0:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko,
	linux-mm, mgorman, dchinner, akpm, torvalds

On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > 
> > > > I'm trying to figure out if the current nofail allocators can get
> > > > their memory needs figured out beforehand.  And reliably so - what
> > > > good are estimates that are right 90% of the time, when failing the
> > > > allocation means corrupting user data?  What is the contingency plan?
> > > 
> > > In the ideal world, we can figure out the exact memory needs
> > > beforehand.  But we live in an imperfect world, and given that block
> > > devices *also* need memory, the answer is "of course not".  We can't
> > > be perfect.  But we can least give some kind of hint, and we can offer
> > > to wait before we get into a situation where we need to loop in
> > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > 
> > Overestimating should be fine, the result would a bit of false memory
> > pressure.  But underestimating and looping can't be an option or the
> > original lockups will still be there.  We need to guarantee forward
> > progress or the problem is somewhat mitigated at best - only now with
> > quite a bit more complexity in the allocator and the filesystems.
> 
> The additional complexity in XFS is actually quite minor, and
> initial "rough worst case" memory usage estimates are not that hard
> to measure....

And, just to point out that the OOM killer can be invoked without a
single transaction-based filesystem ENOMEM failure, here's what
xfs/084 does on 4.0-rc1:

[  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  148.822113] resvtest cpuset=/ mems_allowed=0
[  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
[  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
[  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
[  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
[  148.831734] Call Trace:
[  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
[  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
[  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
[  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
[  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
[  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
[  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
[  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
[  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
[  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
[  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
[  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
[  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
[  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
[  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
[  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
[  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
[  148.854456] Mem-Info:
[  148.854986] Node 0 DMA per-cpu:
[  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
[  148.856820] Node 0 DMA32 per-cpu:
[  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
[  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
[  148.858688]  active_file:19 inactive_file:2 isolated_file:0
[  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
[  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
[  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
[  148.858688]  free_cma:0
[  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
[  148.874431] lowmem_reserve[]: 0 966 966 966
[  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
[  148.884817] lowmem_reserve[]: 0 0 0 0
[  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
[  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
[  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  148.894949] 47361 total pagecache pages
[  148.895816] 47334 pages in swap cache
[  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
[  148.898057] Free swap  = 0kB
[  148.898714] Total swap = 497976kB
[  148.899470] 262044 pages RAM
[  148.900145] 0 pages HighMem/MovableOnly
[  148.901006] 10253 pages reserved
[  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
[  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
[  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
[  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
[  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
[  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
[  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
[  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
[  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
[  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
[  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
[  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
[  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
[  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
[  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
[  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
[  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
[  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
[  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
[  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
[  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
[  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
[  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
[  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
[  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
[  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
[  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
[  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
[  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
[  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
[  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
[  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
[  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
[  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
[  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
[  149.415288] XFS (vda): Unmounting Filesystem
[  150.211229] XFS (vda): Mounting V5 Filesystem
[  150.292092] XFS (vda): Ending clean mount
[  150.342307] XFS (vda): Unmounting Filesystem
[  150.346522] XFS (vdb): Unmounting Filesystem
[  151.264135] XFS: kmalloc allocations by trans type
[  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
[  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
[  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
[  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
[  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
[  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
[  151.272833] XFS: slab allocations by trans type
[  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
[  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
[  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
[  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
[  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
[  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
[  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
[  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
[  151.283476] XFS: vmalloc allocations by trans type
[  151.284535] XFS: page allocations by trans type

Those XFS allocation stats are largest measured allocations done
under transaction context broken down by allocation and transaction
type.  No failures that would result in looping, even though the
system invoked the OOM killer on a filesystem workload....

I need to break the slab allocations down further by cache (other
workloads are generating over 50 slab allocations per transaction),
but another hour's work and a few days of observation of the stats
in my normal day-to-day work wll get me all the information I need
to do a decent first pass at memory reservation requirements for
XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                         ` Dave Chinner
  2015-02-27 18:24                           ` Vlastimil Babka
@ 2015-03-02  9:39                           ` Vlastimil Babka
  2015-03-02 22:31                             ` Dave Chinner
  2015-03-02 20:22                           ` Johannes Weiner
  2 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2015-03-02  9:39 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko,
	linux-mm, mgorman, rientjes, torvalds

On 02/23/2015 08:32 AM, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>>
>> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
>> reserve.  So to reserve N pages we increase the page allocator dynamic
>> reserve by N, do some reclaim if necessary then deposit N tokens into
>> the caller's task_struct (it'll be a set of zone/nr-pages tuples I
>> suppose).
>>
>> When allocating pages the caller should drain its reserves in
>> preference to dipping into the regular freelist.  This guy has already
>> done his reclaim and shouldn't be penalised a second time.  I guess
>> Johannes's preallocation code should switch to doing this for the same
>> reason, plus the fact that snipping a page off
>> task_struct.prealloc_pages is super-fast and needs to be done sometime
>> anyway so why not do it by default.
>
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.

I'm not sure I understand properly. You don't know if a specific 
allocation is permanent or reclaimable, but you can tell in advance how 
much in total will be permanent? Is it because you are conservative and 
assume everything will be permanent, or how?

Can you at least at some later point in transaction recognize that "OK, 
this object was not permanent after all" and tell mm that it can lower 
your reserve?

> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.

Which are the "all other allocations?" Above you wrote that all 
allocations are treated as potentially permanent. Also how does the fact 
that an object is later reclaimable, affect forward progress during its 
allocation? Or all you talking about allocations from contexts that 
don't use reserves?

> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That's probably going to result in lot of wasted memory and I still 
don't understand why it's needed, if your reserve estimate is guaranteed 
to cover the worst-case.

>> Both reservation and preallocation are vulnerable to deadlocks - 10,000
>> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages
>> and we ran out of memory.  Whoops.
>
> Yes, that's the big problem with preallocation, as well as your
> proposed "depelete the reserved memory first" approach. They
> *require* up front "preallocation" of free memory, either directly
> by the application, or internally by the mm subsystem.

I don't see why it would deadlock, if during reserve time the mm can 
return ENOMEM as the reserver should be able to back out at that point.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02  0:17                               ` Dave Chinner
@ 2015-03-02 12:46                                 ` Brian Foster
  0 siblings, 0 replies; 83+ messages in thread
From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Tetsuo Handa, Johannes Weiner, oleg, xfs,
	mhocko, linux-mm, mgorman, dchinner, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > > 
> > > > > I'm trying to figure out if the current nofail allocators can get
> > > > > their memory needs figured out beforehand.  And reliably so - what
> > > > > good are estimates that are right 90% of the time, when failing the
> > > > > allocation means corrupting user data?  What is the contingency plan?
> > > > 
> > > > In the ideal world, we can figure out the exact memory needs
> > > > beforehand.  But we live in an imperfect world, and given that block
> > > > devices *also* need memory, the answer is "of course not".  We can't
> > > > be perfect.  But we can least give some kind of hint, and we can offer
> > > > to wait before we get into a situation where we need to loop in
> > > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > > 
> > > Overestimating should be fine, the result would a bit of false memory
> > > pressure.  But underestimating and looping can't be an option or the
> > > original lockups will still be there.  We need to guarantee forward
> > > progress or the problem is somewhat mitigated at best - only now with
> > > quite a bit more complexity in the allocator and the filesystems.
> > 
> > The additional complexity in XFS is actually quite minor, and
> > initial "rough worst case" memory usage estimates are not that hard
> > to measure....
> 
> And, just to point out that the OOM killer can be invoked without a
> single transaction-based filesystem ENOMEM failure, here's what
> xfs/084 does on 4.0-rc1:
> 
> [  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> [  148.822113] resvtest cpuset=/ mems_allowed=0
> [  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
> [  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c
> [  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000
> [  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8
> [  148.831734] Call Trace:
> [  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
> [  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
> [  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
> [  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
> [  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
> [  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
> [  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
> [  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
> [  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
> [  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
> [  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
> [  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
> [  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
> [  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
> [  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
> [  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
> [  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
> [  148.854456] Mem-Info:
> [  148.854986] Node 0 DMA per-cpu:
> [  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
> [  148.856820] Node 0 DMA32 per-cpu:
> [  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
> [  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
> [  148.858688]  active_file:19 inactive_file:2 isolated_file:0
> [  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
> [  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
> [  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
> [  148.858688]  free_cma:0
> [  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as
> [  148.874431] lowmem_reserve[]: 0 966 966 966
> [  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s
> [  148.884817] lowmem_reserve[]: 0 0 0 0
> [  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB
> [  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB
> [  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  148.894949] 47361 total pagecache pages
> [  148.895816] 47334 pages in swap cache
> [  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
> [  148.898057] Free swap  = 0kB
> [  148.898714] Total swap = 497976kB
> [  148.899470] 262044 pages RAM
> [  148.900145] 0 pages HighMem/MovableOnly
> [  148.901006] 10253 pages reserved
> [  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  148.903637] [ 1204]     0  1204     6039        1      15       3      163         -1000 udevd
> [  148.905571] [ 1323]     0  1323     6038        1      14       3      165         -1000 udevd
> [  148.907499] [ 1324]     0  1324     6038        1      14       3      164         -1000 udevd
> [  148.909439] [ 2176]     0  2176     2524        0       6       2      571             0 dhclient
> [  148.911427] [ 2227]     0  2227     9267        0      22       3       95             0 rpcbind
> [  148.913392] [ 2632]     0  2632    64981       30      29       3      136             0 rsyslogd
> [  148.915391] [ 2686]     0  2686     1062        1       6       3       36             0 acpid
> [  148.917325] [ 2826]     0  2826     4753        0      12       2       44             0 atd
> [  148.919209] [ 2877]     0  2877     6473        0      17       3       66             0 cron
> [  148.921120] [ 2911]   104  2911     7078        1      17       3       81             0 dbus-daemon
> [  148.923150] [ 3591]     0  3591    13731        0      28       2      165         -1000 sshd
> [  148.925073] [ 3603]     0  3603    22024        0      43       2      215             0 winbindd
> [  148.927066] [ 3612]     0  3612    22024        0      42       2      216             0 winbindd
> [  148.929062] [ 3636]     0  3636     3722        1      11       3       41             0 getty
> [  148.930981] [ 3637]     0  3637     3722        1      11       3       40             0 getty
> [  148.932915] [ 3638]     0  3638     3722        1      11       3       39             0 getty
> [  148.934835] [ 3639]     0  3639     3722        1      11       3       40             0 getty
> [  148.936789] [ 3640]     0  3640     3722        1      11       3       40             0 getty
> [  148.938704] [ 3641]     0  3641     3722        1      10       3       38             0 getty
> [  148.940635] [ 3642]     0  3642     3677        1      11       3       40             0 getty
> [  148.942550] [ 3643]     0  3643    25894        2      52       2      248             0 sshd
> [  148.944469] [ 3649]     0  3649   146652        1      35       4      320             0 console-kit-dae
> [  148.946578] [ 3716]     0  3716    48287        1      31       4      171             0 polkitd
> [  148.948552] [ 3722]  1000  3722    25894        0      51       2      250             0 sshd
> [  148.950457] [ 3723]  1000  3723     5435        3      15       3      495             0 bash
> [  148.952375] [ 3742]     0  3742    17157        1      37       2      160             0 sudo
> [  148.954275] [ 3743]     0  3743     3365        1      11       3      516             0 check
> [  148.956229] [ 4130]     0  4130     3334        1      11       3      484             0 084
> [  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808             0 resvtest
> [  148.960104] [ 4343]     0  4343     3334        0      11       3      485             0 084
> [  148.961990] [ 4344]     0  4344     3334        0      11       3      485             0 084
> [  148.963876] [ 4345]     0  4345     3305        0      11       3       36             0 sed
> [  148.965766] [ 4346]     0  4346     3305        0      11       3       37             0 sed
> [  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child
> [  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB
> [  149.415288] XFS (vda): Unmounting Filesystem
> [  150.211229] XFS (vda): Mounting V5 Filesystem
> [  150.292092] XFS (vda): Ending clean mount
> [  150.342307] XFS (vda): Unmounting Filesystem
> [  150.346522] XFS (vdb): Unmounting Filesystem
> [  151.264135] XFS: kmalloc allocations by trans type
> [  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
> [  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
> [  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
> [  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
> [  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
> [  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
> [  151.272833] XFS: slab allocations by trans type
> [  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
> [  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
> [  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
> [  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
> [  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
> [  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
> [  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
> [  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
> [  151.283476] XFS: vmalloc allocations by trans type
> [  151.284535] XFS: page allocations by trans type
> 
> Those XFS allocation stats are largest measured allocations done
> under transaction context broken down by allocation and transaction
> type.  No failures that would result in looping, even though the
> system invoked the OOM killer on a filesystem workload....
> 
> I need to break the slab allocations down further by cache (other
> workloads are generating over 50 slab allocations per transaction),
> but another hour's work and a few days of observation of the stats
> in my normal day-to-day work wll get me all the information I need
> to do a decent first pass at memory reservation requirements for
> XFS.
> 

This sounds like something that would serve us well under sysfs,
particularly if we do adopt the kind of reservation model being
discussed in this thread. I wouldn't expect these values to change
drastically or that often, but they could certainly adjust over time to
the point of being out of line with a reservation. A tool like this
combined with Johannes' idea of a warning or something along those lines
for a reservation overrun should give us all we need to identify
something is wrong and have the ability to fix it.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  0:45                     ` Dave Chinner
                                         ` (2 preceding siblings ...)
  2015-02-28 18:36                       ` Vlastimil Babka
@ 2015-03-02 15:18                       ` Michal Hocko
  2015-03-02 16:05                         ` Johannes Weiner
  2015-03-02 16:39                         ` Theodore Ts'o
  3 siblings, 2 replies; 83+ messages in thread
From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon 23-02-15 11:45:21, Dave Chinner wrote:
[...]
> A reserve memory pool is no different - every time a memory reserve
> occurs, a watermark is lifted to accommodate it, and the transaction
> is not allowed to proceed until the amount of free memory exceeds
> that watermark. The memory allocation subsystem then only allows
> *allocations* marked correctly to allocate pages from that the
> reserve that watermark protects. e.g. only allocations using
> __GFP_RESERVE are allowed to dip into the reserve pool.

The idea is sound. But I am pretty sure we will find many corner
cases. E.g. what if the mere reservation attempt causes the system
to go OOM and trigger the OOM killer? Sure that wouldn't be too much
different from the OOM triggered during the allocation but there is one
major difference. Reservations need to be estimated and I expect the
estimation would be on the more conservative side and so the OOM might
not happen without them.

> By using watermarks, freeing of memory will automatically top
> up the reserve pool which means that we guarantee that reclaimable
> memory allocated for demand paging during transacitons doesn't
> deplete the reserve pool permanently.  As a result, when there is
> plenty of free and/or reclaimable memory, the reserve pool
> watermarks will have almost zero impact on performance and
> behaviour.

Typical busy system won't be very far away from the high watermark
so there would be a reclaim performed during increased watermaks
(aka reservation) and that might lead to visible performance
degradation. This might be acceptable but it also adds a certain level
of unpredictability when performance characteristics might change
suddenly.

> Further, because it's just accounting and behavioural thresholds,
> this allows the mm subsystem to control how the reserve pool is
> accounted internally. e.g. clean, reclaimable pages in the page
> cache could serve as reserve pool pages as they can be immediately
> reclaimed for allocation.

But they also can turn into hard/impossible to reclaim as well. Clean
pages might get dirty and e.g. swap backed pages run out of their
backing storage. So I guess we cannot count with those pages without
reclaiming them first and hiding them into the reserve. Which is what
you suggest below probably but I wasn't really sure...

> This could be acheived by setting reclaim targets first to the reserve
> pool watermark, then the second target is enough pages to satisfy the
> current allocation.
> 
> And, FWIW, there's nothing stopping this mechanism from have order
> based reserve thresholds. e.g. IB could really do with a 64k reserve
> pool threshold and hence help solve the long standing problems they
> have with filling the receive ring in GFP_ATOMIC context...
> 
> Sure, that's looking further down the track, but my point still
> remains: we need a viable long term solution to this problem. Maybe
> reservations are not the solution, but I don't see anyone else who
> is thinking of how to address this architectural problem at a system
> level right now.

I think the idea is good! It will just be quite tricky to get there
without causing more problems than those being solved. The biggest
question mark so far seems to be the reservation size estimation. If
it is hard for any caller to know the size beforehand (which would
be really close to the actually used size) then the whole complexity
in the code sounds like an overkill and asking administrator to tune
min_free_kbytes seems a better fit (we would still have to teach the
allocator to access reserves when really necessary) because the system
would behave more predictably (although some memory would be wasted).

> We need to design and document the model first, then review it, then
> we can start working at the code level to implement the solution we've
> designed.

I have already asked James to add this on LSF agenda but nothing has
materialized on the schedule yet. I will poke him again.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 15:18                       ` Michal Hocko
@ 2015-03-02 16:05                         ` Johannes Weiner
  2015-03-02 17:10                           ` Michal Hocko
  2015-03-02 16:39                         ` Theodore Ts'o
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> On Mon 23-02-15 11:45:21, Dave Chinner wrote:
> [...]
> > A reserve memory pool is no different - every time a memory reserve
> > occurs, a watermark is lifted to accommodate it, and the transaction
> > is not allowed to proceed until the amount of free memory exceeds
> > that watermark. The memory allocation subsystem then only allows
> > *allocations* marked correctly to allocate pages from that the
> > reserve that watermark protects. e.g. only allocations using
> > __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer? Sure that wouldn't be too much
> different from the OOM triggered during the allocation but there is one
> major difference. Reservations need to be estimated and I expect the
> estimation would be on the more conservative side and so the OOM might
> not happen without them.

The whole idea is that filesystems request the reserves while they can
still sleep for progress or fail the macro-operation with -ENOMEM.

And the estimate wouldn't just be on the conservative side, it would
have to be the worst-case scenario.  If we run out of reserves in an
allocation that can not fail that would be a bug that can lock up the
machine.  We can then fall back to the OOM killer in a last-ditch
effort to make forward progress, but as the victim tasks can get stuck
behind state/locks held by the allocation side, the machine might lock
up after all.

> > By using watermarks, freeing of memory will automatically top
> > up the reserve pool which means that we guarantee that reclaimable
> > memory allocated for demand paging during transacitons doesn't
> > deplete the reserve pool permanently.  As a result, when there is
> > plenty of free and/or reclaimable memory, the reserve pool
> > watermarks will have almost zero impact on performance and
> > behaviour.
> 
> Typical busy system won't be very far away from the high watermark
> so there would be a reclaim performed during increased watermaks
> (aka reservation) and that might lead to visible performance
> degradation. This might be acceptable but it also adds a certain level
> of unpredictability when performance characteristics might change
> suddenly.

There is usually a good deal of clean cache.  As Dave pointed out
before, clean cache can be considered re-allocatable from NOFS
contexts, and so we'd only have to maintain this invariant:

	min_wmark + private_reserves < free_pages + clean_cache

> > Further, because it's just accounting and behavioural thresholds,
> > this allows the mm subsystem to control how the reserve pool is
> > accounted internally. e.g. clean, reclaimable pages in the page
> > cache could serve as reserve pool pages as they can be immediately
> > reclaimed for allocation.
> 
> But they also can turn into hard/impossible to reclaim as well. Clean
> pages might get dirty and e.g. swap backed pages run out of their
> backing storage. So I guess we cannot count with those pages without
> reclaiming them first and hiding them into the reserve. Which is what
> you suggest below probably but I wasn't really sure...

Pages reserved for use by the page cleaning path can't be considered
dirtyable.  They have to be included in the dirty_balance_reserve.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 15:18                       ` Michal Hocko
  2015-03-02 16:05                         ` Johannes Weiner
@ 2015-03-02 16:39                         ` Theodore Ts'o
  2015-03-02 16:58                           ` Michal Hocko
  1 sibling, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer?

Doctor, doctor, it hurts when I do that....

So don't trigger the OOM killer.  We can let the caller decide
whether the reservation request should block or return ENOMEM, but the
whole point of the reservation request idea is that this happens
*before* we've taken any mutexes, so blocking won't prevent forward
progress.

The file system could send down a different flag if we are doing
writebacks for page cleaning purposes, in which case the reservation
request would be a "just a heads up, we *will* be needing this much
memory, but this is not something where we can block or return ENOMEM,
so please give us the highest priority for using the free reserves".

> I think the idea is good! It will just be quite tricky to get there
> without causing more problems than those being solved. The biggest
> question mark so far seems to be the reservation size estimation. If
> it is hard for any caller to know the size beforehand (which would
> be really close to the actually used size) then the whole complexity
> in the code sounds like an overkill and asking administrator to tune
> min_free_kbytes seems a better fit (we would still have to teach the
> allocator to access reserves when really necessary) because the system
> would behave more predictably (although some memory would be wasted).

If we do need to teach the allocator to access reserves when really
necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
GFP_NOFAIL?  If the goal is do something more fine-grained,
unfortunately at least for the short-term we'll need to preserve the
existing behaviour and issue warnings until the file system starts
adding GFP_NOFAIL to those memory allocations where previously,
GFP_NOFS was effectively guaranteeing that failures would almostt
never happen.

I know at least one place discovered with recent change (and revert)
where I'll be fixing ext4, but I suspect it won't be the only one,
especially in the block device drivers.

						- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:39                         ` Theodore Ts'o
@ 2015-03-02 16:58                           ` Michal Hocko
  2015-03-04 12:52                             ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm,
	mgorman, rientjes, akpm, torvalds

On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > The idea is sound. But I am pretty sure we will find many corner
> > cases. E.g. what if the mere reservation attempt causes the system
> > to go OOM and trigger the OOM killer?
> 
> Doctor, doctor, it hurts when I do that....
> 
> So don't trigger the OOM killer.  We can let the caller decide whether
> the reservation request should block or return ENOMEM, but the whole
> point of the reservation request idea is that this happens *before*
> we've taken any mutexes, so blocking won't prevent forward progress.

Maybe I wasn't clear. I wasn't concerned about the context which
is doing to reservation. I was more concerned about all the other
allocation requests which might fail now (becasuse they do not have
access to the reserves). So you think that we should simply disable OOM
killer while there is any reservation active? Wouldn't that be even more
fragile when something goes terribly wrong?

> The file system could send down a different flag if we are doing
> writebacks for page cleaning purposes, in which case the reservation
> request would be a "just a heads up, we *will* be needing this much
> memory, but this is not something where we can block or return ENOMEM,
> so please give us the highest priority for using the free reserves".

Sure that thing is clear.
 
> > I think the idea is good! It will just be quite tricky to get there
> > without causing more problems than those being solved. The biggest
> > question mark so far seems to be the reservation size estimation. If
> > it is hard for any caller to know the size beforehand (which would
> > be really close to the actually used size) then the whole complexity
> > in the code sounds like an overkill and asking administrator to tune
> > min_free_kbytes seems a better fit (we would still have to teach the
> > allocator to access reserves when really necessary) because the system
> > would behave more predictably (although some memory would be wasted).
> 
> If we do need to teach the allocator to access reserves when really
> necessary, don't we have that already via GFP_NOIO/GFP_NOFS and
> GFP_NOFAIL?

GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need
to access reserves - e.g. if they are not blocking anybody from making
progress.

> If the goal is do something more fine-grained,
> unfortunately at least for the short-term we'll need to preserve the
> existing behaviour and issue warnings until the file system starts
> adding GFP_NOFAIL to those memory allocations where previously,
> GFP_NOFS was effectively guaranteeing that failures would almostt
> never happen.

GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because
the first one has only very limited ways to perform a reclaim. It
basically relies on somebody else to make a progress and that is
definitely a bad model.

> I know at least one place discovered with recent change (and revert)
> where I'll be fixing ext4, but I suspect it won't be the only one,
> especially in the block device drivers.
> 
> 						- Ted

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:05                         ` Johannes Weiner
@ 2015-03-02 17:10                           ` Michal Hocko
  2015-03-02 17:27                             ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
[...]
> > Typical busy system won't be very far away from the high watermark
> > so there would be a reclaim performed during increased watermaks
> > (aka reservation) and that might lead to visible performance
> > degradation. This might be acceptable but it also adds a certain level
> > of unpredictability when performance characteristics might change
> > suddenly.
> 
> There is usually a good deal of clean cache.  As Dave pointed out
> before, clean cache can be considered re-allocatable from NOFS
> contexts, and so we'd only have to maintain this invariant:
> 
> 	min_wmark + private_reserves < free_pages + clean_cache

Do I understand you correctly that we do not have to reclaim clean pages
as per the above invariant?

If yes, how do you reflect overcommit on the clean_cache from multiple
requestor (who are doing reservations)?
My point was that if we keep clean pages on the LRU rather than forcing
to reclaim them via increased watermarks then it might happen that
different callers with access to reserves wouldn't get promissed amount
of reserved memory because clean_cache is basically a shared resource.

[...]
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 17:10                           ` Michal Hocko
@ 2015-03-02 17:27                             ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes,
	akpm, torvalds

On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:05:37, Johannes Weiner wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> [...]
> > > Typical busy system won't be very far away from the high watermark
> > > so there would be a reclaim performed during increased watermaks
> > > (aka reservation) and that might lead to visible performance
> > > degradation. This might be acceptable but it also adds a certain level
> > > of unpredictability when performance characteristics might change
> > > suddenly.
> > 
> > There is usually a good deal of clean cache.  As Dave pointed out
> > before, clean cache can be considered re-allocatable from NOFS
> > contexts, and so we'd only have to maintain this invariant:
> > 
> > 	min_wmark + private_reserves < free_pages + clean_cache
> 
> Do I understand you correctly that we do not have to reclaim clean pages
> as per the above invariant?
> 
> If yes, how do you reflect overcommit on the clean_cache from multiple
> requestor (who are doing reservations)?
> My point was that if we keep clean pages on the LRU rather than forcing
> to reclaim them via increased watermarks then it might happen that
> different callers with access to reserves wouldn't get promissed amount
> of reserved memory because clean_cache is basically a shared resource.

The sum of all private reservations has to be accounted globally, we
obviously can't overcommit the available resources in order to solve
problems stemming from overcommiting the available resources.

The page allocator can't hand out free pages and page reclaim can not
reclaim clean cache unless that invariant is met.  They both have to
consider them consumed.  It's the same as pre-allocation, the only
thing we save is having to actually reclaim the pages and take them
off the freelist at reservation time - which is a good optimization
since the filesystem might not actually need them all.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-02-23  7:32                         ` Dave Chinner
  2015-02-27 18:24                           ` Vlastimil Babka
  2015-03-02  9:39                           ` Vlastimil Babka
@ 2015-03-02 20:22                           ` Johannes Weiner
  2015-03-02 23:12                             ` Dave Chinner
  2 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > When allocating pages the caller should drain its reserves in
> > preference to dipping into the regular freelist.  This guy has already
> > done his reclaim and shouldn't be penalised a second time.  I guess
> > Johannes's preallocation code should switch to doing this for the same
> > reason, plus the fact that snipping a page off
> > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > anyway so why not do it by default.
> 
> That is at odds with the requirements of demand paging, which
> allocate for objects that are reclaimable within the course of the
> transaction. The reserve is there to ensure forward progress for
> allocations for objects that aren't freed until after the
> transaction completes, but if we drain it for reclaimable objects we
> then have nothing left in the reserve pool when we actually need it.
>
> We do not know ahead of time if the object we are allocating is
> going to modified and hence locked into the transaction. Hence we
> can't say "use the reserve for this *specific* allocation", and so
> the only guidance we can really give is "we will to allocate and
> *permanently consume* this much memory", and the reserve pool needs
> to cover that consumption to guarantee forwards progress.
> 
> Forwards progress for all other allocations is guaranteed because
> they are reclaimable objects - they either freed directly back to
> their source (slab, heap, page lists) or they are freed by shrinkers
> once they have been released from the transaction.
> 
> Hence we need allocations to come from the free list and trigger
> reclaim, regardless of the fact there is a reserve pool there. The
> reserve pool needs to be a last resort once there are no other
> avenues to allocate memory. i.e. it would be used to replace the OOM
> killer for GFP_NOFAIL allocations.

That won't work.  Clean cache can be temporarily unavailable and
off-LRU for several reasons - compaction, migration, pending page
promotion, other reclaimers.  How often are we trying before we dip
into the reserve pool?  As you have noticed, the OOM killer goes off
seemingly prematurely at times, and the reason for that is that we
simply don't KNOW the exact point when we ran out of reclaimable
memory.  We cannot take an atomic snapshot of all zones, of all nodes,
of all tasks running in order to determine this reliably, we have to
approximate it.  That's why OOM is defined as "we have scanned a great
many pages and couldn't free any of them."

So unless you tell us which allocations should come from previously
declared reserves, and which ones should rely on reclaim and may fail,
the reserves can deplete prematurely and we're back to square one.

> > And to make it much worse, how
> > many pages of which orders?  Bless its heart, slub will go and use
> > a 1-order page for allocations which should have been in 0-order
> > pages..

It can always fall back to the minimum order.

> The majority of allocations will be order-0, though if we know that
> they are going to be significant numbers of high order allocations,
> then it should be simple enough to tell the mm subsystem "need a
> reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have
> memory compaction just do it's stuff. But, IMO, we should cross that
> bridge when somebody actually needs reservations to be that
> specific....

Compaction can be at an impasse for the same reasons mentioned above.
It can not just stop_machine() to guarantee it can assemble a higher
order page from a bunch of in-use order-0 cache pages.  If you need
higher-order allocations in a transaction, you have to pre-allocate.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02  9:39                           ` Vlastimil Babka
@ 2015-03-02 22:31                             ` Dave Chinner
  2015-03-03  9:13                               ` Vlastimil Babka
  2015-03-07  0:20                               ` Johannes Weiner
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> On 02/23/2015 08:32 AM, Dave Chinner wrote:
> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc.  Add a dynamic
> >>reserve.  So to reserve N pages we increase the page allocator dynamic
> >>reserve by N, do some reclaim if necessary then deposit N tokens into
> >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I
> >>suppose).
> >>
> >>When allocating pages the caller should drain its reserves in
> >>preference to dipping into the regular freelist.  This guy has already
> >>done his reclaim and shouldn't be penalised a second time.  I guess
> >>Johannes's preallocation code should switch to doing this for the same
> >>reason, plus the fact that snipping a page off
> >>task_struct.prealloc_pages is super-fast and needs to be done sometime
> >>anyway so why not do it by default.
> >
> >That is at odds with the requirements of demand paging, which
> >allocate for objects that are reclaimable within the course of the
> >transaction. The reserve is there to ensure forward progress for
> >allocations for objects that aren't freed until after the
> >transaction completes, but if we drain it for reclaimable objects we
> >then have nothing left in the reserve pool when we actually need it.
> >
> >We do not know ahead of time if the object we are allocating is
> >going to modified and hence locked into the transaction. Hence we
> >can't say "use the reserve for this *specific* allocation", and so
> >the only guidance we can really give is "we will to allocate and
> >*permanently consume* this much memory", and the reserve pool needs
> >to cover that consumption to guarantee forwards progress.
> 
> I'm not sure I understand properly. You don't know if a specific
> allocation is permanent or reclaimable, but you can tell in advance
> how much in total will be permanent? Is it because you are
> conservative and assume everything will be permanent, or how?

Because we know the worst case object modification constraints
*exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
exactly what in memory objects we lock into the transaction and what
memory is required to modify and track those objects. e.g: for a
data extent allocation, the log reservation is as such:

/*
 * In a write transaction we can allocate a maximum of 2
 * extents.  This gives:
 *    the inode getting the new extents: inode size
 *    the inode's bmap btree: max depth * block size
 *    the agfs of the ags from which the extents are allocated: 2 * sector
 *    the superblock free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 * And the bmap_finish transaction can free bmap blocks in a join:
 *    the agfs of the ags containing the blocks: 2 * sector size
 *    the agfls of the ags containing the blocks: 2 * sector size
 *    the super block free block counter: sector size
 *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
 */
STATIC uint
xfs_calc_write_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                MAX((xfs_calc_inode_res(mp, 1) +
                     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
                                      XFS_FSB_TO_B(mp, 1)) +
                     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))),
                    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
                                      XFS_FSB_TO_B(mp, 1))));
}

It's trivial to extend this logic to to memory allocation
requirements, because the above is an exact encoding of all the
objects we "permanently consume" memory for within the transaction.

What we don't know is how many objects we might need to scan to find
the objects we will eventually modify.  Here's an (admittedly
extreme) example to demonstrate a worst case scenario: allocate a
64k data extent. Because it is an exact size allocation, we look it
up in the by-size free space btree. Free space is fragmented, so
there are about a million 64k free space extents in the tree.

Once we find the first 64k extent, we search them to find the best
locality target match.  The btree records are 16 bytes each, so we
fit roughly 500 to a 4k block. Say we search half the extents to
find the best match - i.e. we walk a thousand leaf blocks before
finding the match we want, and modify that leaf block.

Now, the modification removed an entry from the leaf and tht
triggers leaf merge thresholds, so a merge with the 1002nd block
occurs. That block now demand pages in and we then modify and join
it to the transaction. Now we walk back up the btree to update
indexes, merging blocks all the way back up to the root.  We have a
worst case size btree (5 levels) and we merge at every level meaning
we demand page another 8 btree blocks and modify them.

In this case, we've demand paged ~1010 btree blocks, but only
modified 10 of them. i.e. the memory we consumed permanently was
only 10 4k buffers (approx. 10 slab and 10 page allocations), but
the allocation demand was 2 orders of magnitude more than the
unreclaimable memory consumption of the btree modification.

I hope you start to see the scope of the problem now...

> Can you at least at some later point in transaction recognize that
> "OK, this object was not permanent after all" and tell mm that it
> can lower your reserve?

I'm not including any memory used by objects we know won't be locked
into the transaction in the reserve. Demand paged object memory is
essentially unbound but is easily reclaimable. That reclaim will
give us forward progress guarantees on the memory required here.

> >Yes, that's the big problem with preallocation, as well as your
> >proposed "depelete the reserved memory first" approach. They
> >*require* up front "preallocation" of free memory, either directly
> >by the application, or internally by the mm subsystem.
> 
> I don't see why it would deadlock, if during reserve time the mm can
> return ENOMEM as the reserver should be able to back out at that
> point.

Preallocated reserves do not allow for unbound demand paging of
reclaimable objects within reserved allocation contexts.

Cheers

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 20:22                           ` Johannes Weiner
@ 2015-03-02 23:12                             ` Dave Chinner
  2015-03-03  2:50                               ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > When allocating pages the caller should drain its reserves in
> > > preference to dipping into the regular freelist.  This guy has already
> > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > Johannes's preallocation code should switch to doing this for the same
> > > reason, plus the fact that snipping a page off
> > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > anyway so why not do it by default.
> > 
> > That is at odds with the requirements of demand paging, which
> > allocate for objects that are reclaimable within the course of the
> > transaction. The reserve is there to ensure forward progress for
> > allocations for objects that aren't freed until after the
> > transaction completes, but if we drain it for reclaimable objects we
> > then have nothing left in the reserve pool when we actually need it.
> >
> > We do not know ahead of time if the object we are allocating is
> > going to modified and hence locked into the transaction. Hence we
> > can't say "use the reserve for this *specific* allocation", and so
> > the only guidance we can really give is "we will to allocate and
> > *permanently consume* this much memory", and the reserve pool needs
> > to cover that consumption to guarantee forwards progress.
> > 
> > Forwards progress for all other allocations is guaranteed because
> > they are reclaimable objects - they either freed directly back to
> > their source (slab, heap, page lists) or they are freed by shrinkers
> > once they have been released from the transaction.
> > 
> > Hence we need allocations to come from the free list and trigger
> > reclaim, regardless of the fact there is a reserve pool there. The
> > reserve pool needs to be a last resort once there are no other
> > avenues to allocate memory. i.e. it would be used to replace the OOM
> > killer for GFP_NOFAIL allocations.
> 
> That won't work.

I don't see why not...

> Clean cache can be temporarily unavailable and
> off-LRU for several reasons - compaction, migration, pending page
> promotion, other reclaimers.  How often are we trying before we dip
> into the reserve pool?  As you have noticed, the OOM killer goes off
> seemingly prematurely at times, and the reason for that is that we
> simply don't KNOW the exact point when we ran out of reclaimable
> memory.

Sure, but that's irrelevant to the problem at hand. At some point,
the Mm subsystem is going to decide "we're at OOM" - it's *what
happens next* that matters.

> We cannot take an atomic snapshot of all zones, of all nodes,
> of all tasks running in order to determine this reliably, we have to
> approximate it.  That's why OOM is defined as "we have scanned a great
> many pages and couldn't free any of them."

Yes, and reserve pools *do not change* the logic that leads to that
decision. What changes is that we don't "kick the OOM killer",
instead we "allocate from the reserve pool." The reserve pool
*replaces* the OOM killer as a method of guaranteeing forwards
allocation progress for those subsystems that can use reservations.
If there is no reserve pool for the current task, then you can still
kick the OOM killer....

> So unless you tell us which allocations should come from previously
> declared reserves, and which ones should rely on reclaim and may fail,
> the reserves can deplete prematurely and we're back to square one.

Like the OOM killer, filesystems are not omnipotent and are not
perfect.  Requiring us to be so is entirely unreasonable, and is
*entirely unnecessary* from the POV of the mm subsystem.

Reservations give the mm subsystem a *strong model* for guaranteeing
forwards allocation progress, and it can be independently verified
and tested without having to care about how some subsystem uses it.
The mm subsystem supplies the *mechanism*, and mm developers are
entirely focussed around ensuring the mechanism works and is
verifiable.  i.e. you could write some debug kernel module to
exercise, verify and regression test the model behaviour, which is
something that simply cannot be done with the OOM killer.

Reservation sizes required by a subsystem are *policy*. They are not
a problem the mm subsystem needs to be concerned with as the
subsystem has to get the reservations right for the mechanism to
work. i.e. Managing reservation sizes is my responsibility as a
subsystem maintainer, just like it's currently my responsibility for
ensuring that transient ENOMEM conditions don't result in a
filesystem shutdown....

> Compaction can be at an impasse for the same reasons mentioned above.
> It can not just stop_machine() to guarantee it can assemble a higher
> order page from a bunch of in-use order-0 cache pages.  If you need
> higher-order allocations in a transaction, you have to pre-allocate.

It's much simpler just to use order-0 reservations and vmalloc if we
can't get high order allocations. We already do this in most places
where high order allocations are required, so there's really no
change needed here. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 23:12                             ` Dave Chinner
@ 2015-03-03  2:50                               ` Johannes Weiner
  2015-03-04  6:52                                 ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-03  2:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > When allocating pages the caller should drain its reserves in
> > > > preference to dipping into the regular freelist.  This guy has already
> > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > Johannes's preallocation code should switch to doing this for the same
> > > > reason, plus the fact that snipping a page off
> > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > anyway so why not do it by default.
> > > 
> > > That is at odds with the requirements of demand paging, which
> > > allocate for objects that are reclaimable within the course of the
> > > transaction. The reserve is there to ensure forward progress for
> > > allocations for objects that aren't freed until after the
> > > transaction completes, but if we drain it for reclaimable objects we
> > > then have nothing left in the reserve pool when we actually need it.
> > >
> > > We do not know ahead of time if the object we are allocating is
> > > going to modified and hence locked into the transaction. Hence we
> > > can't say "use the reserve for this *specific* allocation", and so
> > > the only guidance we can really give is "we will to allocate and
> > > *permanently consume* this much memory", and the reserve pool needs
> > > to cover that consumption to guarantee forwards progress.
> > > 
> > > Forwards progress for all other allocations is guaranteed because
> > > they are reclaimable objects - they either freed directly back to
> > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > once they have been released from the transaction.
> > > 
> > > Hence we need allocations to come from the free list and trigger
> > > reclaim, regardless of the fact there is a reserve pool there. The
> > > reserve pool needs to be a last resort once there are no other
> > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > killer for GFP_NOFAIL allocations.
> > 
> > That won't work.
> 
> I don't see why not...
> 
> > Clean cache can be temporarily unavailable and
> > off-LRU for several reasons - compaction, migration, pending page
> > promotion, other reclaimers.  How often are we trying before we dip
> > into the reserve pool?  As you have noticed, the OOM killer goes off
> > seemingly prematurely at times, and the reason for that is that we
> > simply don't KNOW the exact point when we ran out of reclaimable
> > memory.
> 
> Sure, but that's irrelevant to the problem at hand. At some point,
> the Mm subsystem is going to decide "we're at OOM" - it's *what
> happens next* that matters.

It's not irrelevant at all.  That point is an arbitrary magic number
that is a byproduct of many implementation details and concurrency in
the memory management layer.  It's completely fine to tie allocations
which can fail to this point, but you can't reasonably calibrate your
emergency reserves, which are supposed to guarantee progress, to such
an unpredictable variable.

When you reserve based on the share of allocations that you know will
be unreclaimable, you are assuming that all other allocations will be
reclaimable, and that is simply flawed.  There is so much concurrency
in the MM subsystem that you can't reasonably expect a single scanner
instance to recover the majority of theoretically reclaimable memory.

> > We cannot take an atomic snapshot of all zones, of all nodes,
> > of all tasks running in order to determine this reliably, we have to
> > approximate it.  That's why OOM is defined as "we have scanned a great
> > many pages and couldn't free any of them."
> 
> Yes, and reserve pools *do not change* the logic that leads to that
> decision. What changes is that we don't "kick the OOM killer",
> instead we "allocate from the reserve pool." The reserve pool
> *replaces* the OOM killer as a method of guaranteeing forwards
> allocation progress for those subsystems that can use reservations.

In order to replace the OOM killer in its role as progress guarantee,
the reserves can't run dry during the transaction.  Because what are
we going to do in that case?

> If there is no reserve pool for the current task, then you can still
> kick the OOM killer....

... so we are not actually replacing the OOM killer, we just defer it
with reserves that were calibrated to an anecdotal snapshot of a fuzzy
quantity of reclaim activity?  Is the idea here to just pile sh*tty,
mostly-working mechanisms on top of each other in the hope that one of
them will kick things along just enough to avoid locking up?

> > So unless you tell us which allocations should come from previously
> > declared reserves, and which ones should rely on reclaim and may fail,
> > the reserves can deplete prematurely and we're back to square one.
> 
> Like the OOM killer, filesystems are not omnipotent and are not
> perfect.  Requiring us to be so is entirely unreasonable, and is
> *entirely unnecessary* from the POV of the mm subsystem.
> 
> Reservations give the mm subsystem a *strong model* for guaranteeing
> forwards allocation progress, and it can be independently verified
> and tested without having to care about how some subsystem uses it.
> The mm subsystem supplies the *mechanism*, and mm developers are
> entirely focussed around ensuring the mechanism works and is
> verifiable.  i.e. you could write some debug kernel module to
> exercise, verify and regression test the model behaviour, which is
> something that simply cannot be done with the OOM killer.
> 
> Reservation sizes required by a subsystem are *policy*. They are not
> a problem the mm subsystem needs to be concerned with as the
> subsystem has to get the reservations right for the mechanism to
> work. i.e. Managing reservation sizes is my responsibility as a
> subsystem maintainer, just like it's currently my responsibility for
> ensuring that transient ENOMEM conditions don't result in a
> filesystem shutdown....

Anything that depends on the point at which the memory management
system gives up reclaiming pages is not verifiable in the slightest.
It will vary from kernel to kernel, from workload to workload, from
run to run.  It will regress in the blink of an eye.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 22:31                             ` Dave Chinner
@ 2015-03-03  9:13                               ` Vlastimil Babka
  2015-03-04  1:33                                 ` Dave Chinner
  2015-03-07  0:20                               ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2015-03-03  9:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On 03/02/2015 11:31 PM, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
>> On 02/23/2015 08:32 AM, Dave Chinner wrote:
>> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
>> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote:
>> >We do not know ahead of time if the object we are allocating is
>> >going to modified and hence locked into the transaction. Hence we
>> >can't say "use the reserve for this *specific* allocation", and so
>> >the only guidance we can really give is "we will to allocate and
>> >*permanently consume* this much memory", and the reserve pool needs
>> >to cover that consumption to guarantee forwards progress.
>> 
>> I'm not sure I understand properly. You don't know if a specific
>> allocation is permanent or reclaimable, but you can tell in advance
>> how much in total will be permanent? Is it because you are
>> conservative and assume everything will be permanent, or how?
> 
> Because we know the worst case object modification constraints
> *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know
> exactly what in memory objects we lock into the transaction and what
> memory is required to modify and track those objects. e.g: for a
> data extent allocation, the log reservation is as such:
> 
> /*
>  * In a write transaction we can allocate a maximum of 2
>  * extents.  This gives:
>  *    the inode getting the new extents: inode size
>  *    the inode's bmap btree: max depth * block size
>  *    the agfs of the ags from which the extents are allocated: 2 * sector
>  *    the superblock free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  * And the bmap_finish transaction can free bmap blocks in a join:
>  *    the agfs of the ags containing the blocks: 2 * sector size
>  *    the agfls of the ags containing the blocks: 2 * sector size
>  *    the super block free block counter: sector size
>  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
>  */
> STATIC uint
> xfs_calc_write_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 MAX((xfs_calc_inode_res(mp, 1) +
>                      xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
>                                       XFS_FSB_TO_B(mp, 1)) +
>                      xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))),
>                     (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
>                                       XFS_FSB_TO_B(mp, 1))));
> }
> 
> It's trivial to extend this logic to to memory allocation
> requirements, because the above is an exact encoding of all the
> objects we "permanently consume" memory for within the transaction.
> 
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Thanks, that example did help me understand your position much better.
So you would need to reserve for a worst case number of the objects you modify,
plus some slack for the demand-paged objects that you need to temporarily
access, before you can drop and reclaim them (I suppose that in some of the tree
operations, you need to be holding references to e.g. two nodes at a time, or
maybe the full depth). Or maybe since all these temporary objects are
potentially modifiable, it's already accounted for in the "might be modified" part.

>> Can you at least at some later point in transaction recognize that
>> "OK, this object was not permanent after all" and tell mm that it
>> can lower your reserve?
> 
> I'm not including any memory used by objects we know won't be locked
> into the transaction in the reserve. Demand paged object memory is
> essentially unbound but is easily reclaimable. That reclaim will
> give us forward progress guarantees on the memory required here.
> 
>> >Yes, that's the big problem with preallocation, as well as your
>> >proposed "depelete the reserved memory first" approach. They
>> >*require* up front "preallocation" of free memory, either directly
>> >by the application, or internally by the mm subsystem.
>> 
>> I don't see why it would deadlock, if during reserve time the mm can
>> return ENOMEM as the reserver should be able to back out at that
>> point.
> 
> Preallocated reserves do not allow for unbound demand paging of
> reclaimable objects within reserved allocation contexts.

OK I think I get the point now.

So, lots of the concerns by me and others were about the wasted memory due to
reservations, and increased pressure on the rest of the system. I was thinking,
are you able, at the beginning of the transaction (for this purposes, I think of
transaction as the work that starts with the memory reservation, then it cannot
rollback and relies on the reserves, until it commits and frees the memory),
determine whether the transaction cannot be blocked in its progress by any other
transaction, and the only thing that would block it would be inability to
allocate memory during its course?

If that was the case, we could "share" the reserved memory for all ongoing
transactions of a single class (i.e. xfs transactions). If a transaction knows
it cannot be blocked by anything else, only then it passes the
GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the
reserve to one such transaction, it will deny the reserves to other such
transactions, until the first one finishes. In practice it would be more complex
of course, but it should guarantee forward progress without lots of
wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages
as reserve in that case, which was also pointed out to be problematic).

Of course it all depends on whether you are able to determine the "guaranteed to
not block". I can however easily imagine it's not possible...

> Cheers
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-03  9:13                               ` Vlastimil Babka
@ 2015-03-04  1:33                                 ` Dave Chinner
  2015-03-04  8:50                                   ` Vlastimil Babka
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-04  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> On 03/02/2015 11:31 PM, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote:
> > 
> > /*
> >  * In a write transaction we can allocate a maximum of 2
> >  * extents.  This gives:
> >  *    the inode getting the new extents: inode size
> >  *    the inode's bmap btree: max depth * block size
> >  *    the agfs of the ags from which the extents are allocated: 2 * sector
> >  *    the superblock free block counter: sector size
> >  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.....
> Thanks, that example did help me understand your position much better.
> So you would need to reserve for a worst case number of the objects you modify,
> plus some slack for the demand-paged objects that you need to temporarily
> access, before you can drop and reclaim them (I suppose that in some of the tree
> operations, you need to be holding references to e.g. two nodes at a time, or
> maybe the full depth). Or maybe since all these temporary objects are
> potentially modifiable, it's already accounted for in the "might be modified" part.

Already accounted for in the "might be modified path".

> >> Can you at least at some later point in transaction recognize that
> >> "OK, this object was not permanent after all" and tell mm that it
> >> can lower your reserve?
> > 
> > I'm not including any memory used by objects we know won't be locked
> > into the transaction in the reserve. Demand paged object memory is
> > essentially unbound but is easily reclaimable. That reclaim will
> > give us forward progress guarantees on the memory required here.
> > 
> >> >Yes, that's the big problem with preallocation, as well as your
> >> >proposed "depelete the reserved memory first" approach. They
> >> >*require* up front "preallocation" of free memory, either directly
> >> >by the application, or internally by the mm subsystem.
> >> 
> >> I don't see why it would deadlock, if during reserve time the mm can
> >> return ENOMEM as the reserver should be able to back out at that
> >> point.
> > 
> > Preallocated reserves do not allow for unbound demand paging of
> > reclaimable objects within reserved allocation contexts.
> 
> OK I think I get the point now.
> 
> So, lots of the concerns by me and others were about the wasted memory due to
> reservations, and increased pressure on the rest of the system. I was thinking,
> are you able, at the beginning of the transaction (for this purposes, I think of
> transaction as the work that starts with the memory reservation, then it cannot
> rollback and relies on the reserves, until it commits and frees the memory),
> determine whether the transaction cannot be blocked in its progress by any other
> transaction, and the only thing that would block it would be inability to
> allocate memory during its course?

No. e.g. any transaction that requires allocation or freeing of an
inode or extent can get stuck behind any other transaction that is
allocating/freeing and inode/extent. And this will happen when
holding inode locks, which means other transactions on that inode
will then get stuck on the inode lock, and so on. Blocking
dependencies within transactions are everywhere and cannot be
avoided.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-03  2:50                               ` Johannes Weiner
@ 2015-03-04  6:52                                 ` Dave Chinner
  2015-03-04 15:04                                   ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-04  6:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote:
> > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote:
> > > > > When allocating pages the caller should drain its reserves in
> > > > > preference to dipping into the regular freelist.  This guy has already
> > > > > done his reclaim and shouldn't be penalised a second time.  I guess
> > > > > Johannes's preallocation code should switch to doing this for the same
> > > > > reason, plus the fact that snipping a page off
> > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime
> > > > > anyway so why not do it by default.
> > > > 
> > > > That is at odds with the requirements of demand paging, which
> > > > allocate for objects that are reclaimable within the course of the
> > > > transaction. The reserve is there to ensure forward progress for
> > > > allocations for objects that aren't freed until after the
> > > > transaction completes, but if we drain it for reclaimable objects we
> > > > then have nothing left in the reserve pool when we actually need it.
> > > >
> > > > We do not know ahead of time if the object we are allocating is
> > > > going to modified and hence locked into the transaction. Hence we
> > > > can't say "use the reserve for this *specific* allocation", and so
> > > > the only guidance we can really give is "we will to allocate and
> > > > *permanently consume* this much memory", and the reserve pool needs
> > > > to cover that consumption to guarantee forwards progress.
> > > > 
> > > > Forwards progress for all other allocations is guaranteed because
> > > > they are reclaimable objects - they either freed directly back to
> > > > their source (slab, heap, page lists) or they are freed by shrinkers
> > > > once they have been released from the transaction.
> > > > 
> > > > Hence we need allocations to come from the free list and trigger
> > > > reclaim, regardless of the fact there is a reserve pool there. The
> > > > reserve pool needs to be a last resort once there are no other
> > > > avenues to allocate memory. i.e. it would be used to replace the OOM
> > > > killer for GFP_NOFAIL allocations.
> > > 
> > > That won't work.
> > 
> > I don't see why not...
> > 
> > > Clean cache can be temporarily unavailable and
> > > off-LRU for several reasons - compaction, migration, pending page
> > > promotion, other reclaimers.  How often are we trying before we dip
> > > into the reserve pool?  As you have noticed, the OOM killer goes off
> > > seemingly prematurely at times, and the reason for that is that we
> > > simply don't KNOW the exact point when we ran out of reclaimable
> > > memory.
> > 
> > Sure, but that's irrelevant to the problem at hand. At some point,
> > the Mm subsystem is going to decide "we're at OOM" - it's *what
> > happens next* that matters.
> 
> It's not irrelevant at all.  That point is an arbitrary magic number
> that is a byproduct of many imlementation details and concurrency in
> the memory management layer.  It's completely fine to tie allocations
> which can fail to this point, but you can't reasonably calibrate your
> emergency reserves, which are supposed to guarantee progress, to such
> an unpredictable variable.
> 
> When you reserve based on the share of allocations that you know will
> be unreclaimable, you are assuming that all other allocations will be
> reclaimable, and that is simply flawed.  There is so much concurrency
> in the MM subsystem that you can't reasonably expect a single scanner
> instance to recover the majority of theoretically reclaimable memory.

On one hand you say "memory accounting is unreliable, so detecting
OOM is unreliable, and so we have an unreliable trigger point.

On the other hand you say "single scanner instance can't reclaim all
memory", again stating we have an unreliable trigger point.

On the gripping hand, that unreliable trigger point is what
kicks the OOM killer.

Yet you consider that point to be reliable enough to kick the OOM
killer, but too unreliable to trigger allocation from a reserve
pool?

Say what?

I suspect you've completely misunderstood what I've been suggesting.

By definition, we have the pages we reserved in the reserve pool,
and unless we've exhausted that reservation with permanent
allocations we should always be able to allocate from it. If the
pool got emptied by demand page allocations, then we back off and
retry reclaim until the reclaimable objects are released back into
the reserve pool. i.e. reclaim fills reserve pools first, then when
they are full pages can go back on free lists for normal
allocations.  This provides the mechanism for forwards progress, and
it's essentially the same mechanism that mempools use to guarantee
forwards progess. the only difference is that reserve pool refilling
comes through reclaim via shrinker invocation...

In reality, though, I don't really care how the mm subsystem
implements that pool as long as it handles the cases I've described
(e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't
think we're making progress here, anyway, so unless you come up with
some other solution this thread is going to die here....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  1:33                                 ` Dave Chinner
@ 2015-03-04  8:50                                   ` Vlastimil Babka
  2015-03-04 11:03                                     ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2015-03-04  8:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On 03/04/2015 02:33 AM, Dave Chinner wrote:
> On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
>>>
>>> Preallocated reserves do not allow for unbound demand paging of
>>> reclaimable objects within reserved allocation contexts.
>>
>> OK I think I get the point now.
>>
>> So, lots of the concerns by me and others were about the wasted memory due to
>> reservations, and increased pressure on the rest of the system. I was thinking,
>> are you able, at the beginning of the transaction (for this purposes, I think of
>> transaction as the work that starts with the memory reservation, then it cannot
>> rollback and relies on the reserves, until it commits and frees the memory),
>> determine whether the transaction cannot be blocked in its progress by any other
>> transaction, and the only thing that would block it would be inability to
>> allocate memory during its course?
>
> No. e.g. any transaction that requires allocation or freeing of an
> inode or extent can get stuck behind any other transaction that is
> allocating/freeing and inode/extent. And this will happen when
> holding inode locks, which means other transactions on that inode
> will then get stuck on the inode lock, and so on. Blocking
> dependencies within transactions are everywhere and cannot be
> avoided.

Hm, I see. I thought that perhaps to avoid deadlocks between 
transactions (which you already have to do somehow), either the 
dependencies have to be structured in a way that there's always some 
transaction that can't block on others. Or you have a way to detect 
potential deadlocks before they happen, and stall somebody who tries to 
lock. Both should (at least theoretically) mean that you would be able 
to point to such transaction, although I can imagine the cost of being 
able to do that could be prohibitive.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  8:50                                   ` Vlastimil Babka
@ 2015-03-04 11:03                                     ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote:
> On 03/04/2015 02:33 AM, Dave Chinner wrote:
> >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote:
> >>>
> >>>Preallocated reserves do not allow for unbound demand paging of
> >>>reclaimable objects within reserved allocation contexts.
> >>
> >>OK I think I get the point now.
> >>
> >>So, lots of the concerns by me and others were about the wasted memory due to
> >>reservations, and increased pressure on the rest of the system. I was thinking,
> >>are you able, at the beginning of the transaction (for this purposes, I think of
> >>transaction as the work that starts with the memory reservation, then it cannot
> >>rollback and relies on the reserves, until it commits and frees the memory),
> >>determine whether the transaction cannot be blocked in its progress by any other
> >>transaction, and the only thing that would block it would be inability to
> >>allocate memory during its course?
> >
> >No. e.g. any transaction that requires allocation or freeing of an
> >inode or extent can get stuck behind any other transaction that is
> >allocating/freeing and inode/extent. And this will happen when
> >holding inode locks, which means other transactions on that inode
> >will then get stuck on the inode lock, and so on. Blocking
> >dependencies within transactions are everywhere and cannot be
> >avoided.
> 
> Hm, I see. I thought that perhaps to avoid deadlocks between
> transactions (which you already have to do somehow),

Of course, by following lock ordering rules, rules about holding
locks over transaction reservations, allowing bulk reservations for
rolling transactions that don't unlock objects between transaction
commits, having allocation group ordering rules, block allocation
ordering rules, transactional lock recursion suport to prevent
transaction deadlocking walking over objects already locked into the
transaction, etc.

By following those rules, we guarantee forwards progress in the
transaction subsystem. If we can also guarantee forwards progress in
memory allocation inside transaction context (like Irix did all
those years ago :P), then we can guarantee that transactions will
always complete unless there is a bug or corruption is detected
during an operation...

> either the
> dependencies have to be structured in a way that there's always some
> transaction that can't block on others. Or you have a way to detect
> potential deadlocks before they happen, and stall somebody who tries
> to lock.

$ git grep ASSERT fs/xfs |wc -l
1716

About 3% of the code in XFS is ASSERT statements used to verify
context specific state is correct in CONFIG_XFS_DEBUG=y builds.

FYI, from cloc:

Subsystem      files          blank        comment	   code
-------------------------------------------------------------------------------
fs/xfs		157          10841          25339          69140
mm/		 97          13923          25534          67870
fs/btrfs	 86          14443          15097          85065

Cheers,

Dave.

PS: XFS userspace has another 110,000 lines of code in xfsprogs and
60,000 lines of code in xfsdump, and there's also 80,000 lines of
test code in xfstests.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 16:58                           ` Michal Hocko
@ 2015-03-04 12:52                             ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs,
	Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds

On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > > The idea is sound. But I am pretty sure we will find many corner
> > > cases. E.g. what if the mere reservation attempt causes the system
> > > to go OOM and trigger the OOM killer?
> > 
> > Doctor, doctor, it hurts when I do that....
> > 
> > So don't trigger the OOM killer.  We can let the caller decide whether
> > the reservation request should block or return ENOMEM, but the whole
> > point of the reservation request idea is that this happens *before*
> > we've taken any mutexes, so blocking won't prevent forward progress.
> 
> Maybe I wasn't clear. I wasn't concerned about the context which
> is doing to reservation. I was more concerned about all the other
> allocation requests which might fail now (becasuse they do not have
> access to the reserves). So you think that we should simply disable OOM
> killer while there is any reservation active? Wouldn't that be even more
> fragile when something goes terribly wrong?

That's a silly strawman.  Why wouldn't you simply block them until
the reserves are released when the transaction completes and the
unused memory goes back to the free pool?

Let me try another tack. My qualifications are as a
distributed control system engineer, not a computer scientist. I
see everything as a system of interconnected feedback loops: an
operating system is nothing but a set of very complex, tightly
interconnected control systems.

Precedence? IO-less dirty throttling - that came about after I'd
been advocating a control theory based algorithm for several years
to solve the breakdown problems of dirty page throttling.  We look
at the code Fenguang Wu wrote as one of the major success stories of
Linux - the writeback code just works and nobody ever has to tune it
anymore.

I see the problem of direct memory reclaim as being very similar to
the problems the old IO based write throttling had: it has unbound
concurrency, severe unfairness and breaks down badly when heavily
loaded.  As a control system, it has the same terrible design
as the IO-based write throttling had.

There are other many similarities, too.

Allocation can only take place at the rate at which reclaim occurs,
and we only have a limited budget of allocatable pages. This is the
same as the dirty page throttling - dirtying pages is limited to the
rate we can clean pages, and there are a limited budget of dirty
pages in the system.

Reclaiming pages is also done most efficiently by a single thread
per zone where lots of internal context can be kept (kswapd). This
is similar to optimal writeback of dirty pages requires a
single thread with internal context per block device..

Waiting for free pages to arrive can be done by an ordered queuing
system, and we can account for the number of pages each allocation
requires in the queueing system and hence only need wake the number
of waiters that will consume the memory just freed. Just like we do
with the the dirty page throttling queue.

As such, the same solutions could be applied. As the allocation
demand exceeds the supply of free pages, we throttle allocation by
sleeping on an ordered queue and only waking waiters at the rate
at which kswapd reclaim can free pages. It's trivial to account
accurately, and the feedback loop is relatively simple, too.

We can also easily maintain a reserve of free pages this way, usable
only by allocation marked with special flags.  The reserve threshold
can be dynamic, and tasks that request it to change can be blocked
until the reserve has been built up to meet caler requirements.
Allocations that are allowed to dip into the reserve may do so
rather than being added to the queue that waits for reclaim.

Reclaim would always fill the reserve back up to it's limits first,
and tasks that have reservations can release them gradually as they
mark them as consumed by the reservation context (e.g. when a
filesystem joins an object to a transaction and modifies it),
thereby reducing the reserve that task has available as it
progresses.

So, there's yet another possible solution to the allocation
reservation problem, and one that solves several other problems that
are being described as making reservation pools difficult or even
impossible to implement.

Seriously, I'm not expecting this problem to be solved tomorrow;
what I want is reliable, deterministic memory allocation behaviour
from the mm subsystem. I want people to be thinking about how to
acheive that rather than limiting their solutions by what we have
now and can hack into the current code, because otherwise we'll
never end up with a reliable memory allocation reservation system....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04  6:52                                 ` Dave Chinner
@ 2015-03-04 15:04                                   ` Johannes Weiner
  2015-03-04 17:38                                     ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote:
> I suspect you've completely misunderstood what I've been suggesting.
> 
> By definition, we have the pages we reserved in the reserve pool,
> and unless we've exhausted that reservation with permanent
> allocations we should always be able to allocate from it. If the
> pool got emptied by demand page allocations, then we back off and
> retry reclaim until the reclaimable objects are released back into
> the reserve pool. i.e. reclaim fills reserve pools first, then when
> they are full pages can go back on free lists for normal
> allocations.  This provides the mechanism for forwards progress, and
> it's essentially the same mechanism that mempools use to guarantee
> forwards progess. the only difference is that reserve pool refilling
> comes through reclaim via shrinker invocation...

Yes, I had something else in mind.

In order to rely on replenishing through reclaim, you have to make
sure that all allocations taken out of the pool are guaranteed to come
back in a reasonable time frame.  So once Ted said that the filesystem
will not be able to declare which allocations of a task are allowed to
dip into its reserves, and thus allocations of indefinite lifetime can
enter the picture, my mind went to a one-off reserve pool that doesn't
rely on replenishing in order to make forward progress.  You declare
the worst-case, finish the transaction, and return what is left of the
reserves.  This obviously conflicts with the estimation model that you
are proposing, I hope it's now clear where our misunderstanding lies.

Yes, we can make this work if you can tell us which allocations have
limited/controllable lifetime.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 15:04                                   ` Johannes Weiner
@ 2015-03-04 17:38                                     ` Theodore Ts'o
  2015-03-04 23:17                                       ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> Yes, we can make this work if you can tell us which allocations have
> limited/controllable lifetime.

It may be helpful to be a bit precise about definitions here.  There
are a number of different object lifetimes:

a) will be released before the kernel thread returns control to
userspace

b) will be released once the current I/O operation finishes.  (In the
case of nbd where the remote server has unexpectedy gone away might be
quite a while, but I'm not sure how much we care about that scenario)

c) can be trivially released if the mm subsystem asks via calling a
shrinker

d) can be released only after doing some amount of bounded work (i.e.,
cleaning a dirty page)

e) impossible to predict when it can be released (e.g., dcache, inodes
attached to an open file descriptors, buffer heads that won't be freed
until the file system is umounted, etc.)


I'm guessing that what you mean is (b), but what about cases such as
(c)?

Would the mm subsystem find it helpful if it had more information
about object lifetime?  For example, the CMA folks seem to really care
about know whether memory allocations falls in category (e) or not.

						- Ted
						

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-04 17:38                                     ` Theodore Ts'o
@ 2015-03-04 23:17                                       ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm,
	mgorman, dchinner, rientjes, Andrew Morton, torvalds

On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote:
> On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote:
> > Yes, we can make this work if you can tell us which allocations have
> > limited/controllable lifetime.
> 
> It may be helpful to be a bit precise about definitions here.  There
> are a number of different object lifetimes:
> 
> a) will be released before the kernel thread returns control to
> userspace
> 
> b) will be released once the current I/O operation finishes.  (In the
> case of nbd where the remote server has unexpectedy gone away might be
> quite a while, but I'm not sure how much we care about that scenario)
> 
> c) can be trivially released if the mm subsystem asks via calling a
> shrinker
> 
> d) can be released only after doing some amount of bounded work (i.e.,
> cleaning a dirty page)
> 
> e) impossible to predict when it can be released (e.g., dcache, inodes
> attached to an open file descriptors, buffer heads that won't be freed
> until the file system is umounted, etc.)
> 
> 
> I'm guessing that what you mean is (b), but what about cases such as
> (c)?

The thing is, in the XFS transaction case we are hitting e) for
every allocation, and only after IO and/or some processing do we
know whether it will fall into c), d) or whether it will be
permanently consumed.

> Would the mm subsystem find it helpful if it had more information
> about object lifetime?  For example, the CMA folks seem to really care
> about know whether memory allocations falls in category (e) or not.

The problem is that most filesystem allocations fall into category
(e). Worse is that the state of an object can change without
allocations having taken place e.g. an object on a reclaimable LRU
can be found via a cache lookup, then joined to and modified in a
transaction. Hence objects can change state from "reclaimable" to
"permanently consumed" without actually going through memory reclaim
and allocation.

IOWs, what is really required is the ability to say "this amount of
allocation reserve is now consumed" /some time after/ we've done the
allocation. i.e. when we join the object to the transaction and
modify it, that's when we need to be able to reduce the reservation
limit as that memory is now permanently consumed by the transaction
context. Objects that fall into c) and d) don't need to have anyting
special done, because reclaim will eventually free the memory they
hold once the allocating context releases them.

Indeed, this model works even when we find those c) and d) objects
in cache rather than allocating them. They would get correctly
accounted as "consumed reserve" because we no longer need to
allocate that memory in transaction context and so that reserve can
be released back to the free pool....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-01 11:17                             ` Tetsuo Handa
@ 2015-03-06 11:53                               ` Tetsuo Handa
  0 siblings, 0 replies; 83+ messages in thread
From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw)
  To: david
  Cc: tytso, hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman,
	rientjes, akpm, fernando_b1, torvalds

Tetsuo Handa wrote:
> If underestimating is tolerable, can we simply set different watermark
> levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations?
> For example,
> 
>    GFP_KERNEL (or above) can fail if memory usage exceeds 95%
>    GFP_NOFS can fail if memory usage exceeds 97%
>    GFP_NOIO can fail if memory usage exceeds 98%
>    GFP_ATOMIC can fail if memory usage exceeds 99%
> 
> I think that below order-0 GFP_NOIO allocation enters into retry-forever loop
> when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds
> strange. Use of same watermark is preventing kernel worker threads from
> processing workqueue. While it is legal to do blocking operation from
> workqueue, being blocked forever is an exclusive occupation for workqueue;
> other jobs in the workqueue get stuck.
> 

Below experimental patch which raises zone watermark works for me.

----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..92233e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,7 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+	gfp_t gfp_mask;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7abfa70..1a6b830 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
+	if (min == mark) {
+		if (current->gfp_mask & __GFP_FS)
+			min <<= 1;
+		if (current->gfp_mask & __GFP_IO)
+			min <<= 1;
+	}
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
@@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.nodemask = nodemask,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
+	gfp_t orig_gfp_mask;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
+	orig_gfp_mask = current->gfp_mask;
+	current->gfp_mask = gfp_mask;
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2873,6 +2882,7 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+	current->gfp_mask = orig_gfp_mask;
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
----------

Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ .
Is Dave Chinner's "reservations" suggestion conceptually doing the patch above?

Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate
how much amount of pages they need for their transaction like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += atomic_read(&reservation_for_gfp_fs);
		if (current->gfp_mask & __GFP_IO)
			min += atomic_read(&reservation_for_gfp_io);
	}

than ask the administrator to specify a static amount like

	if (min == mark) {
		if (current->gfp_mask & __GFP_FS)
			min += sysctl_reservation_for_gfp_fs;
		if (current->gfp_mask & __GFP_IO)
			min += sysctl_reservation_for_gfp_io;
	}

?

The retry-forever loop will happen if underestimated, won't it?
Then, how to handle it when the OOM killer missed the target (due to
__GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-02 22:31                             ` Dave Chinner
  2015-03-03  9:13                               ` Vlastimil Babka
@ 2015-03-07  0:20                               ` Johannes Weiner
  2015-03-07  3:43                                 ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2015-03-07  0:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Isn't this bounded one way or another?  Sure, the inaccuracy itself is
high, but when you put the absolute numbers in perspective it really
doesn't seem to matter: with your extreme case of 3MB per transaction,
you can still run 5k+ of them in parallel on a small 16G machine.
Occupy a generous 75% of RAM with anonymous pages, and you can STILL
run over a thousand transactions concurrently.  That would seem like a
decent pipeline to keep the storage device occupied.

The level of precision that you are asking for comes with complexity
and fragility that I'm not convinced is necessary, or justified.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-07  0:20                               ` Johannes Weiner
@ 2015-03-07  3:43                                 ` Dave Chinner
  2015-03-07 15:08                                   ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2015-03-07  3:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > What we don't know is how many objects we might need to scan to find
> > the objects we will eventually modify.  Here's an (admittedly
> > extreme) example to demonstrate a worst case scenario: allocate a
> > 64k data extent. Because it is an exact size allocation, we look it
> > up in the by-size free space btree. Free space is fragmented, so
> > there are about a million 64k free space extents in the tree.
> > 
> > Once we find the first 64k extent, we search them to find the best
> > locality target match.  The btree records are 16 bytes each, so we
> > fit roughly 500 to a 4k block. Say we search half the extents to
> > find the best match - i.e. we walk a thousand leaf blocks before
> > finding the match we want, and modify that leaf block.
> > 
> > Now, the modification removed an entry from the leaf and tht
> > triggers leaf merge thresholds, so a merge with the 1002nd block
> > occurs. That block now demand pages in and we then modify and join
> > it to the transaction. Now we walk back up the btree to update
> > indexes, merging blocks all the way back up to the root.  We have a
> > worst case size btree (5 levels) and we merge at every level meaning
> > we demand page another 8 btree blocks and modify them.
> > 
> > In this case, we've demand paged ~1010 btree blocks, but only
> > modified 10 of them. i.e. the memory we consumed permanently was
> > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > the allocation demand was 2 orders of magnitude more than the
> > unreclaimable memory consumption of the btree modification.
> > 
> > I hope you start to see the scope of the problem now...
> 
> Isn't this bounded one way or another?

Fo a single transaction? No.

> Sure, the inaccuracy itself is
> high, but when you put the absolute numbers in perspective it really
> doesn't seem to matter: with your extreme case of 3MB per transaction,
> you can still run 5k+ of them in parallel on a small 16G machine.

No you can't. The number of concurrent transactions is bounded by
the size of the log and the amount of unused space available for
reservation in the log. Under heavy modification loads, that's
usually somewhere between 15-25% of the log, so worst case is a few
hundred megabytes. The memory reservation demand is in the same
order of magnitude as the log space reservation demand.....

> Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> run over a thousand transactions concurrently.  That would seem like a
> decent pipeline to keep the storage device occupied.

Typical systems won't ever get to that - they don't do more than a
handful of current transactions at a time - the "thousands of
transactions" occur on dedicated storage servers like petabyte scale
NFS servers that have hundreds of gigabytes of RAM and
hundreds-to-thousands of processing threads to keep the request
pipeline full. The memory in those machines is entirely dedicated to
the filesystem, so keeping a usuable pool of a few gigabytes for
transaction reservations isn't a big deal.

The point here is that you're taking what I'm describing as the
requirements of a reservation pool and then applying the worst case
to situations where completely inappropriate. That's what I mean
when I told Michal to stop building silly strawman situations; large
amounts of concurrency are required for huge machines, not your
desktop workstation.

And, realistically, sizing that reservation pool appropriately is my
problem to solve - it will depend on many factors, one of which is
the actual geometry of the filesystem itself. You need to stop
thinking like you can control how application use the memory
allocation and reclaim subsystem and start to trust we will our
memory usage appropriately to maintain maximum system throughput.

After all, we already do that for all the filesystem caches the mm
subsystem doesn't control - why do you think I have had such an
interest in shrinker scalability? For XFS, the only cache we
actually don't control reclaim from is user data in the page cache -
we control everything else directly from custom shrinkers.....

> The level of precision that you are asking for comes with complexity
> and fragility that I'm not convinced is necessary, or justified.

Look, if you dont think reservations will work, then how about you
suggest something that will. I don't really care what you implement,
as long as it meets the needs of demand paging, I have direct
control over memory usage and concurrency policy and the allocation
mechanism guarantees forward progress without needing the OOM
killer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: How to handle TIF_MEMDIE stalls?
  2015-03-07  3:43                                 ` Dave Chinner
@ 2015-03-07 15:08                                   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman,
	dchinner, Andrew Morton, torvalds, Vlastimil Babka

On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote:
> On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote:
> > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> > > What we don't know is how many objects we might need to scan to find
> > > the objects we will eventually modify.  Here's an (admittedly
> > > extreme) example to demonstrate a worst case scenario: allocate a
> > > 64k data extent. Because it is an exact size allocation, we look it
> > > up in the by-size free space btree. Free space is fragmented, so
> > > there are about a million 64k free space extents in the tree.
> > > 
> > > Once we find the first 64k extent, we search them to find the best
> > > locality target match.  The btree records are 16 bytes each, so we
> > > fit roughly 500 to a 4k block. Say we search half the extents to
> > > find the best match - i.e. we walk a thousand leaf blocks before
> > > finding the match we want, and modify that leaf block.
> > > 
> > > Now, the modification removed an entry from the leaf and tht
> > > triggers leaf merge thresholds, so a merge with the 1002nd block
> > > occurs. That block now demand pages in and we then modify and join
> > > it to the transaction. Now we walk back up the btree to update
> > > indexes, merging blocks all the way back up to the root.  We have a
> > > worst case size btree (5 levels) and we merge at every level meaning
> > > we demand page another 8 btree blocks and modify them.
> > > 
> > > In this case, we've demand paged ~1010 btree blocks, but only
> > > modified 10 of them. i.e. the memory we consumed permanently was
> > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> > > the allocation demand was 2 orders of magnitude more than the
> > > unreclaimable memory consumption of the btree modification.
> > > 
> > > I hope you start to see the scope of the problem now...
> > 
> > Isn't this bounded one way or another?
> 
> Fo a single transaction? No.

So you can have an infinite number of allocations in the context of a
transaction, and only the objects that are going to be locked in are
bounded?

> > Sure, the inaccuracy itself is
> > high, but when you put the absolute numbers in perspective it really
> > doesn't seem to matter: with your extreme case of 3MB per transaction,
> > you can still run 5k+ of them in parallel on a small 16G machine.
> 
> No you can't. The number of concurrent transactions is bounded by
> the size of the log and the amount of unused space available for
> reservation in the log. Under heavy modification loads, that's
> usually somewhere between 15-25% of the log, so worst case is a few
> hundred megabytes. The memory reservation demand is in the same
> order of magnitude as the log space reservation demand.....
> 
> > Occupy a generous 75% of RAM with anonymous pages, and you can STILL
> > run over a thousand transactions concurrently.  That would seem like a
> > decent pipeline to keep the storage device occupied.
> 
> Typical systems won't ever get to that - they don't do more than a
> handful of current transactions at a time - the "thousands of
> transactions" occur on dedicated storage servers like petabyte scale
> NFS servers that have hundreds of gigabytes of RAM and
> hundreds-to-thousands of processing threads to keep the request
> pipeline full. The memory in those machines is entirely dedicated to
> the filesystem, so keeping a usuable pool of a few gigabytes for
> transaction reservations isn't a big deal.
> 
> The point here is that you're taking what I'm describing as the
> requirements of a reservation pool and then applying the worst case
> to situations where completely inappropriate. That's what I mean
> when I told Michal to stop building silly strawman situations; large
> amounts of concurrency are required for huge machines, not your
> desktop workstation.

Why do you have to take everything I say in bad faith and choose to be
smug instead of constructive?  This is unneccessary.  OF COURSE you
know your constraints better than we do.  Now explain how they matter
in practice, because that's what dictates the design in engineering.

I'm trying to figure out your requirements to find the simplest model,
and yes I'm obviously going to follow up when you give me incomplete
information.  I'm responding to this:

: What we don't know is how many objects we might need to scan to find
: the objects we will eventually modify.  Here's an (admittedly
: extreme) example to demonstrate a worst case scenario:

You gave us numbers that you called "worst case", so I took them and
put them in a scenario where it looks like memory wouldn't be the
bottle neck in real life, even if we just had simple pre-allocation
semantics.  If it was a silly example, why not provide a better one?

I'm fine with reservations and I'm fine with adding more complexity
when you demonstrate that it's needed.  Your argument seems to have
been that worst-case estimates are way off, but can you please just
demonstrate why it matters in practice?  Instead of having me do it
and calling my attempts strawman arguments?  I can just guess your
constraints, it's up to you to make a case for your requirements.

Here is another example where you responded to akpm:

---
> When allocating pages the caller should drain its reserves in
> preference to dipping into the regular freelist.  This guy has already
> done his reclaim and shouldn't be penalised a second time.  I guess
> Johannes's preallocation code should switch to doing this for the same
> reason, plus the fact that snipping a page off
> task_struct.prealloc_pages is super-fast and needs to be done sometime
> anyway so why not do it by default.

That is at odds with the requirements of demand paging, which
allocate for objects that are reclaimable within the course of the
transaction. The reserve is there to ensure forward progress for
allocations for objects that aren't freed until after the
transaction completes, but if we drain it for reclaimable objects we
then have nothing left in the reserve pool when we actually need it.

We do not know ahead of time if the object we are allocating is
going to modified and hence locked into the transaction. Hence we
can't say "use the reserve for this *specific* allocation", and so
the only guidance we can really give is "we will to allocate and
*permanently consume* this much memory", and the reserve pool needs
to cover that consumption to guarantee forwards progress.

Forwards progress for all other allocations is guaranteed because
they are reclaimable objects - they either freed directly back to
their source (slab, heap, page lists) or they are freed by shrinkers
once they have been released from the transaction.

Hence we need allocations to come from the free list and trigger
reclaim, regardless of the fact there is a reserve pool there. The
reserve pool needs to be a last resort once there are no other
avenues to allocate memory. i.e. it would be used to replace the OOM
killer for GFP_NOFAIL allocations.
---

Andrew makes a proposal and backs it up with real life benefits:
simpler, faster.  You on the other hand follow up with a list of
unfounded claims and your only counter-argument really seems to be
that Andrew's proposal differs from what you've had in mind.  What you
had in mind was obviously driven by constraints known to you, but it's
not an argument until you actually include them.  We're not taking
your claims at face value, that's not how this ever works.

Just explain why and how your requirements, demand paging reserves in
this case, matter in real life.  Then we can take them seriously.

> And, realistically, sizing that reservation pool appropriately is my
> problem to solve - it will depend on many factors, one of which is
> the actual geometry of the filesystem itself. You need to stop
> thinking like you can control how application use the memory
> allocation and reclaim subsystem and start to trust we will our
> memory usage appropriately to maintain maximum system throughput.

You've been working on the kernel long enough to know that this is not
how it goes.  I don't care about getting a list of things you claim
you need and implementing them blindly, trusting that you know what
you're doing when it comes to memory.  If you want us to expose an
interface, which puts constraints on our implementation, then you
better provide justification for every single requirement.

> After all, we already do that for all the filesystem caches the mm
> subsystem doesn't control - why do you think I have had such an
> interest in shrinker scalability? For XFS, the only cache we
> actually don't control reclaim from is user data in the page cache -
> we control everything else directly from custom shrinkers.....

You mean those global object pools that are aged through unrelated and
independent per-zone pressure values?

Look, we are specialized in different subsystems, which means we know
the details in front of us better than the details in the surrounding
areas.  You are quick to dismiss constraints and scalability concerns
in the memory subsystem, and I do the same for memory users.  We are
having this discussion in order to explore where our problem spaces
intersect, and we could be making more progress if you stopped
assuming that everybody else is an idiot and you already found the
perfect solution.

We need data on your parameters in order to make a basic cost-benefit
analysis of any proposed solutions.  Don't just propose something and
talk down to us when we ask for clarifications on your constraints.
It's not getting us anywhere.  Explore the problem space with us,
explain your constraints and exact requirements based on real life
data, and then we can look for potential solutions.  That is how we
evaluate every single proposal for the kernel, and it's how it's going
to work in this case.  It's not that complicated.

> > The level of precision that you are asking for comes with complexity
> > and fragility that I'm not convinced is necessary, or justified.
> 
> Look, if you dont think reservations will work, then how about you
> suggest something that will. I don't really care what you implement,
> as long as it meets the needs of demand paging, I have direct
> control over memory usage and concurrency policy and the allocation
> mechanism guarantees forward progress without needing the OOM
> killer.

Reservations are fine and I also want them to replace the OOM killer,
we agree on that.

The only thing my email was about was that, in light of the worst-case
numbers you quoted, it didn't look like the demand paging requirement
is strictly necessary to make the system work in practice, which is
why I'm questioning that particular requirement and prompting you to
clarify your position.  You have yet to address this.

Until then, the simplest semantics are preallocation semantics, where
you in advance establish private reserve pools (which can be backed by
clean cache) from which you allocate directly using __GFP_RESERVE.  If
the pool is empty it's immediately detectable and attributable to the
culprit, and the other reserves are not impacted by it.

A globally shared demand-paged pool is much more fragile because you
trust other participants in the system to keep their promise and not
pin more objects than they reserved for.  Otherwise, they deadlock
your transaction and corrupt your userdata.  How does "XFS filesystem
corrupted because it shares its emergency memory pool to ensure data
integrity with some buggy driver" sound to you?

It's also harder to verify.  If one of the participants misbehaves and
pins more objects than they initially reserved for, how do we identify
the culprit when the system locks up?

Make an actual case why preallocation semantics are unworkable on real
systems with real memory and real filesystems and real data on them,
then we can consider making the model more complex and fragile.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2015-03-07 15:08 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20141230112158.GA15546@dhcp22.suse.cz>
     [not found] ` <201502092044.JDG39081.LVFOOtFHQFOMSJ@I-love.SAKURA.ne.jp>
     [not found]   ` <201502102258.IFE09888.OVQFJOMSFtOLFH@I-love.SAKURA.ne.jp>
     [not found]     ` <20150210151934.GA11212@phnom.home.cmpxchg.org>
     [not found]       ` <201502111123.ICD65197.FMLOHSQJFVOtFO@I-love.SAKURA.ne.jp>
     [not found]         ` <201502172123.JIE35470.QOLMVOFJSHOFFt@I-love.SAKURA.ne.jp>
     [not found]           ` <20150217125315.GA14287@phnom.home.cmpxchg.org>
2015-02-17 22:54             ` How to handle TIF_MEMDIE stalls? Dave Chinner
2015-02-17 23:32               ` Dave Chinner
2015-02-18  8:25               ` Michal Hocko
2015-02-18 10:48                 ` Dave Chinner
2015-02-18 12:16                   ` Michal Hocko
2015-02-18 21:31                     ` Dave Chinner
2015-02-19  9:40                       ` Michal Hocko
2015-02-19 22:03                         ` Dave Chinner
2015-02-20  9:27                           ` Michal Hocko
2015-02-19 11:01                     ` Johannes Weiner
2015-02-19 12:29                       ` Michal Hocko
2015-02-19 12:58                         ` Michal Hocko
2015-02-19 15:29                           ` Tetsuo Handa
2015-02-19 21:53                             ` Tetsuo Handa
2015-02-20  9:13                             ` Michal Hocko
2015-02-20 13:37                               ` Stefan Ring
2015-02-19 13:29                         ` Tetsuo Handa
2015-02-20  9:10                           ` Michal Hocko
2015-02-20 12:20                             ` Tetsuo Handa
2015-02-20 12:38                               ` Michal Hocko
2015-02-19 21:43                         ` Dave Chinner
2015-02-20 12:48                           ` Michal Hocko
2015-02-20 23:09                             ` Dave Chinner
2015-02-19 10:24               ` Johannes Weiner
2015-02-19 22:52                 ` Dave Chinner
2015-02-20 10:36                   ` Tetsuo Handa
2015-02-20 23:15                     ` Dave Chinner
2015-02-21  3:20                       ` Theodore Ts'o
2015-02-21  9:19                         ` Andrew Morton
2015-02-21 13:48                           ` Tetsuo Handa
2015-02-21 21:38                           ` Dave Chinner
2015-02-22  0:20                           ` Johannes Weiner
2015-02-23 10:48                             ` Michal Hocko
2015-02-23 11:23                               ` Tetsuo Handa
2015-02-23 21:33                             ` David Rientjes
2015-02-21 12:00                         ` Tetsuo Handa
2015-02-23 10:26                         ` Michal Hocko
2015-02-21 11:12                       ` Tetsuo Handa
2015-02-21 21:48                         ` Dave Chinner
2015-02-21 23:52                   ` Johannes Weiner
2015-02-23  0:45                     ` Dave Chinner
2015-02-23  1:29                       ` Andrew Morton
2015-02-23  7:32                         ` Dave Chinner
2015-02-27 18:24                           ` Vlastimil Babka
2015-02-28  0:03                             ` Dave Chinner
2015-02-28 15:17                               ` Theodore Ts'o
2015-03-02  9:39                           ` Vlastimil Babka
2015-03-02 22:31                             ` Dave Chinner
2015-03-03  9:13                               ` Vlastimil Babka
2015-03-04  1:33                                 ` Dave Chinner
2015-03-04  8:50                                   ` Vlastimil Babka
2015-03-04 11:03                                     ` Dave Chinner
2015-03-07  0:20                               ` Johannes Weiner
2015-03-07  3:43                                 ` Dave Chinner
2015-03-07 15:08                                   ` Johannes Weiner
2015-03-02 20:22                           ` Johannes Weiner
2015-03-02 23:12                             ` Dave Chinner
2015-03-03  2:50                               ` Johannes Weiner
2015-03-04  6:52                                 ` Dave Chinner
2015-03-04 15:04                                   ` Johannes Weiner
2015-03-04 17:38                                     ` Theodore Ts'o
2015-03-04 23:17                                       ` Dave Chinner
2015-02-28 16:29                       ` Johannes Weiner
2015-02-28 16:41                         ` Theodore Ts'o
2015-02-28 22:15                           ` Johannes Weiner
2015-03-01 11:17                             ` Tetsuo Handa
2015-03-06 11:53                               ` Tetsuo Handa
2015-03-01 13:43                             ` Theodore Ts'o
2015-03-01 16:15                               ` Johannes Weiner
2015-03-01 19:36                                 ` Theodore Ts'o
2015-03-01 20:44                                   ` Johannes Weiner
2015-03-01 20:17                               ` Johannes Weiner
2015-03-01 21:48                             ` Dave Chinner
2015-03-02  0:17                               ` Dave Chinner
2015-03-02 12:46                                 ` Brian Foster
2015-02-28 18:36                       ` Vlastimil Babka
2015-03-02 15:18                       ` Michal Hocko
2015-03-02 16:05                         ` Johannes Weiner
2015-03-02 17:10                           ` Michal Hocko
2015-03-02 17:27                             ` Johannes Weiner
2015-03-02 16:39                         ` Theodore Ts'o
2015-03-02 16:58                           ` Michal Hocko
2015-03-04 12:52                             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox