* Re: How to handle TIF_MEMDIE stalls? [not found] ` <20150217125315.GA14287@phnom.home.cmpxchg.org> @ 2015-02-17 22:54 ` Dave Chinner 2015-02-17 23:32 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-17 22:54 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds [ cc xfs list - experienced kernel devs should not have to be reminded to do this ] On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > Tetsuo Handa wrote: > > > Johannes Weiner wrote: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > if (high_zoneidx < ZONE_NORMAL) > > > > goto out; > > > > /* The OOM killer does not compensate for light reclaim */ > > > > - if (!(gfp_mask & __GFP_FS)) > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > + /* > > > > + * XXX: Page reclaim didn't yield anything, > > > > + * and the OOM killer can't be invoked, but > > > > + * keep looping as per should_alloc_retry(). > > > > + */ > > > > + *did_some_progress = 1; > > > > goto out; > > > > + } > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > a regression and below one is the fix. > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > - /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > - goto out; > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Again, we don't want to OOM kill on behalf of allocations that can't > initiate IO, or even actively prevent others from doing it. Not per > default anyway, because most callers can deal with the failure without > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > It's the exceptions that should be annotated instead: > > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > void *ptr; > > do { > ptr = kmalloc(size, lflags); > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > return ptr; > if (!(++retries % 100)) > xfs_err(NULL, > "possible memory allocation deadlock in %s (mode:0x%x)", > __func__, lflags); > congestion_wait(BLK_RW_ASYNC, HZ/50); > } while (1); > } > > This should use __GFP_NOFAIL, which is not only designed to annotate > broken code like this, but also recognizes that endless looping on a > GFP_NOFS allocation needs the OOM killer after all to make progress. > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index a7a3a63bb360..17ced1805d3a 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > - int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > - void *ptr; > > - do { > - ptr = kmalloc(size, lflags); > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > - return ptr; > - if (!(++retries % 100)) > - xfs_err(NULL, > - "possible memory allocation deadlock in %s (mode:0x%x)", > - __func__, lflags); > - congestion_wait(BLK_RW_ASYNC, HZ/50); > - } while (1); > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > + lflags |= __GFP_NOFAIL; > + > + return kmalloc(size, lflags); > } Hmmm - the only reason there is a focus on this loop is that it emits warnings about allocations failing. It's obvious that the problem being dealt with here is a fundamental design issue w.r.t. to locking and the OOM killer, but the proposed special casing hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code in XFS started emitting warnings about allocations failing more often. So the answer is to remove the warning? That's like killing the canary to stop the methane leak in the coal mine. No canary? No problems! Right now, the oom killer is a liability. Over the past 6 months I've slowly had to exclude filesystem regression tests from running on small memory machines because the OOM killer is now so unreliable that it kills the test harness regularly rather than the process generating memory pressure. That's a big red flag to me that all this hacking around the edges is not solving the underlying problem, but instead is breaking things that did once work. And, well, then there's this (gfp.h): * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. This modifier is deprecated and no new * users should be added. So, is this another policy relevation from the mm developers about the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? Or just another symptom of frantic thrashing because nobody actually understands the problem or those that do are unwilling to throw out the broken crap and redesign it? If you are changing allocator behaviour and constraints, then you better damn well think through that changes fully, then document those changes, change all the relevant code to use the new API (not just those that throw warnings in your face) and make sure *everyone* knows about it. e.g. a LWN article explaining the changes and how memory allocation is going to work into the future would be a good start. Otherwise, this just looks like another knee-jerk band aid for an architectural problem that needs more than special case hacks to solve. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` How to handle TIF_MEMDIE stalls? Dave Chinner @ 2015-02-17 23:32 ` Dave Chinner 2015-02-18 8:25 ` Michal Hocko 2015-02-19 10:24 ` Johannes Weiner 2 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-17 23:32 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, akpm, torvalds On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! I'll also point out that there are two other identical allocation loops in XFS, one of which is only 30 lines below this one. That's further indication that this is a "silence the warning" patch rather than something that actually fixes a problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` How to handle TIF_MEMDIE stalls? Dave Chinner 2015-02-17 23:32 ` Dave Chinner @ 2015-02-18 8:25 ` Michal Hocko 2015-02-18 10:48 ` Dave Chinner 2015-02-19 10:24 ` Johannes Weiner 2 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-18 8:25 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: [...] > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. Such a warning should be part of the allocator and the whole point why I like the patch is that we should really warn at a single place. I was thinking about a simple warning (e.g. like the above) and having something more sophisticated when lockdep is enabled. > It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! Not at all. I cannot speak for Johannes but I am pretty sure his motivation wasn't to simply silence the warning. The thing is that no kernel code paths except for the page allocator shouldn't emulate behavior for which we have a gfp flag. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. It would be great to get bug reports. > That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. I am heavily trying to discourage people from adding random hacks to the already complicated and subtle OOM code. > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? It is deprecated and shouldn't be used. But that doesn't mean that users should workaround this by developing their own alternative. I agree the wording could be more clear and mention that if the allocation failure is absolutely unacceptable then the flags can be used rather than creating the loop around. What do you think about the following? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b840e3b2770d..ee6440ccb75d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -57,8 +57,12 @@ struct vm_area_struct; * _might_ fail. This depends upon the particular VM implementation. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. This modifier is deprecated and no new - * users should be added. + * cannot handle allocation failures. This modifier is deprecated for allocation + * with order > 1. Besides that this modifier is very dangerous when allocation + * happens under a lock because it creates a lock dependency invisible for the + * OOM killer so it can livelock. If the allocation failure is _absolutely_ + * unacceptable then the flags has to be used rather than looping around + * allocator. * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? > > If you are changing allocator behaviour and constraints, then you > better damn well think through that changes fully, then document > those changes, change all the relevant code to use the new API (not > just those that throw warnings in your face) and make sure > *everyone* knows about it. e.g. a LWN article explaining the changes > and how memory allocation is going to work into the future would be > a good start. Well, I think the first step is to change the users of the allocator to not lie about gfp flags. So if the code is infinitely trying then it really should use GFP_NOFAIL flag. In the meantime page allocator should develop a proper diagnostic to help identify all the potential dependencies. Next we should start thinking whether all the existing GFP_NOFAIL paths are really necessary or the code can be refactored/reimplemented to accept allocation failures. > Otherwise, this just looks like another knee-jerk band aid for an > architectural problem that needs more than special case hacks to > solve. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 8:25 ` Michal Hocko @ 2015-02-18 10:48 ` Dave Chinner 2015-02-18 12:16 ` Michal Hocko 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-02-18 10:48 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > [...] > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > void *ptr; > > > > > > do { > > > ptr = kmalloc(size, lflags); > > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > return ptr; > > > if (!(++retries % 100)) > > > xfs_err(NULL, > > > "possible memory allocation deadlock in %s (mode:0x%x)", > > > __func__, lflags); > > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > > } while (1); > > > } > > > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > > broken code like this, but also recognizes that endless looping on a > > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > > index a7a3a63bb360..17ced1805d3a 100644 > > > --- a/fs/xfs/kmem.c > > > +++ b/fs/xfs/kmem.c > > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > > void * > > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > > { > > > - int retries = 0; > > > gfp_t lflags = kmem_flags_convert(flags); > > > - void *ptr; > > > > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. > > Such a warning should be part of the allocator and the whole point why > I like the patch is that we should really warn at a single place. I > was thinking about a simple warning (e.g. like the above) and having > something more sophisticated when lockdep is enabled. > > > It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > Not at all. I cannot speak for Johannes but I am pretty sure his > motivation wasn't to simply silence the warning. The thing is that no > kernel code paths except for the page allocator shouldn't emulate > behavior for which we have a gfp flag. > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > It would be great to get bug reports. I thought we were talking about a manifestation of the problems I've been seeing.... > > That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > I am heavily trying to discourage people from adding random hacks to > the already complicated and subtle OOM code. > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > It is deprecated and shouldn't be used. But that doesn't mean that users > should workaround this by developing their own alternative. I'm kinda sick of hearing that, as if saying it enough times will make reality change. We have a *hard requirement* for memory allocation to make forwards progress, otherwise we *fail catastrophically*. History lesson - June 2004: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b30a2f7bf90593b12dbc912e4390b1b8ee133ea9 So, we're hardly working around the deprecation of GFP_NOFAIL when the code existed 5 years before GFP_NOFAIL was deprecated. Indeed, GFP_NOFAIL was shiny and new back then, having been introduced by Andrew Morton back in 2003. > I agree the > wording could be more clear and mention that if the allocation failure > is absolutely unacceptable then the flags can be used rather than > creating the loop around. What do you think about the following? > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index b840e3b2770d..ee6440ccb75d 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -57,8 +57,12 @@ struct vm_area_struct; > * _might_ fail. This depends upon the particular VM implementation. > * > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > - * cannot handle allocation failures. This modifier is deprecated and no new > - * users should be added. > + * cannot handle allocation failures. This modifier is deprecated for allocation > + * with order > 1. Besides that this modifier is very dangerous when allocation > + * happens under a lock because it creates a lock dependency invisible for the > + * OOM killer so it can livelock. If the allocation failure is _absolutely_ > + * unacceptable then the flags has to be used rather than looping around > + * allocator. Doesn't change anything from an XFS point of view. We do order >1 allocations through kmem_alloc() wrapper, and so we are still doing something that is "not supported" even if we use GFP_NOFAIL rather than our own loop. Also, this reads as an excuse for the OOM killer being broken and not fixing it. Keep in mind that we tell the memory alloc/reclaim subsystem that *we hold locks* when we call into it. That's what GFP_NOFS originally meant, and it's what it still means today in an XFS context. If the OOM killer is not obeying GFP_NOFS and deadlocking on locks that the invoking context holds, then that is a OOM killer bug, not a bug in the subsystem calling kmalloc(GFP_NOFS). > * > * __GFP_NORETRY: The VM implementation must not retry indefinitely. > * > > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > > > If you are changing allocator behaviour and constraints, then you > > better damn well think through that changes fully, then document > > those changes, change all the relevant code to use the new API (not > > just those that throw warnings in your face) and make sure > > *everyone* knows about it. e.g. a LWN article explaining the changes > > and how memory allocation is going to work into the future would be > > a good start. > > Well, I think the first step is to change the users of the allocator > to not lie about gfp flags. So if the code is infinitely trying then > it really should use GFP_NOFAIL flag. That's a complete non-issue when it comes to deciding whether it is safe to invoke the OOM killer or not! > In the meantime page allocator > should develop a proper diagnostic to help identify all the potential > dependencies. Next we should start thinking whether all the existing > GFP_NOFAIL paths are really necessary or the code can be > refactored/reimplemented to accept allocation failures. Last time the "just make filesystems handle memory allocation failures" I pointed out what that meant for XFS: dirty transaction rollback is required. That's freakin' complex, will double the memory footprint of transactions, roughly double the CPU cost, and greatly increase the complexity of the transaction subsystem. It's a *major* rework of a significant amount of the XFS codebase and will take at least a couple of years design, test and stabilise before it could be rolled out to production. I'm not about to spend a couple of years rewriting XFS just so the VM can get rid of a GFP_NOFAIL user. Especially as the we already tell the Hammer of Last Resort the context in which it can work. Move the OOM killer to kswapd - get it out of the direct reclaim path altogether. If the system is that backed up on locks that it cannot free any memory and has no reserves to satisfy the allocation that kicked the OOM killer, then the OOM killer was not invoked soon enough. Hell, if you want a better way to proceed, then how about you allow us to tell the MM subsystem how much memory reserve a specific set of operations is going to require to complete? That's something that we can do rough calculations for, and it integrates straight into the existing transaction reservation system we already use for log space and disk space, and we can tell the mm subsystem when the reserve is no longer needed (i.e. last thing in transaction commit). That way we don't start a transaction until the mm subsystem has reserved enough pages for us to work with, and the reserve only needs to be used when normal allocation has already failed. i.e rather than looping we get a page allocated from the reserve pool. The reservations wouldn't be perfect, but the majority of the time we'd be able to make progress and not need the OOM killer. And best of all, there's no responsibilty on the MM subsystem for preventing OOM - getting the reservations right is the responsibiity of the subsystem using them. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 10:48 ` Dave Chinner @ 2015-02-18 12:16 ` Michal Hocko 2015-02-18 21:31 ` Dave Chinner 2015-02-19 11:01 ` Johannes Weiner 0 siblings, 2 replies; 83+ messages in thread From: Michal Hocko @ 2015-02-18 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed 18-02-15 21:48:59, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: [...] > Also, this reads as an excuse for the OOM killer being broken and > not fixing it. Keep in mind that we tell the memory alloc/reclaim > subsystem that *we hold locks* when we call into it. That's what > GFP_NOFS originally meant, and it's what it still means today in an > XFS context. Sure, and OOM killer will not be invoked in NOFS context. See __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where is the OOM killer broken. The crucial problem we are dealing with is not GFP_NOFAIL triggering the OOM killer but a lock dependency introduced by the following sequence: taskA taskB taskC lock(A) alloc() alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory # looping for ever if we select_bad_process # cannot make any progress victim = taskB There is no way OOM killer can tell taskB is blocked and that there is dependency between A and B (without lockdep). That is why I call NOFAIL under a lock as dangerous and a bug. > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > that the invoking context holds, then that is a OOM killer bug, not > a bug in the subsystem calling kmalloc(GFP_NOFS). I guess we are talking about different things here or what am I missing? [...] > > In the meantime page allocator > > should develop a proper diagnostic to help identify all the potential > > dependencies. Next we should start thinking whether all the existing > > GFP_NOFAIL paths are really necessary or the code can be > > refactored/reimplemented to accept allocation failures. > > Last time the "just make filesystems handle memory allocation > failures" I pointed out what that meant for XFS: dirty transaction > rollback is required. That's freakin' complex, will double the > memory footprint of transactions, roughly double the CPU cost, and > greatly increase the complexity of the transaction subsystem. It's a > *major* rework of a significant amount of the XFS codebase and will > take at least a couple of years design, test and stabilise before > it could be rolled out to production. > > I'm not about to spend a couple of years rewriting XFS just so the > VM can get rid of a GFP_NOFAIL user. Especially as the we already > tell the Hammer of Last Resort the context in which it can work. > > Move the OOM killer to kswapd - get it out of the direct reclaim > path altogether. This doesn't change anything as explained in other email. The triggering path doesn't wait for the victim to die. > If the system is that backed up on locks that it > cannot free any memory and has no reserves to satisfy the allocation > that kicked the OOM killer, then the OOM killer was not invoked soon > enough. > > Hell, if you want a better way to proceed, then how about you allow > us to tell the MM subsystem how much memory reserve a specific set > of operations is going to require to complete? That's something that > we can do rough calculations for, and it integrates straight into > the existing transaction reservation system we already use for log > space and disk space, and we can tell the mm subsystem when the > reserve is no longer needed (i.e. last thing in transaction commit). > > That way we don't start a transaction until the mm subsystem has > reserved enough pages for us to work with, and the reserve only > needs to be used when normal allocation has already failed. i.e > rather than looping we get a page allocated from the reserve pool. I am not sure I understand the above but isn't the mempools a tool for this purpose? > The reservations wouldn't be perfect, but the majority of the time > we'd be able to make progress and not need the OOM killer. And best > of all, there's no responsibilty on the MM subsystem for preventing > OOM - getting the reservations right is the responsibiity of the > subsystem using them. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 12:16 ` Michal Hocko @ 2015-02-18 21:31 ` Dave Chinner 2015-02-19 9:40 ` Michal Hocko 2015-02-19 11:01 ` Johannes Weiner 1 sibling, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-02-18 21:31 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. I suspect that the page cache missing the correct GFP_NOFS was one of the sources of the problems I've been seeing. However, the oom killer exceptions are not checked if __GFP_NOFAIL is present and so if we start using __GFP_NOFAIL then it will be called in GFP_NOFS contexts... > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB > > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. Sure. However, eventually the OOM killer with select task A to be killed because nothing else is working. That, at least, marks taskA with TIF_MEMDIE and gives us a potential way to break the deadlock. But the bigger problem is this: taskA taskB lock(A) alloc(GFP_NOFS|GFP_NOFAIL) lock(A) out_of_memory select_bad_process victim = taskB Because there is no way to *ever* resolve that dependency because taskA never leaves the allocator. Even if the oom killer selects taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE because GFP_NOFAIL is set and continues to loop. This is why GFP_NOFAIL is not a solution to the "never fail" alloation problem. The caller doing the "no fail" allocation _must be able to set failure policy_. i.e. the choice of aborting and shutting down because progress cannot be made, or continuing and hoping for forwards progress is owned by the allocating context, no the allocator. The memory allocation subsystem cannot make that choice for us as it has no concept of the failure characteristics of the allocating context. The situations in which this actually matters are extremely *rare* - we've had these allocaiton loops in XFS for > 13 years, and we might get a one or two reports a year of these "possible allocation deadlock" messages occurring. Changing *everything* for such a rare, unusual event is not an efficient use of time or resources. > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > that the invoking context holds, then that is a OOM killer bug, not > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > I guess we are talking about different things here or what am I missing? >From my perspective, you are tightly focussed on one aspect of the problem and hence are not seeing the bigger picture: this is a corner case of behaviour in a "last hope", brute force memory reclaim technique that no production machine relies on for correct or performant operation. > [...] > > > In the meantime page allocator > > > should develop a proper diagnostic to help identify all the potential > > > dependencies. Next we should start thinking whether all the existing > > > GFP_NOFAIL paths are really necessary or the code can be > > > refactored/reimplemented to accept allocation failures. > > > > Last time the "just make filesystems handle memory allocation > > failures" I pointed out what that meant for XFS: dirty transaction > > rollback is required. That's freakin' complex, will double the > > memory footprint of transactions, roughly double the CPU cost, and > > greatly increase the complexity of the transaction subsystem. It's a > > *major* rework of a significant amount of the XFS codebase and will > > take at least a couple of years design, test and stabilise before > > it could be rolled out to production. > > > > I'm not about to spend a couple of years rewriting XFS just so the > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > tell the Hammer of Last Resort the context in which it can work. > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > path altogether. > > This doesn't change anything as explained in other email. The triggering > path doesn't wait for the victim to die. But it does - we wouldn't be talking about deadlocks if there were no blocking dependencies. In this case, allocation keeps retrying until the memory freed by the killed tasks enables it to make forward progress. That's a side effect of the last relevation that was made in this thread that low order allocations never fail... > > If the system is that backed up on locks that it > > cannot free any memory and has no reserves to satisfy the allocation > > that kicked the OOM killer, then the OOM killer was not invoked soon > > enough. > > > > Hell, if you want a better way to proceed, then how about you allow > > us to tell the MM subsystem how much memory reserve a specific set > > of operations is going to require to complete? That's something that > > we can do rough calculations for, and it integrates straight into > > the existing transaction reservation system we already use for log > > space and disk space, and we can tell the mm subsystem when the > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > That way we don't start a transaction until the mm subsystem has > > reserved enough pages for us to work with, and the reserve only > > needs to be used when normal allocation has already failed. i.e > > rather than looping we get a page allocated from the reserve pool. > > I am not sure I understand the above but isn't the mempools a tool for > this purpose? I knew this question would be the next one - I even deleted a one line comment from my last email that said "And no, mempools are not a solution" because that needs a more thorough explanation than a dismissive one-liner. As you know, mempools require a forward progress guarantee on a single type of object and the objects must be slab based. In transaction context we allocate from inode slabs, xfs_buf slabs, log item slabs (6 different ones, IIRC), btree cursor slabs, etc, but then we also have direct page allocations for buffers, vm_map_ram() for mapping multi-page buffers, uncounted heap allocations, etc. We cannot make all of these mempools, nor can me meet the forwards progress requirements of a mempool because other allocations can block and prevent progress. Further, the object have lifetimes that don't correspond to the transaction life cycles, and hence even if we complete the transaction there is no guarantee that the objects allocated within a transaction are going to be returned to the mempool at it's completion. IOWs, we have need for forward allocation progress guarantees on (potentially) several megabytes of allocations from slab caches, the heap and the page allocator, with all allocations all in unpredictable order, with objects of different life times and life cycles, and at which may, at any time, get stuck behind objects locked in other transactions and hence can randomly block until some other thread makes forward progress and completes a transaction and unlocks the object. The reservation would only need to cover the memory we need to allocate and hold in the transaction (i.e. dirtied objects). There is potentially unbound amounts of memory required through demand paging of buffers to find the metadata we need to modify, but demand paged metadata that is read and then released is recoverable. i.e the shrinkers will free it as other memory demand requires, so it's not included in reservation pools because it doesn't deplete the amount of free memory. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 21:31 ` Dave Chinner @ 2015-02-19 9:40 ` Michal Hocko 2015-02-19 22:03 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-19 9:40 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 08:31:18, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > > [...] > > > Also, this reads as an excuse for the OOM killer being broken and > > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > > subsystem that *we hold locks* when we call into it. That's what > > > GFP_NOFS originally meant, and it's what it still means today in an > > > XFS context. > > > > Sure, and OOM killer will not be invoked in NOFS context. See > > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > > is the OOM killer broken. > > I suspect that the page cache missing the correct GFP_NOFS was one > of the sources of the problems I've been seeing. > > However, the oom killer exceptions are not checked if __GFP_NOFAIL Yes this is true. This is an effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) and IMO a desirable one. Requiring infinite retrying with a seriously restricted reclaim context calls for troubles (e.g. livelock without no way out because regular reclaim cannot make any progress and OOM killer as the last resort will not happen). > is present and so if we start using __GFP_NOFAIL then it will be > called in GFP_NOFS contexts... > > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > > OOM killer but a lock dependency introduced by the following sequence: > > > > taskA taskB taskC > > lock(A) alloc() > > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > > # looping for ever if we select_bad_process > > # cannot make any progress victim = taskB > > > > There is no way OOM killer can tell taskB is blocked and that there is > > dependency between A and B (without lockdep). That is why I call NOFAIL > > under a lock as dangerous and a bug. > > Sure. However, eventually the OOM killer with select task A to be > killed because nothing else is working. That would require OOM killer to be able to select another victim while the current one is still alive. There were time based heuristics suggested to do this but I do not think they are the right way to handle the problem and should be considered only if all other options fail. One potential way would be giving access to give GFP_NOFAIL context access to memory reserves when the allocation domain (global/memcg/cpuset) is OOM. Andrea was suggesting something like that IIRC. > That, at least, marks > taskA with TIF_MEMDIE and gives us a potential way to break the > deadlock. > > But the bigger problem is this: > > taskA taskB > lock(A) > alloc(GFP_NOFS|GFP_NOFAIL) lock(A) > out_of_memory > select_bad_process > victim = taskB > > Because there is no way to *ever* resolve that dependency because > taskA never leaves the allocator. Even if the oom killer selects > taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE > because GFP_NOFAIL is set and continues to loop. TIF_MEMDIE will at least give the task access to memory reserves. Anyway this is essentially the same category of livelock as above. > This is why GFP_NOFAIL is not a solution to the "never fail" > alloation problem. The caller doing the "no fail" allocation _must > be able to set failure policy_. i.e. the choice of aborting and > shutting down because progress cannot be made, or continuing and > hoping for forwards progress is owned by the allocating context, no > the allocator. I completely agree that the failure policy is the caller responsibility and I would have no objections to something like: do { ptr = kmalloc(size, GFP_NOFS); if (ptr) return ptr; if (fatal_signal_pending(current)) break; if (looping_too_long()) break; } while (1); fallback_solution(); But this is not the case in kmem_alloc which is essentially GFP_NOFAIL allocation with a warning and congestion_wait. There is no failure policy defined there. The warning should be part of the allocator and the NOFAIL policy should be explicit. So why exactly do you oppose to changing kmem_alloc (and others which are doing essentially the same)? > The memory allocation subsystem cannot make that > choice for us as it has no concept of the failure characteristics of > the allocating context. Of course. I wasn't arguing we should change allocation loops which have a fallback policy as well. That is an entirely different thing. My point was we want to turn GFP_NOFAIL equivalents to use GFP_NOFAIL so that the allocator can prevent from livelocks if possible. > The situations in which this actually matters are extremely *rare* - > we've had these allocaiton loops in XFS for > 13 years, and we might > get a one or two reports a year of these "possible allocation > deadlock" messages occurring. Changing *everything* for such a rare, > unusual event is not an efficient use of time or resources. > > > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > > that the invoking context holds, then that is a OOM killer bug, not > > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > > > I guess we are talking about different things here or what am I missing? > > From my perspective, you are tightly focussed on one aspect of the > problem and hence are not seeing the bigger picture: this is a > corner case of behaviour in a "last hope", brute force memory > reclaim technique that no production machine relies on for correct > or performant operation. Of course this is a corner case. And I am trying to prevent heuristics which would optimize for such a corner case (there were multiple of them suggested in this thread). The reason I care about GFP_NOFAIL is that there are apparently code paths which do not tell allocator they are basically GFP_NOFAIL without any fallback. This leads to two main problems 1) we do not have a good overview how many code paths have such a strong requirements and so cannot estimate e.g. how big memory reserves should be and 2) allocator cannot help those paths (e.g. by giving them access to reserves to break out of the livelock). > > [...] > > > > In the meantime page allocator > > > > should develop a proper diagnostic to help identify all the potential > > > > dependencies. Next we should start thinking whether all the existing > > > > GFP_NOFAIL paths are really necessary or the code can be > > > > refactored/reimplemented to accept allocation failures. > > > > > > Last time the "just make filesystems handle memory allocation > > > failures" I pointed out what that meant for XFS: dirty transaction > > > rollback is required. That's freakin' complex, will double the > > > memory footprint of transactions, roughly double the CPU cost, and > > > greatly increase the complexity of the transaction subsystem. It's a > > > *major* rework of a significant amount of the XFS codebase and will > > > take at least a couple of years design, test and stabilise before > > > it could be rolled out to production. > > > > > > I'm not about to spend a couple of years rewriting XFS just so the > > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > > tell the Hammer of Last Resort the context in which it can work. > > > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > > path altogether. > > > > This doesn't change anything as explained in other email. The triggering > > path doesn't wait for the victim to die. > > But it does - we wouldn't be talking about deadlocks if there were > no blocking dependencies. In this case, allocation keeps retrying > until the memory freed by the killed tasks enables it to make > forward progress. That's a side effect of the last relevation that > was made in this thread that low order allocations never fail... Sure, low order allocations being almost GFP_NOFAIL makes things much worse of course. And this should be changed. We just have to think about the way how to do it without breaking the universe. I hope we can discuss this at LSF. But even then I do not see how triggering the OOM killer from kswapd would help here. Victims would be looping in the allocator whether the actual killing happens from their or any other context. > > > If the system is that backed up on locks that it > > > cannot free any memory and has no reserves to satisfy the allocation > > > that kicked the OOM killer, then the OOM killer was not invoked soon > > > enough. > > > > > > Hell, if you want a better way to proceed, then how about you allow > > > us to tell the MM subsystem how much memory reserve a specific set > > > of operations is going to require to complete? That's something that > > > we can do rough calculations for, and it integrates straight into > > > the existing transaction reservation system we already use for log > > > space and disk space, and we can tell the mm subsystem when the > > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > > > That way we don't start a transaction until the mm subsystem has > > > reserved enough pages for us to work with, and the reserve only > > > needs to be used when normal allocation has already failed. i.e > > > rather than looping we get a page allocated from the reserve pool. > > > > I am not sure I understand the above but isn't the mempools a tool for > > this purpose? > > I knew this question would be the next one - I even deleted a one > line comment from my last email that said "And no, mempools are not > a solution" because that needs a more thorough explanation than a > dismissive one-liner. > > As you know, mempools require a forward progress guarantee on a > single type of object and the objects must be slab based. > > In transaction context we allocate from inode slabs, xfs_buf slabs, > log item slabs (6 different ones, IIRC), btree cursor slabs, etc, > but then we also have direct page allocations for buffers, vm_map_ram() > for mapping multi-page buffers, uncounted heap allocations, etc. > We cannot make all of these mempools, nor can me meet the forwards > progress requirements of a mempool because other allocations can > block and prevent progress. > > Further, the object have lifetimes that don't correspond to the > transaction life cycles, and hence even if we complete the > transaction there is no guarantee that the objects allocated within > a transaction are going to be returned to the mempool at it's > completion. > > IOWs, we have need for forward allocation progress guarantees on > (potentially) several megabytes of allocations from slab caches, the > heap and the page allocator, with all allocations all in > unpredictable order, with objects of different life times and life > cycles, and at which may, at any time, get stuck behind > objects locked in other transactions and hence can randomly block > until some other thread makes forward progress and completes a > transaction and unlocks the object. Thanks for the clarification, I have to think about it some more, though. My thinking was that mempools could be used for an emergency pool with a pre-allocated memory which would be used in the non failing contexts. > The reservation would only need to cover the memory we need to > allocate and hold in the transaction (i.e. dirtied objects). There > is potentially unbound amounts of memory required through demand > paging of buffers to find the metadata we need to modify, but demand > paged metadata that is read and then released is recoverable. i.e > the shrinkers will free it as other memory demand requires, so it's > not included in reservation pools because it doesn't deplete the > amount of free memory. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 9:40 ` Michal Hocko @ 2015-02-19 22:03 ` Dave Chinner 2015-02-20 9:27 ` Michal Hocko 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-02-19 22:03 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 10:40:20AM +0100, Michal Hocko wrote: > On Thu 19-02-15 08:31:18, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > This is why GFP_NOFAIL is not a solution to the "never fail" > > alloation problem. The caller doing the "no fail" allocation _must > > be able to set failure policy_. i.e. the choice of aborting and > > shutting down because progress cannot be made, or continuing and > > hoping for forwards progress is owned by the allocating context, no > > the allocator. > > I completely agree that the failure policy is the caller responsibility > and I would have no objections to something like: > > do { > ptr = kmalloc(size, GFP_NOFS); > if (ptr) > return ptr; > if (fatal_signal_pending(current)) > break; > if (looping_too_long()) > break; > } while (1); > > fallback_solution(); > > But this is not the case in kmem_alloc which is essentially GFP_NOFAIL > allocation with a warning and congestion_wait. There is no failure > policy defined there. The warning should be part of the allocator and > the NOFAIL policy should be explicit. So why exactly do you oppose to > changing kmem_alloc (and others which are doing essentially the same)? I'm opposing changing kmem_alloc() to GFP_NOFAIL precisely because doing so is *broken*, *and* it removes the policy decision from the calling context where it belongs. We are in the process of discussing - at an XFS level - how to handle errors in a configurable manner. See, for example, this discussion: http://oss.sgi.com/archives/xfs/2015-02/msg00343.html Where we are trying to decide how to expose failure policy to admins to make decisions about error handling behaviour: http://oss.sgi.com/archives/xfs/2015-02/msg00346.html There is little doubt in my mind that this stretches to ENOMEM handling; it is another case where we consider ENOMEM to be a transient error and hence retry forever until it succeeds. But some people are going to want to configure that behaviour, and the API above allows peopel to configure exactly how many repeated memory allocations we'd fail before considering the situation hopeless, failing, and risking a filesystem shutdown.... Converting the code to use GFP_NOFAIL takes us in exactly the opposite direction to our current line of development w.r.t. to filesystem error handling. > The reason I care about GFP_NOFAIL is that there are apparently code > paths which do not tell allocator they are basically GFP_NOFAIL without > any fallback. This leads to two main problems 1) we do not have a good > overview how many code paths have such a strong requirements and so > cannot estimate e.g. how big memory reserves should be and Right, when GFP_NOFAIL got deprecated we lost the ability to document such behaviour and find it easily. People just put retry loops in instead of using GFP_NOFAIL. Good luck finding them all :/ > 2) allocator > cannot help those paths (e.g. by giving them access to reserves to break > out of the livelock). Allocator should not help. Global reserves are unreliable - make the allocation context reserve the amount it needs before it enters the context where it can't back out.... > > IOWs, we have need for forward allocation progress guarantees on > > (potentially) several megabytes of allocations from slab caches, the > > heap and the page allocator, with all allocations all in > > unpredictable order, with objects of different life times and life > > cycles, and at which may, at any time, get stuck behind > > objects locked in other transactions and hence can randomly block > > until some other thread makes forward progress and completes a > > transaction and unlocks the object. > > Thanks for the clarification, I have to think about it some more, > though. My thinking was that mempools could be used for an emergency > pool with a pre-allocated memory which would be used in the non failing > contexts. The other problem with mempools is that they aren't exclusive to the context that needs the reservation. i.e. we can preallocate to the mempool, but then when the preallocating context goes to allocate, that preallocation may have already been drained by other contexts. The memory reservation needs to be follow to the transaction - we can pass them between tasks, and they need to persist across sleeping locks, IO, etc, and mempools simply too constrainted to be usable in this environment. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:03 ` Dave Chinner @ 2015-02-20 9:27 ` Michal Hocko 0 siblings, 0 replies; 83+ messages in thread From: Michal Hocko @ 2015-02-20 9:27 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri 20-02-15 09:03:55, Dave Chinner wrote: [...] > Converting the code to use GFP_NOFAIL takes us in exactly the > opposite direction to our current line of development w.r.t. to > filesystem error handling. Fair enough. If there are plans to have a failure policy rather than GFP_NOFAIL like behavior then I have, of course, no objections. Quite opposite. This is exactly what I would like to see. GFP_NOFAIL should be rarely used, really. The whole point of this discussion, and I am sorry if I didn't make it clear, is that _if_ there is really a GFP_NOFAIL requirement hidden from the allocator then it should be changed to use GFP_NOFAIL so that allocator knows about this requirement. > > The reason I care about GFP_NOFAIL is that there are apparently code > > paths which do not tell allocator they are basically GFP_NOFAIL without > > any fallback. This leads to two main problems 1) we do not have a good > > overview how many code paths have such a strong requirements and so > > cannot estimate e.g. how big memory reserves should be and > > Right, when GFP_NOFAIL got deprecated we lost the ability to document > such behaviour and find it easily. People just put retry loops in > instead of using GFP_NOFAIL. Good luck finding them all :/ That will be PITA, all right, but I guess the deprecation was a mistake and we should stop this tendency. > > 2) allocator > > cannot help those paths (e.g. by giving them access to reserves to break > > out of the livelock). > > Allocator should not help. Global reserves are unreliable - make the > allocation context reserve the amount it needs before it enters the > context where it can't back out.... Sure pre-allocation is preferable. But once somebody asks for GFP_NOFAIL then it is too late and the allocator only has memory reclaim and potentially reserves. [...] -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-18 12:16 ` Michal Hocko 2015-02-18 21:31 ` Dave Chinner @ 2015-02-19 11:01 ` Johannes Weiner 2015-02-19 12:29 ` Michal Hocko 1 sibling, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-02-19 11:01 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB You don't even need taskC here. taskA could invoke the OOM killer with lock(A) held, and taskB getting selected as the victim while trying to acquire lock(A). It'll get the signal and TIF_MEMDIE and then wait for lock(A) while taskA is waiting for it to exit. But it doesn't matter who is doing the OOM killing - if the allocating task with the lock/state is waiting for the OOM victim to free memory, and the victim is waiting for same the lock/state, we have a deadlock. > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. You keep ignoring that it's also one of the main usecases of this flag. The caller has state that it can't unwind and thus needs the allocation to succeed. Chances are somebody else can get blocked up on that same state. And when that somebody else is the first choice of the OOM killer, we're screwed. This is exactly why I'm proposing that the OOM killer should not wait indefinitely for its first choice to exit, but ultimately move on and try other tasks. There is no other way to resolve this deadlock. Preferrably, we'd get rid of all nofail allocations and replace them with preallocated reserves. But this is not going to happen anytime soon, so what other option do we have than resolving this on the OOM killer side? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 11:01 ` Johannes Weiner @ 2015-02-19 12:29 ` Michal Hocko 2015-02-19 12:58 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Michal Hocko @ 2015-02-19 12:29 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 06:01:24, Johannes Weiner wrote: [...] > Preferrably, we'd get rid of all nofail allocations and replace them > with preallocated reserves. But this is not going to happen anytime > soon, so what other option do we have than resolving this on the OOM > killer side? As I've mentioned in other email, we might give GFP_NOFAIL allocator access to memory reserves (by giving it __GFP_HIGH). This is still not a 100% solution because reserves could get depleted but this risk is there even with multiple oom victims. I would still argue that this would be a better approach because selecting more victims might hit pathological case more easily (other victims might be blocked on the very same lock e.g.). Something like the following: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..4b5cf28a13f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2628,6 +2629,15 @@ retry: wake_all_kswapds(order, ac); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + gfp_mask |= __GFP_HIGH; + + /* * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko @ 2015-02-19 12:58 ` Michal Hocko 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 13:29 ` Tetsuo Handa 2015-02-19 21:43 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-19 12:58 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Thu 19-02-15 13:29:14, Michal Hocko wrote: [...] > Something like the following. __GFP_HIGH doesn't seem to be sufficient so we would need something slightly else but the idea is still the same: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..2d224bbdf8e8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + int oom = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2635,6 +2636,15 @@ retry: alloc_flags = gfp_to_alloc_flags(gfp_mask); /* + * __GFP_NOFAIL allocations cannot fail but yet the current context + * might be blocking resources needed by the OOM victim to terminate. + * Allow the caller to dive into memory reserves to succeed the + * allocation and break out from a potential deadlock. + */ + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) + alloc_flags |= ALLOC_NO_WATERMARKS; + + /* * Find the true preferred zone if the allocation is unconstrained by * cpusets. */ @@ -2759,6 +2769,8 @@ retry: goto got_pg; if (!did_some_progress) goto nopage; + + oom++; } /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:58 ` Michal Hocko @ 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 21:53 ` Tetsuo Handa 2015-02-20 9:13 ` Michal Hocko 0 siblings, 2 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-02-19 15:29 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > [...] > > Something like the following. > __GFP_HIGH doesn't seem to be sufficient so we would need something > slightly else but the idea is still the same: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8d52ab18fe0d..2d224bbdf8e8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > enum migrate_mode migration_mode = MIGRATE_ASYNC; > bool deferred_compaction = false; > int contended_compaction = COMPACT_CONTENDED_NONE; > + int oom = 0; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -2635,6 +2636,15 @@ retry: > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > /* > + * __GFP_NOFAIL allocations cannot fail but yet the current context > + * might be blocking resources needed by the OOM victim to terminate. > + * Allow the caller to dive into memory reserves to succeed the > + * allocation and break out from a potential deadlock. > + */ We don't know how many callers will pass __GFP_NOFAIL. But if 1000 threads are doing the same operation which requires __GFP_NOFAIL allocation with a lock held, wouldn't memory reserves deplete? This heuristic can't continue if memory reserves depleted or continuous pages of requested order cannot be found. > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > + alloc_flags |= ALLOC_NO_WATERMARKS; > + > + /* > * Find the true preferred zone if the allocation is unconstrained by > * cpusets. > */ > @@ -2759,6 +2769,8 @@ retry: > goto got_pg; > if (!did_some_progress) > goto nopage; > + > + oom++; > } > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > -- > Michal Hocko > SUSE Labs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 15:29 ` Tetsuo Handa @ 2015-02-19 21:53 ` Tetsuo Handa 2015-02-20 9:13 ` Michal Hocko 1 sibling, 0 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-02-19 21:53 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? > > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. > Even if the system seems to be stalled, deadlocks may not have occurred. If the cause is (e.g.) virtio disk being stuck for unknown reason than a deadlock, nobody should start consuming the memory reserves after waiting for a while. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. Therefore, I think that throttling heuristics for memory requester side (deflator of the balloon, or SIGKILL receiver called processes) should be avoided and throttling heuristics for memory releaser side (inflator of the balloon, or SIGKILL sender called the OOM killer) should be used. If heuristic is used on the deflator side, the memory allocator may deliver a final blow via ALLOC_NO_WATERMARKS. If heuristic is used on the inflator side, the OOM killer can act as a watchdog when nobody volunteered memory within reasonable period. > > + if (oom > 10 && (gfp_mask & __GFP_NOFAIL)) > > + alloc_flags |= ALLOC_NO_WATERMARKS; > > + > > + /* > > * Find the true preferred zone if the allocation is unconstrained by > > * cpusets. > > */ > > @@ -2759,6 +2769,8 @@ retry: > > goto got_pg; > > if (!did_some_progress) > > goto nopage; > > + > > + oom++; > > } > > /* Wait for some write requests to complete then retry */ > > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > > -- > > Michal Hocko > > SUSE Labs > > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 15:29 ` Tetsuo Handa 2015-02-19 21:53 ` Tetsuo Handa @ 2015-02-20 9:13 ` Michal Hocko 2015-02-20 13:37 ` Stefan Ring 1 sibling, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-20 9:13 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Fri 20-02-15 00:29:29, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 13:29:14, Michal Hocko wrote: > > [...] > > > Something like the following. > > __GFP_HIGH doesn't seem to be sufficient so we would need something > > slightly else but the idea is still the same: > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 8d52ab18fe0d..2d224bbdf8e8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2599,6 +2599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > enum migrate_mode migration_mode = MIGRATE_ASYNC; > > bool deferred_compaction = false; > > int contended_compaction = COMPACT_CONTENDED_NONE; > > + int oom = 0; > > > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -2635,6 +2636,15 @@ retry: > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > /* > > + * __GFP_NOFAIL allocations cannot fail but yet the current context > > + * might be blocking resources needed by the OOM victim to terminate. > > + * Allow the caller to dive into memory reserves to succeed the > > + * allocation and break out from a potential deadlock. > > + */ > > We don't know how many callers will pass __GFP_NOFAIL. But if 1000 > threads are doing the same operation which requires __GFP_NOFAIL > allocation with a lock held, wouldn't memory reserves deplete? We shouldn't have an unbounded number of GFP_NOFAIL allocations at the same time. This would be even more broken. If a load is known to use such allocations excessively then the administrator can enlarge the memory reserves. > This heuristic can't continue if memory reserves depleted or > continuous pages of requested order cannot be found. Once memory reserves are depleted we are screwed anyway and we might panic. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 9:13 ` Michal Hocko @ 2015-02-20 13:37 ` Stefan Ring 0 siblings, 0 replies; 83+ messages in thread From: Stefan Ring @ 2015-02-20 13:37 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, Linux fs XFS, linux-mm, mgorman, hannes, linux-fsdevel, rientjes, akpm, fernando_b1, torvalds >> We don't know how many callers will pass __GFP_NOFAIL. But if 1000 >> threads are doing the same operation which requires __GFP_NOFAIL >> allocation with a lock held, wouldn't memory reserves deplete? > > We shouldn't have an unbounded number of GFP_NOFAIL allocations at the > same time. This would be even more broken. If a load is known to use > such allocations excessively then the administrator can enlarge the > memory reserves. > >> This heuristic can't continue if memory reserves depleted or >> continuous pages of requested order cannot be found. > > Once memory reserves are depleted we are screwed anyway and we might > panic. This discussion reminds me of a situation I've seen somewhat regularly, which I have described here: http://oss.sgi.com/pipermail/xfs/2014-April/035793.html I've actually seen it more often on another box with OpenVZ and VirtualBox installed, where it would almost always happen during startup of a VirtualBox guest machine. This other machine is also running XFS. I blamed it on OpenVZ or VirtualBox originally, but having seen the same thing happen on the other machine with neither of them, the next candidate for taking blame is XFS. Is this behavior something that can be attributed to these memory allocation retry loops? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko 2015-02-19 12:58 ` Michal Hocko @ 2015-02-19 13:29 ` Tetsuo Handa 2015-02-20 9:10 ` Michal Hocko 2015-02-19 21:43 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Tetsuo Handa @ 2015-02-19 13:29 UTC (permalink / raw) To: mhocko, hannes Cc: dchinner, oleg, xfs, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). This is still not a > 100% solution because reserves could get depleted but this risk is there > even with multiple oom victims. I would still argue that this would be a > better approach because selecting more victims might hit pathological > case more easily (other victims might be blocked on the very same lock > e.g.). > Does "multiple OOM victims" mean "select next if first does not die"? Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 does not deplete memory reserves. ;-) If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, those who do not want to fail (e.g. journal transaction) will start passing __GFP_NOFAIL? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 13:29 ` Tetsuo Handa @ 2015-02-20 9:10 ` Michal Hocko 2015-02-20 12:20 ` Tetsuo Handa 0 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-20 9:10 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > 100% solution because reserves could get depleted but this risk is there > > even with multiple oom victims. I would still argue that this would be a > > better approach because selecting more victims might hit pathological > > case more easily (other victims might be blocked on the very same lock > > e.g.). > > > Does "multiple OOM victims" mean "select next if first does not die"? > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > does not deplete memory reserves. ;-) It doesn't because --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; - else if (!in_interrupt() && - ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion and break out from the allocator. Exiting task might need a memory to do so and you make all those allocations fail basically. How do you know this is not going to blow up? > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > those who do not want to fail (e.g. journal transaction) will start passing > __GFP_NOFAIL? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 9:10 ` Michal Hocko @ 2015-02-20 12:20 ` Tetsuo Handa 2015-02-20 12:38 ` Michal Hocko 0 siblings, 1 reply; 83+ messages in thread From: Tetsuo Handa @ 2015-02-20 12:20 UTC (permalink / raw) To: mhocko Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds Michal Hocko wrote: > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > 100% solution because reserves could get depleted but this risk is there > > > even with multiple oom victims. I would still argue that this would be a > > > better approach because selecting more victims might hit pathological > > > case more easily (other victims might be blocked on the very same lock > > > e.g.). > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > does not deplete memory reserves. ;-) > > It doesn't because > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > alloc_flags |= ALLOC_NO_WATERMARKS; > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > - else if (!in_interrupt() && > - ((current->flags & PF_MEMALLOC) || > - unlikely(test_thread_flag(TIF_MEMDIE)))) > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > alloc_flags |= ALLOC_NO_WATERMARKS; > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > and break out from the allocator. Exiting task might need a memory to do > so and you make all those allocations fail basically. How do you know > this is not going to blow up? > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? We cannot determine correct task to kill + allow access to memory reserves based on lock dependency. Therefore, this patch evenly allow no tasks to access to memory reserves. Exiting task might need some memory to exit, and not allowing access to memory reserves can retard exit of that task. But that task will eventually get memory released by other tasks killed by timeout-based kill-more mechanism. If no more killable tasks or expired panic-timeout, it is the same result with depletion of memory reserves. I think that this situation (automatically making forward progress as if the administrator is periodically doing SysRq-f until the OOM condition is solved, or is doing SysRq-c if no more killable tasks or stalled too long) is better than current situation (not making forward progress since the exiting task cannot exit due to lock dependency, caused by failing to determine correct task to kill + allow access to memory reserves). > > If we change to permit invocation of the OOM killer for GFP_NOFS / GFP_NOIO, > > those who do not want to fail (e.g. journal transaction) will start passing > > __GFP_NOFAIL? > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 12:20 ` Tetsuo Handa @ 2015-02-20 12:38 ` Michal Hocko 0 siblings, 0 replies; 83+ messages in thread From: Michal Hocko @ 2015-02-20 12:38 UTC (permalink / raw) To: Tetsuo Handa Cc: dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, linux-fsdevel, akpm, fernando_b1, torvalds On Fri 20-02-15 21:20:58, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 19-02-15 22:29:37, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > > [...] > > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > > with preallocated reserves. But this is not going to happen anytime > > > > > soon, so what other option do we have than resolving this on the OOM > > > > > killer side? > > > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > > access to memory reserves (by giving it __GFP_HIGH). This is still not a > > > > 100% solution because reserves could get depleted but this risk is there > > > > even with multiple oom victims. I would still argue that this would be a > > > > better approach because selecting more victims might hit pathological > > > > case more easily (other victims might be blocked on the very same lock > > > > e.g.). > > > > > > > Does "multiple OOM victims" mean "select next if first does not die"? > > > Then, I think my timeout patch http://marc.info/?l=linux-mm&m=142002495532320&w=2 > > > does not deplete memory reserves. ;-) > > > > It doesn't because > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2603,9 +2603,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > - else if (!in_interrupt() && > > - ((current->flags & PF_MEMALLOC) || > > - unlikely(test_thread_flag(TIF_MEMDIE)))) > > + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) > > alloc_flags |= ALLOC_NO_WATERMARKS; > > > > you disabled the TIF_MEMDIE heuristic and use it only for OOM exclusion > > and break out from the allocator. Exiting task might need a memory to do > > so and you make all those allocations fail basically. How do you know > > this is not going to blow up? > > > > Well, treat exiting tasks to imply __GFP_NOFAIL for clean up? > > We cannot determine correct task to kill + allow access to memory reserves > based on lock dependency. Therefore, this patch evenly allow no tasks to > access to memory reserves. > > Exiting task might need some memory to exit, and not allowing access to > memory reserves can retard exit of that task. But that task will eventually > get memory released by other tasks killed by timeout-based kill-more > mechanism. If no more killable tasks or expired panic-timeout, it is > the same result with depletion of memory reserves. > > I think that this situation (automatically making forward progress as if > the administrator is periodically doing SysRq-f until the OOM condition > is solved, or is doing SysRq-c if no more killable tasks or stalled too > long) is better than current situation (not making forward progress since > the exiting task cannot exit due to lock dependency, caused by failing to > determine correct task to kill + allow access to memory reserves). If you really believe this is an improvement then send a proper patch with justification. But I am _really_ skeptical about such a change to be honest. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 12:29 ` Michal Hocko 2015-02-19 12:58 ` Michal Hocko 2015-02-19 13:29 ` Tetsuo Handa @ 2015-02-19 21:43 ` Dave Chinner 2015-02-20 12:48 ` Michal Hocko 2 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-02-19 21:43 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > [...] > > Preferrably, we'd get rid of all nofail allocations and replace them > > with preallocated reserves. But this is not going to happen anytime > > soon, so what other option do we have than resolving this on the OOM > > killer side? > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > access to memory reserves (by giving it __GFP_HIGH). Won't work when you have thousands of concurrent transactions running in XFS and they are all doing GFP_NOFAIL allocations. That's why I suggested the per-transaction reserve pool - we can use that to throttle the number of concurent contexts demanding memory for forwards progress, just the same was we throttle the number of concurrent processes based on maximum log space requirements of the transactions and the amount of unreserved log space available. No log space, transaction reservations waits on an ordered queue for space to become available. No memory available, transaction reservation waits on an ordered queue for memory to become available. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 21:43 ` Dave Chinner @ 2015-02-20 12:48 ` Michal Hocko 2015-02-20 23:09 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-20 12:48 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri 20-02-15 08:43:56, Dave Chinner wrote: > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > [...] > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > with preallocated reserves. But this is not going to happen anytime > > > soon, so what other option do we have than resolving this on the OOM > > > killer side? > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > access to memory reserves (by giving it __GFP_HIGH). > > Won't work when you have thousands of concurrent transactions > running in XFS and they are all doing GFP_NOFAIL allocations. Is there any bound on how many transactions can run at the same time? > That's why I suggested the per-transaction reserve pool - we can use > that I am still not sure what you mean by reserve pool (API wise). How does it differ from pre-allocating memory before the "may not fail context"? Could you elaborate on it, please? > to throttle the number of concurent contexts demanding memory for > forwards progress, just the same was we throttle the number of > concurrent processes based on maximum log space requirements of the > transactions and the amount of unreserved log space available. > > No log space, transaction reservations waits on an ordered queue for > space to become available. No memory available, transaction > reservation waits on an ordered queue for memory to become > available. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 12:48 ` Michal Hocko @ 2015-02-20 23:09 ` Dave Chinner 0 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-20 23:09 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote: > On Fri 20-02-15 08:43:56, Dave Chinner wrote: > > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote: > > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote: > > > [...] > > > > Preferrably, we'd get rid of all nofail allocations and replace them > > > > with preallocated reserves. But this is not going to happen anytime > > > > soon, so what other option do we have than resolving this on the OOM > > > > killer side? > > > > > > As I've mentioned in other email, we might give GFP_NOFAIL allocator > > > access to memory reserves (by giving it __GFP_HIGH). > > > > Won't work when you have thousands of concurrent transactions > > running in XFS and they are all doing GFP_NOFAIL allocations. > > Is there any bound on how many transactions can run at the same time? Yes. As many reservations that can fit in the available log space. The log can be sized up to 2GB, and for filesystems larger than 4TB will default to 2GB. Log space reservations depend on the operation being done - an inode timestamp update requires about 5kB of reservation, and rename requires about 200kB. Hence we can easily have thousands of active transactions, even in the worst case log space reversation cases. You're saying it would be insane to have hundreds or thousands of threads doing GFP_NOFAIL allocations concurrently. Reality check: XFS has been operating successfully under such workload conditions in production systems for many years. > > That's why I suggested the per-transaction reserve pool - we can use > > that > > I am still not sure what you mean by reserve pool (API wise). How > does it differ from pre-allocating memory before the "may not fail > context"? Could you elaborate on it, please? It is preallocating memory: into a reserve pool associated with the transaction, done as part of the transaction reservation mechanism we already have in XFS. The allocator then uses that reserve pool to allocate from if an allocation would otherwise fail. There is no way we can preallocate specific objects before the transaction - that's just insane, especially handling the unbound demand paged object requirement. Hence the need for a "preallocated reserve pool" that the allocator can dip into that covers the memory we need to *allocate and can't reclaim* during the course of the transaction. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-17 22:54 ` How to handle TIF_MEMDIE stalls? Dave Chinner 2015-02-17 23:32 ` Dave Chinner 2015-02-18 8:25 ` Michal Hocko @ 2015-02-19 10:24 ` Johannes Weiner 2015-02-19 22:52 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-02-19 10:24 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > [ cc xfs list - experienced kernel devs should not have to be > reminded to do this ] > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > On Tue, Feb 17, 2015 at 09:23:26PM +0900, Tetsuo Handa wrote: > > > Tetsuo Handa wrote: > > > > Johannes Weiner wrote: > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > index 8e20f9c2fa5a..f77c58ebbcfa 100644 > > > > > --- a/mm/page_alloc.c > > > > > +++ b/mm/page_alloc.c > > > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > > > if (high_zoneidx < ZONE_NORMAL) > > > > > goto out; > > > > > /* The OOM killer does not compensate for light reclaim */ > > > > > - if (!(gfp_mask & __GFP_FS)) > > > > > + if (!(gfp_mask & __GFP_FS)) { > > > > > + /* > > > > > + * XXX: Page reclaim didn't yield anything, > > > > > + * and the OOM killer can't be invoked, but > > > > > + * keep looping as per should_alloc_retry(). > > > > > + */ > > > > > + *did_some_progress = 1; > > > > > goto out; > > > > > + } > > > > > > > > Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations? > > > > > > I can see "possible memory allocation deadlock in %s (mode:0x%x)" warnings > > > at kmem_alloc() in fs/xfs/kmem.c . I think commit 9879de7373fcfb46 "mm: > > > page_alloc: embed OOM killing naturally into allocation slowpath" introduced > > > a regression and below one is the fix. > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2381,9 +2381,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > /* The OOM killer does not needlessly kill tasks for lowmem */ > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > - /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > - goto out; > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Again, we don't want to OOM kill on behalf of allocations that can't > > initiate IO, or even actively prevent others from doing it. Not per > > default anyway, because most callers can deal with the failure without > > having to resort to killing tasks, and NOFS reclaim *can* easily fail. > > It's the exceptions that should be annotated instead: > > > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > void *ptr; > > > > do { > > ptr = kmalloc(size, lflags); > > if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > return ptr; > > if (!(++retries % 100)) > > xfs_err(NULL, > > "possible memory allocation deadlock in %s (mode:0x%x)", > > __func__, lflags); > > congestion_wait(BLK_RW_ASYNC, HZ/50); > > } while (1); > > } > > > > This should use __GFP_NOFAIL, which is not only designed to annotate > > broken code like this, but also recognizes that endless looping on a > > GFP_NOFS allocation needs the OOM killer after all to make progress. > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index a7a3a63bb360..17ced1805d3a 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -45,20 +45,12 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize) > > void * > > kmem_alloc(size_t size, xfs_km_flags_t flags) > > { > > - int retries = 0; > > gfp_t lflags = kmem_flags_convert(flags); > > - void *ptr; > > > > - do { > > - ptr = kmalloc(size, lflags); > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > - return ptr; > > - if (!(++retries % 100)) > > - xfs_err(NULL, > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > - __func__, lflags); > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > - } while (1); > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > + lflags |= __GFP_NOFAIL; > > + > > + return kmalloc(size, lflags); > > } > > Hmmm - the only reason there is a focus on this loop is that it > emits warnings about allocations failing. It's obvious that the > problem being dealt with here is a fundamental design issue w.r.t. > to locking and the OOM killer, but the proposed special casing > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > in XFS started emitting warnings about allocations failing more > often. > > So the answer is to remove the warning? That's like killing the > canary to stop the methane leak in the coal mine. No canary? No > problems! That's not what happened. The patch that affected behavior here transformed code that an incoherent collection of conditions to something that has an actual model. That model is that we don't loop in the allocator if there are no means to making forward progress. In this case, it was GFP_NOFS triggering an early exit from the allocator because it's not allowed to invoke the OOM killer per default, and there is little point in looping for times to better on their own. So these deadlock warnings happen, ironically, by the page allocator now bailing out of a locked-up state in which it's not making forward progress. They don't strike me as a very useful canary in this case. > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. That's a big red flag to me that all > this hacking around the edges is not solving the underlying problem, > but instead is breaking things that did once work. > > And, well, then there's this (gfp.h): > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > So, is this another policy relevation from the mm developers about > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > Or just another symptom of frantic thrashing because nobody actually > understands the problem or those that do are unwilling to throw out > the broken crap and redesign it? Well, understand our dilemma here. __GFP_NOFAIL is a liability because it can trap tasks with unknown state and locks in a potentially never ending loop, and we don't want people to start using it as a convenient solution to get out of having a fallback strategy. However, if your entire architecture around a particular allocation is that failure is not an option at this point, and you can't reasonably preallocate - although that would always be preferrable - then please do not open code an endless loop around the call to the allocator but use __GFP_NOFAIL instead so that these callsites are annotated and can be reviewed. By giving the allocator this information, it can then also adjust its behavior, like it is the case right here: we don't usually want to OOM kill for regular GFP_NOFS allocations because their reclaim powers are weak and we don't want to kill tasks prematurely. But if your NOFS allocation can not fail under any circumstances, then the OOM killer should very much be employed to make any kind of forward progress at all for this allocation. It's just that the allocator needs to be made aware of this requirement. So yes, we are wary of __GFP_NOFAIL allocations, but this is an instance where it's the right way to communicate with the allocator, it was introduced to replace such open-coded endless loops and have the liability of making progress with the allocator, not the caller. And please understand that this callsite blowing up is a chance to better the code and behavior here. Where previously it would just endlessly loop in the allocator without any means to make progress, converting it to a __GFP_NOFAIL allocation tells the allocator that it's fine to use the OOM killer in such an instance, improving the chances that this caller will actually make headway under heavy load. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 10:24 ` Johannes Weiner @ 2015-02-19 22:52 ` Dave Chinner 2015-02-20 10:36 ` Tetsuo Handa 2015-02-21 23:52 ` Johannes Weiner 0 siblings, 2 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-19 22:52 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Thu, Feb 19, 2015 at 05:24:31AM -0500, Johannes Weiner wrote: > On Wed, Feb 18, 2015 at 09:54:30AM +1100, Dave Chinner wrote: > > [ cc xfs list - experienced kernel devs should not have to be > > reminded to do this ] > > > > On Tue, Feb 17, 2015 at 07:53:15AM -0500, Johannes Weiner wrote: > > > - do { > > > - ptr = kmalloc(size, lflags); > > > - if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP))) > > > - return ptr; > > > - if (!(++retries % 100)) > > > - xfs_err(NULL, > > > - "possible memory allocation deadlock in %s (mode:0x%x)", > > > - __func__, lflags); > > > - congestion_wait(BLK_RW_ASYNC, HZ/50); > > > - } while (1); > > > + if (!(flags & (KM_MAYFAIL | KM_NOSLEEP))) > > > + lflags |= __GFP_NOFAIL; > > > + > > > + return kmalloc(size, lflags); > > > } > > > > Hmmm - the only reason there is a focus on this loop is that it > > emits warnings about allocations failing. It's obvious that the > > problem being dealt with here is a fundamental design issue w.r.t. > > to locking and the OOM killer, but the proposed special casing > > hack^H^H^H^Hband aid^W^Wsolution is not "working" because some code > > in XFS started emitting warnings about allocations failing more > > often. > > > > So the answer is to remove the warning? That's like killing the > > canary to stop the methane leak in the coal mine. No canary? No > > problems! > > That's not what happened. The patch that affected behavior here > transformed code that an incoherent collection of conditions to > something that has an actual model. Which is entirely undocumented. If you have a model, the first thing to do is document it and communicate that model to everyone who needs to know about that new model. I have no idea what that model is. Keeping it in your head and changing code that other people maintain without giving them any means of understanding WTF you are doing is a really bad engineering practice. And yes, I have had a bit to say about this in public recently. Go watch my recent LCA talk, for example.... And, FWIW, email discussions on a list is no substitute for a properly documented design that people can take their time to understand and digest. > That model is that we don't loop > in the allocator if there are no means to making forward progress. In > this case, it was GFP_NOFS triggering an early exit from the allocator > because it's not allowed to invoke the OOM killer per default, and > there is little point in looping for times to better on their own. So you keep saying.... > So these deadlock warnings happen, ironically, by the page allocator > now bailing out of a locked-up state in which it's not making forward > progress. They don't strike me as a very useful canary in this case. ... yet we *rarely* see the canary warnings we emit when we do too many allocation retries, the code has been that way for 13-odd years. Hence, despite your protestations that your way is *better*, we have code that is tried, tested and proven in rugged production environments. That's far more convincing evidence that the *code should not change* than your assertions that it is broken and needs to be fixed. > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. That's a big red flag to me that all > > this hacking around the edges is not solving the underlying problem, > > but instead is breaking things that did once work. > > > > And, well, then there's this (gfp.h): > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > So, is this another policy relevation from the mm developers about > > the kmalloc API? i.e. that __GFP_NOFAIL is no longer deprecated? > > Or just another symptom of frantic thrashing because nobody actually > > understands the problem or those that do are unwilling to throw out > > the broken crap and redesign it? > > Well, understand our dilemma here. __GFP_NOFAIL is a liability > because it can trap tasks with unknown state and locks in a > potentially never ending loop, and we don't want people to start using > it as a convenient solution to get out of having a fallback strategy. > > However, if your entire architecture around a particular allocation is > that failure is not an option at this point, and you can't reasonably > preallocate - although that would always be preferrable - then please > do not open code an endless loop around the call to the allocator but > use __GFP_NOFAIL instead so that these callsites are annotated and can > be reviewed. I will actively work around aanything that causes filesystem memory pressure to increase the chance of oom killer invocations. The OOM killer is not a solution - it is, by definition, a loose cannon and so we should be reducing dependencies on it. I really don't care about the OOM Killer corner cases - it's completely the wrong way line of development to be spending time on and you aren't going to convince me otherwise. The OOM killer a crutch used to justify having a memory allocation subsystem that can't provide forward progress guarantee mechanisms to callers that need it. I've proposed a method of providing this forward progress guarantee for subsystems of arbitrary complexity, and this removes the dependency on the OOM killer for fowards allocation progress in such contexts (e.g. filesystems). We should be discussing how to implement that, not what bandaids we need to apply to the OOM killer. I want to fix the underlying problems, not push them under the OOM-killer bus... > And please understand that this callsite blowing up is a chance to > better the code and behavior here. Where previously it would just > endlessly loop in the allocator without any means to make progress, Again, this statement ignores the fact we have *no credible evidence* that this is actually a problem in production environments. And, besides, even if you do force through changing the XFS code to GFP_NOFAIL, it'll get changed back to a retry loop in the near future when we add admin configurable error handling behaviour to XFS, as I pointed Michal to.... (http://oss.sgi.com/archives/xfs/2015-02/msg00346.html) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:52 ` Dave Chinner @ 2015-02-20 10:36 ` Tetsuo Handa 2015-02-20 23:15 ` Dave Chinner 2015-02-21 23:52 ` Johannes Weiner 1 sibling, 1 reply; 83+ messages in thread From: Tetsuo Handa @ 2015-02-20 10:36 UTC (permalink / raw) To: david, hannes Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds Dave Chinner wrote: > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. I really care about the OOM Killer corner cases, for I'm (1) seeing trouble cases which occurred in enterprise systems under OOM conditions (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which an unprivileged user with a login shell can trivially trigger since Linux 2.0) to OOM "Genocide" attacks in order to allow OOM-unkillable daemons to restart OOM-killed processes (3) waiting for a bandaid for (2) in order to propose changes for mitigating OOM "Genocide" attacks (as bad guys will find how to trigger OOM "Deadlock or Genocide" attacks from changes for mitigating OOM "Genocide" attacks) I started posting to linux-mm ML in order to make forward progress about (1) and (2). I don't want the memory allocation subsystem to lock up an entire system by indefinitely disabling memory releasing mechanism provided by the OOM killer. > I've proposed a method of providing this forward progress guarantee > for subsystems of arbitrary complexity, and this removes the > dependency on the OOM killer for fowards allocation progress in such > contexts (e.g. filesystems). We should be discussing how to > implement that, not what bandaids we need to apply to the OOM > killer. I want to fix the underlying problems, not push them under > the OOM-killer bus... I'm fine with that direction for new kernels provided that a simple bandaid which can be backported to distributor kernels for making OOM "Deadlock" attacks impossible is implemented. Therefore, I'm discussing what bandaids we need to apply to the OOM killer. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 10:36 ` Tetsuo Handa @ 2015-02-20 23:15 ` Dave Chinner 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 11:12 ` Tetsuo Handa 0 siblings, 2 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-20 23:15 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > I really care about the OOM Killer corner cases, for I'm > > (1) seeing trouble cases which occurred in enterprise systems > under OOM conditions You reach OOM, then your SLAs are dead and buried. Reboot the box - its a much more reliable way of returning to a working system than playing Russian Roulette with the OOM killer. > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > an unprivileged user with a login shell can trivially trigger > since Linux 2.0) to OOM "Genocide" attacks in order to allow > OOM-unkillable daemons to restart OOM-killed processes > > (3) waiting for a bandaid for (2) in order to propose changes for > mitigating OOM "Genocide" attacks (as bad guys will find how to > trigger OOM "Deadlock or Genocide" attacks from changes for > mitigating OOM "Genocide" attacks) Which is yet another indication that the OOM killer is the wrong solution to the "lack of forward progress" problem. Any one can generate enough memory pressure to trigger the OOM killer; we can't prevent that from occurring when the OOM killer can be invoked by user processes. > I started posting to linux-mm ML in order to make forward progress > about (1) and (2). I don't want the memory allocation subsystem to > lock up an entire system by indefinitely disabling memory releasing > mechanism provided by the OOM killer. > > > I've proposed a method of providing this forward progress guarantee > > for subsystems of arbitrary complexity, and this removes the > > dependency on the OOM killer for fowards allocation progress in such > > contexts (e.g. filesystems). We should be discussing how to > > implement that, not what bandaids we need to apply to the OOM > > killer. I want to fix the underlying problems, not push them under > > the OOM-killer bus... > > I'm fine with that direction for new kernels provided that a simple > bandaid which can be backported to distributor kernels for making > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > discussing what bandaids we need to apply to the OOM killer. The band-aids being proposed are worse than the problem they are intended to cover up. In which case, the band-aids should not be applied. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 23:15 ` Dave Chinner @ 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 9:19 ` Andrew Morton ` (2 more replies) 2015-02-21 11:12 ` Tetsuo Handa 1 sibling, 3 replies; 83+ messages in thread From: Theodore Ts'o @ 2015-02-21 3:20 UTC (permalink / raw) To: Dave Chinner Cc: hannes, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds +akpm So I'm arriving late to this discussion since I've been in conference mode for the past week, and I'm only now catching up on this thread. I'll note that this whole question of whether or not file systems should use GFP_NOFAIL is one where the mm developers are not of one mind. In fact, search for the subject line "fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL" where we recapitulated many of these arguments, Andrew Morton said that it was better to use GFP_NOFAIL over the alternatives of (a) panic'ing the kernel because the file system has no way to move forward other than leaving the file system corrupted, or (b) looping in the file system to retry the memory allocation to avoid the unfortunate effects of (a). So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to ext4/jbd2. It sounds like 9879de7373fc is causing massive file system errors, and it seems **really** unfortunate it was added so late in the day (between -rc6 and rc7). So at this point, it seems we have two choices. We can either revert 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's memory allocations and submit them as stable bug fixes. Linux MM developers, this is your call. I will liberally be adding GFP_NOFAIL to ext4 if you won't revert the commit, because that's the only way I can fix things with minimal risk of adding additional, potentially more serious regressions. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o @ 2015-02-21 9:19 ` Andrew Morton 2015-02-21 13:48 ` Tetsuo Handa ` (2 more replies) 2015-02-21 12:00 ` Tetsuo Handa 2015-02-23 10:26 ` Michal Hocko 2 siblings, 3 replies; 83+ messages in thread From: Andrew Morton @ 2015-02-21 9:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > +akpm I was hoping not to have to read this thread ;) afaict there are two (main) issues: a) whether to oom-kill when __GFP_FS is not set. The kernel hasn't been doing this for ages and nothing has changed recently. b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS is not set and we can't oom-kill anything (which goes without saying, because __GFP_FS isn't set!). And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") somewhat inadvertently changed this policy - the allocation attempt will now promptly return ENOMEM if !__GFP_NOFAIL and !__GFP_FS. Correct enough? Question a) seems a bit of red herring and we can park it for now. What I'm not really understanding is why the pre-3.19 implementation actually worked. We've exhausted the free pages, we're not succeeding at reclaiming anything, we aren't able to oom-kill anyone. Yet it *does* work - we eventually find that memory and everything proceeds. How come? Where did that memory come from? Short term, we need to fix 3.19.x and 3.20 and that appears to be by applying Johannes's akpm-doesnt-know-why-it-works patch: --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. Have people adequately confirmed that this gets us out of trouble? And yes, I agree that sites such as xfs's kmem_alloc() should be passing __GFP_NOFAIL to tell the page allocator what's going on. I don't think it matters a lot whether kmem_alloc() retains its retry loop. If __GFP_NOFAIL is working correctly then it will never loop anyway... Also, this: On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > Right now, the oom killer is a liability. Over the past 6 months > I've slowly had to exclude filesystem regression tests from running > on small memory machines because the OOM killer is now so unreliable > that it kills the test harness regularly rather than the process > generating memory pressure. David, I did not know this! If you've been telling us about this then perhaps it wasn't loud enough. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton @ 2015-02-21 13:48 ` Tetsuo Handa 2015-02-21 21:38 ` Dave Chinner 2015-02-22 0:20 ` Johannes Weiner 2 siblings, 0 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-02-21 13:48 UTC (permalink / raw) To: akpm Cc: tytso, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) Sorry for getting so complicated. > What I'm not really understanding is why the pre-3.19 implementation > actually worked. We've exhausted the free pages, we're not succeeding > at reclaiming anything, we aren't able to oom-kill anyone. Yet it > *does* work - we eventually find that memory and everything proceeds. > > How come? Where did that memory come from? > Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever (without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying. This implies silent hang-up forever if nobody volunteers memory. > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations not to retry unless __GFP_NOFAIL is specified. Therefore, either applying Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL will restore the pre-3.19 behavior (with possibility of silent hang-up). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton 2015-02-21 13:48 ` Tetsuo Handa @ 2015-02-21 21:38 ` Dave Chinner 2015-02-22 0:20 ` Johannes Weiner 2 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-21 21:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, hannes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, linux-ext4, torvalds On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > > +akpm > > I was hoping not to have to read this thread ;) ditto.... > And yes, I agree that sites such as xfs's kmem_alloc() should be > passing __GFP_NOFAIL to tell the page allocator what's going on. I > don't think it matters a lot whether kmem_alloc() retains its retry > loop. If __GFP_NOFAIL is working correctly then it will never loop > anyway... I'm not about to change behaviour "just because". Any sort of change like this requires a *lot* of low memory regression testing because we'd be replacing long standing known behaviour with behaviour that changes without warning. e.g the ext4 low memory failures starting because of changes made in 3.19-rc6 due to changes in oom-killer behaviour. Those changes *did not affect XFS* and that's the way I'd like things to remain. Put simply: right now I don't trust the mm subsystem to get low memory behaviour right, and this thread has done nothing to convince me that it's going to improve any time soon. > Also, this: > > On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > Right now, the oom killer is a liability. Over the past 6 months > > I've slowly had to exclude filesystem regression tests from running > > on small memory machines because the OOM killer is now so unreliable > > that it kills the test harness regularly rather than the process > > generating memory pressure. > > David, I did not know this! If you've been telling us about this then > perhaps it wasn't loud enough. IME, such bug reports get ignored. Instead, over the past few months I have been pointing out bugs and problems in the oom-killer in threads like this because it seems to be the only way to get any attention to the issues I'm seeing. Bug reports simply get ignored. From this process, I've managed to learn that low order memory allocation now never fails (contrary to documentation and long standing behavioural expectations) and pointed out bugs that cause the oom killer to get invoked when the filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm: get rid of radix tree gfp mask for pagecache_get_page"). And yes, I've definitely mentioned in these discussions that, for example, xfstests::generic/224 is triggering the oom killer far more often than it used to on my 1GB RAM vm. The only fix that has been made recently that's made any difference is 45f87de, so it's a slow process of raising awareness and trying to ensure things don't get worse before they get better.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 9:19 ` Andrew Morton 2015-02-21 13:48 ` Tetsuo Handa 2015-02-21 21:38 ` Dave Chinner @ 2015-02-22 0:20 ` Johannes Weiner 2015-02-23 10:48 ` Michal Hocko 2015-02-23 21:33 ` David Rientjes 2 siblings, 2 replies; 83+ messages in thread From: Johannes Weiner @ 2015-02-22 0:20 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, linux-ext4, torvalds On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > applying Johannes's akpm-doesnt-know-why-it-works patch: > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (high_zoneidx < ZONE_NORMAL) > goto out; > /* The OOM killer does not compensate for light reclaim */ > - if (!(gfp_mask & __GFP_FS)) > + if (!(gfp_mask & __GFP_FS)) { > + /* > + * XXX: Page reclaim didn't yield anything, > + * and the OOM killer can't be invoked, but > + * keep looping as per should_alloc_retry(). > + */ > + *did_some_progress = 1; > goto out; > + } > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > Have people adequately confirmed that this gets us out of trouble? I'd be interested in this too. Who is seeing these failures? Andrew, can you please use the following changelog for this patch? --- From: Johannes Weiner <hannes@cmpxchg.org> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change Historically, !__GFP_FS allocations were not allowed to invoke the OOM killer once reclaim had failed, but nevertheless kept looping in the allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath"), which should have been a simple cleanup patch, accidentally changed the behavior to aborting the allocation at that point. This creates problems with filesystem callers (?) that currently rely on the allocator waiting for other tasks to intervene. Revert the behavior as it shouldn't have been changed as part of a cleanup patch. Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-22 0:20 ` Johannes Weiner @ 2015-02-23 10:48 ` Michal Hocko 2015-02-23 11:23 ` Tetsuo Handa 2015-02-23 21:33 ` David Rientjes 1 sibling, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-02-23 10:48 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, linux-mm, mgorman, dchinner, Andrew Morton, linux-ext4, torvalds On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > if (high_zoneidx < ZONE_NORMAL) > > goto out; > > /* The OOM killer does not compensate for light reclaim */ > > - if (!(gfp_mask & __GFP_FS)) > > + if (!(gfp_mask & __GFP_FS)) { > > + /* > > + * XXX: Page reclaim didn't yield anything, > > + * and the OOM killer can't be invoked, but > > + * keep looping as per should_alloc_retry(). > > + */ > > + *did_some_progress = 1; > > goto out; > > + } > > /* > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > Have people adequately confirmed that this gets us out of trouble? > > I'd be interested in this too. Who is seeing these failures? > > Andrew, can you please use the following changelog for this patch? > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. OK, if this a _short term_ change. I really think that all the requests except for __GFP_NOFAIL should be able to fail. I would argue that it should be the caller who should be fixed but it is true that the patch was introduced too late (rc7) and so it caught other subsystems unprepared so backporting to stable makes sense to me. But can we please move on and stop pretending that allocations do not fail for the upcoming release? > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 10:48 ` Michal Hocko @ 2015-02-23 11:23 ` Tetsuo Handa 0 siblings, 0 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-02-23 11:23 UTC (permalink / raw) To: mhocko, hannes Cc: tytso, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds Michal Hocko wrote: > On Sat 21-02-15 19:20:58, Johannes Weiner wrote: > > On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote: > > > Short term, we need to fix 3.19.x and 3.20 and that appears to be by > > > applying Johannes's akpm-doesnt-know-why-it-works patch: > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > > > if (high_zoneidx < ZONE_NORMAL) > > > goto out; > > > /* The OOM killer does not compensate for light reclaim */ > > > - if (!(gfp_mask & __GFP_FS)) > > > + if (!(gfp_mask & __GFP_FS)) { > > > + /* > > > + * XXX: Page reclaim didn't yield anything, > > > + * and the OOM killer can't be invoked, but > > > + * keep looping as per should_alloc_retry(). > > > + */ > > > + *did_some_progress = 1; > > > goto out; > > > + } > > > /* > > > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > > > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > > > > > > Have people adequately confirmed that this gets us out of trouble? > > > > I'd be interested in this too. Who is seeing these failures? So far ext4 and xfs. I don't have environment to test other filesystems. > > > > Andrew, can you please use the following changelog for this patch? > > > > --- > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > > killer once reclaim had failed, but nevertheless kept looping in the > > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > > into allocation slowpath"), which should have been a simple cleanup > > patch, accidentally changed the behavior to aborting the allocation at > > that point. This creates problems with filesystem callers (?) that > > currently rely on the allocator waiting for other tasks to intervene. > > > > Revert the behavior as it shouldn't have been changed as part of a > > cleanup patch. > > OK, if this a _short term_ change. I really think that all the requests > except for __GFP_NOFAIL should be able to fail. I would argue that it > should be the caller who should be fixed but it is true that the patch > was introduced too late (rc7) and so it caught other subsystems > unprepared so backporting to stable makes sense to me. But can we please > move on and stop pretending that allocations do not fail for the > upcoming release? > > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Acked-by: Michal Hocko <mhocko@suse.cz> > Without this patch, I think the system becomes unusable under OOM. However, with this patch, I know the system may become unusable under OOM. Please do write patches for handling below condition. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Johannes's patch will get us out of filesystem error troubles, at the cost of getting us into stall troubles (as with until 3.19-rc6). I retested http://marc.info/?l=linux-ext4&m=142443125221571&w=2 with debug printk patch shown below. ---------- debug printk patch ---------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9c..5144506 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -610,6 +610,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +atomic_t oom_killer_skipped_count = ATOMIC_INIT(0); + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -679,6 +681,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, nodemask, "Out of memory"); killed = 1; } + else + atomic_inc(&oom_killer_skipped_count); out: /* * Give the killed threads a good chance of exiting before trying to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8e20f9c..eaea16b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) + if (!(gfp_mask & __GFP_FS)) { + /* + * XXX: Page reclaim didn't yield anything, + * and the OOM killer can't be invoked, but + * keep looping as per should_alloc_retry(). + */ + *did_some_progress = 1; goto out; + } /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2635,6 +2642,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS); } +extern atomic_t oom_killer_skipped_count; + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -2649,6 +2658,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long first_retried_time = 0; + unsigned long next_warn_time = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -2821,6 +2832,19 @@ retry: if (!did_some_progress) goto nopage; } + if (!first_retried_time) { + first_retried_time = jiffies; + if (!first_retried_time) + first_retried_time = 1; + next_warn_time = first_retried_time + 5 * HZ; + } else if (time_after(jiffies, next_warn_time)) { + printk(KERN_INFO "%d (%s) : gfp 0x%X : %lu seconds : " + "OOM-killer skipped %u\n", current->pid, + current->comm, gfp_mask, + (jiffies - first_retried_time) / HZ, + atomic_read(&oom_killer_skipped_count)); + next_warn_time = jiffies + 5 * HZ; + } /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto retry; ---------- debug printk patch ---------- GFP_NOFS allocations stalled for 10 minutes waiting for somebody else to volunteer memory. GFP_FS allocations stalled for 10 minutes waiting for the OOM killer to kill somebody. The OOM killer stalled for 10 minutes waiting for GFP_NOFS allocations to complete. I guess the system made forward progress because the number of remaining a.out processes decreased over time. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-ext4-patched.txt.xz ) ---------- ext4 / Linux 3.19 + patch ---------- [ 1335.187579] Out of memory: Kill process 14156 (a.out) score 760 or sacrifice child [ 1335.189604] Killed process 14156 (a.out) total-vm:2167392kB, anon-rss:1360196kB, file-rss:0kB [ 1335.191920] Kill process 14177 (a.out) sharing same memory [ 1335.193465] Kill process 14178 (a.out) sharing same memory [ 1335.195013] Kill process 14179 (a.out) sharing same memory [ 1335.196580] Kill process 14180 (a.out) sharing same memory [ 1335.198128] Kill process 14181 (a.out) sharing same memory [ 1335.199674] Kill process 14182 (a.out) sharing same memory [ 1335.201217] Kill process 14183 (a.out) sharing same memory [ 1335.202768] Kill process 14184 (a.out) sharing same memory [ 1335.204316] Kill process 14185 (a.out) sharing same memory [ 1335.205871] Kill process 14186 (a.out) sharing same memory [ 1335.207420] Kill process 14187 (a.out) sharing same memory [ 1335.208974] Kill process 14188 (a.out) sharing same memory [ 1335.210515] Kill process 14189 (a.out) sharing same memory [ 1335.212063] Kill process 14190 (a.out) sharing same memory [ 1335.213611] Kill process 14191 (a.out) sharing same memory [ 1335.215165] Kill process 14192 (a.out) sharing same memory [ 1335.216715] Kill process 14193 (a.out) sharing same memory [ 1335.218286] Kill process 14194 (a.out) sharing same memory [ 1335.219836] Kill process 14195 (a.out) sharing same memory [ 1335.221378] Kill process 14196 (a.out) sharing same memory [ 1335.222918] Kill process 14197 (a.out) sharing same memory [ 1335.224461] Kill process 14198 (a.out) sharing same memory [ 1335.225999] Kill process 14199 (a.out) sharing same memory [ 1335.227545] Kill process 14200 (a.out) sharing same memory [ 1335.229095] Kill process 14201 (a.out) sharing same memory [ 1335.230643] Kill process 14202 (a.out) sharing same memory [ 1335.232184] Kill process 14203 (a.out) sharing same memory [ 1335.233738] Kill process 14204 (a.out) sharing same memory [ 1335.235293] Kill process 14205 (a.out) sharing same memory [ 1335.236834] Kill process 14206 (a.out) sharing same memory [ 1335.238387] Kill process 14207 (a.out) sharing same memory [ 1335.239930] Kill process 14208 (a.out) sharing same memory [ 1335.241471] Kill process 14209 (a.out) sharing same memory [ 1335.243011] Kill process 14210 (a.out) sharing same memory [ 1335.244554] Kill process 14211 (a.out) sharing same memory [ 1335.246101] Kill process 14212 (a.out) sharing same memory [ 1335.247645] Kill process 14213 (a.out) sharing same memory [ 1335.249182] Kill process 14214 (a.out) sharing same memory [ 1335.250718] Kill process 14215 (a.out) sharing same memory [ 1335.252305] Kill process 14216 (a.out) sharing same memory [ 1335.253899] Kill process 14217 (a.out) sharing same memory [ 1335.255443] Kill process 14218 (a.out) sharing same memory [ 1335.256993] Kill process 14219 (a.out) sharing same memory [ 1335.258531] Kill process 14220 (a.out) sharing same memory [ 1335.260066] Kill process 14221 (a.out) sharing same memory [ 1335.261616] Kill process 14222 (a.out) sharing same memory [ 1335.263143] Kill process 14223 (a.out) sharing same memory [ 1335.264647] Kill process 14224 (a.out) sharing same memory [ 1335.266121] Kill process 14225 (a.out) sharing same memory [ 1335.267598] Kill process 14226 (a.out) sharing same memory [ 1335.269077] Kill process 14227 (a.out) sharing same memory [ 1335.270560] Kill process 14228 (a.out) sharing same memory [ 1335.272038] Kill process 14229 (a.out) sharing same memory [ 1335.273508] Kill process 14230 (a.out) sharing same memory [ 1335.274999] Kill process 14231 (a.out) sharing same memory [ 1335.276469] Kill process 14232 (a.out) sharing same memory [ 1335.277947] Kill process 14233 (a.out) sharing same memory [ 1335.279428] Kill process 14234 (a.out) sharing same memory [ 1335.280894] Kill process 14235 (a.out) sharing same memory [ 1335.282361] Kill process 14236 (a.out) sharing same memory [ 1335.283832] Kill process 14237 (a.out) sharing same memory [ 1335.285304] Kill process 14238 (a.out) sharing same memory [ 1335.286768] Kill process 14239 (a.out) sharing same memory [ 1335.288242] Kill process 14240 (a.out) sharing same memory [ 1335.289714] Kill process 14241 (a.out) sharing same memory [ 1335.291196] Kill process 14242 (a.out) sharing same memory [ 1335.292731] Kill process 14243 (a.out) sharing same memory [ 1335.294258] Kill process 14244 (a.out) sharing same memory [ 1335.295734] Kill process 14245 (a.out) sharing same memory [ 1335.297215] Kill process 14246 (a.out) sharing same memory [ 1335.298710] Kill process 14247 (a.out) sharing same memory [ 1335.300188] Kill process 14248 (a.out) sharing same memory [ 1335.301672] Kill process 14249 (a.out) sharing same memory [ 1335.303157] Kill process 14250 (a.out) sharing same memory [ 1335.304655] Kill process 14251 (a.out) sharing same memory [ 1335.306141] Kill process 14252 (a.out) sharing same memory [ 1335.307621] Kill process 14253 (a.out) sharing same memory [ 1335.309107] Kill process 14254 (a.out) sharing same memory [ 1335.310573] Kill process 14255 (a.out) sharing same memory [ 1335.312052] Kill process 14256 (a.out) sharing same memory [ 1335.313528] Kill process 14257 (a.out) sharing same memory [ 1335.315039] Kill process 14258 (a.out) sharing same memory [ 1335.316522] Kill process 14259 (a.out) sharing same memory [ 1335.317992] Kill process 14260 (a.out) sharing same memory [ 1335.319462] Kill process 14261 (a.out) sharing same memory [ 1335.320965] Kill process 14262 (a.out) sharing same memory [ 1335.322459] Kill process 14263 (a.out) sharing same memory [ 1335.323958] Kill process 14264 (a.out) sharing same memory [ 1335.325472] Kill process 14265 (a.out) sharing same memory [ 1335.326966] Kill process 14266 (a.out) sharing same memory [ 1335.328454] Kill process 14267 (a.out) sharing same memory [ 1335.329945] Kill process 14268 (a.out) sharing same memory [ 1335.331444] Kill process 14269 (a.out) sharing same memory [ 1335.332944] Kill process 14270 (a.out) sharing same memory [ 1335.334435] Kill process 14271 (a.out) sharing same memory [ 1335.335930] Kill process 14272 (a.out) sharing same memory [ 1335.337437] Kill process 14273 (a.out) sharing same memory [ 1335.338927] Kill process 14274 (a.out) sharing same memory [ 1335.340400] Kill process 14275 (a.out) sharing same memory [ 1335.341890] Kill process 14276 (a.out) sharing same memory [ 1339.640500] 464 (systemd-journal) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459181 [ 1339.649374] 615 (vmtoolsd) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459438 [ 1339.649611] 4079 (pool) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22459447 [ 1340.343322] 14258 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343331] 14194 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478275 [ 1340.343345] 14210 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478276 [ 1340.343360] 14179 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478277 [ 1340.345290] 14154 (su) : gfp 0x201DA : 5 seconds : OOM-killer skipped 22478339 [ 1340.345312] 14180 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345319] 14260 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478339 [ 1340.345337] 14178 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345345] 14245 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478340 [ 1340.345361] 14226 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478341 [ 1340.346119] 14256 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478368 [ 1340.346139] 14181 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478369 [ 1340.347082] 14274 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347091] 14267 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347095] 14189 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347099] 14238 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478402 [ 1340.347107] 14276 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347112] 14183 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478403 [ 1340.347397] 14254 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347402] 14228 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478413 [ 1340.347414] 14185 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347419] 14261 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347423] 14217 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347427] 14203 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478414 [ 1340.347439] 14234 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347452] 14269 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478415 [ 1340.347461] 14255 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347465] 14192 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347473] 14259 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478416 [ 1340.347492] 14232 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347497] 14223 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347505] 14220 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478417 [ 1340.347523] 14252 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 [ 1340.347531] 14193 (a.out) : gfp 0x50 : 5 seconds : OOM-killer skipped 22478418 (...snipped...) [ 1949.672951] 43 (kworker/1:1) : gfp 0x10 : 90 seconds : OOM-killer skipped 41315348 [ 1949.993045] 4079 (pool) : gfp 0x201DA : 615 seconds : OOM-killer skipped 41325108 [ 1950.694909] 14269 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41346727 [ 1950.703945] 14181 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41347003 [ 1950.742087] 14254 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348208 [ 1950.744937] 14193 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348299 [ 1950.748884] 2 (kthreadd) : gfp 0x2000D0 : 10 seconds : OOM-killer skipped 41348418 [ 1950.751565] 14203 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348502 [ 1950.756955] 14232 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41348656 [ 1950.776918] 14185 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349279 [ 1950.791214] 14217 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349720 [ 1950.798961] 14179 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41349957 [ 1950.806551] 14255 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350209 [ 1950.810860] 14234 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350356 [ 1950.813821] 14258 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41350450 [ 1950.860422] 14261 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41351919 [ 1950.864015] 14210 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352033 [ 1950.866636] 14226 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41352107 [ 1950.905003] 14238 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353303 [ 1950.907813] 14180 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353381 [ 1950.913963] 14276 (a.out) : gfp 0x50 : 615 seconds : OOM-killer skipped 41353567 [ 1952.238344] 649 (chronyd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393388 [ 1952.243228] 4030 (gnome-shell) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393566 [ 1952.247225] 592 (audispd) : gfp 0x201DA : 25 seconds : OOM-killer skipped 41393701 [ 1952.258265] 1 (systemd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394041 [ 1952.269296] 1691 (rpcbind) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41394365 [ 1952.299073] 702 (rtkit-daemon) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41395288 [ 1952.301231] 627 (lsmd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41395385 [ 1952.350200] 464 (systemd-journal) : gfp 0x201DA : 165 seconds : OOM-killer skipped 41396935 [ 1952.472040] 543 (auditd) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400669 [ 1952.475211] 14154 (su) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41400795 [ 1952.527084] 3514 (smbd) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402412 [ 1952.543205] 613 (irqbalance) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41402892 [ 1952.568276] 12672 (pickup) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403656 [ 1952.572329] 770 (tuned) : gfp 0x201DA : 95 seconds : OOM-killer skipped 41403784 [ 1952.578076] 3392 (master) : gfp 0x201DA : 35 seconds : OOM-killer skipped 41403955 [ 1952.597273] 615 (vmtoolsd) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41404520 [ 1952.619187] 14146 (sleep) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405206 [ 1952.621214] 811 (NetworkManager) : gfp 0x201DA : 105 seconds : OOM-killer skipped 41405265 [ 1952.765035] 3700 (gnome-settings-) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409551 [ 1952.776099] 603 (alsactl) : gfp 0x201DA : 315 seconds : OOM-killer skipped 41409856 [ 1952.823163] 661 (crond) : gfp 0x201DA : 325 seconds : OOM-killer skipped 41411303 [ 1953.201269] SysRq : Resetting ---------- ext4 / Linux 3.19 + patch ---------- I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19 with debug printk patch shown above. According to console logs, oom_kill_process() is trivially called via pagefault_out_of_memory() for the former kernel. Due to giving up !GFP_FS allocations immediately? (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz ) ---------- xfs / Linux 3.19 ---------- [ 793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 [ 793.283102] su cpuset=/ mems_allowed=0 [ 793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40 [ 793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 793.283161] 0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe [ 793.283162] ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206 [ 793.283163] 0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8 [ 793.283164] Call Trace: [ 793.283169] [<ffffffff816ae9d4>] dump_stack+0x45/0x57 [ 793.283171] [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1 [ 793.283174] [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390 [ 793.283177] [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30 [ 793.283178] [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500 [ 793.283179] [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90 [ 793.283180] [<ffffffff816aab2c>] mm_fault_error+0x67/0x140 [ 793.283182] [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580 [ 793.283185] [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60 [ 793.283186] [<ffffffff81070fcb>] ? do_wait+0x12b/0x240 [ 793.283187] [<ffffffff8105abb1>] do_page_fault+0x31/0x70 [ 793.283189] [<ffffffff816b83e8>] page_fault+0x28/0x30 ---------- xfs / Linux 3.19 ---------- On the other hand, stall is observed for the latter kernel. I guess that this time the system failed to make forward progress, for oom_killer_skipped_count is increasing over time but the number of remaining a.out processes remained unchanged. (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-patched.txt.xz ) ---------- xfs / Linux 3.19 + patch ---------- [ 2062.847965] 505 (abrt-watch-log) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388568 [ 2062.850270] 515 (lsmd) : gfp 0x2015A : 674 seconds : OOM-killer skipped 22388662 [ 2062.850389] 491 (audispd) : gfp 0x2015A : 666 seconds : OOM-killer skipped 22388667 [ 2062.850400] 346 (systemd-journal) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388667 [ 2062.850402] 610 (rtkit-daemon) : gfp 0x2015A : 677 seconds : OOM-killer skipped 22388667 [ 2062.850424] 494 (alsactl) : gfp 0x2015A : 546 seconds : OOM-killer skipped 22388668 [ 2062.850446] 558 (crond) : gfp 0x2015A : 645 seconds : OOM-killer skipped 22388669 [ 2062.850451] 25532 (su) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388669 [ 2062.850456] 516 (vmtoolsd) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388669 [ 2062.850494] 741 (NetworkManager) : gfp 0x2015A : 530 seconds : OOM-killer skipped 22388670 [ 2062.850503] 3132 (master) : gfp 0x2015A : 644 seconds : OOM-killer skipped 22388671 [ 2062.850508] 3144 (pickup) : gfp 0x2015A : 604 seconds : OOM-killer skipped 22388671 [ 2062.850512] 3145 (qmgr) : gfp 0x2015A : 526 seconds : OOM-killer skipped 22388671 [ 2062.850540] 25653 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388672 [ 2062.850561] 655 (tuned) : gfp 0x2015A : 682 seconds : OOM-killer skipped 22388673 [ 2062.852404] 10429 (kworker/0:14) : gfp 0x2040D0 : 683 seconds : OOM-killer skipped 22388748 [ 2062.852430] 543 (chronyd) : gfp 0x2015A : 293 seconds : OOM-killer skipped 22388749 [ 2062.852436] 13012 (goa-daemon) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22388749 [ 2062.852449] 1454 (rpcbind) : gfp 0x2015A : 662 seconds : OOM-killer skipped 22388749 [ 2062.854288] 466 (auditd) : gfp 0x2015A : 626 seconds : OOM-killer skipped 22388751 [ 2062.854305] 25622 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854426] 1419 (dhclient) : gfp 0x2015A : 388 seconds : OOM-killer skipped 22388751 [ 2062.854443] 25638 (a.out) : gfp 0x204250 : 683 seconds : OOM-killer skipped 22388751 [ 2062.854450] 25582 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388751 [ 2062.854462] 25400 (sleep) : gfp 0x2015A : 635 seconds : OOM-killer skipped 22388751 [ 2062.854469] 532 (smartd) : gfp 0x2015A : 246 seconds : OOM-killer skipped 22388751 [ 2062.854486] 2 (kthreadd) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22388752 [ 2062.854497] 3867 (gnome-shell) : gfp 0x2015A : 683 seconds : OOM-killer skipped 22388752 [ 2062.854502] 3562 (gnome-settings-) : gfp 0x2015A : 676 seconds : OOM-killer skipped 22388752 [ 2062.854524] 25641 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.854536] 25566 (a.out) : gfp 0x102005A : 683 seconds : OOM-killer skipped 22388753 [ 2062.908915] 61 (kworker/3:1) : gfp 0x2040D0 : 682 seconds : OOM-killer skipped 22390715 [ 2062.913407] 531 (irqbalance) : gfp 0x2015A : 679 seconds : OOM-killer skipped 22390894 [ 2064.988155] SysRq : Resetting ---------- xfs / Linux 3.19 + patch ---------- Oh, current code is too hintless to determine whether forward progress is made, for no kernel messages are printed when the OOM victim failed to die immediately. I wish we had debug printk patch shown above and/or like http://marc.info/?l=linux-mm&m=141671829611143&w=2 . _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-22 0:20 ` Johannes Weiner 2015-02-23 10:48 ` Michal Hocko @ 2015-02-23 21:33 ` David Rientjes 1 sibling, 0 replies; 83+ messages in thread From: David Rientjes @ 2015-02-23 21:33 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, linux-ext4, torvalds On Sat, 21 Feb 2015, Johannes Weiner wrote: > From: Johannes Weiner <hannes@cmpxchg.org> > > mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change > > Historically, !__GFP_FS allocations were not allowed to invoke the OOM > killer once reclaim had failed, but nevertheless kept looping in the > allocator. 9879de7373fc ("mm: page_alloc: embed OOM killing naturally > into allocation slowpath"), which should have been a simple cleanup > patch, accidentally changed the behavior to aborting the allocation at > that point. This creates problems with filesystem callers (?) that > currently rely on the allocator waiting for other tasks to intervene. > > Revert the behavior as it shouldn't have been changed as part of a > cleanup patch. > > Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org [3.19] Acked-by: David Rientjes <rientjes@google.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 9:19 ` Andrew Morton @ 2015-02-21 12:00 ` Tetsuo Handa 2015-02-23 10:26 ` Michal Hocko 2 siblings, 0 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-02-21 12:00 UTC (permalink / raw) To: tytso Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds Theodore Ts'o wrote: > So at this point, it seems we have two choices. We can either revert > 9879de7373fc, or I can add a whole lot more GFP_FAIL flags to ext4's > memory allocations and submit them as stable bug fixes. Can you absorb this side effect by simply adding GFP_NOFAIL to only ext4's memory allocations? Don't you also depend on lower layers which use GFP_NOIO? BTW, while you are using open-coded GFP_NOFAIL retry loop for GFP_NOFS allocation in jbd2, you are already using GFP_NOFAIL for GFP_NOFS allocation in jbd. Failure check being there for GFP_NOFAIL seems redundant. ---------- linux-3.19/fs/jbd2/transaction.c ---------- 257 static int start_this_handle(journal_t *journal, handle_t *handle, 258 gfp_t gfp_mask) 259 { 260 transaction_t *transaction, *new_transaction = NULL; 261 int blocks = handle->h_buffer_credits; 262 int rsv_blocks = 0; 263 unsigned long ts = jiffies; 264 265 /* 266 * 1/2 of transaction can be reserved so we can practically handle 267 * only 1/2 of maximum transaction size per operation 268 */ 269 if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) { 270 printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n", 271 current->comm, blocks, 272 journal->j_max_transaction_buffers / 2); 273 return -ENOSPC; 274 } 275 276 if (handle->h_rsv_handle) 277 rsv_blocks = handle->h_rsv_handle->h_buffer_credits; 278 279 alloc_transaction: 280 if (!journal->j_running_transaction) { 281 new_transaction = kmem_cache_zalloc(transaction_cache, 282 gfp_mask); 283 if (!new_transaction) { 284 /* 285 * If __GFP_FS is not present, then we may be 286 * being called from inside the fs writeback 287 * layer, so we MUST NOT fail. Since 288 * __GFP_NOFAIL is going away, we will arrange 289 * to retry the allocation ourselves. 290 */ 291 if ((gfp_mask & __GFP_FS) == 0) { 292 congestion_wait(BLK_RW_ASYNC, HZ/50); 293 goto alloc_transaction; 294 } 295 return -ENOMEM; 296 } 297 } 298 299 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd2/transaction.c ---------- ---------- linux-3.19/fs/jbd/transaction.c ---------- 84 static int start_this_handle(journal_t *journal, handle_t *handle) 85 { 86 transaction_t *transaction; 87 int needed; 88 int nblocks = handle->h_buffer_credits; 89 transaction_t *new_transaction = NULL; 90 int ret = 0; 91 92 if (nblocks > journal->j_max_transaction_buffers) { 93 printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n", 94 current->comm, nblocks, 95 journal->j_max_transaction_buffers); 96 ret = -ENOSPC; 97 goto out; 98 } 99 100 alloc_transaction: 101 if (!journal->j_running_transaction) { 102 new_transaction = kzalloc(sizeof(*new_transaction), 103 GFP_NOFS|__GFP_NOFAIL); 104 if (!new_transaction) { 105 ret = -ENOMEM; 106 goto out; 107 } 108 } 109 110 jbd_debug(3, "New handle %p going live.\n", handle); ---------- linux-3.19/fs/jbd/transaction.c ---------- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 3:20 ` Theodore Ts'o 2015-02-21 9:19 ` Andrew Morton 2015-02-21 12:00 ` Tetsuo Handa @ 2015-02-23 10:26 ` Michal Hocko 2 siblings, 0 replies; 83+ messages in thread From: Michal Hocko @ 2015-02-23 10:26 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, hannes, linux-mm, mgorman, rientjes, akpm, linux-ext4, torvalds On Fri 20-02-15 22:20:00, Theodore Ts'o wrote: [...] > So based on akpm's sage advise and wisdom, I added back GFP_NOFAIL to > ext4/jbd2. I am currently going through opencoded GFP_NOFAIL allocations and have this in my local branch currently. I assume you did the same so I will drop mine if you have pushed yours already. --- >From dc49cef75dbd677d5542c9e5bd27bbfab9a7bc3a Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Fri, 20 Feb 2015 11:32:58 +0100 Subject: [PATCH] jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2 layer). The deprecation of __GFP_NOFAIL was a bad choice because it led to open coding the endless loop around the allocator rather than removing the dependency on the non failing allocation. So the deprecation was a clear failure and the reality tells us that __GFP_NOFAIL is not even close to go away. It is still true that __GFP_NOFAIL allocations are generally discouraged and new uses should be evaluated and an alternative (pre-allocations or reservations) should be considered but it doesn't make any sense to lie the allocator about the requirements. Allocator can take steps to help making a progress if it knows the requirements. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/jbd2/journal.c | 11 +---------- fs/jbd2/transaction.c | 20 +++++++------------- 2 files changed, 8 insertions(+), 23 deletions(-) diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 1df94fabe4eb..878ed3e761f0 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -371,16 +371,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction, */ J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in)); -retry_alloc: - new_bh = alloc_buffer_head(GFP_NOFS); - if (!new_bh) { - /* - * Failure is not an option, but __GFP_NOFAIL is going - * away; so we retry ourselves here. - */ - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry_alloc; - } + new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL); /* keep subsequent assertions sane */ atomic_set(&new_bh->b_count, 1); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 5f09370c90a8..dac4523fa142 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -278,22 +278,16 @@ static int start_this_handle(journal_t *journal, handle_t *handle, alloc_transaction: if (!journal->j_running_transaction) { + /* + * If __GFP_FS is not present, then we may be being called from + * inside the fs writeback layer, so we MUST NOT fail. + */ + if ((gfp_mask & __GFP_FS) == 0) + gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); - if (!new_transaction) { - /* - * If __GFP_FS is not present, then we may be - * being called from inside the fs writeback - * layer, so we MUST NOT fail. Since - * __GFP_NOFAIL is going away, we will arrange - * to retry the allocation ourselves. - */ - if ((gfp_mask & __GFP_FS) == 0) { - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto alloc_transaction; - } + if (!new_transaction) return -ENOMEM; - } } jbd_debug(3, "New handle %p going live.\n", handle); -- 2.1.4 -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-20 23:15 ` Dave Chinner 2015-02-21 3:20 ` Theodore Ts'o @ 2015-02-21 11:12 ` Tetsuo Handa 2015-02-21 21:48 ` Dave Chinner 1 sibling, 1 reply; 83+ messages in thread From: Tetsuo Handa @ 2015-02-21 11:12 UTC (permalink / raw) To: david Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds My main issue is c) whether to oom-kill more processes when the OOM victim cannot be terminated presumably due to the OOM killer deadlock. Dave Chinner wrote: > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > Dave Chinner wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > I really care about the OOM Killer corner cases, for I'm > > > > (1) seeing trouble cases which occurred in enterprise systems > > under OOM conditions > > You reach OOM, then your SLAs are dead and buried. Reboot the > box - its a much more reliable way of returning to a working system > than playing Russian Roulette with the OOM killer. What Service Level Agreements? Such troubles are occurring on RHEL systems where users are not sitting in front of the console. Unless somebody is sitting in front of the console in order to do SysRq-b when troubles occur, the down time of system will become significantly longer. What mechanisms are available for minimizing the down time of system when troubles under OOM condition occur? Software/hardware watchdog? Indeed they may help, but they may be triggered prematurely when the system has not entered into the OOM condition. Only the OOM killer knows. > > > (2) trying to downgrade OOM "Deadlock or Genocide" attacks (which > > an unprivileged user with a login shell can trivially trigger > > since Linux 2.0) to OOM "Genocide" attacks in order to allow > > OOM-unkillable daemons to restart OOM-killed processes > > > > (3) waiting for a bandaid for (2) in order to propose changes for > > mitigating OOM "Genocide" attacks (as bad guys will find how to > > trigger OOM "Deadlock or Genocide" attacks from changes for > > mitigating OOM "Genocide" attacks) > > Which is yet another indication that the OOM killer is the wrong > solution to the "lack of forward progress" problem. Any one can > generate enough memory pressure to trigger the OOM killer; we can't > prevent that from occurring when the OOM killer can be invoked by > user processes. > We have memory cgroups to reduce the possibility of triggering the OOM killer, though there will be several bugs remaining in RHEL kernels which make administrators hesitate to use memory cgroups. > > I started posting to linux-mm ML in order to make forward progress > > about (1) and (2). I don't want the memory allocation subsystem to > > lock up an entire system by indefinitely disabling memory releasing > > mechanism provided by the OOM killer. > > > > > I've proposed a method of providing this forward progress guarantee > > > for subsystems of arbitrary complexity, and this removes the > > > dependency on the OOM killer for fowards allocation progress in such > > > contexts (e.g. filesystems). We should be discussing how to > > > implement that, not what bandaids we need to apply to the OOM > > > killer. I want to fix the underlying problems, not push them under > > > the OOM-killer bus... > > > > I'm fine with that direction for new kernels provided that a simple > > bandaid which can be backported to distributor kernels for making > > OOM "Deadlock" attacks impossible is implemented. Therefore, I'm > > discussing what bandaids we need to apply to the OOM killer. > > The band-aids being proposed are worse than the problem they are > intended to cover up. In which case, the band-aids should not be > applied. > The problem is simple. /proc/sys/vm/panic_on_oom == 0 setting does not help if the OOM killer failed to determine correct task to kill + allow access to memory reserves. The OOM killer is waiting forever under the OOM deadlock condition than triggering kernel panic. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html says that "Usually, oom_killer can kill rogue processes and the system will survive." but says nothing about what to do when we hit the OOM killer deadlock condition. My band-aids allows the OOM killer to trigger kernel panic (followed by optionally kdump and automatic reboot) for people who want to reboot the box when default /proc/sys/vm/panic_on_oom == 0 setting failed to kill rogue processes, and allows people who want the system to survive when the OOM killer failed to determine correct task to kill + allow access to memory reserves. Not only we cannot expect that the OOM killer messages being saved to /var/log/messages under the OOM killer deadlock condition, but also we do not emit the OOM killer messages if we hit void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, const char *message) { struct task_struct *victim = p; struct task_struct *child; struct task_struct *t; struct mm_struct *mm; unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (task_will_free_mem(p)) { /***** _THIS_ _CONDITION_ *****/ set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; } if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); task_lock(p); pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); task_unlock(p); followed by entering into the OOM killer deadlock condition. This is annoying for me because neither serial console nor netconsole helps finding out that the system entered into the OOM condition. If you want to stop people from playing Russian Roulette with the OOM killer, please remove the OOM killer code entirely from RHEL kernels so that people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 setting. Can you do it? > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 11:12 ` Tetsuo Handa @ 2015-02-21 21:48 ` Dave Chinner 0 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-21 21:48 UTC (permalink / raw) To: Tetsuo Handa Cc: hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 21, 2015 at 08:12:08PM +0900, Tetsuo Handa wrote: > My main issue is > > c) whether to oom-kill more processes when the OOM victim cannot be > terminated presumably due to the OOM killer deadlock. > > Dave Chinner wrote: > > On Fri, Feb 20, 2015 at 07:36:33PM +0900, Tetsuo Handa wrote: > > > Dave Chinner wrote: > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > I really care about the OOM Killer corner cases, for I'm > > > > > > (1) seeing trouble cases which occurred in enterprise systems > > > under OOM conditions > > > > You reach OOM, then your SLAs are dead and buried. Reboot the > > box - its a much more reliable way of returning to a working system > > than playing Russian Roulette with the OOM killer. > > What Service Level Agreements? Such troubles are occurring on RHEL systems > where users are not sitting in front of the console. Unless somebody is > sitting in front of the console in order to do SysRq-b when troubles > occur, the down time of system will become significantly longer. > > What mechanisms are available for minimizing the down time of system > when troubles under OOM condition occur? Software/hardware watchdog? > Indeed they may help, but they may be triggered prematurely when the > system has not entered into the OOM condition. Only the OOM killer knows. # echo 1 > /proc/sys/vm/panic_on_oom .... > We have memory cgroups to reduce the possibility of triggering the OOM > killer, though there will be several bugs remaining in RHEL kernels > which make administrators hesitate to use memory cgroups. Fix upstream first, then worry about vendor kernels. .... > Not only we cannot expect that the OOM killer messages being saved to > /var/log/messages under the OOM killer deadlock condition, but also CONFIG_PSTORE=y and configure appropriately from there. > we do not emit the OOM killer messages if we hit So add a warning. > If you want to stop people from playing Russian Roulette with the OOM > killer, please remove the OOM killer code entirely from RHEL kernels so that > people must use their systems with hardcoded /proc/sys/vm/panic_on_oom == 1 > setting. Can you do it? No. You need to go through vendor channels to get a vendor kernel config change made. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-19 22:52 ` Dave Chinner 2015-02-20 10:36 ` Tetsuo Handa @ 2015-02-21 23:52 ` Johannes Weiner 2015-02-23 0:45 ` Dave Chinner 1 sibling, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-02-21 23:52 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > I will actively work around aanything that causes filesystem memory > pressure to increase the chance of oom killer invocations. The OOM > killer is not a solution - it is, by definition, a loose cannon and > so we should be reducing dependencies on it. Once we have a better-working alternative, sure. > I really don't care about the OOM Killer corner cases - it's > completely the wrong way line of development to be spending time on > and you aren't going to convince me otherwise. The OOM killer a > crutch used to justify having a memory allocation subsystem that > can't provide forward progress guarantee mechanisms to callers that > need it. We can provide this. Are all these callers able to preallocate? --- diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 51bd1e72a917..af81b8a67651 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -380,6 +380,10 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) +void register_private_page(struct page *page, unsigned int order); +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr); +void free_private_pages(void); + void page_alloc_init(void); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..1fe390779f23 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1545,6 +1545,8 @@ struct task_struct { #endif /* VM state */ + struct list_head private_pages; + struct reclaim_state *reclaim_state; struct backing_dev_info *backing_dev_info; diff --git a/kernel/fork.c b/kernel/fork.c index cf65139615a0..b6349b0e5da2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1308,6 +1308,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, memset(&p->rss_stat, 0, sizeof(p->rss_stat)); #endif + INIT_LIST_HEAD(&p->private_pages); + p->default_timer_slack_ns = current->timer_slack_ns; task_io_accounting_init(&p->ioac); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a47f0b229a1a..546db4e0da75 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -490,12 +490,10 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, static inline void set_page_order(struct page *page, unsigned int order) { set_page_private(page, order); - __SetPageBuddy(page); } static inline void rmv_page_order(struct page *page) { - __ClearPageBuddy(page); set_page_private(page, 0); } @@ -617,6 +615,7 @@ static inline void __free_one_page(struct page *page, list_del(&buddy->lru); zone->free_area[order].nr_free--; rmv_page_order(buddy); + __ClearPageBuddy(buddy); } combined_idx = buddy_idx & page_idx; page = page + (combined_idx - page_idx); @@ -624,6 +623,7 @@ static inline void __free_one_page(struct page *page, order++; } set_page_order(page, order); + __SetPageBuddy(page); /* * If this is not the largest possible page, check if the buddy @@ -924,6 +924,7 @@ static inline void expand(struct zone *zone, struct page *page, list_add(&page[size].lru, &area->free_list[migratetype]); area->nr_free++; set_page_order(&page[size], high); + __SetPageBuddy(page); } } @@ -1015,6 +1016,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page, lru); list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); area->nr_free--; expand(zone, page, order, current_order, area, migratetype); set_freepage_migratetype(page, migratetype); @@ -1212,6 +1214,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) /* Remove the page from the freelists */ list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); expand(zone, page, order, current_order, area, buddy_type); @@ -1598,6 +1601,7 @@ int __isolate_free_page(struct page *page, unsigned int order) list_del(&page->lru); zone->free_area[order].nr_free--; rmv_page_order(page); + __ClearPageBuddy(page); /* Set the pageblock if the isolated page is at least a pageblock */ if (order >= pageblock_order - 1) { @@ -2504,6 +2508,40 @@ retry: return page; } +/* Try to allocate from the caller's private memory reserves */ +static inline struct page * +__alloc_pages_private(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + unsigned int uninitialized_var(alloc_order); + struct page *page = NULL; + struct page *p; + + /* Dopy, but this is a slowpath right before OOM */ + list_for_each_entry(p, ¤t->private_pages, lru) { + int o = page_order(p); + + if (o >= order && (!page || o < alloc_order)) { + page = p; + alloc_order = o; + } + } + if (!page) + return NULL; + + list_del(&page->lru); + rmv_page_order(page); + + /* Give back the remainder */ + while (alloc_order > order) { + alloc_order--; + set_page_order(&page[1 << alloc_order], alloc_order); + list_add(&page[1 << alloc_order].lru, ¤t->private_pages); + } + + return page; +} + /* * This is called in the allocator slow-path if the allocation request is of * sufficient urgency to ignore watermarks and take other desperate measures @@ -2753,9 +2791,13 @@ retry: /* * If we fail to make progress by freeing individual * pages, but the allocation wants us to keep going, - * start OOM killing tasks. + * dip into private reserves, or start OOM killing. */ if (!did_some_progress) { + page = __alloc_pages_private(gfp_mask, order, ac); + if (page) + goto got_pg; + page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -3046,6 +3088,82 @@ void free_pages_exact(void *virt, size_t size) EXPORT_SYMBOL(free_pages_exact); /** + * alloc_private_pages - allocate private memory reserve pages + * @gfp_mask: gfp flags for the allocations + * @order: order of pages to allocate + * @nr: number of pages to allocate + * + * This allocates @nr pages of order @order as an emergency reserve of + * the calling task, to be used by the page allocator if an allocation + * would otherwise fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +int alloc_private_pages(gfp_t gfp_mask, unsigned int order, unsigned int nr) +{ + struct page *page, *page2; + LIST_HEAD(pages); + unsigned int i; + + for (i = 0; i < nr; i++) { + page = alloc_pages(gfp_mask, order); + if (!page) + goto error; + set_page_order(page, order); + list_add(&page->lru, &pages); + } + + list_splice(&pages, ¤t->private_pages); + return 0; + +error: + list_for_each_entry_safe(page, page2, &pages, lru) { + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } + return -ENOMEM; +} + +/** + * register_private_page - register a private memory reserve page + * @page: pre-allocated page + * @order: @page's order + * + * This registers @page as an emergency reserve of the calling task, + * to be used by the page allocator if an allocation would otherwise + * fail. + * + * The caller is responsible for calling free_private_pages() once the + * reserves are no longer required. + */ +void register_private_page(struct page *page, unsigned int order) +{ + set_page_order(page, order); + list_add(&page->lru, ¤t->private_pages); +} + +/** + * free_private_pages - free all private memory reserve pages + * + * Frees all (remaining) pages of the calling task's memory reserves + * established by alloc_private_pages() and register_private_page(). + */ +void free_private_pages(void) +{ + struct page *page, *page2; + + list_for_each_entry_safe(page, page2, ¤t->private_pages, lru) { + int order = page_order(page); + + list_del(&page->lru); + rmv_page_order(page); + __free_pages(page, order); + } +} + +/** * nr_free_zone_pages - count number of pages beyond high watermark * @offset: The zone index of the highest zone * @@ -6551,6 +6669,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) #endif list_del(&page->lru); rmv_page_order(page); + __ClearPageBuddy(page); zone->free_area[order].nr_free--; for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-21 23:52 ` Johannes Weiner @ 2015-02-23 0:45 ` Dave Chinner 2015-02-23 1:29 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-23 0:45 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > I will actively work around aanything that causes filesystem memory > > pressure to increase the chance of oom killer invocations. The OOM > > killer is not a solution - it is, by definition, a loose cannon and > > so we should be reducing dependencies on it. > > Once we have a better-working alternative, sure. Great, but first a simple request: please stop writing code and instead start architecting a solution to the problem. i.e. we need a design and have that documented before code gets written. If you watched my recent LCA talk, then you'll understand what I mean when I say: stop programming and start engineering. > > I really don't care about the OOM Killer corner cases - it's > > completely the wrong way line of development to be spending time on > > and you aren't going to convince me otherwise. The OOM killer a > > crutch used to justify having a memory allocation subsystem that > > can't provide forward progress guarantee mechanisms to callers that > > need it. > > We can provide this. Are all these callers able to preallocate? Anything that allocates in transaction context (and therefor is GFP_NOFS by definition) can preallocate at transaction reservation time. However, preallocation is dumb, complex, CPU and memory intensive and will have a *massive* impact on performance. Allocating 10-100 pages to a reserve which we will almost *never use* and then free them again *on every single transaction* is a lot of unnecessary additional fast path overhead. Hence a "preallocate for every context" reserve pool is not a viable solution. And, really, "reservation" != "preallocation". Maybe it's my filesystem background, but those to things are vastly different things. Reservations are simply an *accounting* of the maximum amount of a reserve required by an operation to guarantee forwards progress. In filesystems, we do this for log space (transactions) and some do it for filesystem space (e.g. delayed allocation needs correct ENOSPC detection so we don't overcommit disk space). The VM already has such concepts (e.g. watermarks and things like min_free_kbytes) that it uses to ensure that there are sufficient reserves for certain types of allocations to succeed. A reserve memory pool is no different - every time a memory reserve occurs, a watermark is lifted to accommodate it, and the transaction is not allowed to proceed until the amount of free memory exceeds that watermark. The memory allocation subsystem then only allows *allocations* marked correctly to allocate pages from that the reserve that watermark protects. e.g. only allocations using __GFP_RESERVE are allowed to dip into the reserve pool. By using watermarks, freeing of memory will automatically top up the reserve pool which means that we guarantee that reclaimable memory allocated for demand paging during transacitons doesn't deplete the reserve pool permanently. As a result, when there is plenty of free and/or reclaimable memory, the reserve pool watermarks will have almost zero impact on performance and behaviour. Further, because it's just accounting and behavioural thresholds, this allows the mm subsystem to control how the reserve pool is accounted internally. e.g. clean, reclaimable pages in the page cache could serve as reserve pool pages as they can be immediately reclaimed for allocation. This could be acheived by setting reclaim targets first to the reserve pool watermark, then the second target is enough pages to satisfy the current allocation. And, FWIW, there's nothing stopping this mechanism from have order based reserve thresholds. e.g. IB could really do with a 64k reserve pool threshold and hence help solve the long standing problems they have with filling the receive ring in GFP_ATOMIC context... Sure, that's looking further down the track, but my point still remains: we need a viable long term solution to this problem. Maybe reservations are not the solution, but I don't see anyone else who is thinking of how to address this architectural problem at a system level right now. We need to design and document the model first, then review it, then we can start working at the code level to implement the solution we've designed. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner @ 2015-02-23 1:29 ` Andrew Morton 2015-02-23 7:32 ` Dave Chinner 2015-02-28 16:29 ` Johannes Weiner ` (2 subsequent siblings) 3 siblings, 1 reply; 83+ messages in thread From: Andrew Morton @ 2015-02-23 1:29 UTC (permalink / raw) To: Dave Chinner Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. Yup. > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic reserve. So to reserve N pages we increase the page allocator dynamic reserve by N, do some reclaim if necessary then deposit N tokens into the caller's task_struct (it'll be a set of zone/nr-pages tuples I suppose). When allocating pages the caller should drain its reserves in preference to dipping into the regular freelist. This guy has already done his reclaim and shouldn't be penalised a second time. I guess Johannes's preallocation code should switch to doing this for the same reason, plus the fact that snipping a page off task_struct.prealloc_pages is super-fast and needs to be done sometime anyway so why not do it by default. Both reservation and preallocation are vulnerable to deadlocks - 10,000 tasks all trying to reserve/prealloc 100 pages, they all have 50 pages and we ran out of memory. Whoops. We can undeadlock by returning ENOMEM but I suspect there will still be problematic situations where massive numbers of pages are temporarily AWOL. Perhaps some form of queuing and throttling will be needed, to limit the peak number of reserved pages. Per zone, I guess. And it'll be a huge pain handling order>0 pages. I'd be inclined to make it order-0 only, and tell the lamer callers that vmap-is-thattaway. Alas, one lame caller is slub. But the biggest issue is how the heck does a caller work out how many pages to reserve/prealloc? Even a single sb_bread() - it's sitting on loop on a sparse NTFS file on loop on a five-deep DM stack on a six-deep MD stack on loop on NFS on an eleventy-deep networking stack. And then there will be an unknown number of slab allocations of unknown size with unknown slabs-per-page rules - how many pages needed for them? And to make it much worse, how many pages of which orders? Bless its heart, slub will go and use a 1-order page for allocations which should have been in 0-order pages.. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 1:29 ` Andrew Morton @ 2015-02-23 7:32 ` Dave Chinner 2015-02-27 18:24 ` Vlastimil Babka ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Dave Chinner @ 2015-02-23 7:32 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > > > I really don't care about the OOM Killer corner cases - it's > > > > completely the wrong way line of development to be spending time on > > > > and you aren't going to convince me otherwise. The OOM killer a > > > > crutch used to justify having a memory allocation subsystem that > > > > can't provide forward progress guarantee mechanisms to callers that > > > > need it. > > > > > > We can provide this. Are all these callers able to preallocate? > > > > Anything that allocates in transaction context (and therefor is > > GFP_NOFS by definition) can preallocate at transaction reservation > > time. However, preallocation is dumb, complex, CPU and memory > > intensive and will have a *massive* impact on performance. > > Allocating 10-100 pages to a reserve which we will almost *never > > use* and then free them again *on every single transaction* is a lot > > of unnecessary additional fast path overhead. Hence a "preallocate > > for every context" reserve pool is not a viable solution. > > Yup. > > > Reservations are simply an *accounting* of the maximum amount of a > > reserve required by an operation to guarantee forwards progress. In > > filesystems, we do this for log space (transactions) and some do it > > for filesystem space (e.g. delayed allocation needs correct ENOSPC > > detection so we don't overcommit disk space). The VM already has > > such concepts (e.g. watermarks and things like min_free_kbytes) that > > it uses to ensure that there are sufficient reserves for certain > > types of allocations to succeed. > > Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > reserve. So to reserve N pages we increase the page allocator dynamic > reserve by N, do some reclaim if necessary then deposit N tokens into > the caller's task_struct (it'll be a set of zone/nr-pages tuples I > suppose). > > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. > Both reservation and preallocation are vulnerable to deadlocks - 10,000 > tasks all trying to reserve/prealloc 100 pages, they all have 50 pages > and we ran out of memory. Whoops. Yes, that's the big problem with preallocation, as well as your proposed "depelete the reserved memory first" approach. They *require* up front "preallocation" of free memory, either directly by the application, or internally by the mm subsystem. Hence my comments about appropriate classification of "reserved memory". Reserved memory does not necessarily need to be on the free list. It could be "immediately reclaimable" memory, so that reserving memory doesn't need to immediately reclaim memory, but can it can be pulled from the reclaimable memory reserves when memory pressure occurs. If there is no memory pressure, we do nothing beause we have no need to do anything.... > We can undeadlock by returning ENOMEM but I suspect there will > still be problematic situations where massive numbers of pages are > temporarily AWOL. Perhaps some form of queuing and throttling > will be needed, Yes, think that is necessary, but I don't see it as necessary in the MM subsystem. XFS already has a ticket-based queue mechanisms for throttling concurrent access to ensure we don't overcommit log space and I'd want to tie the two together... > to limit the peak number of reserved pages. Per > zone, I guess. Internal implementation issue that I don't really care about. When it comes to guaranteeing memory allocation, global context is all I care about. Locality of allocation simple doesn't matter; we want that page we reserved, no matter wher eit is located. > And it'll be a huge pain handling order>0 pages. I'd be inclined > to make it order-0 only, and tell the lamer callers that > vmap-is-thattaway. Alas, one lame caller is slub. Sure, but vmap requires GFP_KERNEL memory allocation and we're talking about allocation in transactions, which are GFP_NOFS. I've lost count of the number of times we've asked for that problem to be fixed. Refusing to fix it has simply lead to the growing use of ugly hacks around that problem (i.e. memalloc_noio_save() and friends). > But the biggest issue is how the heck does a caller work out how > many pages to reserve/prealloc? Even a single sb_bread() - it's > sitting on loop on a sparse NTFS file on loop on a five-deep DM > stack on a six-deep MD stack on loop on NFS on an eleventy-deep > networking stack. Each subsystem needs to take care of itself first, then we can worry about esoteric stacking requirements. Besides, stacking requirements through the IO layer is still pretty trivial - we only need to guarantee single IO progress from the highest layer as it can be recycled again and again for every IO that needs to be done. And, because mempools already give that guarantee to most block devices and drivers, we won't need to reserve memory for most block devices to make forwards progress. It's only crazy "recurse through filesystem" configurations where this will be an issue. > And then there will be an unknown number of > slab allocations of unknown size with unknown slabs-per-page rules > - how many pages needed for them? However many pages needed to allocate the number of objects we'll consume from the slab. > And to make it much worse, how > many pages of which orders? Bless its heart, slub will go and use > a 1-order page for allocations which should have been in 0-order > pages.. The majority of allocations will be order-0, though if we know that they are going to be significant numbers of high order allocations, then it should be simple enough to tell the mm subsystem "need a reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have memory compaction just do it's stuff. But, IMO, we should cross that bridge when somebody actually needs reservations to be that specific.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner @ 2015-02-27 18:24 ` Vlastimil Babka 2015-02-28 0:03 ` Dave Chinner 2015-03-02 9:39 ` Vlastimil Babka 2015-03-02 20:22 ` Johannes Weiner 2 siblings, 1 reply; 83+ messages in thread From: Vlastimil Babka @ 2015-02-27 18:24 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On 02/23/2015 08:32 AM, Dave Chinner wrote: >> > And then there will be an unknown number of >> > slab allocations of unknown size with unknown slabs-per-page rules >> > - how many pages needed for them? > However many pages needed to allocate the number of objects we'll > consume from the slab. I think the best way is if slab could also learn to provide reserves for individual objects. Either just mark internally how many of them are reserved, if sufficient number is free, or translate this to the page allocator reserves, as slab knows which order it uses for the given objects. >> > And to make it much worse, how >> > many pages of which orders? Bless its heart, slub will go and use >> > a 1-order page for allocations which should have been in 0-order >> > pages.. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Note that watermark checking for higher-order allocations is somewhat fuzzy compared to order-0 checks, but I guess some kind of reservations could work there too. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-27 18:24 ` Vlastimil Babka @ 2015-02-28 0:03 ` Dave Chinner 2015-02-28 15:17 ` Theodore Ts'o 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-02-28 0:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Fri, Feb 27, 2015 at 07:24:34PM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >> > And then there will be an unknown number of > >> > slab allocations of unknown size with unknown slabs-per-page rules > >> > - how many pages needed for them? > > However many pages needed to allocate the number of objects we'll > > consume from the slab. > > I think the best way is if slab could also learn to provide reserves for > individual objects. Either just mark internally how many of them are reserved, > if sufficient number is free, or translate this to the page allocator reserves, > as slab knows which order it uses for the given objects. Which is effectively what a slab based mempool is. Mempools don't guarantee a reserve is available once it's been resized, however, and we'd have to have mempools configured for every type of allocation we are going to do. So from that perspective it's not really a solution. Further, the kmalloc heap is backed by slab caches. We do *lots* of variable sized kmalloc allocations in transactions the size of which aren't known until allocation time. In that case, we have to assume it's going to be a page per object, because the allocations could actually be that size. AFAICT, the worst case is a slab-backing page allocation for every slab object that is allocated, so we may as well cater for that case from the start... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 0:03 ` Dave Chinner @ 2015-02-28 15:17 ` Theodore Ts'o 0 siblings, 0 replies; 83+ messages in thread From: Theodore Ts'o @ 2015-02-28 15:17 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds, Vlastimil Babka On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote: > > I think the best way is if slab could also learn to provide reserves for > > individual objects. Either just mark internally how many of them are reserved, > > if sufficient number is free, or translate this to the page allocator reserves, > > as slab knows which order it uses for the given objects. > > Which is effectively what a slab based mempool is. Mempools don't > guarantee a reserve is available once it's been resized, however, > and we'd have to have mempools configured for every type of > allocation we are going to do. So from that perspective it's not > really a solution. The bigger problem is it means that the upper layer which is making the reservation before it starts taking lock won't necessarily know exactly which slab objects it and all of the lower layers might need. So it's much more flexible, and requires less accuracy, if we can just request that (a) the mm subsystems reserves at least N pages, and (b) tell it that at this point in time, it's safe for the requesting subsystem to block until N pages is available. Can this be guaranteed to be accurate? No, of course not. And in some cases, it may be possible since it might depend on whether the iSCSI device needs to reconnect to the target, or some sort of exception handling, before it can complete its I/O request. But it's better than what we have now, which is that once we've taken certain locks, and/or started a complex transaction, we can't really back out, so we end up looping either using GFP_NOFAIL, or around the memory allocation request if there are still mm developers who are delusional enough to believe, ala like King Canute, to say, "You must always be able to handle memory allocation at any point in the kernel and GFP_NOFAIL is an indicatoin of a subsystem bug!" I can imagine using some adjustment factors, where a particular voratious device might require hint to the file system to boost its memory allocation estimate by 30%, or 50%. So yes, it's a very, *very* rough estimate. And if we guess wrong, we might end up having to loop ala GFP_NOFAIL anyway. But it's better than not having such an estimate. I also grant that this doesn't work very well for emergency writeback, or background writeback, where we can't and shouldn't block waiting for enough memory to become free, since page cleaning is one of the ways that we might be able to make memory available. But if that's the only problem we have, we're in good shape, since that can be solved by either (a) doing a better job throttling memory allocations or memory reservation requests in the first place, and/or (b) starting the background writeback much more aggressively and earlier. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner 2015-02-27 18:24 ` Vlastimil Babka @ 2015-03-02 9:39 ` Vlastimil Babka 2015-03-02 22:31 ` Dave Chinner 2015-03-02 20:22 ` Johannes Weiner 2 siblings, 1 reply; 83+ messages in thread From: Vlastimil Babka @ 2015-03-02 9:39 UTC (permalink / raw) To: Dave Chinner, Andrew Morton Cc: Johannes Weiner, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, torvalds On 02/23/2015 08:32 AM, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >> Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic >> reserve. So to reserve N pages we increase the page allocator dynamic >> reserve by N, do some reclaim if necessary then deposit N tokens into >> the caller's task_struct (it'll be a set of zone/nr-pages tuples I >> suppose). >> >> When allocating pages the caller should drain its reserves in >> preference to dipping into the regular freelist. This guy has already >> done his reclaim and shouldn't be penalised a second time. I guess >> Johannes's preallocation code should switch to doing this for the same >> reason, plus the fact that snipping a page off >> task_struct.prealloc_pages is super-fast and needs to be done sometime >> anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. I'm not sure I understand properly. You don't know if a specific allocation is permanent or reclaimable, but you can tell in advance how much in total will be permanent? Is it because you are conservative and assume everything will be permanent, or how? Can you at least at some later point in transaction recognize that "OK, this object was not permanent after all" and tell mm that it can lower your reserve? > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. Which are the "all other allocations?" Above you wrote that all allocations are treated as potentially permanent. Also how does the fact that an object is later reclaimable, affect forward progress during its allocation? Or all you talking about allocations from contexts that don't use reserves? > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That's probably going to result in lot of wasted memory and I still don't understand why it's needed, if your reserve estimate is guaranteed to cover the worst-case. >> Both reservation and preallocation are vulnerable to deadlocks - 10,000 >> tasks all trying to reserve/prealloc 100 pages, they all have 50 pages >> and we ran out of memory. Whoops. > > Yes, that's the big problem with preallocation, as well as your > proposed "depelete the reserved memory first" approach. They > *require* up front "preallocation" of free memory, either directly > by the application, or internally by the mm subsystem. I don't see why it would deadlock, if during reserve time the mm can return ENOMEM as the reserver should be able to back out at that point. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 9:39 ` Vlastimil Babka @ 2015-03-02 22:31 ` Dave Chinner 2015-03-03 9:13 ` Vlastimil Babka 2015-03-07 0:20 ` Johannes Weiner 0 siblings, 2 replies; 83+ messages in thread From: Dave Chinner @ 2015-03-02 22:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > On 02/23/2015 08:32 AM, Dave Chinner wrote: > >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: > >> > >>Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic > >>reserve. So to reserve N pages we increase the page allocator dynamic > >>reserve by N, do some reclaim if necessary then deposit N tokens into > >>the caller's task_struct (it'll be a set of zone/nr-pages tuples I > >>suppose). > >> > >>When allocating pages the caller should drain its reserves in > >>preference to dipping into the regular freelist. This guy has already > >>done his reclaim and shouldn't be penalised a second time. I guess > >>Johannes's preallocation code should switch to doing this for the same > >>reason, plus the fact that snipping a page off > >>task_struct.prealloc_pages is super-fast and needs to be done sometime > >>anyway so why not do it by default. > > > >That is at odds with the requirements of demand paging, which > >allocate for objects that are reclaimable within the course of the > >transaction. The reserve is there to ensure forward progress for > >allocations for objects that aren't freed until after the > >transaction completes, but if we drain it for reclaimable objects we > >then have nothing left in the reserve pool when we actually need it. > > > >We do not know ahead of time if the object we are allocating is > >going to modified and hence locked into the transaction. Hence we > >can't say "use the reserve for this *specific* allocation", and so > >the only guidance we can really give is "we will to allocate and > >*permanently consume* this much memory", and the reserve pool needs > >to cover that consumption to guarantee forwards progress. > > I'm not sure I understand properly. You don't know if a specific > allocation is permanent or reclaimable, but you can tell in advance > how much in total will be permanent? Is it because you are > conservative and assume everything will be permanent, or how? Because we know the worst case object modification constraints *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know exactly what in memory objects we lock into the transaction and what memory is required to modify and track those objects. e.g: for a data extent allocation, the log reservation is as such: /* * In a write transaction we can allocate a maximum of 2 * extents. This gives: * the inode getting the new extents: inode size * the inode's bmap btree: max depth * block size * the agfs of the ags from which the extents are allocated: 2 * sector * the superblock free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size * And the bmap_finish transaction can free bmap blocks in a join: * the agfs of the ags containing the blocks: 2 * sector size * the agfls of the ags containing the blocks: 2 * sector size * the super block free block counter: sector size * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size */ STATIC uint xfs_calc_write_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + MAX((xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), XFS_FSB_TO_B(mp, 1)) + xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1))), (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), XFS_FSB_TO_B(mp, 1)))); } It's trivial to extend this logic to to memory allocation requirements, because the above is an exact encoding of all the objects we "permanently consume" memory for within the transaction. What we don't know is how many objects we might need to scan to find the objects we will eventually modify. Here's an (admittedly extreme) example to demonstrate a worst case scenario: allocate a 64k data extent. Because it is an exact size allocation, we look it up in the by-size free space btree. Free space is fragmented, so there are about a million 64k free space extents in the tree. Once we find the first 64k extent, we search them to find the best locality target match. The btree records are 16 bytes each, so we fit roughly 500 to a 4k block. Say we search half the extents to find the best match - i.e. we walk a thousand leaf blocks before finding the match we want, and modify that leaf block. Now, the modification removed an entry from the leaf and tht triggers leaf merge thresholds, so a merge with the 1002nd block occurs. That block now demand pages in and we then modify and join it to the transaction. Now we walk back up the btree to update indexes, merging blocks all the way back up to the root. We have a worst case size btree (5 levels) and we merge at every level meaning we demand page another 8 btree blocks and modify them. In this case, we've demand paged ~1010 btree blocks, but only modified 10 of them. i.e. the memory we consumed permanently was only 10 4k buffers (approx. 10 slab and 10 page allocations), but the allocation demand was 2 orders of magnitude more than the unreclaimable memory consumption of the btree modification. I hope you start to see the scope of the problem now... > Can you at least at some later point in transaction recognize that > "OK, this object was not permanent after all" and tell mm that it > can lower your reserve? I'm not including any memory used by objects we know won't be locked into the transaction in the reserve. Demand paged object memory is essentially unbound but is easily reclaimable. That reclaim will give us forward progress guarantees on the memory required here. > >Yes, that's the big problem with preallocation, as well as your > >proposed "depelete the reserved memory first" approach. They > >*require* up front "preallocation" of free memory, either directly > >by the application, or internally by the mm subsystem. > > I don't see why it would deadlock, if during reserve time the mm can > return ENOMEM as the reserver should be able to back out at that > point. Preallocated reserves do not allow for unbound demand paging of reclaimable objects within reserved allocation contexts. Cheers Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 22:31 ` Dave Chinner @ 2015-03-03 9:13 ` Vlastimil Babka 2015-03-04 1:33 ` Dave Chinner 2015-03-07 0:20 ` Johannes Weiner 1 sibling, 1 reply; 83+ messages in thread From: Vlastimil Babka @ 2015-03-03 9:13 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On 03/02/2015 11:31 PM, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: >> On 02/23/2015 08:32 AM, Dave Chinner wrote: >> >On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: >> >>On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@fromorbit.com> wrote: >> >We do not know ahead of time if the object we are allocating is >> >going to modified and hence locked into the transaction. Hence we >> >can't say "use the reserve for this *specific* allocation", and so >> >the only guidance we can really give is "we will to allocate and >> >*permanently consume* this much memory", and the reserve pool needs >> >to cover that consumption to guarantee forwards progress. >> >> I'm not sure I understand properly. You don't know if a specific >> allocation is permanent or reclaimable, but you can tell in advance >> how much in total will be permanent? Is it because you are >> conservative and assume everything will be permanent, or how? > > Because we know the worst case object modification constraints > *exactly* (e.g. see fs/xfs/libxfs/xfs_trans_resv.c), we know > exactly what in memory objects we lock into the transaction and what > memory is required to modify and track those objects. e.g: for a > data extent allocation, the log reservation is as such: > > /* > * In a write transaction we can allocate a maximum of 2 > * extents. This gives: > * the inode getting the new extents: inode size > * the inode's bmap btree: max depth * block size > * the agfs of the ags from which the extents are allocated: 2 * sector > * the superblock free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > * And the bmap_finish transaction can free bmap blocks in a join: > * the agfs of the ags containing the blocks: 2 * sector size > * the agfls of the ags containing the blocks: 2 * sector size > * the super block free block counter: sector size > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size > */ > STATIC uint > xfs_calc_write_reservation( > struct xfs_mount *mp) > { > return XFS_DQUOT_LOGRES(mp) + > MAX((xfs_calc_inode_res(mp, 1) + > xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), > XFS_FSB_TO_B(mp, 1)) + > xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1))), > (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + > xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2), > XFS_FSB_TO_B(mp, 1)))); > } > > It's trivial to extend this logic to to memory allocation > requirements, because the above is an exact encoding of all the > objects we "permanently consume" memory for within the transaction. > > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Thanks, that example did help me understand your position much better. So you would need to reserve for a worst case number of the objects you modify, plus some slack for the demand-paged objects that you need to temporarily access, before you can drop and reclaim them (I suppose that in some of the tree operations, you need to be holding references to e.g. two nodes at a time, or maybe the full depth). Or maybe since all these temporary objects are potentially modifiable, it's already accounted for in the "might be modified" part. >> Can you at least at some later point in transaction recognize that >> "OK, this object was not permanent after all" and tell mm that it >> can lower your reserve? > > I'm not including any memory used by objects we know won't be locked > into the transaction in the reserve. Demand paged object memory is > essentially unbound but is easily reclaimable. That reclaim will > give us forward progress guarantees on the memory required here. > >> >Yes, that's the big problem with preallocation, as well as your >> >proposed "depelete the reserved memory first" approach. They >> >*require* up front "preallocation" of free memory, either directly >> >by the application, or internally by the mm subsystem. >> >> I don't see why it would deadlock, if during reserve time the mm can >> return ENOMEM as the reserver should be able to back out at that >> point. > > Preallocated reserves do not allow for unbound demand paging of > reclaimable objects within reserved allocation contexts. OK I think I get the point now. So, lots of the concerns by me and others were about the wasted memory due to reservations, and increased pressure on the rest of the system. I was thinking, are you able, at the beginning of the transaction (for this purposes, I think of transaction as the work that starts with the memory reservation, then it cannot rollback and relies on the reserves, until it commits and frees the memory), determine whether the transaction cannot be blocked in its progress by any other transaction, and the only thing that would block it would be inability to allocate memory during its course? If that was the case, we could "share" the reserved memory for all ongoing transactions of a single class (i.e. xfs transactions). If a transaction knows it cannot be blocked by anything else, only then it passes the GFP_CAN_USE_RESERVE flag to the allocator. Once the allocator gives part of the reserve to one such transaction, it will deny the reserves to other such transactions, until the first one finishes. In practice it would be more complex of course, but it should guarantee forward progress without lots of wasted memory (maybe we wouldn't have to rely on treting clean reclaimable pages as reserve in that case, which was also pointed out to be problematic). Of course it all depends on whether you are able to determine the "guaranteed to not block". I can however easily imagine it's not possible... > Cheers > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-03 9:13 ` Vlastimil Babka @ 2015-03-04 1:33 ` Dave Chinner 2015-03-04 8:50 ` Vlastimil Babka 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-04 1:33 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > On 03/02/2015 11:31 PM, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 10:39:54AM +0100, Vlastimil Babka wrote: > > > > /* > > * In a write transaction we can allocate a maximum of 2 > > * extents. This gives: > > * the inode getting the new extents: inode size > > * the inode's bmap btree: max depth * block size > > * the agfs of the ags from which the extents are allocated: 2 * sector > > * the superblock free block counter: sector size > > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ..... > Thanks, that example did help me understand your position much better. > So you would need to reserve for a worst case number of the objects you modify, > plus some slack for the demand-paged objects that you need to temporarily > access, before you can drop and reclaim them (I suppose that in some of the tree > operations, you need to be holding references to e.g. two nodes at a time, or > maybe the full depth). Or maybe since all these temporary objects are > potentially modifiable, it's already accounted for in the "might be modified" part. Already accounted for in the "might be modified path". > >> Can you at least at some later point in transaction recognize that > >> "OK, this object was not permanent after all" and tell mm that it > >> can lower your reserve? > > > > I'm not including any memory used by objects we know won't be locked > > into the transaction in the reserve. Demand paged object memory is > > essentially unbound but is easily reclaimable. That reclaim will > > give us forward progress guarantees on the memory required here. > > > >> >Yes, that's the big problem with preallocation, as well as your > >> >proposed "depelete the reserved memory first" approach. They > >> >*require* up front "preallocation" of free memory, either directly > >> >by the application, or internally by the mm subsystem. > >> > >> I don't see why it would deadlock, if during reserve time the mm can > >> return ENOMEM as the reserver should be able to back out at that > >> point. > > > > Preallocated reserves do not allow for unbound demand paging of > > reclaimable objects within reserved allocation contexts. > > OK I think I get the point now. > > So, lots of the concerns by me and others were about the wasted memory due to > reservations, and increased pressure on the rest of the system. I was thinking, > are you able, at the beginning of the transaction (for this purposes, I think of > transaction as the work that starts with the memory reservation, then it cannot > rollback and relies on the reserves, until it commits and frees the memory), > determine whether the transaction cannot be blocked in its progress by any other > transaction, and the only thing that would block it would be inability to > allocate memory during its course? No. e.g. any transaction that requires allocation or freeing of an inode or extent can get stuck behind any other transaction that is allocating/freeing and inode/extent. And this will happen when holding inode locks, which means other transactions on that inode will then get stuck on the inode lock, and so on. Blocking dependencies within transactions are everywhere and cannot be avoided. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 1:33 ` Dave Chinner @ 2015-03-04 8:50 ` Vlastimil Babka 2015-03-04 11:03 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Vlastimil Babka @ 2015-03-04 8:50 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On 03/04/2015 02:33 AM, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: >>> >>> Preallocated reserves do not allow for unbound demand paging of >>> reclaimable objects within reserved allocation contexts. >> >> OK I think I get the point now. >> >> So, lots of the concerns by me and others were about the wasted memory due to >> reservations, and increased pressure on the rest of the system. I was thinking, >> are you able, at the beginning of the transaction (for this purposes, I think of >> transaction as the work that starts with the memory reservation, then it cannot >> rollback and relies on the reserves, until it commits and frees the memory), >> determine whether the transaction cannot be blocked in its progress by any other >> transaction, and the only thing that would block it would be inability to >> allocate memory during its course? > > No. e.g. any transaction that requires allocation or freeing of an > inode or extent can get stuck behind any other transaction that is > allocating/freeing and inode/extent. And this will happen when > holding inode locks, which means other transactions on that inode > will then get stuck on the inode lock, and so on. Blocking > dependencies within transactions are everywhere and cannot be > avoided. Hm, I see. I thought that perhaps to avoid deadlocks between transactions (which you already have to do somehow), either the dependencies have to be structured in a way that there's always some transaction that can't block on others. Or you have a way to detect potential deadlocks before they happen, and stall somebody who tries to lock. Both should (at least theoretically) mean that you would be able to point to such transaction, although I can imagine the cost of being able to do that could be prohibitive. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 8:50 ` Vlastimil Babka @ 2015-03-04 11:03 ` Dave Chinner 0 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-03-04 11:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 09:50:58AM +0100, Vlastimil Babka wrote: > On 03/04/2015 02:33 AM, Dave Chinner wrote: > >On Tue, Mar 03, 2015 at 10:13:04AM +0100, Vlastimil Babka wrote: > >>> > >>>Preallocated reserves do not allow for unbound demand paging of > >>>reclaimable objects within reserved allocation contexts. > >> > >>OK I think I get the point now. > >> > >>So, lots of the concerns by me and others were about the wasted memory due to > >>reservations, and increased pressure on the rest of the system. I was thinking, > >>are you able, at the beginning of the transaction (for this purposes, I think of > >>transaction as the work that starts with the memory reservation, then it cannot > >>rollback and relies on the reserves, until it commits and frees the memory), > >>determine whether the transaction cannot be blocked in its progress by any other > >>transaction, and the only thing that would block it would be inability to > >>allocate memory during its course? > > > >No. e.g. any transaction that requires allocation or freeing of an > >inode or extent can get stuck behind any other transaction that is > >allocating/freeing and inode/extent. And this will happen when > >holding inode locks, which means other transactions on that inode > >will then get stuck on the inode lock, and so on. Blocking > >dependencies within transactions are everywhere and cannot be > >avoided. > > Hm, I see. I thought that perhaps to avoid deadlocks between > transactions (which you already have to do somehow), Of course, by following lock ordering rules, rules about holding locks over transaction reservations, allowing bulk reservations for rolling transactions that don't unlock objects between transaction commits, having allocation group ordering rules, block allocation ordering rules, transactional lock recursion suport to prevent transaction deadlocking walking over objects already locked into the transaction, etc. By following those rules, we guarantee forwards progress in the transaction subsystem. If we can also guarantee forwards progress in memory allocation inside transaction context (like Irix did all those years ago :P), then we can guarantee that transactions will always complete unless there is a bug or corruption is detected during an operation... > either the > dependencies have to be structured in a way that there's always some > transaction that can't block on others. Or you have a way to detect > potential deadlocks before they happen, and stall somebody who tries > to lock. $ git grep ASSERT fs/xfs |wc -l 1716 About 3% of the code in XFS is ASSERT statements used to verify context specific state is correct in CONFIG_XFS_DEBUG=y builds. FYI, from cloc: Subsystem files blank comment code ------------------------------------------------------------------------------- fs/xfs 157 10841 25339 69140 mm/ 97 13923 25534 67870 fs/btrfs 86 14443 15097 85065 Cheers, Dave. PS: XFS userspace has another 110,000 lines of code in xfsprogs and 60,000 lines of code in xfsdump, and there's also 80,000 lines of test code in xfstests. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 22:31 ` Dave Chinner 2015-03-03 9:13 ` Vlastimil Babka @ 2015-03-07 0:20 ` Johannes Weiner 2015-03-07 3:43 ` Dave Chinner 1 sibling, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-07 0:20 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > What we don't know is how many objects we might need to scan to find > the objects we will eventually modify. Here's an (admittedly > extreme) example to demonstrate a worst case scenario: allocate a > 64k data extent. Because it is an exact size allocation, we look it > up in the by-size free space btree. Free space is fragmented, so > there are about a million 64k free space extents in the tree. > > Once we find the first 64k extent, we search them to find the best > locality target match. The btree records are 16 bytes each, so we > fit roughly 500 to a 4k block. Say we search half the extents to > find the best match - i.e. we walk a thousand leaf blocks before > finding the match we want, and modify that leaf block. > > Now, the modification removed an entry from the leaf and tht > triggers leaf merge thresholds, so a merge with the 1002nd block > occurs. That block now demand pages in and we then modify and join > it to the transaction. Now we walk back up the btree to update > indexes, merging blocks all the way back up to the root. We have a > worst case size btree (5 levels) and we merge at every level meaning > we demand page another 8 btree blocks and modify them. > > In this case, we've demand paged ~1010 btree blocks, but only > modified 10 of them. i.e. the memory we consumed permanently was > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > the allocation demand was 2 orders of magnitude more than the > unreclaimable memory consumption of the btree modification. > > I hope you start to see the scope of the problem now... Isn't this bounded one way or another? Sure, the inaccuracy itself is high, but when you put the absolute numbers in perspective it really doesn't seem to matter: with your extreme case of 3MB per transaction, you can still run 5k+ of them in parallel on a small 16G machine. Occupy a generous 75% of RAM with anonymous pages, and you can STILL run over a thousand transactions concurrently. That would seem like a decent pipeline to keep the storage device occupied. The level of precision that you are asking for comes with complexity and fragility that I'm not convinced is necessary, or justified. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-07 0:20 ` Johannes Weiner @ 2015-03-07 3:43 ` Dave Chinner 2015-03-07 15:08 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-07 3:43 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > What we don't know is how many objects we might need to scan to find > > the objects we will eventually modify. Here's an (admittedly > > extreme) example to demonstrate a worst case scenario: allocate a > > 64k data extent. Because it is an exact size allocation, we look it > > up in the by-size free space btree. Free space is fragmented, so > > there are about a million 64k free space extents in the tree. > > > > Once we find the first 64k extent, we search them to find the best > > locality target match. The btree records are 16 bytes each, so we > > fit roughly 500 to a 4k block. Say we search half the extents to > > find the best match - i.e. we walk a thousand leaf blocks before > > finding the match we want, and modify that leaf block. > > > > Now, the modification removed an entry from the leaf and tht > > triggers leaf merge thresholds, so a merge with the 1002nd block > > occurs. That block now demand pages in and we then modify and join > > it to the transaction. Now we walk back up the btree to update > > indexes, merging blocks all the way back up to the root. We have a > > worst case size btree (5 levels) and we merge at every level meaning > > we demand page another 8 btree blocks and modify them. > > > > In this case, we've demand paged ~1010 btree blocks, but only > > modified 10 of them. i.e. the memory we consumed permanently was > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > the allocation demand was 2 orders of magnitude more than the > > unreclaimable memory consumption of the btree modification. > > > > I hope you start to see the scope of the problem now... > > Isn't this bounded one way or another? Fo a single transaction? No. > Sure, the inaccuracy itself is > high, but when you put the absolute numbers in perspective it really > doesn't seem to matter: with your extreme case of 3MB per transaction, > you can still run 5k+ of them in parallel on a small 16G machine. No you can't. The number of concurrent transactions is bounded by the size of the log and the amount of unused space available for reservation in the log. Under heavy modification loads, that's usually somewhere between 15-25% of the log, so worst case is a few hundred megabytes. The memory reservation demand is in the same order of magnitude as the log space reservation demand..... > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > run over a thousand transactions concurrently. That would seem like a > decent pipeline to keep the storage device occupied. Typical systems won't ever get to that - they don't do more than a handful of current transactions at a time - the "thousands of transactions" occur on dedicated storage servers like petabyte scale NFS servers that have hundreds of gigabytes of RAM and hundreds-to-thousands of processing threads to keep the request pipeline full. The memory in those machines is entirely dedicated to the filesystem, so keeping a usuable pool of a few gigabytes for transaction reservations isn't a big deal. The point here is that you're taking what I'm describing as the requirements of a reservation pool and then applying the worst case to situations where completely inappropriate. That's what I mean when I told Michal to stop building silly strawman situations; large amounts of concurrency are required for huge machines, not your desktop workstation. And, realistically, sizing that reservation pool appropriately is my problem to solve - it will depend on many factors, one of which is the actual geometry of the filesystem itself. You need to stop thinking like you can control how application use the memory allocation and reclaim subsystem and start to trust we will our memory usage appropriately to maintain maximum system throughput. After all, we already do that for all the filesystem caches the mm subsystem doesn't control - why do you think I have had such an interest in shrinker scalability? For XFS, the only cache we actually don't control reclaim from is user data in the page cache - we control everything else directly from custom shrinkers..... > The level of precision that you are asking for comes with complexity > and fragility that I'm not convinced is necessary, or justified. Look, if you dont think reservations will work, then how about you suggest something that will. I don't really care what you implement, as long as it meets the needs of demand paging, I have direct control over memory usage and concurrency policy and the allocation mechanism guarantees forward progress without needing the OOM killer. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-07 3:43 ` Dave Chinner @ 2015-03-07 15:08 ` Johannes Weiner 0 siblings, 0 replies; 83+ messages in thread From: Johannes Weiner @ 2015-03-07 15:08 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds, Vlastimil Babka On Sat, Mar 07, 2015 at 02:43:47PM +1100, Dave Chinner wrote: > On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > > What we don't know is how many objects we might need to scan to find > > > the objects we will eventually modify. Here's an (admittedly > > > extreme) example to demonstrate a worst case scenario: allocate a > > > 64k data extent. Because it is an exact size allocation, we look it > > > up in the by-size free space btree. Free space is fragmented, so > > > there are about a million 64k free space extents in the tree. > > > > > > Once we find the first 64k extent, we search them to find the best > > > locality target match. The btree records are 16 bytes each, so we > > > fit roughly 500 to a 4k block. Say we search half the extents to > > > find the best match - i.e. we walk a thousand leaf blocks before > > > finding the match we want, and modify that leaf block. > > > > > > Now, the modification removed an entry from the leaf and tht > > > triggers leaf merge thresholds, so a merge with the 1002nd block > > > occurs. That block now demand pages in and we then modify and join > > > it to the transaction. Now we walk back up the btree to update > > > indexes, merging blocks all the way back up to the root. We have a > > > worst case size btree (5 levels) and we merge at every level meaning > > > we demand page another 8 btree blocks and modify them. > > > > > > In this case, we've demand paged ~1010 btree blocks, but only > > > modified 10 of them. i.e. the memory we consumed permanently was > > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > > the allocation demand was 2 orders of magnitude more than the > > > unreclaimable memory consumption of the btree modification. > > > > > > I hope you start to see the scope of the problem now... > > > > Isn't this bounded one way or another? > > Fo a single transaction? No. So you can have an infinite number of allocations in the context of a transaction, and only the objects that are going to be locked in are bounded? > > Sure, the inaccuracy itself is > > high, but when you put the absolute numbers in perspective it really > > doesn't seem to matter: with your extreme case of 3MB per transaction, > > you can still run 5k+ of them in parallel on a small 16G machine. > > No you can't. The number of concurrent transactions is bounded by > the size of the log and the amount of unused space available for > reservation in the log. Under heavy modification loads, that's > usually somewhere between 15-25% of the log, so worst case is a few > hundred megabytes. The memory reservation demand is in the same > order of magnitude as the log space reservation demand..... > > > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > > run over a thousand transactions concurrently. That would seem like a > > decent pipeline to keep the storage device occupied. > > Typical systems won't ever get to that - they don't do more than a > handful of current transactions at a time - the "thousands of > transactions" occur on dedicated storage servers like petabyte scale > NFS servers that have hundreds of gigabytes of RAM and > hundreds-to-thousands of processing threads to keep the request > pipeline full. The memory in those machines is entirely dedicated to > the filesystem, so keeping a usuable pool of a few gigabytes for > transaction reservations isn't a big deal. > > The point here is that you're taking what I'm describing as the > requirements of a reservation pool and then applying the worst case > to situations where completely inappropriate. That's what I mean > when I told Michal to stop building silly strawman situations; large > amounts of concurrency are required for huge machines, not your > desktop workstation. Why do you have to take everything I say in bad faith and choose to be smug instead of constructive? This is unneccessary. OF COURSE you know your constraints better than we do. Now explain how they matter in practice, because that's what dictates the design in engineering. I'm trying to figure out your requirements to find the simplest model, and yes I'm obviously going to follow up when you give me incomplete information. I'm responding to this: : What we don't know is how many objects we might need to scan to find : the objects we will eventually modify. Here's an (admittedly : extreme) example to demonstrate a worst case scenario: You gave us numbers that you called "worst case", so I took them and put them in a scenario where it looks like memory wouldn't be the bottle neck in real life, even if we just had simple pre-allocation semantics. If it was a silly example, why not provide a better one? I'm fine with reservations and I'm fine with adding more complexity when you demonstrate that it's needed. Your argument seems to have been that worst-case estimates are way off, but can you please just demonstrate why it matters in practice? Instead of having me do it and calling my attempts strawman arguments? I can just guess your constraints, it's up to you to make a case for your requirements. Here is another example where you responded to akpm: --- > When allocating pages the caller should drain its reserves in > preference to dipping into the regular freelist. This guy has already > done his reclaim and shouldn't be penalised a second time. I guess > Johannes's preallocation code should switch to doing this for the same > reason, plus the fact that snipping a page off > task_struct.prealloc_pages is super-fast and needs to be done sometime > anyway so why not do it by default. That is at odds with the requirements of demand paging, which allocate for objects that are reclaimable within the course of the transaction. The reserve is there to ensure forward progress for allocations for objects that aren't freed until after the transaction completes, but if we drain it for reclaimable objects we then have nothing left in the reserve pool when we actually need it. We do not know ahead of time if the object we are allocating is going to modified and hence locked into the transaction. Hence we can't say "use the reserve for this *specific* allocation", and so the only guidance we can really give is "we will to allocate and *permanently consume* this much memory", and the reserve pool needs to cover that consumption to guarantee forwards progress. Forwards progress for all other allocations is guaranteed because they are reclaimable objects - they either freed directly back to their source (slab, heap, page lists) or they are freed by shrinkers once they have been released from the transaction. Hence we need allocations to come from the free list and trigger reclaim, regardless of the fact there is a reserve pool there. The reserve pool needs to be a last resort once there are no other avenues to allocate memory. i.e. it would be used to replace the OOM killer for GFP_NOFAIL allocations. --- Andrew makes a proposal and backs it up with real life benefits: simpler, faster. You on the other hand follow up with a list of unfounded claims and your only counter-argument really seems to be that Andrew's proposal differs from what you've had in mind. What you had in mind was obviously driven by constraints known to you, but it's not an argument until you actually include them. We're not taking your claims at face value, that's not how this ever works. Just explain why and how your requirements, demand paging reserves in this case, matter in real life. Then we can take them seriously. > And, realistically, sizing that reservation pool appropriately is my > problem to solve - it will depend on many factors, one of which is > the actual geometry of the filesystem itself. You need to stop > thinking like you can control how application use the memory > allocation and reclaim subsystem and start to trust we will our > memory usage appropriately to maintain maximum system throughput. You've been working on the kernel long enough to know that this is not how it goes. I don't care about getting a list of things you claim you need and implementing them blindly, trusting that you know what you're doing when it comes to memory. If you want us to expose an interface, which puts constraints on our implementation, then you better provide justification for every single requirement. > After all, we already do that for all the filesystem caches the mm > subsystem doesn't control - why do you think I have had such an > interest in shrinker scalability? For XFS, the only cache we > actually don't control reclaim from is user data in the page cache - > we control everything else directly from custom shrinkers..... You mean those global object pools that are aged through unrelated and independent per-zone pressure values? Look, we are specialized in different subsystems, which means we know the details in front of us better than the details in the surrounding areas. You are quick to dismiss constraints and scalability concerns in the memory subsystem, and I do the same for memory users. We are having this discussion in order to explore where our problem spaces intersect, and we could be making more progress if you stopped assuming that everybody else is an idiot and you already found the perfect solution. We need data on your parameters in order to make a basic cost-benefit analysis of any proposed solutions. Don't just propose something and talk down to us when we ask for clarifications on your constraints. It's not getting us anywhere. Explore the problem space with us, explain your constraints and exact requirements based on real life data, and then we can look for potential solutions. That is how we evaluate every single proposal for the kernel, and it's how it's going to work in this case. It's not that complicated. > > The level of precision that you are asking for comes with complexity > > and fragility that I'm not convinced is necessary, or justified. > > Look, if you dont think reservations will work, then how about you > suggest something that will. I don't really care what you implement, > as long as it meets the needs of demand paging, I have direct > control over memory usage and concurrency policy and the allocation > mechanism guarantees forward progress without needing the OOM > killer. Reservations are fine and I also want them to replace the OOM killer, we agree on that. The only thing my email was about was that, in light of the worst-case numbers you quoted, it didn't look like the demand paging requirement is strictly necessary to make the system work in practice, which is why I'm questioning that particular requirement and prompting you to clarify your position. You have yet to address this. Until then, the simplest semantics are preallocation semantics, where you in advance establish private reserve pools (which can be backed by clean cache) from which you allocate directly using __GFP_RESERVE. If the pool is empty it's immediately detectable and attributable to the culprit, and the other reserves are not impacted by it. A globally shared demand-paged pool is much more fragile because you trust other participants in the system to keep their promise and not pin more objects than they reserved for. Otherwise, they deadlock your transaction and corrupt your userdata. How does "XFS filesystem corrupted because it shares its emergency memory pool to ensure data integrity with some buggy driver" sound to you? It's also harder to verify. If one of the participants misbehaves and pins more objects than they initially reserved for, how do we identify the culprit when the system locks up? Make an actual case why preallocation semantics are unworkable on real systems with real memory and real filesystems and real data on them, then we can consider making the model more complex and fragile. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 7:32 ` Dave Chinner 2015-02-27 18:24 ` Vlastimil Babka 2015-03-02 9:39 ` Vlastimil Babka @ 2015-03-02 20:22 ` Johannes Weiner 2015-03-02 23:12 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-02 20:22 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > When allocating pages the caller should drain its reserves in > > preference to dipping into the regular freelist. This guy has already > > done his reclaim and shouldn't be penalised a second time. I guess > > Johannes's preallocation code should switch to doing this for the same > > reason, plus the fact that snipping a page off > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > anyway so why not do it by default. > > That is at odds with the requirements of demand paging, which > allocate for objects that are reclaimable within the course of the > transaction. The reserve is there to ensure forward progress for > allocations for objects that aren't freed until after the > transaction completes, but if we drain it for reclaimable objects we > then have nothing left in the reserve pool when we actually need it. > > We do not know ahead of time if the object we are allocating is > going to modified and hence locked into the transaction. Hence we > can't say "use the reserve for this *specific* allocation", and so > the only guidance we can really give is "we will to allocate and > *permanently consume* this much memory", and the reserve pool needs > to cover that consumption to guarantee forwards progress. > > Forwards progress for all other allocations is guaranteed because > they are reclaimable objects - they either freed directly back to > their source (slab, heap, page lists) or they are freed by shrinkers > once they have been released from the transaction. > > Hence we need allocations to come from the free list and trigger > reclaim, regardless of the fact there is a reserve pool there. The > reserve pool needs to be a last resort once there are no other > avenues to allocate memory. i.e. it would be used to replace the OOM > killer for GFP_NOFAIL allocations. That won't work. Clean cache can be temporarily unavailable and off-LRU for several reasons - compaction, migration, pending page promotion, other reclaimers. How often are we trying before we dip into the reserve pool? As you have noticed, the OOM killer goes off seemingly prematurely at times, and the reason for that is that we simply don't KNOW the exact point when we ran out of reclaimable memory. We cannot take an atomic snapshot of all zones, of all nodes, of all tasks running in order to determine this reliably, we have to approximate it. That's why OOM is defined as "we have scanned a great many pages and couldn't free any of them." So unless you tell us which allocations should come from previously declared reserves, and which ones should rely on reclaim and may fail, the reserves can deplete prematurely and we're back to square one. > > And to make it much worse, how > > many pages of which orders? Bless its heart, slub will go and use > > a 1-order page for allocations which should have been in 0-order > > pages.. It can always fall back to the minimum order. > The majority of allocations will be order-0, though if we know that > they are going to be significant numbers of high order allocations, > then it should be simple enough to tell the mm subsystem "need a > reserve of 32 order-0, 4 order-1 and 1 order-3 allocations" and have > memory compaction just do it's stuff. But, IMO, we should cross that > bridge when somebody actually needs reservations to be that > specific.... Compaction can be at an impasse for the same reasons mentioned above. It can not just stop_machine() to guarantee it can assemble a higher order page from a bunch of in-use order-0 cache pages. If you need higher-order allocations in a transaction, you have to pre-allocate. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 20:22 ` Johannes Weiner @ 2015-03-02 23:12 ` Dave Chinner 2015-03-03 2:50 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-02 23:12 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > When allocating pages the caller should drain its reserves in > > > preference to dipping into the regular freelist. This guy has already > > > done his reclaim and shouldn't be penalised a second time. I guess > > > Johannes's preallocation code should switch to doing this for the same > > > reason, plus the fact that snipping a page off > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > anyway so why not do it by default. > > > > That is at odds with the requirements of demand paging, which > > allocate for objects that are reclaimable within the course of the > > transaction. The reserve is there to ensure forward progress for > > allocations for objects that aren't freed until after the > > transaction completes, but if we drain it for reclaimable objects we > > then have nothing left in the reserve pool when we actually need it. > > > > We do not know ahead of time if the object we are allocating is > > going to modified and hence locked into the transaction. Hence we > > can't say "use the reserve for this *specific* allocation", and so > > the only guidance we can really give is "we will to allocate and > > *permanently consume* this much memory", and the reserve pool needs > > to cover that consumption to guarantee forwards progress. > > > > Forwards progress for all other allocations is guaranteed because > > they are reclaimable objects - they either freed directly back to > > their source (slab, heap, page lists) or they are freed by shrinkers > > once they have been released from the transaction. > > > > Hence we need allocations to come from the free list and trigger > > reclaim, regardless of the fact there is a reserve pool there. The > > reserve pool needs to be a last resort once there are no other > > avenues to allocate memory. i.e. it would be used to replace the OOM > > killer for GFP_NOFAIL allocations. > > That won't work. I don't see why not... > Clean cache can be temporarily unavailable and > off-LRU for several reasons - compaction, migration, pending page > promotion, other reclaimers. How often are we trying before we dip > into the reserve pool? As you have noticed, the OOM killer goes off > seemingly prematurely at times, and the reason for that is that we > simply don't KNOW the exact point when we ran out of reclaimable > memory. Sure, but that's irrelevant to the problem at hand. At some point, the Mm subsystem is going to decide "we're at OOM" - it's *what happens next* that matters. > We cannot take an atomic snapshot of all zones, of all nodes, > of all tasks running in order to determine this reliably, we have to > approximate it. That's why OOM is defined as "we have scanned a great > many pages and couldn't free any of them." Yes, and reserve pools *do not change* the logic that leads to that decision. What changes is that we don't "kick the OOM killer", instead we "allocate from the reserve pool." The reserve pool *replaces* the OOM killer as a method of guaranteeing forwards allocation progress for those subsystems that can use reservations. If there is no reserve pool for the current task, then you can still kick the OOM killer.... > So unless you tell us which allocations should come from previously > declared reserves, and which ones should rely on reclaim and may fail, > the reserves can deplete prematurely and we're back to square one. Like the OOM killer, filesystems are not omnipotent and are not perfect. Requiring us to be so is entirely unreasonable, and is *entirely unnecessary* from the POV of the mm subsystem. Reservations give the mm subsystem a *strong model* for guaranteeing forwards allocation progress, and it can be independently verified and tested without having to care about how some subsystem uses it. The mm subsystem supplies the *mechanism*, and mm developers are entirely focussed around ensuring the mechanism works and is verifiable. i.e. you could write some debug kernel module to exercise, verify and regression test the model behaviour, which is something that simply cannot be done with the OOM killer. Reservation sizes required by a subsystem are *policy*. They are not a problem the mm subsystem needs to be concerned with as the subsystem has to get the reservations right for the mechanism to work. i.e. Managing reservation sizes is my responsibility as a subsystem maintainer, just like it's currently my responsibility for ensuring that transient ENOMEM conditions don't result in a filesystem shutdown.... > Compaction can be at an impasse for the same reasons mentioned above. > It can not just stop_machine() to guarantee it can assemble a higher > order page from a bunch of in-use order-0 cache pages. If you need > higher-order allocations in a transaction, you have to pre-allocate. It's much simpler just to use order-0 reservations and vmalloc if we can't get high order allocations. We already do this in most places where high order allocations are required, so there's really no change needed here. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 23:12 ` Dave Chinner @ 2015-03-03 2:50 ` Johannes Weiner 2015-03-04 6:52 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-03 2:50 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > When allocating pages the caller should drain its reserves in > > > > preference to dipping into the regular freelist. This guy has already > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > Johannes's preallocation code should switch to doing this for the same > > > > reason, plus the fact that snipping a page off > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > anyway so why not do it by default. > > > > > > That is at odds with the requirements of demand paging, which > > > allocate for objects that are reclaimable within the course of the > > > transaction. The reserve is there to ensure forward progress for > > > allocations for objects that aren't freed until after the > > > transaction completes, but if we drain it for reclaimable objects we > > > then have nothing left in the reserve pool when we actually need it. > > > > > > We do not know ahead of time if the object we are allocating is > > > going to modified and hence locked into the transaction. Hence we > > > can't say "use the reserve for this *specific* allocation", and so > > > the only guidance we can really give is "we will to allocate and > > > *permanently consume* this much memory", and the reserve pool needs > > > to cover that consumption to guarantee forwards progress. > > > > > > Forwards progress for all other allocations is guaranteed because > > > they are reclaimable objects - they either freed directly back to > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > once they have been released from the transaction. > > > > > > Hence we need allocations to come from the free list and trigger > > > reclaim, regardless of the fact there is a reserve pool there. The > > > reserve pool needs to be a last resort once there are no other > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > killer for GFP_NOFAIL allocations. > > > > That won't work. > > I don't see why not... > > > Clean cache can be temporarily unavailable and > > off-LRU for several reasons - compaction, migration, pending page > > promotion, other reclaimers. How often are we trying before we dip > > into the reserve pool? As you have noticed, the OOM killer goes off > > seemingly prematurely at times, and the reason for that is that we > > simply don't KNOW the exact point when we ran out of reclaimable > > memory. > > Sure, but that's irrelevant to the problem at hand. At some point, > the Mm subsystem is going to decide "we're at OOM" - it's *what > happens next* that matters. It's not irrelevant at all. That point is an arbitrary magic number that is a byproduct of many implementation details and concurrency in the memory management layer. It's completely fine to tie allocations which can fail to this point, but you can't reasonably calibrate your emergency reserves, which are supposed to guarantee progress, to such an unpredictable variable. When you reserve based on the share of allocations that you know will be unreclaimable, you are assuming that all other allocations will be reclaimable, and that is simply flawed. There is so much concurrency in the MM subsystem that you can't reasonably expect a single scanner instance to recover the majority of theoretically reclaimable memory. > > We cannot take an atomic snapshot of all zones, of all nodes, > > of all tasks running in order to determine this reliably, we have to > > approximate it. That's why OOM is defined as "we have scanned a great > > many pages and couldn't free any of them." > > Yes, and reserve pools *do not change* the logic that leads to that > decision. What changes is that we don't "kick the OOM killer", > instead we "allocate from the reserve pool." The reserve pool > *replaces* the OOM killer as a method of guaranteeing forwards > allocation progress for those subsystems that can use reservations. In order to replace the OOM killer in its role as progress guarantee, the reserves can't run dry during the transaction. Because what are we going to do in that case? > If there is no reserve pool for the current task, then you can still > kick the OOM killer.... ... so we are not actually replacing the OOM killer, we just defer it with reserves that were calibrated to an anecdotal snapshot of a fuzzy quantity of reclaim activity? Is the idea here to just pile sh*tty, mostly-working mechanisms on top of each other in the hope that one of them will kick things along just enough to avoid locking up? > > So unless you tell us which allocations should come from previously > > declared reserves, and which ones should rely on reclaim and may fail, > > the reserves can deplete prematurely and we're back to square one. > > Like the OOM killer, filesystems are not omnipotent and are not > perfect. Requiring us to be so is entirely unreasonable, and is > *entirely unnecessary* from the POV of the mm subsystem. > > Reservations give the mm subsystem a *strong model* for guaranteeing > forwards allocation progress, and it can be independently verified > and tested without having to care about how some subsystem uses it. > The mm subsystem supplies the *mechanism*, and mm developers are > entirely focussed around ensuring the mechanism works and is > verifiable. i.e. you could write some debug kernel module to > exercise, verify and regression test the model behaviour, which is > something that simply cannot be done with the OOM killer. > > Reservation sizes required by a subsystem are *policy*. They are not > a problem the mm subsystem needs to be concerned with as the > subsystem has to get the reservations right for the mechanism to > work. i.e. Managing reservation sizes is my responsibility as a > subsystem maintainer, just like it's currently my responsibility for > ensuring that transient ENOMEM conditions don't result in a > filesystem shutdown.... Anything that depends on the point at which the memory management system gives up reclaiming pages is not verifiable in the slightest. It will vary from kernel to kernel, from workload to workload, from run to run. It will regress in the blink of an eye. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-03 2:50 ` Johannes Weiner @ 2015-03-04 6:52 ` Dave Chinner 2015-03-04 15:04 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-04 6:52 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Mon, Mar 02, 2015 at 09:50:23PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 10:12:06AM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 03:22:28PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 23, 2015 at 06:32:35PM +1100, Dave Chinner wrote: > > > > On Sun, Feb 22, 2015 at 05:29:30PM -0800, Andrew Morton wrote: > > > > > When allocating pages the caller should drain its reserves in > > > > > preference to dipping into the regular freelist. This guy has already > > > > > done his reclaim and shouldn't be penalised a second time. I guess > > > > > Johannes's preallocation code should switch to doing this for the same > > > > > reason, plus the fact that snipping a page off > > > > > task_struct.prealloc_pages is super-fast and needs to be done sometime > > > > > anyway so why not do it by default. > > > > > > > > That is at odds with the requirements of demand paging, which > > > > allocate for objects that are reclaimable within the course of the > > > > transaction. The reserve is there to ensure forward progress for > > > > allocations for objects that aren't freed until after the > > > > transaction completes, but if we drain it for reclaimable objects we > > > > then have nothing left in the reserve pool when we actually need it. > > > > > > > > We do not know ahead of time if the object we are allocating is > > > > going to modified and hence locked into the transaction. Hence we > > > > can't say "use the reserve for this *specific* allocation", and so > > > > the only guidance we can really give is "we will to allocate and > > > > *permanently consume* this much memory", and the reserve pool needs > > > > to cover that consumption to guarantee forwards progress. > > > > > > > > Forwards progress for all other allocations is guaranteed because > > > > they are reclaimable objects - they either freed directly back to > > > > their source (slab, heap, page lists) or they are freed by shrinkers > > > > once they have been released from the transaction. > > > > > > > > Hence we need allocations to come from the free list and trigger > > > > reclaim, regardless of the fact there is a reserve pool there. The > > > > reserve pool needs to be a last resort once there are no other > > > > avenues to allocate memory. i.e. it would be used to replace the OOM > > > > killer for GFP_NOFAIL allocations. > > > > > > That won't work. > > > > I don't see why not... > > > > > Clean cache can be temporarily unavailable and > > > off-LRU for several reasons - compaction, migration, pending page > > > promotion, other reclaimers. How often are we trying before we dip > > > into the reserve pool? As you have noticed, the OOM killer goes off > > > seemingly prematurely at times, and the reason for that is that we > > > simply don't KNOW the exact point when we ran out of reclaimable > > > memory. > > > > Sure, but that's irrelevant to the problem at hand. At some point, > > the Mm subsystem is going to decide "we're at OOM" - it's *what > > happens next* that matters. > > It's not irrelevant at all. That point is an arbitrary magic number > that is a byproduct of many imlementation details and concurrency in > the memory management layer. It's completely fine to tie allocations > which can fail to this point, but you can't reasonably calibrate your > emergency reserves, which are supposed to guarantee progress, to such > an unpredictable variable. > > When you reserve based on the share of allocations that you know will > be unreclaimable, you are assuming that all other allocations will be > reclaimable, and that is simply flawed. There is so much concurrency > in the MM subsystem that you can't reasonably expect a single scanner > instance to recover the majority of theoretically reclaimable memory. On one hand you say "memory accounting is unreliable, so detecting OOM is unreliable, and so we have an unreliable trigger point. On the other hand you say "single scanner instance can't reclaim all memory", again stating we have an unreliable trigger point. On the gripping hand, that unreliable trigger point is what kicks the OOM killer. Yet you consider that point to be reliable enough to kick the OOM killer, but too unreliable to trigger allocation from a reserve pool? Say what? I suspect you've completely misunderstood what I've been suggesting. By definition, we have the pages we reserved in the reserve pool, and unless we've exhausted that reservation with permanent allocations we should always be able to allocate from it. If the pool got emptied by demand page allocations, then we back off and retry reclaim until the reclaimable objects are released back into the reserve pool. i.e. reclaim fills reserve pools first, then when they are full pages can go back on free lists for normal allocations. This provides the mechanism for forwards progress, and it's essentially the same mechanism that mempools use to guarantee forwards progess. the only difference is that reserve pool refilling comes through reclaim via shrinker invocation... In reality, though, I don't really care how the mm subsystem implements that pool as long as it handles the cases I've described (e.g http://oss.sgi.com/archives/xfs/2015-03/msg00039.html). I don't think we're making progress here, anyway, so unless you come up with some other solution this thread is going to die here.... -Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 6:52 ` Dave Chinner @ 2015-03-04 15:04 ` Johannes Weiner 2015-03-04 17:38 ` Theodore Ts'o 0 siblings, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-04 15:04 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 05:52:42PM +1100, Dave Chinner wrote: > I suspect you've completely misunderstood what I've been suggesting. > > By definition, we have the pages we reserved in the reserve pool, > and unless we've exhausted that reservation with permanent > allocations we should always be able to allocate from it. If the > pool got emptied by demand page allocations, then we back off and > retry reclaim until the reclaimable objects are released back into > the reserve pool. i.e. reclaim fills reserve pools first, then when > they are full pages can go back on free lists for normal > allocations. This provides the mechanism for forwards progress, and > it's essentially the same mechanism that mempools use to guarantee > forwards progess. the only difference is that reserve pool refilling > comes through reclaim via shrinker invocation... Yes, I had something else in mind. In order to rely on replenishing through reclaim, you have to make sure that all allocations taken out of the pool are guaranteed to come back in a reasonable time frame. So once Ted said that the filesystem will not be able to declare which allocations of a task are allowed to dip into its reserves, and thus allocations of indefinite lifetime can enter the picture, my mind went to a one-off reserve pool that doesn't rely on replenishing in order to make forward progress. You declare the worst-case, finish the transaction, and return what is left of the reserves. This obviously conflicts with the estimation model that you are proposing, I hope it's now clear where our misunderstanding lies. Yes, we can make this work if you can tell us which allocations have limited/controllable lifetime. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 15:04 ` Johannes Weiner @ 2015-03-04 17:38 ` Theodore Ts'o 2015-03-04 23:17 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Theodore Ts'o @ 2015-03-04 17:38 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > Yes, we can make this work if you can tell us which allocations have > limited/controllable lifetime. It may be helpful to be a bit precise about definitions here. There are a number of different object lifetimes: a) will be released before the kernel thread returns control to userspace b) will be released once the current I/O operation finishes. (In the case of nbd where the remote server has unexpectedy gone away might be quite a while, but I'm not sure how much we care about that scenario) c) can be trivially released if the mm subsystem asks via calling a shrinker d) can be released only after doing some amount of bounded work (i.e., cleaning a dirty page) e) impossible to predict when it can be released (e.g., dcache, inodes attached to an open file descriptors, buffer heads that won't be freed until the file system is umounted, etc.) I'm guessing that what you mean is (b), but what about cases such as (c)? Would the mm subsystem find it helpful if it had more information about object lifetime? For example, the CMA folks seem to really care about know whether memory allocations falls in category (e) or not. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-04 17:38 ` Theodore Ts'o @ 2015-03-04 23:17 ` Dave Chinner 0 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-03-04 23:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, Andrew Morton, torvalds On Wed, Mar 04, 2015 at 12:38:41PM -0500, Theodore Ts'o wrote: > On Wed, Mar 04, 2015 at 10:04:36AM -0500, Johannes Weiner wrote: > > Yes, we can make this work if you can tell us which allocations have > > limited/controllable lifetime. > > It may be helpful to be a bit precise about definitions here. There > are a number of different object lifetimes: > > a) will be released before the kernel thread returns control to > userspace > > b) will be released once the current I/O operation finishes. (In the > case of nbd where the remote server has unexpectedy gone away might be > quite a while, but I'm not sure how much we care about that scenario) > > c) can be trivially released if the mm subsystem asks via calling a > shrinker > > d) can be released only after doing some amount of bounded work (i.e., > cleaning a dirty page) > > e) impossible to predict when it can be released (e.g., dcache, inodes > attached to an open file descriptors, buffer heads that won't be freed > until the file system is umounted, etc.) > > > I'm guessing that what you mean is (b), but what about cases such as > (c)? The thing is, in the XFS transaction case we are hitting e) for every allocation, and only after IO and/or some processing do we know whether it will fall into c), d) or whether it will be permanently consumed. > Would the mm subsystem find it helpful if it had more information > about object lifetime? For example, the CMA folks seem to really care > about know whether memory allocations falls in category (e) or not. The problem is that most filesystem allocations fall into category (e). Worse is that the state of an object can change without allocations having taken place e.g. an object on a reclaimable LRU can be found via a cache lookup, then joined to and modified in a transaction. Hence objects can change state from "reclaimable" to "permanently consumed" without actually going through memory reclaim and allocation. IOWs, what is really required is the ability to say "this amount of allocation reserve is now consumed" /some time after/ we've done the allocation. i.e. when we join the object to the transaction and modify it, that's when we need to be able to reduce the reservation limit as that memory is now permanently consumed by the transaction context. Objects that fall into c) and d) don't need to have anyting special done, because reclaim will eventually free the memory they hold once the allocating context releases them. Indeed, this model works even when we find those c) and d) objects in cache rather than allocating them. They would get correctly accounted as "consumed reserve" because we no longer need to allocate that memory in transaction context and so that reserve can be released back to the free pool.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner 2015-02-23 1:29 ` Andrew Morton @ 2015-02-28 16:29 ` Johannes Weiner 2015-02-28 16:41 ` Theodore Ts'o 2015-02-28 18:36 ` Vlastimil Babka 2015-03-02 15:18 ` Michal Hocko 3 siblings, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-02-28 16:29 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: > > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: > > > I will actively work around aanything that causes filesystem memory > > > pressure to increase the chance of oom killer invocations. The OOM > > > killer is not a solution - it is, by definition, a loose cannon and > > > so we should be reducing dependencies on it. > > > > Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. This code was for the sake of argument, see below. > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. You are missing the point of my question. Whether we allocate right away or make sure the memory is allocatable later on is a matter of cost, but the logical outcome is the same. That is not my concern right now. An OOM killer allows transactional allocation sites to get away without planning ahead. You are arguing that the OOM killer is a cop-out on the MM site but I see it as the opposite: it puts a lot of complexity in the allocator so that callsites can maneuver themselves into situations where they absolutely need to get memory - or corrupt user data - without actually making sure their needs will be covered. If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are putting the full responsibility on the user. Are you sure this is going to reduce our kernel-wide error rate? > And, really, "reservation" != "preallocation". That's an implementation detail. Yes, the example implementation was dumb and heavy-handed, but a reservation system that works based on watermarks, and considers clean cache readily allocatable, is not much more complex than that. I'm trying to figure out if the current nofail allocators can get their memory needs figured out beforehand. And reliably so - what good are estimates that are right 90% of the time, when failing the allocation means corrupting user data? What is the contingency plan? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 16:29 ` Johannes Weiner @ 2015-02-28 16:41 ` Theodore Ts'o 2015-02-28 22:15 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Theodore Ts'o @ 2015-02-28 16:41 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > I'm trying to figure out if the current nofail allocators can get > their memory needs figured out beforehand. And reliably so - what > good are estimates that are right 90% of the time, when failing the > allocation means corrupting user data? What is the contingency plan? In the ideal world, we can figure out the exact memory needs beforehand. But we live in an imperfect world, and given that block devices *also* need memory, the answer is "of course not". We can't be perfect. But we can least give some kind of hint, and we can offer to wait before we get into a situation where we need to loop in GFP_NOWAIT --- which is the contingency/fallback plan. I'm sure that's not very satisfying, but it's better than what we have now. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 16:41 ` Theodore Ts'o @ 2015-02-28 22:15 ` Johannes Weiner 2015-03-01 11:17 ` Tetsuo Handa ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Johannes Weiner @ 2015-02-28 22:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > I'm trying to figure out if the current nofail allocators can get > > their memory needs figured out beforehand. And reliably so - what > > good are estimates that are right 90% of the time, when failing the > > allocation means corrupting user data? What is the contingency plan? > > In the ideal world, we can figure out the exact memory needs > beforehand. But we live in an imperfect world, and given that block > devices *also* need memory, the answer is "of course not". We can't > be perfect. But we can least give some kind of hint, and we can offer > to wait before we get into a situation where we need to loop in > GFP_NOWAIT --- which is the contingency/fallback plan. Overestimating should be fine, the result would a bit of false memory pressure. But underestimating and looping can't be an option or the original lockups will still be there. We need to guarantee forward progress or the problem is somewhat mitigated at best - only now with quite a bit more complexity in the allocator and the filesystems. The block code would have to be looked at separately, but doesn't it already use mempools etc. to guarantee progress? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner @ 2015-03-01 11:17 ` Tetsuo Handa 2015-03-06 11:53 ` Tetsuo Handa 2015-03-01 13:43 ` Theodore Ts'o 2015-03-01 21:48 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Tetsuo Handa @ 2015-03-01 11:17 UTC (permalink / raw) To: hannes, tytso Cc: dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, fernando_b1, torvalds Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. > > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? > If underestimating is tolerable, can we simply set different watermark levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? For example, GFP_KERNEL (or above) can fail if memory usage exceeds 95% GFP_NOFS can fail if memory usage exceeds 97% GFP_NOIO can fail if memory usage exceeds 98% GFP_ATOMIC can fail if memory usage exceeds 99% I think that below order-0 GFP_NOIO allocation enters into retry-forever loop when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds strange. Use of same watermark is preventing kernel worker threads from processing workqueue. While it is legal to do blocking operation from workqueue, being blocked forever is an exclusive occupation for workqueue; other jobs in the workqueue get stuck. [ 907.302050] kworker/1:0 R running task 0 10832 2 0x00000080 [ 907.303961] Workqueue: events_freezable_power_ disk_events_workfn [ 907.305706] ffff88007c8ab7d8 0000000000000046 ffff88007c8ab8a0 ffff88007c894190 [ 907.307761] 0000000000012500 ffff88007c8abfd8 0000000000012500 ffff88007c894190 [ 907.309894] 0000000000000020 ffff88007c8ab8b0 0000000000000002 ffffffff81848408 [ 907.311949] Call Trace: [ 907.312989] [<ffffffff8159f814>] _cond_resched+0x24/0x40 [ 907.314578] [<ffffffff81122119>] shrink_slab+0x139/0x150 [ 907.316182] [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0 [ 907.317889] [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0 [ 907.319535] [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40 [ 907.321259] [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100 [ 907.322945] [<ffffffff8125bed6>] bio_copy_user_iov+0x1d6/0x380 [ 907.324606] [<ffffffff8125e4cd>] ? blk_rq_init+0xed/0x160 [ 907.326196] [<ffffffff8125c119>] bio_copy_kern+0x49/0x100 [ 907.327788] [<ffffffff810a14a0>] ? prepare_to_wait_event+0x100/0x100 [ 907.329549] [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130 [ 907.331184] [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0 [ 907.332877] [<ffffffff813a66cf>] scsi_execute+0x12f/0x160 [ 907.334452] [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0 [ 907.336156] [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod] [ 907.337893] [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0 [ 907.339539] [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom] [ 907.341289] [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod] [ 907.343115] [<ffffffff812701c6>] disk_check_events+0x56/0x1b0 [ 907.344771] [<ffffffff81270331>] disk_events_workfn+0x11/0x20 [ 907.346421] [<ffffffff8107ceaf>] process_one_work+0x13f/0x370 [ 907.348057] [<ffffffff8107de99>] worker_thread+0x119/0x500 [ 907.349650] [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350 [ 907.351295] [<ffffffff81082f7c>] kthread+0xdc/0x100 [ 907.352765] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 [ 907.354520] [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0 [ 907.356097] [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0 If I change GFP_NOIO in scsi_execute() to GFP_ATOMIC, above trace went away. If we can reserve some amount of memory for block / filesystem layer than allow non critical allocation, above trace will likely go away. Or, instead maybe we can change GFP_NOIO to do (1) try allocation using GFP_ATOMIC|GFP_NOWARN (2) try allocating from freelist for GFP_NOIO (3) fail the allocation with warning message steps if we can implement freelist for GFP_NOIO. Ditto for GFP_NOFS. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 11:17 ` Tetsuo Handa @ 2015-03-06 11:53 ` Tetsuo Handa 0 siblings, 0 replies; 83+ messages in thread From: Tetsuo Handa @ 2015-03-06 11:53 UTC (permalink / raw) To: david Cc: tytso, hannes, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, fernando_b1, torvalds Tetsuo Handa wrote: > If underestimating is tolerable, can we simply set different watermark > levels for GFP_ATOMIC / GFP_NOIO / GFP_NOFS / GFP_KERNEL allocations? > For example, > > GFP_KERNEL (or above) can fail if memory usage exceeds 95% > GFP_NOFS can fail if memory usage exceeds 97% > GFP_NOIO can fail if memory usage exceeds 98% > GFP_ATOMIC can fail if memory usage exceeds 99% > > I think that below order-0 GFP_NOIO allocation enters into retry-forever loop > when GFP_KERNEL (or above) allocation starts waiting for reclaim sounds > strange. Use of same watermark is preventing kernel worker threads from > processing workqueue. While it is legal to do blocking operation from > workqueue, being blocked forever is an exclusive occupation for workqueue; > other jobs in the workqueue get stuck. > Below experimental patch which raises zone watermark works for me. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432..92233e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1710,6 +1710,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + gfp_t gfp_mask; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7abfa70..1a6b830 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1810,6 +1810,12 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order, min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; + if (min == mark) { + if (current->gfp_mask & __GFP_FS) + min <<= 1; + if (current->gfp_mask & __GFP_IO) + min <<= 1; + } #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ if (!(alloc_flags & ALLOC_CMA)) @@ -2810,6 +2816,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, .nodemask = nodemask, .migratetype = gfpflags_to_migratetype(gfp_mask), }; + gfp_t orig_gfp_mask; gfp_mask &= gfp_allowed_mask; @@ -2831,6 +2838,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; + orig_gfp_mask = current->gfp_mask; + current->gfp_mask = gfp_mask; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2873,6 +2882,7 @@ out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; + current->gfp_mask = orig_gfp_mask; return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); ---------- Thanks again to Jonathan Corbet for writing https://lwn.net/Articles/635354/ . Is Dave Chinner's "reservations" suggestion conceptually doing the patch above? Dave's suggestion is to ask each GFP_NOFS and GFP_NOIO users to estimate how much amount of pages they need for their transaction like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += atomic_read(&reservation_for_gfp_fs); if (current->gfp_mask & __GFP_IO) min += atomic_read(&reservation_for_gfp_io); } than ask the administrator to specify a static amount like if (min == mark) { if (current->gfp_mask & __GFP_FS) min += sysctl_reservation_for_gfp_fs; if (current->gfp_mask & __GFP_IO) min += sysctl_reservation_for_gfp_io; } ? The retry-forever loop will happen if underestimated, won't it? Then, how to handle it when the OOM killer missed the target (due to __GFP_FS) or the OOM killer cannot be invoked (due to !__GFP_FS)? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner 2015-03-01 11:17 ` Tetsuo Handa @ 2015-03-01 13:43 ` Theodore Ts'o 2015-03-01 16:15 ` Johannes Weiner 2015-03-01 20:17 ` Johannes Weiner 2015-03-01 21:48 ` Dave Chinner 2 siblings, 2 replies; 83+ messages in thread From: Theodore Ts'o @ 2015-03-01 13:43 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. We've lived with looping as it is and in practice it's actually worked well. I can only speak for ext4, but I do a lot of testing under very high memory pressure situations, and it is used in *production* under very high stress situations --- and the only time we'e run into trouble is when the looping behaviour somehow got accidentally *removed*. There have been MM experts who have been worrying about this situation for a very long time, but honestly, it seems to be much more of a theoretical than actual concern. So if you don't want to get hints/estimates about how much memory the file system is about to use, when the file system is willing to wait or even potentially return ENOMEM (although I suspect starting to return ENOMEM where most user space application don't expect it will cause more problems), I'm personally happy to just use GFP_NOFAIL everywhere --- or to hard code my own infinite loops if the MM developers want to take GFP_NOFAIL away. Because in my experience, looping simply hasn't been as awful as some folks on this thread have made it out to be. So if you don't like the complexity because the perfect is the enemy of the good, we can just drop this and the file systems can simply continue to loop around their memory allocation calls... or if that fails we can start adding subsystem specific mempools, which would be even more wasteful of memory and probably at least as complicated. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 13:43 ` Theodore Ts'o @ 2015-03-01 16:15 ` Johannes Weiner 2015-03-01 19:36 ` Theodore Ts'o 2015-03-01 20:17 ` Johannes Weiner 1 sibling, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-01 16:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. > > There have been MM experts who have been worrying about this situation > for a very long time, but honestly, it seems to be much more of a > theoretical than actual concern. Well, looping is a valid thing to do in most situations because on a loaded system there is a decent chance that an unrelated thread will volunteer some unreclaimable memory, or exit altogether. Right now, we rely on this happening, and it works most of the time. Maybe all the time, depending on how your machine is used. But when it does't, machines do lock up in practice. We had these lockups in cgroups with just a handful of threads, which all got stuck in the allocator and there was nobody left to volunteer unreclaimable memory. When this was being addressed, we knew that the same can theoretically happen on the system-level but weren't aware of any reports. Well now, here we are. It's been argued in this thread that systems shouldn't be pushed to such extremes in real life and that we simply expect failure at some point. If that's the consensus, then yes, we can stop this and tell users that they should scale back. But I'm not convinced just yet that this is the best we can do. > So if you don't want to get hints/estimates about how much memory > the file system is about to use, when the file system is willing to > wait or even potentially return ENOMEM (although I suspect starting > to return ENOMEM where most user space application don't expect it > will cause more problems), I'm personally happy to just use > GFP_NOFAIL everywhere --- or to hard code my own infinite loops if > the MM developers want to take GFP_NOFAIL away. Because in my > experience, looping simply hasn't been as awful as some folks on > this thread have made it out to be. As I've said before, I'd be happy to get estimates from the filesystem so that we can adjust our reserves, instead of simply running against the wall at some point and hoping that the OOM killer heuristics will save the day. Until then, I'd much prefer __GFP_NOFAIL over open-coded loops. If the OOM killer is too aggressive, we can tone it down, but as it stands that mechanism is the last attempt at forward progress if looping doesn't work out. In addition, when we finally transition to private memory reserves, we can easily find the callsites that need to be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES. > So if you don't like the complexity because the perfect is the enemy > of the good, we can just drop this and the file systems can simply > continue to loop around their memory allocation calls... or if that > fails we can start adding subsystem specific mempools, which would be > even more wasteful of memory and probably at least as complicated. It really depends on what the goal here is. You don't have to be perfectly accurate, but if you can give us a worst-case estimate we can actually guarantee forward progress and eliminate these lockups entirely, like in the block layer. Sure, there will be bugs and the estimates won't be right from the start, but we can converge towards the right answer. If the allocations which are allowed to dip into the reserves - the current nofail sites? - can be annotated with a gfp flag, we can easily verify the estimates by serving those sites exclusively from the private reserve pool and emit warnings when that runs dry. We wouldn't even have to stress the system for that. But there are legitimate concerns that this might never work. For example, the requirements could be so unpredictable, or assessing them with reasonable accuracy could be so expensive, that the margin of error would make the worst case estimate too big to be useful. Big enough that the reserves would harm well-behaved systems. And if useful worst-case estimates are unattainable, I don't think we need to bother with reserves. We can just stick with looping and OOM killing, that works most of the time, too. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 16:15 ` Johannes Weiner @ 2015-03-01 19:36 ` Theodore Ts'o 2015-03-01 20:44 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Theodore Ts'o @ 2015-03-01 19:36 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > We had these lockups in cgroups with just a handful of threads, which > all got stuck in the allocator and there was nobody left to volunteer > unreclaimable memory. When this was being addressed, we knew that the > same can theoretically happen on the system-level but weren't aware of > any reports. Well now, here we are. I think the "few threads in a small" cgroup problem is a little difference, because in those cases very often the global system has enough memory, and there is always the possibility that we might relax the memory cgroup guarantees a little in order to allow forward progress. In fact, arguably this *is* the right thing to do, because we have situations where (a) the VFS takes the directory mutex, (b) the directory blocks have been pushed out of memory, and so (c) a system call running in container with a small amount of memory and/or a small amount of disk bandwidth allowed via its prop I/O settings ends up taking a very long time for the directory blocks to be read into memory. If a high priority process, like say a cluster management daemon, also tries to to read the same directory, it can end up stalled for long enough for the software watchdog to take out the entire machine from the cluster. The hard problem here is that the lock is taken by the VFS, *before* it calls into the file system specific layer, and so the VFS has no idea (a) how much memory or disk bandwidth it needs, and (b) whether it needs any memory or disk bandwidth in the first place in order to service a directory lookup operation (most of the time, it doesn't). So there may be situations where in the restricted cgroup, it would useful for the file system to be able to say, "you know, we're holding onto a lock and the fact that the disk controller is going to force this low priority cgroup to wait over a minute for the I/O to even be queued out to the disk, maybe we should make an exception and bust the disk controller cgroup cap". (There is a related problem where a cgroup with a low disk bandwidth quota is slowing down writeback, and we are desperately short on global memory, and where relaxing the disk bandwidth limit via some kind of priority inheritance scheme would prevent "innocent" high, proprity cgroups from having some of their processes get OOM-killed. I suppose one could claim that the high priority cgroups tend to belong to the sysadmin, who set the stupid disk bandwidth caps in the first place, so there is a certain justice in having the high priority processes getting OOM killed, but still, it would be nice if we could do the right thing automatically.) But in any case, some of these workarounds, where we relax a particuarly tightly constrained cgroup limit, are obviously not going to help when the entire system is low on memory. > It really depends on what the goal here is. You don't have to be > perfectly accurate, but if you can give us a worst-case estimate we > can actually guarantee forward progress and eliminate these lockups > entirely, like in the block layer. Sure, there will be bugs and the > estimates won't be right from the start, but we can converge towards > the right answer. If the allocations which are allowed to dip into > the reserves - the current nofail sites? - can be annotated with a gfp > flag, we can easily verify the estimates by serving those sites > exclusively from the private reserve pool and emit warnings when that > runs dry. We wouldn't even have to stress the system for that. > > But there are legitimate concerns that this might never work. For > example, the requirements could be so unpredictable, or assessing them > with reasonable accuracy could be so expensive, that the margin of > error would make the worst case estimate too big to be useful. Big > enough that the reserves would harm well-behaved systems. And if > useful worst-case estimates are unattainable, I don't think we need to > bother with reserves. We can just stick with looping and OOM killing, > that works most of the time, too. I'm not sure that you want to reserve for the worst-case. What might work is if subsystems (probably primarily file systems) give you estimates for the usual case and the worst case, and you reserve for the something in between these two bounds. In practice there will be a huge number of file systems operations taking place in your typical super-busy system, and if you reserve for the worst case, it probably will be too much. We need to make sure there is enough memory available for some forward progress, and if we need to stall a few operations with some sleeping loops, so be it. So I don't think the "heads up" mounts don't have to be strict reservations in the sense that the memory will be available instantly without any sleeping or looping. I would also suggest that "reservations" be tied to a task struct and not to some magic __GFP_* flag, since it's not just allocations done by the file system, but also by the block device drivers, and if certain write operations fail, the results will be catastrophic -- and the block device can't tell whether a particular I/O operatoion must succeed or we declare the file system as needing manual recovery and potentially reboot the entire system, and an I/O operation where a fail could be handled by reflecting ENOMEM back up to userspace. The difference is a property of the call stack, so the simplest way of handing this is to store the reservation in the task struct, and let the reservation get automatically returned to the system when a particular process makes a transition from kernel space to user space. The bottom line is that I agree that looping and OOM-killing works most of the time, and so I'm happy with something that makes life a little bit better and a little bit more predictable for the VM, if that makes the system behave a bit more smoothly under high memory pressures. But at the same time, we don't want to make things too complicated; whether that means that we don't try to achieve perfection, or simply not worry about the global memory pressure situation, and instead try to think about other solutions to handle the "small number of threads in a container, and try to OOM kill a bit less frequently, and instead force it to loop/sleep for a bit, and then let a random foreground kernel thread in the container allow to "borrow" a small amount of memory to hopefully let it make forward progress, especially if it is holding locks, or is in the process of exiting, etc. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 19:36 ` Theodore Ts'o @ 2015-03-01 20:44 ` Johannes Weiner 0 siblings, 0 replies; 83+ messages in thread From: Johannes Weiner @ 2015-03-01 20:44 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 02:36:35PM -0500, Theodore Ts'o wrote: > On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > > > We had these lockups in cgroups with just a handful of threads, which > > all got stuck in the allocator and there was nobody left to volunteer > > unreclaimable memory. When this was being addressed, we knew that the > > same can theoretically happen on the system-level but weren't aware of > > any reports. Well now, here we are. > > I think the "few threads in a small" cgroup problem is a little > difference, because in those cases very often the global system has > enough memory, and there is always the possibility that we might relax > the memory cgroup guarantees a little in order to allow forward > progress. That's exactly how we fixed it. __GFP_NOFAIL are allowed to simply bypass the cgroup memory limits when reclaim within the group fails to make room for the allocation. I'm just mentioning that because the global case doesn't have the same out, but is susceptible to the same deadlock situation when there are no other threads volunteering pages. If your machines are loaded with hundreds or thousands of threads, the chances that a thread stuck in the allocator will be bailed out by the other threads in the system is likely (or that you run into CPU limits first), but if you have only a handful of memory-intensive tasks, this might not be the case. The cgroup problem was closer to that second scenario, where few threads split all available memory between them. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 13:43 ` Theodore Ts'o 2015-03-01 16:15 ` Johannes Weiner @ 2015-03-01 20:17 ` Johannes Weiner 1 sibling, 0 replies; 83+ messages in thread From: Johannes Weiner @ 2015-03-01 20:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. Memory is a finite resource and there are (unlimited) consumers that do not allow their share to be reclaimed/recycled. Mainly this is the kernel itself, but it also includes anon memory once swap space runs out, as well as mlocked and dirty memory. It's not a question of whether there exists a true point of OOM (where not enough memory is recyclable to satisfy new allocations). That point inevitably exists. It's a policy question of how to inform userspace once it is reached. We agree that we can't unconditionally fail allocations, because we might be in the middle of a transaction, where an allocation failure can potentially corrupt userdata. However, endlessly looping for progress that can not happen at this point has the exact same effect: the transaction won't finish. Only the machine locks up in addition. It's great that your setups don't ever truly go out of memory, but that doesn't mean it can't happen in practice. One answer to users at this point could certainly be to stay away from the true point of OOM, and if you don't then that's your problem. But the issue I take with this answer is that, for the sake of memory utilization, users kind of do want to get fairly close to this point, and at the same time it's hard to reliably predict the memory consumption of a workload in advance. It can depend on the timing between threads, it can depend on user/network-supplied input, and it can simply be a bug in the application. And if that OOM situation is accidentally entered, I'd prefer we had a better answer than locking up the machine and blame the user. So one attempt to make progress in this situation is to kill userspace applications that are pinning unreclaimable memory. This is what we are doing now, but there are several problems with it. For one, we are doing a terrible job and might still get stuck sometimes, which deteriorates the situation back to failing the allocation and corrupting the filesystem. Secondly, killing tasks is disruptive, and because it's driven by heuristics we're never going to kill the "right" one in all situations. Reserves would allow us to look ahead and avoid starting transactions that can not be finished given the available resources. So we are at least avoiding filesystem corruption. The tasks could probably be put to sleep for some time in the hope that ongoing transactions complete and release memory, but there might not be any, and eventually the OOM situation has to be communicated to userspace. Arguably, an -ENOMEM from a syscall at this point might be easier to handle than a SIGKILL from the OOM killer in an unrelated task. So if we could pull off reserves, they look like the most attractive solution to me. If not, the OOM killer needs to be fixed to always make forward progress instead. I proposed a patch for that already. But infinite loops that force the user to reboot the machine at the point of OOM seem like a terrible policy. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-28 22:15 ` Johannes Weiner 2015-03-01 11:17 ` Tetsuo Handa 2015-03-01 13:43 ` Theodore Ts'o @ 2015-03-01 21:48 ` Dave Chinner 2015-03-02 0:17 ` Dave Chinner 2 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-01 21:48 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > I'm trying to figure out if the current nofail allocators can get > > > their memory needs figured out beforehand. And reliably so - what > > > good are estimates that are right 90% of the time, when failing the > > > allocation means corrupting user data? What is the contingency plan? > > > > In the ideal world, we can figure out the exact memory needs > > beforehand. But we live in an imperfect world, and given that block > > devices *also* need memory, the answer is "of course not". We can't > > be perfect. But we can least give some kind of hint, and we can offer > > to wait before we get into a situation where we need to loop in > > GFP_NOWAIT --- which is the contingency/fallback plan. > > Overestimating should be fine, the result would a bit of false memory > pressure. But underestimating and looping can't be an option or the > original lockups will still be there. We need to guarantee forward > progress or the problem is somewhat mitigated at best - only now with > quite a bit more complexity in the allocator and the filesystems. The additional complexity in XFS is actually quite minor, and initial "rough worst case" memory usage estimates are not that hard to measure.... > The block code would have to be looked at separately, but doesn't it > already use mempools etc. to guarantee progress? Yes, it does. I'm not concerned about the block layer. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-01 21:48 ` Dave Chinner @ 2015-03-02 0:17 ` Dave Chinner 2015-03-02 12:46 ` Brian Foster 0 siblings, 1 reply; 83+ messages in thread From: Dave Chinner @ 2015-03-02 0:17 UTC (permalink / raw) To: Johannes Weiner Cc: Theodore Ts'o, Tetsuo Handa, rientjes, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, akpm, torvalds On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > their memory needs figured out beforehand. And reliably so - what > > > > good are estimates that are right 90% of the time, when failing the > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > In the ideal world, we can figure out the exact memory needs > > > beforehand. But we live in an imperfect world, and given that block > > > devices *also* need memory, the answer is "of course not". We can't > > > be perfect. But we can least give some kind of hint, and we can offer > > > to wait before we get into a situation where we need to loop in > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > The additional complexity in XFS is actually quite minor, and > initial "rough worst case" memory usage estimates are not that hard > to measure.... And, just to point out that the OOM killer can be invoked without a single transaction-based filesystem ENOMEM failure, here's what xfs/084 does on 4.0-rc1: [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 148.822113] resvtest cpuset=/ mems_allowed=0 [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 [ 148.831734] Call Trace: [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 [ 148.854456] Mem-Info: [ 148.854986] Node 0 DMA per-cpu: [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 [ 148.856820] Node 0 DMA32 per-cpu: [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 [ 148.858688] free_cma:0 [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as [ 148.874431] lowmem_reserve[]: 0 966 966 966 [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s [ 148.884817] lowmem_reserve[]: 0 0 0 0 [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 148.894949] 47361 total pagecache pages [ 148.895816] 47334 pages in swap cache [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 [ 148.898057] Free swap = 0kB [ 148.898714] Total swap = 497976kB [ 148.899470] 262044 pages RAM [ 148.900145] 0 pages HighMem/MovableOnly [ 148.901006] 10253 pages reserved [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB [ 149.415288] XFS (vda): Unmounting Filesystem [ 150.211229] XFS (vda): Mounting V5 Filesystem [ 150.292092] XFS (vda): Ending clean mount [ 150.342307] XFS (vda): Unmounting Filesystem [ 150.346522] XFS (vdb): Unmounting Filesystem [ 151.264135] XFS: kmalloc allocations by trans type [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 [ 151.272833] XFS: slab allocations by trans type [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 [ 151.283476] XFS: vmalloc allocations by trans type [ 151.284535] XFS: page allocations by trans type Those XFS allocation stats are largest measured allocations done under transaction context broken down by allocation and transaction type. No failures that would result in looping, even though the system invoked the OOM killer on a filesystem workload.... I need to break the slab allocations down further by cache (other workloads are generating over 50 slab allocations per transaction), but another hour's work and a few days of observation of the stats in my normal day-to-day work wll get me all the information I need to do a decent first pass at memory reservation requirements for XFS. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 0:17 ` Dave Chinner @ 2015-03-02 12:46 ` Brian Foster 0 siblings, 0 replies; 83+ messages in thread From: Brian Foster @ 2015-03-02 12:46 UTC (permalink / raw) To: Dave Chinner Cc: Theodore Ts'o, Tetsuo Handa, Johannes Weiner, oleg, xfs, mhocko, linux-mm, mgorman, dchinner, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > > their memory needs figured out beforehand. And reliably so - what > > > > > good are estimates that are right 90% of the time, when failing the > > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > > > In the ideal world, we can figure out the exact memory needs > > > > beforehand. But we live in an imperfect world, and given that block > > > > devices *also* need memory, the answer is "of course not". We can't > > > > be perfect. But we can least give some kind of hint, and we can offer > > > > to wait before we get into a situation where we need to loop in > > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > > > Overestimating should be fine, the result would a bit of false memory > > > pressure. But underestimating and looping can't be an option or the > > > original lockups will still be there. We need to guarantee forward > > > progress or the problem is somewhat mitigated at best - only now with > > > quite a bit more complexity in the allocator and the filesystems. > > > > The additional complexity in XFS is actually quite minor, and > > initial "rough worst case" memory usage estimates are not that hard > > to measure.... > > And, just to point out that the OOM killer can be invoked without a > single transaction-based filesystem ENOMEM failure, here's what > xfs/084 does on 4.0-rc1: > > [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 > [ 148.822113] resvtest cpuset=/ mems_allowed=0 > [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 > [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c > [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 > [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 > [ 148.831734] Call Trace: > [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 > [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb > [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 > [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 > [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 > [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 > [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 > [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 > [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 > [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 > [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 > [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 > [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 > [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 > [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 > [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 > [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 > [ 148.854456] Mem-Info: > [ 148.854986] Node 0 DMA per-cpu: > [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 > [ 148.856820] Node 0 DMA32 per-cpu: > [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 > [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 > [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 > [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 > [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 > [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 > [ 148.858688] free_cma:0 > [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as > [ 148.874431] lowmem_reserve[]: 0 966 966 966 > [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s > [ 148.884817] lowmem_reserve[]: 0 0 0 0 > [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB > [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB > [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > [ 148.894949] 47361 total pagecache pages > [ 148.895816] 47334 pages in swap cache > [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 > [ 148.898057] Free swap = 0kB > [ 148.898714] Total swap = 497976kB > [ 148.899470] 262044 pages RAM > [ 148.900145] 0 pages HighMem/MovableOnly > [ 148.901006] 10253 pages reserved > [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd > [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd > [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd > [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient > [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind > [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd > [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid > [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd > [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron > [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon > [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd > [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd > [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd > [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty > [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty > [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty > [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty > [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty > [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty > [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty > [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd > [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae > [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd > [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd > [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash > [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo > [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check > [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 > [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest > [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 > [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 > [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed > [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed > [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child > [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB > [ 149.415288] XFS (vda): Unmounting Filesystem > [ 150.211229] XFS (vda): Mounting V5 Filesystem > [ 150.292092] XFS (vda): Ending clean mount > [ 150.342307] XFS (vda): Unmounting Filesystem > [ 150.346522] XFS (vdb): Unmounting Filesystem > [ 151.264135] XFS: kmalloc allocations by trans type > [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 > [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 > [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 > [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 > [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 > [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 > [ 151.272833] XFS: slab allocations by trans type > [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 > [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 > [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 > [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 > [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 > [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 > [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 > [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 > [ 151.283476] XFS: vmalloc allocations by trans type > [ 151.284535] XFS: page allocations by trans type > > Those XFS allocation stats are largest measured allocations done > under transaction context broken down by allocation and transaction > type. No failures that would result in looping, even though the > system invoked the OOM killer on a filesystem workload.... > > I need to break the slab allocations down further by cache (other > workloads are generating over 50 slab allocations per transaction), > but another hour's work and a few days of observation of the stats > in my normal day-to-day work wll get me all the information I need > to do a decent first pass at memory reservation requirements for > XFS. > This sounds like something that would serve us well under sysfs, particularly if we do adopt the kind of reservation model being discussed in this thread. I wouldn't expect these values to change drastically or that often, but they could certainly adjust over time to the point of being out of line with a reservation. A tool like this combined with Johannes' idea of a warning or something along those lines for a reservation overrun should give us all we need to identify something is wrong and have the ability to fix it. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner 2015-02-23 1:29 ` Andrew Morton 2015-02-28 16:29 ` Johannes Weiner @ 2015-02-28 18:36 ` Vlastimil Babka 2015-03-02 15:18 ` Michal Hocko 3 siblings, 0 replies; 83+ messages in thread From: Vlastimil Babka @ 2015-02-28 18:36 UTC (permalink / raw) To: Dave Chinner, Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, mhocko, linux-mm, mgorman, rientjes, akpm, torvalds On 23.2.2015 1:45, Dave Chinner wrote: > On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote: >> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote: >>> I will actively work around aanything that causes filesystem memory >>> pressure to increase the chance of oom killer invocations. The OOM >>> killer is not a solution - it is, by definition, a loose cannon and >>> so we should be reducing dependencies on it. >> >> Once we have a better-working alternative, sure. > > Great, but first a simple request: please stop writing code and > instead start architecting a solution to the problem. i.e. we need a > design and have that documented before code gets written. If you > watched my recent LCA talk, then you'll understand what I mean > when I say: stop programming and start engineering. About that... I guess good engineering also means looking at past solutions to the same problem. I expect there would be a lot of academic work on this, which might tell us what's (not) possible. And maybe even actual implementations with real-life experience to learn from? >>> I really don't care about the OOM Killer corner cases - it's >>> completely the wrong way line of development to be spending time on >>> and you aren't going to convince me otherwise. The OOM killer a >>> crutch used to justify having a memory allocation subsystem that >>> can't provide forward progress guarantee mechanisms to callers that >>> need it. >> >> We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. But won't even the reservation have potentially large impact on performance, if as you later suggest (IIUC), we don't actually dip into our reserves until regular reclaim starts failing? Doesn't that mean potentially lot of wasted memory? Right, it doesn't have to be if we allow clean reclaimable pages to be part of reserve, but still... > And, really, "reservation" != "preallocation". > > Maybe it's my filesystem background, but those to things are vastly > different things. > > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. > > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. > > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. > > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. This could be acheived by setting reclaim > targets first to the reserve pool watermark, then the second target > is enough pages to satisfy the current allocation. Hmm but what if the clean pages need us to take some locks to unmap and some proces holding them is blocked... Also we would need to potentally block a process that wants to dirty a page, is that being done now? > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... I don't know the details here, but if the allocation is done for incoming packets i.e. something you can't predict then how would you set the reserve for that? If they could predict, they would be able to preallocate necessary buffers already. > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. We need to design and document the model first, > then review it, then we can start working at the code level to > implement the solution we've designed. Right. A conference to discuss this on could come handy :) > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-02-23 0:45 ` Dave Chinner ` (2 preceding siblings ...) 2015-02-28 18:36 ` Vlastimil Babka @ 2015-03-02 15:18 ` Michal Hocko 2015-03-02 16:05 ` Johannes Weiner 2015-03-02 16:39 ` Theodore Ts'o 3 siblings, 2 replies; 83+ messages in thread From: Michal Hocko @ 2015-03-02 15:18 UTC (permalink / raw) To: Dave Chinner Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 23-02-15 11:45:21, Dave Chinner wrote: [...] > A reserve memory pool is no different - every time a memory reserve > occurs, a watermark is lifted to accommodate it, and the transaction > is not allowed to proceed until the amount of free memory exceeds > that watermark. The memory allocation subsystem then only allows > *allocations* marked correctly to allocate pages from that the > reserve that watermark protects. e.g. only allocations using > __GFP_RESERVE are allowed to dip into the reserve pool. The idea is sound. But I am pretty sure we will find many corner cases. E.g. what if the mere reservation attempt causes the system to go OOM and trigger the OOM killer? Sure that wouldn't be too much different from the OOM triggered during the allocation but there is one major difference. Reservations need to be estimated and I expect the estimation would be on the more conservative side and so the OOM might not happen without them. > By using watermarks, freeing of memory will automatically top > up the reserve pool which means that we guarantee that reclaimable > memory allocated for demand paging during transacitons doesn't > deplete the reserve pool permanently. As a result, when there is > plenty of free and/or reclaimable memory, the reserve pool > watermarks will have almost zero impact on performance and > behaviour. Typical busy system won't be very far away from the high watermark so there would be a reclaim performed during increased watermaks (aka reservation) and that might lead to visible performance degradation. This might be acceptable but it also adds a certain level of unpredictability when performance characteristics might change suddenly. > Further, because it's just accounting and behavioural thresholds, > this allows the mm subsystem to control how the reserve pool is > accounted internally. e.g. clean, reclaimable pages in the page > cache could serve as reserve pool pages as they can be immediately > reclaimed for allocation. But they also can turn into hard/impossible to reclaim as well. Clean pages might get dirty and e.g. swap backed pages run out of their backing storage. So I guess we cannot count with those pages without reclaiming them first and hiding them into the reserve. Which is what you suggest below probably but I wasn't really sure... > This could be acheived by setting reclaim targets first to the reserve > pool watermark, then the second target is enough pages to satisfy the > current allocation. > > And, FWIW, there's nothing stopping this mechanism from have order > based reserve thresholds. e.g. IB could really do with a 64k reserve > pool threshold and hence help solve the long standing problems they > have with filling the receive ring in GFP_ATOMIC context... > > Sure, that's looking further down the track, but my point still > remains: we need a viable long term solution to this problem. Maybe > reservations are not the solution, but I don't see anyone else who > is thinking of how to address this architectural problem at a system > level right now. I think the idea is good! It will just be quite tricky to get there without causing more problems than those being solved. The biggest question mark so far seems to be the reservation size estimation. If it is hard for any caller to know the size beforehand (which would be really close to the actually used size) then the whole complexity in the code sounds like an overkill and asking administrator to tune min_free_kbytes seems a better fit (we would still have to teach the allocator to access reserves when really necessary) because the system would behave more predictably (although some memory would be wasted). > We need to design and document the model first, then review it, then > we can start working at the code level to implement the solution we've > designed. I have already asked James to add this on LSF agenda but nothing has materialized on the schedule yet. I will poke him again. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 15:18 ` Michal Hocko @ 2015-03-02 16:05 ` Johannes Weiner 2015-03-02 17:10 ` Michal Hocko 2015-03-02 16:39 ` Theodore Ts'o 1 sibling, 1 reply; 83+ messages in thread From: Johannes Weiner @ 2015-03-02 16:05 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > On Mon 23-02-15 11:45:21, Dave Chinner wrote: > [...] > > A reserve memory pool is no different - every time a memory reserve > > occurs, a watermark is lifted to accommodate it, and the transaction > > is not allowed to proceed until the amount of free memory exceeds > > that watermark. The memory allocation subsystem then only allows > > *allocations* marked correctly to allocate pages from that the > > reserve that watermark protects. e.g. only allocations using > > __GFP_RESERVE are allowed to dip into the reserve pool. > > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Sure that wouldn't be too much > different from the OOM triggered during the allocation but there is one > major difference. Reservations need to be estimated and I expect the > estimation would be on the more conservative side and so the OOM might > not happen without them. The whole idea is that filesystems request the reserves while they can still sleep for progress or fail the macro-operation with -ENOMEM. And the estimate wouldn't just be on the conservative side, it would have to be the worst-case scenario. If we run out of reserves in an allocation that can not fail that would be a bug that can lock up the machine. We can then fall back to the OOM killer in a last-ditch effort to make forward progress, but as the victim tasks can get stuck behind state/locks held by the allocation side, the machine might lock up after all. > > By using watermarks, freeing of memory will automatically top > > up the reserve pool which means that we guarantee that reclaimable > > memory allocated for demand paging during transacitons doesn't > > deplete the reserve pool permanently. As a result, when there is > > plenty of free and/or reclaimable memory, the reserve pool > > watermarks will have almost zero impact on performance and > > behaviour. > > Typical busy system won't be very far away from the high watermark > so there would be a reclaim performed during increased watermaks > (aka reservation) and that might lead to visible performance > degradation. This might be acceptable but it also adds a certain level > of unpredictability when performance characteristics might change > suddenly. There is usually a good deal of clean cache. As Dave pointed out before, clean cache can be considered re-allocatable from NOFS contexts, and so we'd only have to maintain this invariant: min_wmark + private_reserves < free_pages + clean_cache > > Further, because it's just accounting and behavioural thresholds, > > this allows the mm subsystem to control how the reserve pool is > > accounted internally. e.g. clean, reclaimable pages in the page > > cache could serve as reserve pool pages as they can be immediately > > reclaimed for allocation. > > But they also can turn into hard/impossible to reclaim as well. Clean > pages might get dirty and e.g. swap backed pages run out of their > backing storage. So I guess we cannot count with those pages without > reclaiming them first and hiding them into the reserve. Which is what > you suggest below probably but I wasn't really sure... Pages reserved for use by the page cleaning path can't be considered dirtyable. They have to be included in the dirty_balance_reserve. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:05 ` Johannes Weiner @ 2015-03-02 17:10 ` Michal Hocko 2015-03-02 17:27 ` Johannes Weiner 0 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-03-02 17:10 UTC (permalink / raw) To: Johannes Weiner Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: [...] > > Typical busy system won't be very far away from the high watermark > > so there would be a reclaim performed during increased watermaks > > (aka reservation) and that might lead to visible performance > > degradation. This might be acceptable but it also adds a certain level > > of unpredictability when performance characteristics might change > > suddenly. > > There is usually a good deal of clean cache. As Dave pointed out > before, clean cache can be considered re-allocatable from NOFS > contexts, and so we'd only have to maintain this invariant: > > min_wmark + private_reserves < free_pages + clean_cache Do I understand you correctly that we do not have to reclaim clean pages as per the above invariant? If yes, how do you reflect overcommit on the clean_cache from multiple requestor (who are doing reservations)? My point was that if we keep clean pages on the LRU rather than forcing to reclaim them via increased watermarks then it might happen that different callers with access to reserves wouldn't get promissed amount of reserved memory because clean_cache is basically a shared resource. [...] -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 17:10 ` Michal Hocko @ 2015-03-02 17:27 ` Johannes Weiner 0 siblings, 0 replies; 83+ messages in thread From: Johannes Weiner @ 2015-03-02 17:27 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 06:10:58PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:05:37, Johannes Weiner wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > [...] > > > Typical busy system won't be very far away from the high watermark > > > so there would be a reclaim performed during increased watermaks > > > (aka reservation) and that might lead to visible performance > > > degradation. This might be acceptable but it also adds a certain level > > > of unpredictability when performance characteristics might change > > > suddenly. > > > > There is usually a good deal of clean cache. As Dave pointed out > > before, clean cache can be considered re-allocatable from NOFS > > contexts, and so we'd only have to maintain this invariant: > > > > min_wmark + private_reserves < free_pages + clean_cache > > Do I understand you correctly that we do not have to reclaim clean pages > as per the above invariant? > > If yes, how do you reflect overcommit on the clean_cache from multiple > requestor (who are doing reservations)? > My point was that if we keep clean pages on the LRU rather than forcing > to reclaim them via increased watermarks then it might happen that > different callers with access to reserves wouldn't get promissed amount > of reserved memory because clean_cache is basically a shared resource. The sum of all private reservations has to be accounted globally, we obviously can't overcommit the available resources in order to solve problems stemming from overcommiting the available resources. The page allocator can't hand out free pages and page reclaim can not reclaim clean cache unless that invariant is met. They both have to consider them consumed. It's the same as pre-allocation, the only thing we save is having to actually reclaim the pages and take them off the freelist at reservation time - which is a good optimization since the filesystem might not actually need them all. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 15:18 ` Michal Hocko 2015-03-02 16:05 ` Johannes Weiner @ 2015-03-02 16:39 ` Theodore Ts'o 2015-03-02 16:58 ` Michal Hocko 1 sibling, 1 reply; 83+ messages in thread From: Theodore Ts'o @ 2015-03-02 16:39 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > The idea is sound. But I am pretty sure we will find many corner > cases. E.g. what if the mere reservation attempt causes the system > to go OOM and trigger the OOM killer? Doctor, doctor, it hurts when I do that.... So don't trigger the OOM killer. We can let the caller decide whether the reservation request should block or return ENOMEM, but the whole point of the reservation request idea is that this happens *before* we've taken any mutexes, so blocking won't prevent forward progress. The file system could send down a different flag if we are doing writebacks for page cleaning purposes, in which case the reservation request would be a "just a heads up, we *will* be needing this much memory, but this is not something where we can block or return ENOMEM, so please give us the highest priority for using the free reserves". > I think the idea is good! It will just be quite tricky to get there > without causing more problems than those being solved. The biggest > question mark so far seems to be the reservation size estimation. If > it is hard for any caller to know the size beforehand (which would > be really close to the actually used size) then the whole complexity > in the code sounds like an overkill and asking administrator to tune > min_free_kbytes seems a better fit (we would still have to teach the > allocator to access reserves when really necessary) because the system > would behave more predictably (although some memory would be wasted). If we do need to teach the allocator to access reserves when really necessary, don't we have that already via GFP_NOIO/GFP_NOFS and GFP_NOFAIL? If the goal is do something more fine-grained, unfortunately at least for the short-term we'll need to preserve the existing behaviour and issue warnings until the file system starts adding GFP_NOFAIL to those memory allocations where previously, GFP_NOFS was effectively guaranteeing that failures would almostt never happen. I know at least one place discovered with recent change (and revert) where I'll be fixing ext4, but I suspect it won't be the only one, especially in the block device drivers. - Ted _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:39 ` Theodore Ts'o @ 2015-03-02 16:58 ` Michal Hocko 2015-03-04 12:52 ` Dave Chinner 0 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2015-03-02 16:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > The idea is sound. But I am pretty sure we will find many corner > > cases. E.g. what if the mere reservation attempt causes the system > > to go OOM and trigger the OOM killer? > > Doctor, doctor, it hurts when I do that.... > > So don't trigger the OOM killer. We can let the caller decide whether > the reservation request should block or return ENOMEM, but the whole > point of the reservation request idea is that this happens *before* > we've taken any mutexes, so blocking won't prevent forward progress. Maybe I wasn't clear. I wasn't concerned about the context which is doing to reservation. I was more concerned about all the other allocation requests which might fail now (becasuse they do not have access to the reserves). So you think that we should simply disable OOM killer while there is any reservation active? Wouldn't that be even more fragile when something goes terribly wrong? > The file system could send down a different flag if we are doing > writebacks for page cleaning purposes, in which case the reservation > request would be a "just a heads up, we *will* be needing this much > memory, but this is not something where we can block or return ENOMEM, > so please give us the highest priority for using the free reserves". Sure that thing is clear. > > I think the idea is good! It will just be quite tricky to get there > > without causing more problems than those being solved. The biggest > > question mark so far seems to be the reservation size estimation. If > > it is hard for any caller to know the size beforehand (which would > > be really close to the actually used size) then the whole complexity > > in the code sounds like an overkill and asking administrator to tune > > min_free_kbytes seems a better fit (we would still have to teach the > > allocator to access reserves when really necessary) because the system > > would behave more predictably (although some memory would be wasted). > > If we do need to teach the allocator to access reserves when really > necessary, don't we have that already via GFP_NOIO/GFP_NOFS and > GFP_NOFAIL? GFP_NOFAIL doesn't sound like the best fit. Not all NOFAIL callers need to access reserves - e.g. if they are not blocking anybody from making progress. > If the goal is do something more fine-grained, > unfortunately at least for the short-term we'll need to preserve the > existing behaviour and issue warnings until the file system starts > adding GFP_NOFAIL to those memory allocations where previously, > GFP_NOFS was effectively guaranteeing that failures would almostt > never happen. GFP_NOFS not failing is even worse than GFP_KERNEL not failing. Because the first one has only very limited ways to perform a reclaim. It basically relies on somebody else to make a progress and that is definitely a bad model. > I know at least one place discovered with recent change (and revert) > where I'll be fixing ext4, but I suspect it won't be the only one, > especially in the block device drivers. > > - Ted -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: How to handle TIF_MEMDIE stalls? 2015-03-02 16:58 ` Michal Hocko @ 2015-03-04 12:52 ` Dave Chinner 0 siblings, 0 replies; 83+ messages in thread From: Dave Chinner @ 2015-03-04 12:52 UTC (permalink / raw) To: Michal Hocko Cc: Theodore Ts'o, Tetsuo Handa, dchinner, oleg, xfs, Johannes Weiner, linux-mm, mgorman, rientjes, akpm, torvalds On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > > The idea is sound. But I am pretty sure we will find many corner > > > cases. E.g. what if the mere reservation attempt causes the system > > > to go OOM and trigger the OOM killer? > > > > Doctor, doctor, it hurts when I do that.... > > > > So don't trigger the OOM killer. We can let the caller decide whether > > the reservation request should block or return ENOMEM, but the whole > > point of the reservation request idea is that this happens *before* > > we've taken any mutexes, so blocking won't prevent forward progress. > > Maybe I wasn't clear. I wasn't concerned about the context which > is doing to reservation. I was more concerned about all the other > allocation requests which might fail now (becasuse they do not have > access to the reserves). So you think that we should simply disable OOM > killer while there is any reservation active? Wouldn't that be even more > fragile when something goes terribly wrong? That's a silly strawman. Why wouldn't you simply block them until the reserves are released when the transaction completes and the unused memory goes back to the free pool? Let me try another tack. My qualifications are as a distributed control system engineer, not a computer scientist. I see everything as a system of interconnected feedback loops: an operating system is nothing but a set of very complex, tightly interconnected control systems. Precedence? IO-less dirty throttling - that came about after I'd been advocating a control theory based algorithm for several years to solve the breakdown problems of dirty page throttling. We look at the code Fenguang Wu wrote as one of the major success stories of Linux - the writeback code just works and nobody ever has to tune it anymore. I see the problem of direct memory reclaim as being very similar to the problems the old IO based write throttling had: it has unbound concurrency, severe unfairness and breaks down badly when heavily loaded. As a control system, it has the same terrible design as the IO-based write throttling had. There are other many similarities, too. Allocation can only take place at the rate at which reclaim occurs, and we only have a limited budget of allocatable pages. This is the same as the dirty page throttling - dirtying pages is limited to the rate we can clean pages, and there are a limited budget of dirty pages in the system. Reclaiming pages is also done most efficiently by a single thread per zone where lots of internal context can be kept (kswapd). This is similar to optimal writeback of dirty pages requires a single thread with internal context per block device.. Waiting for free pages to arrive can be done by an ordered queuing system, and we can account for the number of pages each allocation requires in the queueing system and hence only need wake the number of waiters that will consume the memory just freed. Just like we do with the the dirty page throttling queue. As such, the same solutions could be applied. As the allocation demand exceeds the supply of free pages, we throttle allocation by sleeping on an ordered queue and only waking waiters at the rate at which kswapd reclaim can free pages. It's trivial to account accurately, and the feedback loop is relatively simple, too. We can also easily maintain a reserve of free pages this way, usable only by allocation marked with special flags. The reserve threshold can be dynamic, and tasks that request it to change can be blocked until the reserve has been built up to meet caler requirements. Allocations that are allowed to dip into the reserve may do so rather than being added to the queue that waits for reclaim. Reclaim would always fill the reserve back up to it's limits first, and tasks that have reservations can release them gradually as they mark them as consumed by the reservation context (e.g. when a filesystem joins an object to a transaction and modifies it), thereby reducing the reserve that task has available as it progresses. So, there's yet another possible solution to the allocation reservation problem, and one that solves several other problems that are being described as making reservation pools difficult or even impossible to implement. Seriously, I'm not expecting this problem to be solved tomorrow; what I want is reliable, deterministic memory allocation behaviour from the mm subsystem. I want people to be thinking about how to acheive that rather than limiting their solutions by what we have now and can hack into the current code, because otherwise we'll never end up with a reliable memory allocation reservation system.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2015-03-07 15:08 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20141230112158.GA15546@dhcp22.suse.cz>
[not found] ` <201502092044.JDG39081.LVFOOtFHQFOMSJ@I-love.SAKURA.ne.jp>
[not found] ` <201502102258.IFE09888.OVQFJOMSFtOLFH@I-love.SAKURA.ne.jp>
[not found] ` <20150210151934.GA11212@phnom.home.cmpxchg.org>
[not found] ` <201502111123.ICD65197.FMLOHSQJFVOtFO@I-love.SAKURA.ne.jp>
[not found] ` <201502172123.JIE35470.QOLMVOFJSHOFFt@I-love.SAKURA.ne.jp>
[not found] ` <20150217125315.GA14287@phnom.home.cmpxchg.org>
2015-02-17 22:54 ` How to handle TIF_MEMDIE stalls? Dave Chinner
2015-02-17 23:32 ` Dave Chinner
2015-02-18 8:25 ` Michal Hocko
2015-02-18 10:48 ` Dave Chinner
2015-02-18 12:16 ` Michal Hocko
2015-02-18 21:31 ` Dave Chinner
2015-02-19 9:40 ` Michal Hocko
2015-02-19 22:03 ` Dave Chinner
2015-02-20 9:27 ` Michal Hocko
2015-02-19 11:01 ` Johannes Weiner
2015-02-19 12:29 ` Michal Hocko
2015-02-19 12:58 ` Michal Hocko
2015-02-19 15:29 ` Tetsuo Handa
2015-02-19 21:53 ` Tetsuo Handa
2015-02-20 9:13 ` Michal Hocko
2015-02-20 13:37 ` Stefan Ring
2015-02-19 13:29 ` Tetsuo Handa
2015-02-20 9:10 ` Michal Hocko
2015-02-20 12:20 ` Tetsuo Handa
2015-02-20 12:38 ` Michal Hocko
2015-02-19 21:43 ` Dave Chinner
2015-02-20 12:48 ` Michal Hocko
2015-02-20 23:09 ` Dave Chinner
2015-02-19 10:24 ` Johannes Weiner
2015-02-19 22:52 ` Dave Chinner
2015-02-20 10:36 ` Tetsuo Handa
2015-02-20 23:15 ` Dave Chinner
2015-02-21 3:20 ` Theodore Ts'o
2015-02-21 9:19 ` Andrew Morton
2015-02-21 13:48 ` Tetsuo Handa
2015-02-21 21:38 ` Dave Chinner
2015-02-22 0:20 ` Johannes Weiner
2015-02-23 10:48 ` Michal Hocko
2015-02-23 11:23 ` Tetsuo Handa
2015-02-23 21:33 ` David Rientjes
2015-02-21 12:00 ` Tetsuo Handa
2015-02-23 10:26 ` Michal Hocko
2015-02-21 11:12 ` Tetsuo Handa
2015-02-21 21:48 ` Dave Chinner
2015-02-21 23:52 ` Johannes Weiner
2015-02-23 0:45 ` Dave Chinner
2015-02-23 1:29 ` Andrew Morton
2015-02-23 7:32 ` Dave Chinner
2015-02-27 18:24 ` Vlastimil Babka
2015-02-28 0:03 ` Dave Chinner
2015-02-28 15:17 ` Theodore Ts'o
2015-03-02 9:39 ` Vlastimil Babka
2015-03-02 22:31 ` Dave Chinner
2015-03-03 9:13 ` Vlastimil Babka
2015-03-04 1:33 ` Dave Chinner
2015-03-04 8:50 ` Vlastimil Babka
2015-03-04 11:03 ` Dave Chinner
2015-03-07 0:20 ` Johannes Weiner
2015-03-07 3:43 ` Dave Chinner
2015-03-07 15:08 ` Johannes Weiner
2015-03-02 20:22 ` Johannes Weiner
2015-03-02 23:12 ` Dave Chinner
2015-03-03 2:50 ` Johannes Weiner
2015-03-04 6:52 ` Dave Chinner
2015-03-04 15:04 ` Johannes Weiner
2015-03-04 17:38 ` Theodore Ts'o
2015-03-04 23:17 ` Dave Chinner
2015-02-28 16:29 ` Johannes Weiner
2015-02-28 16:41 ` Theodore Ts'o
2015-02-28 22:15 ` Johannes Weiner
2015-03-01 11:17 ` Tetsuo Handa
2015-03-06 11:53 ` Tetsuo Handa
2015-03-01 13:43 ` Theodore Ts'o
2015-03-01 16:15 ` Johannes Weiner
2015-03-01 19:36 ` Theodore Ts'o
2015-03-01 20:44 ` Johannes Weiner
2015-03-01 20:17 ` Johannes Weiner
2015-03-01 21:48 ` Dave Chinner
2015-03-02 0:17 ` Dave Chinner
2015-03-02 12:46 ` Brian Foster
2015-02-28 18:36 ` Vlastimil Babka
2015-03-02 15:18 ` Michal Hocko
2015-03-02 16:05 ` Johannes Weiner
2015-03-02 17:10 ` Michal Hocko
2015-03-02 17:27 ` Johannes Weiner
2015-03-02 16:39 ` Theodore Ts'o
2015-03-02 16:58 ` Michal Hocko
2015-03-04 12:52 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox