[LSF/MM TOPIC] proposals for topics

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] proposals for topics
@ 2016-01-25 13:33 Michal Hocko
  2016-01-25 14:21 ` [Lsf-pc] " Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Michal Hocko @ 2016-01-25 13:33 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-fsdevel

Hi,
I would like to propose the following topics (mainly for the MM track
but some of them might be of interest for FS people as well)
- gfp flags for allocations requests seems to be quite complicated
  and used arbitrarily by many subsystems. GFP_REPEAT is one such
  example. Half of the current usage is for low order allocations
  requests where it is basically ignored. Moreover the documentation
  claims that such a request is _not_ retrying endlessly which is
  true only for costly high order allocations. I think we should get
  rid of most of the users of this flag (basically all low order ones)
  and then come up with something like GFP_BEST_EFFORT which would work
  for all orders consistently [1]
- GFP_NOFS is another one which would be good to discuss. Its primary
  use is to prevent from reclaim recursion back into FS. This makes
  such an allocation context weaker and historically we haven't
  triggered OOM killer and rather hopelessly retry the request and
  rely on somebody else to make a progress for us. There are two issues
  here.
  First we shouldn't retry endlessly and rather fail the allocation and
  allow the FS to handle the error. As per my experiments most FS cope
  with that quite reasonably. Btrfs unfortunately handles many of those
  failures by BUG_ON which is really unfortunate.
  Another issue is that GFP_NOFS is quite often used without any obvious
  reason. It is not clear which lock is held and could be taken from
  the reclaim path. Wouldn't it be much better if the no-recursion
  behavior was bound to the lock scope rather than particular allocation
  request? We already have something like this for PM
  pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It
  would be great if we could unify this and use the context based NOFS
  in the FS.
- OOM killer has been discussed a lot throughout this year. We have
  discussed this topic the last year at LSF and there has been quite some
  progress since then. We have async memory tear down for the OOM victim
  [2] which should help in many corner cases. We are still waiting
  to make mmap_sem for write killable which would help in some other
  classes of corner cases. Whatever we do, however, will not work in
  100% cases. So the primary question is how far are we willing to go to
  support different corner cases. Do we want to have a
  panic_after_timeout global knob, allow multiple OOM victims after
  a timeout?
- sysrq+f to trigger the oom killer follows some heuristics used by the
  OOM killer invoked by the system which means that it is unreliable
  and it might skip to kill any task without any explanation why. The
  semantic of the knob doesn't seem to clear and it has been even
  suggested [3] to remove it altogether as an unuseful debugging aid. Is
  this really a general consensus?
- One of the long lasting issue related to the OOM handling is when to
  actually declare OOM. There are workloads which might be trashing on
  few last remaining pagecache pages or on the swap which makes the
  system completely unusable for considerable amount of time yet the
  OOM killer is not invoked. Can we finally do something about that?

[1] http://lkml.kernel.org/r/1446740160-29094-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[3] http://lkml.kernel.org/r/alpine.DEB.2.10.1601141347220.16227@chino.kir.corp.google.com
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics
  2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko
@ 2016-01-25 14:21 ` Jan Kara
  2016-01-25 14:40   ` Michal Hocko
  2016-01-25 15:08 ` Tetsuo Handa
  2016-01-25 18:45 ` Johannes Weiner
  2 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-01-25 14:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-fsdevel, linux-mm

Hi!

On Mon 25-01-16 14:33:57, Michal Hocko wrote:
> - GFP_NOFS is another one which would be good to discuss. Its primary
>   use is to prevent from reclaim recursion back into FS. This makes
>   such an allocation context weaker and historically we haven't
>   triggered OOM killer and rather hopelessly retry the request and
>   rely on somebody else to make a progress for us. There are two issues
>   here.
>   First we shouldn't retry endlessly and rather fail the allocation and
>   allow the FS to handle the error. As per my experiments most FS cope
>   with that quite reasonably. Btrfs unfortunately handles many of those
>   failures by BUG_ON which is really unfortunate.
>   Another issue is that GFP_NOFS is quite often used without any obvious
>   reason. It is not clear which lock is held and could be taken from
>   the reclaim path. Wouldn't it be much better if the no-recursion
>   behavior was bound to the lock scope rather than particular allocation
>   request? We already have something like this for PM
>   pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It
>   would be great if we could unify this and use the context based NOFS
>   in the FS.

I like the idea that we'd protect lock scopes from reclaim recursion but the
effort to do so would be IMHO rather big. E.g. there are ~75 instances of
GFP_NOFS allocation in ext4/jbd2 codebase and making sure all are properly
covered will take quite some auditing... I'm not saying we shouldn't do
something like this, just you will have to be good in selling the benefits
:).

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics
  2016-01-25 14:21 ` [Lsf-pc] " Jan Kara
@ 2016-01-25 14:40   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-01-25 14:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Mon 25-01-16 15:21:39, Jan Kara wrote:
> Hi!
> 
> On Mon 25-01-16 14:33:57, Michal Hocko wrote:
[...
> >   Another issue is that GFP_NOFS is quite often used without any obvious
> >   reason. It is not clear which lock is held and could be taken from
> >   the reclaim path. Wouldn't it be much better if the no-recursion
> >   behavior was bound to the lock scope rather than particular allocation
> >   request? We already have something like this for PM
> >   pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It
> >   would be great if we could unify this and use the context based NOFS
> >   in the FS.
> 
> I like the idea that we'd protect lock scopes from reclaim recursion but the
> effort to do so would be IMHO rather big. E.g. there are ~75 instances of
> GFP_NOFS allocation in ext4/jbd2 codebase and making sure all are properly
> covered will take quite some auditing... I'm not saying we shouldn't do
> something like this, just you will have to be good in selling the benefits
> :).

My idea was that the first step would be using the helpers to mark
scopes and other usage of the ~__GFP_FS inside such a scope could be
identified much easier (e.g. a debugging WARN_ON or something like
that). That can be done in a longer term. Then I would hope for reducing
GFP_NOFS usage from mapping_gfp_mask.

I realize this is a lot of work but I believe this will pay of long
term. And especially the first step shouldn't be that hard because locks
used from the reclaim path shouldn't be that hard to identify.

GFP_NOFS is a mess these days and it is far from trivial to tell wether
it should be used or not from some paths.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko
  2016-01-25 14:21 ` [Lsf-pc] " Jan Kara
@ 2016-01-25 15:08 ` Tetsuo Handa
  2016-01-26  9:43   ` Michal Hocko
  2016-01-25 18:45 ` Johannes Weiner
  2 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-01-25 15:08 UTC (permalink / raw)
  To: Michal Hocko, lsf-pc; +Cc: linux-mm, linux-fsdevel

Michal Hocko wrote:
>   Another issue is that GFP_NOFS is quite often used without any obvious
>   reason. It is not clear which lock is held and could be taken from
>   the reclaim path. Wouldn't it be much better if the no-recursion
>   behavior was bound to the lock scope rather than particular allocation
>   request? We already have something like this for PM
>   pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It
>   would be great if we could unify this and use the context based NOFS
>   in the FS.

Yes, I do want it. I think some of LSM hooks are called from GFP_NOFS context
but it is too difficult for me to tell whether we are using GFP_NOFS correctly.

>   First we shouldn't retry endlessly and rather fail the allocation and
>   allow the FS to handle the error. As per my experiments most FS cope
>   with that quite reasonably. Btrfs unfortunately handles many of those
>   failures by BUG_ON which is really unfortunate.

If it turned out that we are using GFP_NOFS from LSM hooks correctly,
I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending.
Filesystems might be able to handle GFP_NOFS allocation failures. But
userspace might not be able to handle system call failures caused by
GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly
terminate as if they are OOM-killed. Would you please add GFP_KILLABLE
to list of the topics?

> - OOM killer has been discussed a lot throughout this year. We have
>   discussed this topic the last year at LSF and there has been quite some
>   progress since then. We have async memory tear down for the OOM victim
>   [2] which should help in many corner cases. We are still waiting
>   to make mmap_sem for write killable which would help in some other
>   classes of corner cases. Whatever we do, however, will not work in
>   100% cases. So the primary question is how far are we willing to go to
>   support different corner cases. Do we want to have a
>   panic_after_timeout global knob, allow multiple OOM victims after
>   a timeout?

A sequence for handling any corner case (as long as OOM killer is
invoked) was proposal at
http://lkml.kernel.org/r/201601222259.GJB90663.MLOJtFFOQFVHSO@I-love.SAKURA.ne.jp .

> - sysrq+f to trigger the oom killer follows some heuristics used by the
>   OOM killer invoked by the system which means that it is unreliable
>   and it might skip to kill any task without any explanation why. The
>   semantic of the knob doesn't seem to clear and it has been even
>   suggested [3] to remove it altogether as an unuseful debugging aid. Is
>   this really a general consensus?

Even if we remove SysRq-f from future kernels, please give us a fix for
current kernels. ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 15:08 ` Tetsuo Handa
@ 2016-01-26  9:43   ` Michal Hocko
  2016-01-27 13:44     ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-01-26  9:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: lsf-pc, linux-mm, linux-fsdevel

On Tue 26-01-16 00:08:28, Tetsuo Handa wrote:
[...]
> If it turned out that we are using GFP_NOFS from LSM hooks correctly,
> I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending.
> Filesystems might be able to handle GFP_NOFS allocation failures. But
> userspace might not be able to handle system call failures caused by
> GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly
> terminate as if they are OOM-killed. Would you please add GFP_KILLABLE
> to list of the topics?

Are there so many places to justify a flag? Isn't it easier to check for
fatal_signal_pending in the failed path and do the retry otherwise? This
allows for a more flexible fallback strategy - e.g. drop the locks and
retry again, sleep for reasonable time, wait for some event etc... This
sounds much more extensible than a single flag burried down in the
allocator path. Besides that all allocations besides __GFP_NOFAIL and
GFP_NOFS are already killable. The first one by definition and the later
one because of the current implementation restrictions which we can
hopefully fix longterm.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-26  9:43   ` Michal Hocko
@ 2016-01-27 13:44     ` Tetsuo Handa
  2016-01-27 14:33       ` [Lsf-pc] " Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-01-27 13:44 UTC (permalink / raw)
  To: mhocko; +Cc: lsf-pc, linux-mm, linux-fsdevel

Michal Hocko wrote:
> On Tue 26-01-16 00:08:28, Tetsuo Handa wrote:
> [...]
> > If it turned out that we are using GFP_NOFS from LSM hooks correctly,
> > I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending.
> > Filesystems might be able to handle GFP_NOFS allocation failures. But
> > userspace might not be able to handle system call failures caused by
> > GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly
> > terminate as if they are OOM-killed. Would you please add GFP_KILLABLE
> > to list of the topics?
> 
> Are there so many places to justify a flag? Isn't it easier to check for
> fatal_signal_pending in the failed path and do the retry otherwise? This
> allows for a more flexible fallback strategy - e.g. drop the locks and
> retry again, sleep for reasonable time, wait for some event etc... This
> sounds much more extensible than a single flag burried down in the
> allocator path.

If you allow any in-kernel code to directly call out_of_memory(), I'm
OK with that.

I consider that whether to invoke the OOM killer should not be determined
based on ability to reclaim memory; it should be determined based on
importance and/or purpose of that memory allocation request.

We allocate memory on behalf of userspace processes. If a userspace process
asks for a page via page fault, we are using __GFP_FS. If in-kernel code
does something on behalf of a userspace process, we should use __GFP_FS.

Forcing in-kernel code to use !__GFP_FS allocation requests is a hack for
workarounding inconvenient circumstances in memory allocation (memory
reclaim deadlock) which is not fault of userspace processes.

Userspace controls oom_score_adj and makes a bet between processes.
If process A wins, the OOM killer kills process B, and process A gets memory.
If process B wins, the OOM killer kills process A, and process B gets memory.
Not invoking the OOM killer due to lack of __GFP_FS is something like forcing
processes to use oom_kill_allocating_task = 1.

Therefore, since __GFP_KILLABLE does not exist and out_of_memory() is not
exported, I'll change my !__GFP_FS allocation requests to __GFP_NOFAIL
(in order to allow processes to make a bet) if mm people change small !__GFP_FS
allocation requests to fail upon OOM. Note that there is no need to retry such
__GFP_NOFAIL allocation requests if SIGKILL is pending, but __GFP_NOFAIL does
not allow fail upon SIGKILL. __GFP_KILLABLE (with current "no-fail unless chosen
by the OOM killer" behavior) will handle it perfectly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics
  2016-01-27 13:44     ` Tetsuo Handa
@ 2016-01-27 14:33       ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-01-27 14:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-fsdevel, linux-mm, lsf-pc

On Wed 27-01-16 22:44:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 26-01-16 00:08:28, Tetsuo Handa wrote:
> > [...]
> > > If it turned out that we are using GFP_NOFS from LSM hooks correctly,
> > > I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending.
> > > Filesystems might be able to handle GFP_NOFS allocation failures. But
> > > userspace might not be able to handle system call failures caused by
> > > GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly
> > > terminate as if they are OOM-killed. Would you please add GFP_KILLABLE
> > > to list of the topics?
> > 
> > Are there so many places to justify a flag? Isn't it easier to check for
> > fatal_signal_pending in the failed path and do the retry otherwise? This
> > allows for a more flexible fallback strategy - e.g. drop the locks and
> > retry again, sleep for reasonable time, wait for some event etc... This
> > sounds much more extensible than a single flag burried down in the
> > allocator path.
> 
> If you allow any in-kernel code to directly call out_of_memory(), I'm
> OK with that.
> 
> I consider that whether to invoke the OOM killer should not be determined
> based on ability to reclaim memory; it should be determined based on
> importance and/or purpose of that memory allocation request.

Well, in my opinion that's fairly difficult to judge at the site doing the
memory allocation. E.g. is it better to loop in allocator to be able to
satisfy allocation request to do IO, or is it better to fail the IO with
error, or is it better to invoke OOM killer to free some memory and then do
the IO? Who knows... This is a policy decision and as such it is better
done by the administrator and there should be one common place to tune such
things. Not call sites spread around the kernel...

> We allocate memory on behalf of userspace processes. If a userspace process
> asks for a page via page fault, we are using __GFP_FS. If in-kernel code
> does something on behalf of a userspace process, we should use __GFP_FS.
> 
> Forcing in-kernel code to use !__GFP_FS allocation requests is a hack for
> workarounding inconvenient circumstances in memory allocation (memory
> reclaim deadlock) which is not fault of userspace processes.

It is as if you said that using GFP_ATOMIC allocation is a hack for device
drivers to do allocation in atomic context. It is a reality of kernel
programming that you sometimes have to do allocation in restricted context.
One kind of this restricted context is that you cannot recurse back into
the filesystem to free memory. I see nothing hacky in it.

> Userspace controls oom_score_adj and makes a bet between processes.
> If process A wins, the OOM killer kills process B, and process A gets memory.
> If process B wins, the OOM killer kills process A, and process B gets memory.
> Not invoking the OOM killer due to lack of __GFP_FS is something like forcing
> processes to use oom_kill_allocating_task = 1.
> 
> Therefore, since __GFP_KILLABLE does not exist and out_of_memory() is not
> exported, I'll change my !__GFP_FS allocation requests to __GFP_NOFAIL
> (in order to allow processes to make a bet) if mm people change small !__GFP_FS
> allocation requests to fail upon OOM. Note that there is no need to retry such
> __GFP_NOFAIL allocation requests if SIGKILL is pending, but __GFP_NOFAIL does
> not allow fail upon SIGKILL. __GFP_KILLABLE (with current "no-fail unless chosen
> by the OOM killer" behavior) will handle it perfectly.

So GFP_KILLABLE with GFP_NOFAIL combination actually makes sense to me.
Although most of the places I'm aware of which need GFP_NOFAIL wouldn't use
GFP_KILLABLE either - they are places where we have two options:

1) lose user data without a way to tell that back to the user

2) allocate more memory

And from these two options, looping trying option 2) and hoping that
someone will solve the problem for us is the best we can do.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko
  2016-01-25 14:21 ` [Lsf-pc] " Jan Kara
  2016-01-25 15:08 ` Tetsuo Handa
@ 2016-01-25 18:45 ` Johannes Weiner
  2016-01-26  9:50   ` Michal Hocko
                     ` (2 more replies)
  2 siblings, 3 replies; 19+ messages in thread
From: Johannes Weiner @ 2016-01-25 18:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm, linux-fsdevel

Hi Michal,

On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote:
> Hi,
> I would like to propose the following topics (mainly for the MM track
> but some of them might be of interest for FS people as well)
> - gfp flags for allocations requests seems to be quite complicated
>   and used arbitrarily by many subsystems. GFP_REPEAT is one such
>   example. Half of the current usage is for low order allocations
>   requests where it is basically ignored. Moreover the documentation
>   claims that such a request is _not_ retrying endlessly which is
>   true only for costly high order allocations. I think we should get
>   rid of most of the users of this flag (basically all low order ones)
>   and then come up with something like GFP_BEST_EFFORT which would work
>   for all orders consistently [1]

I think nobody would mind a patch that just cleans this stuff up. Do
you expect controversy there?

> - GFP_NOFS is another one which would be good to discuss. Its primary
>   use is to prevent from reclaim recursion back into FS. This makes
>   such an allocation context weaker and historically we haven't
>   triggered OOM killer and rather hopelessly retry the request and
>   rely on somebody else to make a progress for us. There are two issues
>   here.
>   First we shouldn't retry endlessly and rather fail the allocation and
>   allow the FS to handle the error. As per my experiments most FS cope
>   with that quite reasonably. Btrfs unfortunately handles many of those
>   failures by BUG_ON which is really unfortunate.

Are there any new datapoints on how to deal with failing allocations?
IIRC the conclusion last time was that some filesystems simply can't
support this without a reservation system - which I don't believe
anybody is working on. Does it make sense to rehash this when nothing
really changed since last time?

> - OOM killer has been discussed a lot throughout this year. We have
>   discussed this topic the last year at LSF and there has been quite some
>   progress since then. We have async memory tear down for the OOM victim
>   [2] which should help in many corner cases. We are still waiting
>   to make mmap_sem for write killable which would help in some other
>   classes of corner cases. Whatever we do, however, will not work in
>   100% cases. So the primary question is how far are we willing to go to
>   support different corner cases. Do we want to have a
>   panic_after_timeout global knob, allow multiple OOM victims after
>   a timeout?

Yes, that sounds like a good topic to cover. I'm honestly surprised
that there is so much resistence to trying to make the OOM killer
deterministic, and patches that try to fix that are resisted while the
thing can still lock up quietly.

It would be good to take a step back and consider our priorities
there, think about what the ultimate goal of the OOM killer is, and
then how to make it operate smoothly without compromising that goal -
not the other way round.

> - sysrq+f to trigger the oom killer follows some heuristics used by the
>   OOM killer invoked by the system which means that it is unreliable
>   and it might skip to kill any task without any explanation why. The
>   semantic of the knob doesn't seem to clear and it has been even
>   suggested [3] to remove it altogether as an unuseful debugging aid. Is
>   this really a general consensus?

I think it's an okay debugging aid, but I worry about it coming up so
much in discussions about how the OOM killer should behave. We should
never *require* manual intervention to put a machine back into known
state after it ran out of memory.

> - One of the long lasting issue related to the OOM handling is when to
>   actually declare OOM. There are workloads which might be trashing on
>   few last remaining pagecache pages or on the swap which makes the
>   system completely unusable for considerable amount of time yet the
>   OOM killer is not invoked. Can we finally do something about that?

I'm working on this, but it's not an easy situation to detect.

We can't decide based on amount of page cache, as you could have very
little of it and still be fine. Most of it could still be used-once.

We can't decide based on number or rate of (re)faults, because this
spikes during startup and workingset changes, or can be even sustained
when working with a data set that you'd never expect to fit into
memory in the first place, while still making acceptable progress.

The only thing that I could come up with as a meaningful metric here
is the share of actual walltime that is spent waiting on refetching
stuff from disk. If we know that in the last X seconds, the whole
system spent more than idk 95% of its time waiting on the disk to read
recently evicted data back into the cache, then it's time to kick the
OOM killer, as this state is likely not worth maintaining.

Such a "thrashing time" metric could be great to export to userspace
in general as it can be useful in other situations, such as quickly
gauging how comfortable a workload is (inside a container), and how
much time is wasted due to underprovisioning of memory. Because it
isn't just the pathological cases, you migh just wait a bit here and
there and could it still add up to a sizable portion of a job's time.

If other people think this could be a useful thing to talk about, I'd
be happy to discuss it at the conference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 18:45 ` Johannes Weiner
@ 2016-01-26  9:50   ` Michal Hocko
  2016-01-26 17:17     ` Vlastimil Babka
  2016-01-28 20:55     ` Dave Chinner
  2016-01-26 17:07   ` Vlastimil Babka
  2016-01-30 18:18   ` Greg Thelen
  2 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-01-26  9:50 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: lsf-pc, linux-mm, linux-fsdevel

On Mon 25-01-16 13:45:59, Johannes Weiner wrote:
> Hi Michal,
> 
> On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote:
> > Hi,
> > I would like to propose the following topics (mainly for the MM track
> > but some of them might be of interest for FS people as well)
> > - gfp flags for allocations requests seems to be quite complicated
> >   and used arbitrarily by many subsystems. GFP_REPEAT is one such
> >   example. Half of the current usage is for low order allocations
> >   requests where it is basically ignored. Moreover the documentation
> >   claims that such a request is _not_ retrying endlessly which is
> >   true only for costly high order allocations. I think we should get
> >   rid of most of the users of this flag (basically all low order ones)
> >   and then come up with something like GFP_BEST_EFFORT which would work
> >   for all orders consistently [1]
> 
> I think nobody would mind a patch that just cleans this stuff up. Do
> you expect controversy there?

Well, I thought the same but the patches didn't get much traction.
The reason might be that people are too busy in general to look
into changes that are of no immediate benefit so I thought that
discussing such a higher level topic at LSF might make sense. I really
wish we rethink our current GFP flags battery and try to come up with
something that will be more consistent and ideally without the weight of
the history tweaks.

> > - GFP_NOFS is another one which would be good to discuss. Its primary
> >   use is to prevent from reclaim recursion back into FS. This makes
> >   such an allocation context weaker and historically we haven't
> >   triggered OOM killer and rather hopelessly retry the request and
> >   rely on somebody else to make a progress for us. There are two issues
> >   here.
> >   First we shouldn't retry endlessly and rather fail the allocation and
> >   allow the FS to handle the error. As per my experiments most FS cope
> >   with that quite reasonably. Btrfs unfortunately handles many of those
> >   failures by BUG_ON which is really unfortunate.
> 
> Are there any new datapoints on how to deal with failing allocations?
> IIRC the conclusion last time was that some filesystems simply can't
> support this without a reservation system - which I don't believe
> anybody is working on. Does it make sense to rehash this when nothing
> really changed since last time?

There have been patches posted during the year to fortify those places
which cannot cope with allocation failures for ext[34] and testing
has shown that ext* resp. xfs are quite ready to see NOFS allocation
failures. It is merely Btrfs which is in the biggest troubles now and
this is a work in progress AFAIK. I am perfectly OK to discuss some
details with interested FS people during BoF e.g.

> > - OOM killer has been discussed a lot throughout this year. We have
> >   discussed this topic the last year at LSF and there has been quite some
> >   progress since then. We have async memory tear down for the OOM victim
> >   [2] which should help in many corner cases. We are still waiting
> >   to make mmap_sem for write killable which would help in some other
> >   classes of corner cases. Whatever we do, however, will not work in
> >   100% cases. So the primary question is how far are we willing to go to
> >   support different corner cases. Do we want to have a
> >   panic_after_timeout global knob, allow multiple OOM victims after
> >   a timeout?
> 
> Yes, that sounds like a good topic to cover. I'm honestly surprised
> that there is so much resistence to trying to make the OOM killer
> deterministic, and patches that try to fix that are resisted while the
> thing can still lock up quietly.

I guess the problem is what different parties see as the deterministic
behavior. Timeout based solutions suggested so far were either too
convoluted IMHO, not deterministic or too simplistic to attract general
interest I guess.

> It would be good to take a step back and consider our priorities
> there, think about what the ultimate goal of the OOM killer is, and
> then how to make it operate smoothly without compromising that goal -
> not the other way round.

Agreed.

> > - sysrq+f to trigger the oom killer follows some heuristics used by the
> >   OOM killer invoked by the system which means that it is unreliable
> >   and it might skip to kill any task without any explanation why. The
> >   semantic of the knob doesn't seem to clear and it has been even
> >   suggested [3] to remove it altogether as an unuseful debugging aid. Is
> >   this really a general consensus?
> 
> I think it's an okay debugging aid, but I worry about it coming up so
> much in discussions about how the OOM killer should behave. We should
> never *require* manual intervention to put a machine back into known
> state after it ran out of memory.

My argument has been that this is more of an emergency break when the
system cannot cope with the current load (not only after OOM) than a
debugging aid but it seems that there is indeed not a clear consensus on
this topic so I think we should make it clear.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-26  9:50   ` Michal Hocko
@ 2016-01-26 17:17     ` Vlastimil Babka
  2016-01-26 17:20       ` [Lsf-pc] " Jan Kara
  2016-01-28 20:55     ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2016-01-26 17:17 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner; +Cc: lsf-pc, linux-mm, linux-fsdevel

On 01/26/2016 10:50 AM, Michal Hocko wrote:
> On Mon 25-01-16 13:45:59, Johannes Weiner wrote:
>> Hi Michal,
>>
>> On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote:
>>> Hi,
>>> I would like to propose the following topics (mainly for the MM track
>>> but some of them might be of interest for FS people as well)
>>> - gfp flags for allocations requests seems to be quite complicated
>>>    and used arbitrarily by many subsystems. GFP_REPEAT is one such
>>>    example. Half of the current usage is for low order allocations
>>>    requests where it is basically ignored. Moreover the documentation
>>>    claims that such a request is _not_ retrying endlessly which is
>>>    true only for costly high order allocations. I think we should get
>>>    rid of most of the users of this flag (basically all low order ones)
>>>    and then come up with something like GFP_BEST_EFFORT which would work
>>>    for all orders consistently [1]
>>
>> I think nobody would mind a patch that just cleans this stuff up. Do
>> you expect controversy there?
>
> Well, I thought the same but the patches didn't get much traction.
> The reason might be that people are too busy in general to look
> into changes that are of no immediate benefit so I thought that
> discussing such a higher level topic at LSF might make sense. I really
> wish we rethink our current GFP flags battery and try to come up with
> something that will be more consistent and ideally without the weight of
> the history tweaks.

Agreed. LSF discussion could help both with the traction and to 
brainstorm a better defined/named set of flags for today's __GFP_REPEAT, 
__GFP_NORETRY etc. So far it was just me and Michal on the thread and we 
share the same office...

>>> - GFP_NOFS is another one which would be good to discuss. Its primary
>>>    use is to prevent from reclaim recursion back into FS. This makes
>>>    such an allocation context weaker and historically we haven't
>>>    triggered OOM killer and rather hopelessly retry the request and
>>>    rely on somebody else to make a progress for us. There are two issues
>>>    here.
>>>    First we shouldn't retry endlessly and rather fail the allocation and
>>>    allow the FS to handle the error. As per my experiments most FS cope
>>>    with that quite reasonably. Btrfs unfortunately handles many of those
>>>    failures by BUG_ON which is really unfortunate.
>>
>> Are there any new datapoints on how to deal with failing allocations?
>> IIRC the conclusion last time was that some filesystems simply can't
>> support this without a reservation system - which I don't believe
>> anybody is working on. Does it make sense to rehash this when nothing
>> really changed since last time?
>
> There have been patches posted during the year to fortify those places
> which cannot cope with allocation failures for ext[34] and testing
> has shown that ext* resp. xfs are quite ready to see NOFS allocation
> failures.

Hmm from last year I remember Dave Chinner saying there really are some 
places that can't handle failure, period? That's why all the discussions 
about reservations, and I would be surprised if all such places were 
gone today? Which of course doesn't mean that there couldn't be 
different NOFS places that can handle failures, which however don't 
happen in current implementation.

> It is merely Btrfs which is in the biggest troubles now and
> this is a work in progress AFAIK. I am perfectly OK to discuss some
> details with interested FS people during BoF e.g.
>
>>> - OOM killer has been discussed a lot throughout this year. We have
>>>    discussed this topic the last year at LSF and there has been quite some
>>>    progress since then. We have async memory tear down for the OOM victim
>>>    [2] which should help in many corner cases. We are still waiting
>>>    to make mmap_sem for write killable which would help in some other
>>>    classes of corner cases. Whatever we do, however, will not work in
>>>    100% cases. So the primary question is how far are we willing to go to
>>>    support different corner cases. Do we want to have a
>>>    panic_after_timeout global knob, allow multiple OOM victims after
>>>    a timeout?
>>
>> Yes, that sounds like a good topic to cover. I'm honestly surprised
>> that there is so much resistence to trying to make the OOM killer
>> deterministic, and patches that try to fix that are resisted while the
>> thing can still lock up quietly.
>
> I guess the problem is what different parties see as the deterministic
> behavior. Timeout based solutions suggested so far were either too
> convoluted IMHO, not deterministic or too simplistic to attract general
> interest I guess.

Yep, a good topic.

>> It would be good to take a step back and consider our priorities
>> there, think about what the ultimate goal of the OOM killer is, and
>> then how to make it operate smoothly without compromising that goal -
>> not the other way round.
>
> Agreed.
>
>>> - sysrq+f to trigger the oom killer follows some heuristics used by the
>>>    OOM killer invoked by the system which means that it is unreliable
>>>    and it might skip to kill any task without any explanation why. The
>>>    semantic of the knob doesn't seem to clear and it has been even
>>>    suggested [3] to remove it altogether as an unuseful debugging aid. Is
>>>    this really a general consensus?
>>
>> I think it's an okay debugging aid, but I worry about it coming up so
>> much in discussions about how the OOM killer should behave. We should
>> never *require* manual intervention to put a machine back into known
>> state after it ran out of memory.
>
> My argument has been that this is more of an emergency break when the
> system cannot cope with the current load (not only after OOM) than a
> debugging aid but it seems that there is indeed not a clear consensus on
> this topic so I think we should make it clear.

Right.

> Thanks!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics
  2016-01-26 17:17     ` Vlastimil Babka
@ 2016-01-26 17:20       ` Jan Kara
  2016-01-27  9:08         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-01-26 17:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Johannes Weiner, linux-fsdevel, linux-mm, lsf-pc

On Tue 26-01-16 18:17:01, Vlastimil Babka wrote:
> >>>- GFP_NOFS is another one which would be good to discuss. Its primary
> >>>   use is to prevent from reclaim recursion back into FS. This makes
> >>>   such an allocation context weaker and historically we haven't
> >>>   triggered OOM killer and rather hopelessly retry the request and
> >>>   rely on somebody else to make a progress for us. There are two issues
> >>>   here.
> >>>   First we shouldn't retry endlessly and rather fail the allocation and
> >>>   allow the FS to handle the error. As per my experiments most FS cope
> >>>   with that quite reasonably. Btrfs unfortunately handles many of those
> >>>   failures by BUG_ON which is really unfortunate.
> >>
> >>Are there any new datapoints on how to deal with failing allocations?
> >>IIRC the conclusion last time was that some filesystems simply can't
> >>support this without a reservation system - which I don't believe
> >>anybody is working on. Does it make sense to rehash this when nothing
> >>really changed since last time?
> >
> >There have been patches posted during the year to fortify those places
> >which cannot cope with allocation failures for ext[34] and testing
> >has shown that ext* resp. xfs are quite ready to see NOFS allocation
> >failures.
> 
> Hmm from last year I remember Dave Chinner saying there really are some
> places that can't handle failure, period? That's why all the discussions
> about reservations, and I would be surprised if all such places were gone
> today? Which of course doesn't mean that there couldn't be different NOFS
> places that can handle failures, which however don't happen in current
> implementation.

Well, but we have GFP_NOFAIL (or equivalent of thereof opencoded) in there.
So yes, there are GFP_NOFAIL | GFP_NOFS allocations and allocator must deal
with it somehow.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics
  2016-01-26 17:20       ` [Lsf-pc] " Jan Kara
@ 2016-01-27  9:08         ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-01-27  9:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vlastimil Babka, Johannes Weiner, linux-fsdevel, linux-mm, lsf-pc

On Tue 26-01-16 18:20:51, Jan Kara wrote:
> On Tue 26-01-16 18:17:01, Vlastimil Babka wrote:
[...]
> > Hmm from last year I remember Dave Chinner saying there really are some
> > places that can't handle failure, period? That's why all the discussions
> > about reservations, and I would be surprised if all such places were gone
> > today? Which of course doesn't mean that there couldn't be different NOFS
> > places that can handle failures, which however don't happen in current
> > implementation.
> 
> Well, but we have GFP_NOFAIL (or equivalent of thereof opencoded) in there.
> So yes, there are GFP_NOFAIL | GFP_NOFS allocations and allocator must deal
> with it somehow.

Yes, the allocator deals with them in two ways. a) it allows to trigger
the OOM killer and b) gives them access to memory reserves. So while
the reservation system sounds like a more robust plan long term but we
have a way forward right now and distinguish must not fail and do have a
fallback method already.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-26  9:50   ` Michal Hocko
  2016-01-26 17:17     ` Vlastimil Babka
@ 2016-01-28 20:55     ` Dave Chinner
  2016-01-28 22:04       ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-01-28 20:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel

On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote:
> On Mon 25-01-16 13:45:59, Johannes Weiner wrote:
> > Hi Michal,
> > 
> > On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote:
> > > - GFP_NOFS is another one which would be good to discuss. Its primary
> > >   use is to prevent from reclaim recursion back into FS. This makes
> > >   such an allocation context weaker and historically we haven't
> > >   triggered OOM killer and rather hopelessly retry the request and
> > >   rely on somebody else to make a progress for us. There are two issues
> > >   here.
> > >   First we shouldn't retry endlessly and rather fail the allocation and
> > >   allow the FS to handle the error. As per my experiments most FS cope
> > >   with that quite reasonably. Btrfs unfortunately handles many of those
> > >   failures by BUG_ON which is really unfortunate.
> > 
> > Are there any new datapoints on how to deal with failing allocations?
> > IIRC the conclusion last time was that some filesystems simply can't
> > support this without a reservation system - which I don't believe
> > anybody is working on. Does it make sense to rehash this when nothing
> > really changed since last time?
> 
> There have been patches posted during the year to fortify those places
> which cannot cope with allocation failures for ext[34] and testing
> has shown that ext* resp. xfs are quite ready to see NOFS allocation
> failures.

The XFS situation is compeletely unchanged from last year, and the
fact that you say it handles NOFS allocation failures just fine
makes me seriously question your testing methodology.

In XFS, *any* memory allocation failure during a transaction will
either cause a panic through null point deference (because we don't
check for allocation failure in most cases) or a filesystem
shutdown (in the cases where we do check). If you haven't seen these
behaviours, then you haven't been failing memory allocations during
filesystem modifications.

We need to fundamentally change error handling in transactions in
XFS to allow arbitrary memory allocation to fail. That is, we need
to implement a full transaction rollback capability so we can back
out changes made during the transaction before the error occurred.
That's a major amount of work, and I'm probably not going to do
anything on this in the next year as it's low priority because what
we have now works.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-28 20:55     ` Dave Chinner
@ 2016-01-28 22:04       ` Michal Hocko
  2016-01-31 23:29         ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-01-28 22:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel

On Fri 29-01-16 07:55:25, Dave Chinner wrote:
> On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote:
[...]
> > There have been patches posted during the year to fortify those places
> > which cannot cope with allocation failures for ext[34] and testing
> > has shown that ext* resp. xfs are quite ready to see NOFS allocation
> > failures.
> 
> The XFS situation is compeletely unchanged from last year, and the
> fact that you say it handles NOFS allocation failures just fine
> makes me seriously question your testing methodology.

I am certainly open to suggestions there. My testing managed to identify
some weaker points in ext[34] which led to RO remounts. __GFP_NOFAIL as
the current band aid worked for them. I wasn't able to hit this with
xfs.

> In XFS, *any* memory allocation failure during a transaction will
> either cause a panic through null point deference (because we don't
> check for allocation failure in most cases) or a filesystem
> shutdown (in the cases where we do check). If you haven't seen these
> behaviours, then you haven't been failing memory allocations during
> filesystem modifications.
> 
> We need to fundamentally change error handling in transactions in
> XFS to allow arbitrary memory allocation to fail. That is, we need
> to implement a full transaction rollback capability so we can back
> out changes made during the transaction before the error occurred.
> That's a major amount of work, and I'm probably not going to do
> anything on this in the next year as it's low priority because what
> we have now works.

I am quite confused now. I remember you were the one who complained
about the silent nofail behavior of the allocator because that means
you cannot implement an appropriate fallback strategy. Please also
note that I am talking solely about GFP_NOFS allocation here. The
allocator really cannot do much other than hoplessly retrying and
relying on somebody _else_ to make a forward progress.

That being said, I do understand that allowing GFP_NOFS allocation to
fail is not an easy task and nothing to be done tomorrow or in few
months, but I believe that a discussion with FS people about what
can/should be done in order to make this happen is valuable.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-28 22:04       ` Michal Hocko
@ 2016-01-31 23:29         ` Dave Chinner
  2016-02-01 12:24           ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-01-31 23:29 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel

On Thu, Jan 28, 2016 at 11:04:23PM +0100, Michal Hocko wrote:
> On Fri 29-01-16 07:55:25, Dave Chinner wrote:
> > On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote:
> [...]
> > > There have been patches posted during the year to fortify those places
> > > which cannot cope with allocation failures for ext[34] and testing
> > > has shown that ext* resp. xfs are quite ready to see NOFS allocation
> > > failures.
> > 
> > The XFS situation is compeletely unchanged from last year, and the
> > fact that you say it handles NOFS allocation failures just fine
> > makes me seriously question your testing methodology.
> 
> I am certainly open to suggestions there. My testing managed to identify
> some weaker points in ext[34] which led to RO remounts. __GFP_NOFAIL as
> the current band aid worked for them. I wasn't able to hit this with
> xfs.

I'd suggest that you turn on error injection to fail memory
allocation. See Documentation/fault-injection/fault-injection.txt
and start failing random slab allocations whilst running a workload
that creates/unlinks lots of files.

> > We need to fundamentally change error handling in transactions in
> > XFS to allow arbitrary memory allocation to fail. That is, we need
> > to implement a full transaction rollback capability so we can back
> > out changes made during the transaction before the error occurred.
> > That's a major amount of work, and I'm probably not going to do
> > anything on this in the next year as it's low priority because what
> > we have now works.
> 
> I am quite confused now. I remember you were the one who complained
> about the silent nofail behavior of the allocator because that means
> you cannot implement an appropriate fallback strategy. 

I complained about the fact the allocator did not behave as
documented (or expected) in that it didn't fail allocations we
expected it to fail.

> Please also
> note that I am talking solely about GFP_NOFS allocation here. The
> allocator really cannot do much other than hoplessly retrying and
> relying on somebody _else_ to make a forward progress.

Well, yes, that's why XFS has, for many years, counted retry
attempts and emitted warnings when it is struggling to make
allocation progress (in any context). :)

> That being said, I do understand that allowing GFP_NOFS allocation to
> fail is not an easy task and nothing to be done tomorrow or in few
> months, but I believe that a discussion with FS people about what
> can/should be done in order to make this happen is valuable.

The discussion - from my perspective - is likely to be no different
to previous years. None of the proposals that FS people have come up
to address the "need memory allocation guarantees" issue have got
any traction on the mm side. Unless there's something fundamentally
new from the MM side that provides filesystems with a replacement
for __GFP_NOFAIL type behaviour, I don't think further discussion is
going to change the status quo.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-31 23:29         ` Dave Chinner
@ 2016-02-01 12:24           ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2016-02-01 12:24 UTC (permalink / raw)
  To: Dave Chinner, Michal Hocko
  Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel

On 02/01/2016 12:29 AM, Dave Chinner wrote:
> On Thu, Jan 28, 2016 at 11:04:23PM +0100, Michal Hocko wrote:
>> On Fri 29-01-16 07:55:25, Dave Chinner wrote:
>>> On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote:
>> [...]
>>>> There have been patches posted during the year to fortify those places
>>>> which cannot cope with allocation failures for ext[34] and testing
>>>> has shown that ext* resp. xfs are quite ready to see NOFS allocation
>>>> failures.
>>>
>>> The XFS situation is compeletely unchanged from last year, and the
>>> fact that you say it handles NOFS allocation failures just fine
>>> makes me seriously question your testing methodology.
>>
>> I am quite confused now. I remember you were the one who complained
>> about the silent nofail behavior of the allocator because that means
>> you cannot implement an appropriate fallback strategy.
>
> I complained about the fact the allocator did not behave as
> documented (or expected) in that it didn't fail allocations we
> expected it to fail.

Yes, I believe this is exactly what Michal was talking about in the 
original e-mail:

> - GFP_NOFS is another one which would be good to discuss. Its primary
>   use is to prevent from reclaim recursion back into FS. This makes
>   such an allocation context weaker and historically we haven't
>   triggered OOM killer and rather hopelessly retry the request and
>   rely on somebody else to make a progress for us. There are two issues
>   here.
>   First we shouldn't retry endlessly and rather fail the allocation and
>   allow the FS to handle the error. As per my experiments most FS cope
>   with that quite reasonably. Btrfs unfortunately handles many of those
>   failures by BUG_ON which is really unfortunate.

So this should address your complain above.

>> That being said, I do understand that allowing GFP_NOFS allocation to
>> fail is not an easy task and nothing to be done tomorrow or in few
>> months, but I believe that a discussion with FS people about what
>> can/should be done in order to make this happen is valuable.
>
> The discussion - from my perspective - is likely to be no different
> to previous years. None of the proposals that FS people have come up
> to address the "need memory allocation guarantees" issue have got
> any traction on the mm side. Unless there's something fundamentally
> new from the MM side that provides filesystems with a replacement
> for __GFP_NOFAIL type behaviour, I don't think further discussion is
> going to change the status quo.

Yeah, the guaranteed reserves as discussed last year didn't happen so 
far. But that's a separate issue than GPF_NOFS *without* __GFP_NOFAIL.
It just got mixed up in this thread.

> Cheers,
>
> Dave.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 18:45 ` Johannes Weiner
  2016-01-26  9:50   ` Michal Hocko
@ 2016-01-26 17:07   ` Vlastimil Babka
  2016-01-26 18:09     ` Johannes Weiner
  2016-01-30 18:18   ` Greg Thelen
  2 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2016-01-26 17:07 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko; +Cc: lsf-pc, linux-mm, linux-fsdevel

On 01/25/2016 07:45 PM, Johannes Weiner wrote:
>> >- One of the long lasting issue related to the OOM handling is when to
>> >   actually declare OOM. There are workloads which might be trashing on
>> >   few last remaining pagecache pages or on the swap which makes the
>> >   system completely unusable for considerable amount of time yet the
>> >   OOM killer is not invoked. Can we finally do something about that?
> I'm working on this, but it's not an easy situation to detect.
>
> We can't decide based on amount of page cache, as you could have very
> little of it and still be fine. Most of it could still be used-once.
>
> We can't decide based on number or rate of (re)faults, because this
> spikes during startup and workingset changes, or can be even sustained
> when working with a data set that you'd never expect to fit into
> memory in the first place, while still making acceptable progress.

I would hope that workingset should help distinguish workloads thrashing 
due to low memory and those that can't fit there no matter what? Or 
would it require tracking lifetime of so many evicted pages that the 
memory overhead of that would be infeasible?

> The only thing that I could come up with as a meaningful metric here
> is the share of actual walltime that is spent waiting on refetching
> stuff from disk. If we know that in the last X seconds, the whole
> system spent more than idk 95% of its time waiting on the disk to read
> recently evicted data back into the cache, then it's time to kick the
> OOM killer, as this state is likely not worth maintaining.
>
> Such a "thrashing time" metric could be great to export to userspace
> in general as it can be useful in other situations, such as quickly
> gauging how comfortable a workload is (inside a container), and how
> much time is wasted due to underprovisioning of memory. Because it
> isn't just the pathological cases, you migh just wait a bit here and
> there and could it still add up to a sizable portion of a job's time.
>
> If other people think this could be a useful thing to talk about, I'd
> be happy to discuss it at the conference.

I think this discussion would be useful, yeah.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-26 17:07   ` Vlastimil Babka
@ 2016-01-26 18:09     ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2016-01-26 18:09 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, lsf-pc, linux-mm, linux-fsdevel

On Tue, Jan 26, 2016 at 06:07:52PM +0100, Vlastimil Babka wrote:
> On 01/25/2016 07:45 PM, Johannes Weiner wrote:
> >>>- One of the long lasting issue related to the OOM handling is when to
> >>>   actually declare OOM. There are workloads which might be trashing on
> >>>   few last remaining pagecache pages or on the swap which makes the
> >>>   system completely unusable for considerable amount of time yet the
> >>>   OOM killer is not invoked. Can we finally do something about that?
> >I'm working on this, but it's not an easy situation to detect.
> >
> >We can't decide based on amount of page cache, as you could have very
> >little of it and still be fine. Most of it could still be used-once.
> >
> >We can't decide based on number or rate of (re)faults, because this
> >spikes during startup and workingset changes, or can be even sustained
> >when working with a data set that you'd never expect to fit into
> >memory in the first place, while still making acceptable progress.
> 
> I would hope that workingset should help distinguish workloads thrashing due
> to low memory and those that can't fit there no matter what? Or would it
> require tracking lifetime of so many evicted pages that the memory overhead
> of that would be infeasible?

Yes, using the workingset code is exactly my plan. The only thing it
requires on top is a time component. Then we can kick the OOM killer
based on the share of time a workload (the system?) spends thrashing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] proposals for topics
  2016-01-25 18:45 ` Johannes Weiner
  2016-01-26  9:50   ` Michal Hocko
  2016-01-26 17:07   ` Vlastimil Babka
@ 2016-01-30 18:18   ` Greg Thelen
  2 siblings, 0 replies; 19+ messages in thread
From: Greg Thelen @ 2016-01-30 18:18 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, lsf-pc, linux-mm@kvack.org, linux-fsdevel

On Mon, Jan 25, 2016 at 10:45 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi Michal,
>
> On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote:
>> Hi,
>> I would like to propose the following topics (mainly for the MM track
>> but some of them might be of interest for FS people as well)
>> - gfp flags for allocations requests seems to be quite complicated
>>   and used arbitrarily by many subsystems. GFP_REPEAT is one such
>>   example. Half of the current usage is for low order allocations
>>   requests where it is basically ignored. Moreover the documentation
>>   claims that such a request is _not_ retrying endlessly which is
>>   true only for costly high order allocations. I think we should get
>>   rid of most of the users of this flag (basically all low order ones)
>>   and then come up with something like GFP_BEST_EFFORT which would work
>>   for all orders consistently [1]
>
> I think nobody would mind a patch that just cleans this stuff up. Do
> you expect controversy there?
>
>> - GFP_NOFS is another one which would be good to discuss. Its primary
>>   use is to prevent from reclaim recursion back into FS. This makes
>>   such an allocation context weaker and historically we haven't
>>   triggered OOM killer and rather hopelessly retry the request and
>>   rely on somebody else to make a progress for us. There are two issues
>>   here.
>>   First we shouldn't retry endlessly and rather fail the allocation and
>>   allow the FS to handle the error. As per my experiments most FS cope
>>   with that quite reasonably. Btrfs unfortunately handles many of those
>>   failures by BUG_ON which is really unfortunate.
>
> Are there any new datapoints on how to deal with failing allocations?
> IIRC the conclusion last time was that some filesystems simply can't
> support this without a reservation system - which I don't believe
> anybody is working on. Does it make sense to rehash this when nothing
> really changed since last time?
>
>> - OOM killer has been discussed a lot throughout this year. We have
>>   discussed this topic the last year at LSF and there has been quite some
>>   progress since then. We have async memory tear down for the OOM victim
>>   [2] which should help in many corner cases. We are still waiting
>>   to make mmap_sem for write killable which would help in some other
>>   classes of corner cases. Whatever we do, however, will not work in
>>   100% cases. So the primary question is how far are we willing to go to
>>   support different corner cases. Do we want to have a
>>   panic_after_timeout global knob, allow multiple OOM victims after
>>   a timeout?
>
> Yes, that sounds like a good topic to cover. I'm honestly surprised
> that there is so much resistence to trying to make the OOM killer
> deterministic, and patches that try to fix that are resisted while the
> thing can still lock up quietly.
>
> It would be good to take a step back and consider our priorities
> there, think about what the ultimate goal of the OOM killer is, and
> then how to make it operate smoothly without compromising that goal -
> not the other way round.

A few thoughts on our current/future oom killer usage.

We've been using the oom killer as a overcommit tie breaker.  Victim
selection isn't always based on memory usage, instead low priority
jobs are the first victims.  Thus a deterministic scoring system,
independent of memory usage, has been useful.  And a scoring system
that's based on memcg hierarchy.  Because jobs are often defined at
container boundaries it's also expedient to oom kill all processes
within a memcg.

Killing processes isn't always enough to free memory because
tmpfs/hugetlbfs aren't oom direct victims.  Though a combination of
namespaces and kill-all-container-processes is promising, because last
referenced on namespace can umount its filesystems.  Though this
doesn't help if refs to the filesystem exist outside of the namespace
(e.g. fd's passed over unix sockets).  So other ideas are floating
around.

And thrash detection is also quite helpful to decided when oom killing
is better than hamming reclaim for a really long time.  Refaulting is
one signal of when to oom kill, but another is that high priority
tasks are only willing to spend X before oom killing a lower prio
victim (sorry X is vague, because it hasn't been sorted out yet, it
could be wallclock, cpu time, disk bandwidth, etc.).

>> - sysrq+f to trigger the oom killer follows some heuristics used by the
>>   OOM killer invoked by the system which means that it is unreliable
>>   and it might skip to kill any task without any explanation why. The
>>   semantic of the knob doesn't seem to clear and it has been even
>>   suggested [3] to remove it altogether as an unuseful debugging aid. Is
>>   this really a general consensus?
>
> I think it's an okay debugging aid, but I worry about it coming up so
> much in discussions about how the OOM killer should behave. We should
> never *require* manual intervention to put a machine back into known
> state after it ran out of memory.
>
>> - One of the long lasting issue related to the OOM handling is when to
>>   actually declare OOM. There are workloads which might be trashing on
>>   few last remaining pagecache pages or on the swap which makes the
>>   system completely unusable for considerable amount of time yet the
>>   OOM killer is not invoked. Can we finally do something about that?
>
> I'm working on this, but it's not an easy situation to detect.
>
> We can't decide based on amount of page cache, as you could have very
> little of it and still be fine. Most of it could still be used-once.
>
> We can't decide based on number or rate of (re)faults, because this
> spikes during startup and workingset changes, or can be even sustained
> when working with a data set that you'd never expect to fit into
> memory in the first place, while still making acceptable progress.
>
> The only thing that I could come up with as a meaningful metric here
> is the share of actual walltime that is spent waiting on refetching
> stuff from disk. If we know that in the last X seconds, the whole
> system spent more than idk 95% of its time waiting on the disk to read
> recently evicted data back into the cache, then it's time to kick the
> OOM killer, as this state is likely not worth maintaining.
>
> Such a "thrashing time" metric could be great to export to userspace
> in general as it can be useful in other situations, such as quickly
> gauging how comfortable a workload is (inside a container), and how
> much time is wasted due to underprovisioning of memory. Because it
> isn't just the pathological cases, you migh just wait a bit here and
> there and could it still add up to a sizable portion of a job's time.
>
> If other people think this could be a useful thing to talk about, I'd
> be happy to discuss it at the conference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-02-01 12:24 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko
2016-01-25 14:21 ` [Lsf-pc] " Jan Kara
2016-01-25 14:40   ` Michal Hocko
2016-01-25 15:08 ` Tetsuo Handa
2016-01-26  9:43   ` Michal Hocko
2016-01-27 13:44     ` Tetsuo Handa
2016-01-27 14:33       ` [Lsf-pc] " Jan Kara
2016-01-25 18:45 ` Johannes Weiner
2016-01-26  9:50   ` Michal Hocko
2016-01-26 17:17     ` Vlastimil Babka
2016-01-26 17:20       ` [Lsf-pc] " Jan Kara
2016-01-27  9:08         ` Michal Hocko
2016-01-28 20:55     ` Dave Chinner
2016-01-28 22:04       ` Michal Hocko
2016-01-31 23:29         ` Dave Chinner
2016-02-01 12:24           ` Vlastimil Babka
2016-01-26 17:07   ` Vlastimil Babka
2016-01-26 18:09     ` Johannes Weiner
2016-01-30 18:18   ` Greg Thelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).