* [LSF/MM TOPIC] proposals for topics @ 2016-01-25 13:33 Michal Hocko 2016-01-25 14:21 ` [Lsf-pc] " Jan Kara ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Michal Hocko @ 2016-01-25 13:33 UTC (permalink / raw) To: lsf-pc; +Cc: linux-mm, linux-fsdevel Hi, I would like to propose the following topics (mainly for the MM track but some of them might be of interest for FS people as well) - gfp flags for allocations requests seems to be quite complicated and used arbitrarily by many subsystems. GFP_REPEAT is one such example. Half of the current usage is for low order allocations requests where it is basically ignored. Moreover the documentation claims that such a request is _not_ retrying endlessly which is true only for costly high order allocations. I think we should get rid of most of the users of this flag (basically all low order ones) and then come up with something like GFP_BEST_EFFORT which would work for all orders consistently [1] - GFP_NOFS is another one which would be good to discuss. Its primary use is to prevent from reclaim recursion back into FS. This makes such an allocation context weaker and historically we haven't triggered OOM killer and rather hopelessly retry the request and rely on somebody else to make a progress for us. There are two issues here. First we shouldn't retry endlessly and rather fail the allocation and allow the FS to handle the error. As per my experiments most FS cope with that quite reasonably. Btrfs unfortunately handles many of those failures by BUG_ON which is really unfortunate. Another issue is that GFP_NOFS is quite often used without any obvious reason. It is not clear which lock is held and could be taken from the reclaim path. Wouldn't it be much better if the no-recursion behavior was bound to the lock scope rather than particular allocation request? We already have something like this for PM pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It would be great if we could unify this and use the context based NOFS in the FS. - OOM killer has been discussed a lot throughout this year. We have discussed this topic the last year at LSF and there has been quite some progress since then. We have async memory tear down for the OOM victim [2] which should help in many corner cases. We are still waiting to make mmap_sem for write killable which would help in some other classes of corner cases. Whatever we do, however, will not work in 100% cases. So the primary question is how far are we willing to go to support different corner cases. Do we want to have a panic_after_timeout global knob, allow multiple OOM victims after a timeout? - sysrq+f to trigger the oom killer follows some heuristics used by the OOM killer invoked by the system which means that it is unreliable and it might skip to kill any task without any explanation why. The semantic of the knob doesn't seem to clear and it has been even suggested [3] to remove it altogether as an unuseful debugging aid. Is this really a general consensus? - One of the long lasting issue related to the OOM handling is when to actually declare OOM. There are workloads which might be trashing on few last remaining pagecache pages or on the swap which makes the system completely unusable for considerable amount of time yet the OOM killer is not invoked. Can we finally do something about that? [1] http://lkml.kernel.org/r/1446740160-29094-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [3] http://lkml.kernel.org/r/alpine.DEB.2.10.1601141347220.16227@chino.kir.corp.google.com -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics 2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko @ 2016-01-25 14:21 ` Jan Kara 2016-01-25 14:40 ` Michal Hocko 2016-01-25 15:08 ` Tetsuo Handa 2016-01-25 18:45 ` Johannes Weiner 2 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2016-01-25 14:21 UTC (permalink / raw) To: Michal Hocko; +Cc: lsf-pc, linux-fsdevel, linux-mm Hi! On Mon 25-01-16 14:33:57, Michal Hocko wrote: > - GFP_NOFS is another one which would be good to discuss. Its primary > use is to prevent from reclaim recursion back into FS. This makes > such an allocation context weaker and historically we haven't > triggered OOM killer and rather hopelessly retry the request and > rely on somebody else to make a progress for us. There are two issues > here. > First we shouldn't retry endlessly and rather fail the allocation and > allow the FS to handle the error. As per my experiments most FS cope > with that quite reasonably. Btrfs unfortunately handles many of those > failures by BUG_ON which is really unfortunate. > Another issue is that GFP_NOFS is quite often used without any obvious > reason. It is not clear which lock is held and could be taken from > the reclaim path. Wouldn't it be much better if the no-recursion > behavior was bound to the lock scope rather than particular allocation > request? We already have something like this for PM > pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It > would be great if we could unify this and use the context based NOFS > in the FS. I like the idea that we'd protect lock scopes from reclaim recursion but the effort to do so would be IMHO rather big. E.g. there are ~75 instances of GFP_NOFS allocation in ext4/jbd2 codebase and making sure all are properly covered will take quite some auditing... I'm not saying we shouldn't do something like this, just you will have to be good in selling the benefits :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics 2016-01-25 14:21 ` [Lsf-pc] " Jan Kara @ 2016-01-25 14:40 ` Michal Hocko 0 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2016-01-25 14:40 UTC (permalink / raw) To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm On Mon 25-01-16 15:21:39, Jan Kara wrote: > Hi! > > On Mon 25-01-16 14:33:57, Michal Hocko wrote: [... > > Another issue is that GFP_NOFS is quite often used without any obvious > > reason. It is not clear which lock is held and could be taken from > > the reclaim path. Wouldn't it be much better if the no-recursion > > behavior was bound to the lock scope rather than particular allocation > > request? We already have something like this for PM > > pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It > > would be great if we could unify this and use the context based NOFS > > in the FS. > > I like the idea that we'd protect lock scopes from reclaim recursion but the > effort to do so would be IMHO rather big. E.g. there are ~75 instances of > GFP_NOFS allocation in ext4/jbd2 codebase and making sure all are properly > covered will take quite some auditing... I'm not saying we shouldn't do > something like this, just you will have to be good in selling the benefits > :). My idea was that the first step would be using the helpers to mark scopes and other usage of the ~__GFP_FS inside such a scope could be identified much easier (e.g. a debugging WARN_ON or something like that). That can be done in a longer term. Then I would hope for reducing GFP_NOFS usage from mapping_gfp_mask. I realize this is a lot of work but I believe this will pay of long term. And especially the first step shouldn't be that hard because locks used from the reclaim path shouldn't be that hard to identify. GFP_NOFS is a mess these days and it is far from trivial to tell wether it should be used or not from some paths. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko 2016-01-25 14:21 ` [Lsf-pc] " Jan Kara @ 2016-01-25 15:08 ` Tetsuo Handa 2016-01-26 9:43 ` Michal Hocko 2016-01-25 18:45 ` Johannes Weiner 2 siblings, 1 reply; 19+ messages in thread From: Tetsuo Handa @ 2016-01-25 15:08 UTC (permalink / raw) To: Michal Hocko, lsf-pc; +Cc: linux-mm, linux-fsdevel Michal Hocko wrote: > Another issue is that GFP_NOFS is quite often used without any obvious > reason. It is not clear which lock is held and could be taken from > the reclaim path. Wouldn't it be much better if the no-recursion > behavior was bound to the lock scope rather than particular allocation > request? We already have something like this for PM > pm_res{trict,tore}_gfp_mask resp. memalloc_noio_{save,restore}. It > would be great if we could unify this and use the context based NOFS > in the FS. Yes, I do want it. I think some of LSM hooks are called from GFP_NOFS context but it is too difficult for me to tell whether we are using GFP_NOFS correctly. > First we shouldn't retry endlessly and rather fail the allocation and > allow the FS to handle the error. As per my experiments most FS cope > with that quite reasonably. Btrfs unfortunately handles many of those > failures by BUG_ON which is really unfortunate. If it turned out that we are using GFP_NOFS from LSM hooks correctly, I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending. Filesystems might be able to handle GFP_NOFS allocation failures. But userspace might not be able to handle system call failures caused by GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly terminate as if they are OOM-killed. Would you please add GFP_KILLABLE to list of the topics? > - OOM killer has been discussed a lot throughout this year. We have > discussed this topic the last year at LSF and there has been quite some > progress since then. We have async memory tear down for the OOM victim > [2] which should help in many corner cases. We are still waiting > to make mmap_sem for write killable which would help in some other > classes of corner cases. Whatever we do, however, will not work in > 100% cases. So the primary question is how far are we willing to go to > support different corner cases. Do we want to have a > panic_after_timeout global knob, allow multiple OOM victims after > a timeout? A sequence for handling any corner case (as long as OOM killer is invoked) was proposal at http://lkml.kernel.org/r/201601222259.GJB90663.MLOJtFFOQFVHSO@I-love.SAKURA.ne.jp . > - sysrq+f to trigger the oom killer follows some heuristics used by the > OOM killer invoked by the system which means that it is unreliable > and it might skip to kill any task without any explanation why. The > semantic of the knob doesn't seem to clear and it has been even > suggested [3] to remove it altogether as an unuseful debugging aid. Is > this really a general consensus? Even if we remove SysRq-f from future kernels, please give us a fix for current kernels. ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 15:08 ` Tetsuo Handa @ 2016-01-26 9:43 ` Michal Hocko 2016-01-27 13:44 ` Tetsuo Handa 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2016-01-26 9:43 UTC (permalink / raw) To: Tetsuo Handa; +Cc: lsf-pc, linux-mm, linux-fsdevel On Tue 26-01-16 00:08:28, Tetsuo Handa wrote: [...] > If it turned out that we are using GFP_NOFS from LSM hooks correctly, > I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending. > Filesystems might be able to handle GFP_NOFS allocation failures. But > userspace might not be able to handle system call failures caused by > GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly > terminate as if they are OOM-killed. Would you please add GFP_KILLABLE > to list of the topics? Are there so many places to justify a flag? Isn't it easier to check for fatal_signal_pending in the failed path and do the retry otherwise? This allows for a more flexible fallback strategy - e.g. drop the locks and retry again, sleep for reasonable time, wait for some event etc... This sounds much more extensible than a single flag burried down in the allocator path. Besides that all allocations besides __GFP_NOFAIL and GFP_NOFS are already killable. The first one by definition and the later one because of the current implementation restrictions which we can hopefully fix longterm. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-26 9:43 ` Michal Hocko @ 2016-01-27 13:44 ` Tetsuo Handa 2016-01-27 14:33 ` [Lsf-pc] " Jan Kara 0 siblings, 1 reply; 19+ messages in thread From: Tetsuo Handa @ 2016-01-27 13:44 UTC (permalink / raw) To: mhocko; +Cc: lsf-pc, linux-mm, linux-fsdevel Michal Hocko wrote: > On Tue 26-01-16 00:08:28, Tetsuo Handa wrote: > [...] > > If it turned out that we are using GFP_NOFS from LSM hooks correctly, > > I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending. > > Filesystems might be able to handle GFP_NOFS allocation failures. But > > userspace might not be able to handle system call failures caused by > > GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly > > terminate as if they are OOM-killed. Would you please add GFP_KILLABLE > > to list of the topics? > > Are there so many places to justify a flag? Isn't it easier to check for > fatal_signal_pending in the failed path and do the retry otherwise? This > allows for a more flexible fallback strategy - e.g. drop the locks and > retry again, sleep for reasonable time, wait for some event etc... This > sounds much more extensible than a single flag burried down in the > allocator path. If you allow any in-kernel code to directly call out_of_memory(), I'm OK with that. I consider that whether to invoke the OOM killer should not be determined based on ability to reclaim memory; it should be determined based on importance and/or purpose of that memory allocation request. We allocate memory on behalf of userspace processes. If a userspace process asks for a page via page fault, we are using __GFP_FS. If in-kernel code does something on behalf of a userspace process, we should use __GFP_FS. Forcing in-kernel code to use !__GFP_FS allocation requests is a hack for workarounding inconvenient circumstances in memory allocation (memory reclaim deadlock) which is not fault of userspace processes. Userspace controls oom_score_adj and makes a bet between processes. If process A wins, the OOM killer kills process B, and process A gets memory. If process B wins, the OOM killer kills process A, and process B gets memory. Not invoking the OOM killer due to lack of __GFP_FS is something like forcing processes to use oom_kill_allocating_task = 1. Therefore, since __GFP_KILLABLE does not exist and out_of_memory() is not exported, I'll change my !__GFP_FS allocation requests to __GFP_NOFAIL (in order to allow processes to make a bet) if mm people change small !__GFP_FS allocation requests to fail upon OOM. Note that there is no need to retry such __GFP_NOFAIL allocation requests if SIGKILL is pending, but __GFP_NOFAIL does not allow fail upon SIGKILL. __GFP_KILLABLE (with current "no-fail unless chosen by the OOM killer" behavior) will handle it perfectly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics 2016-01-27 13:44 ` Tetsuo Handa @ 2016-01-27 14:33 ` Jan Kara 0 siblings, 0 replies; 19+ messages in thread From: Jan Kara @ 2016-01-27 14:33 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mhocko, linux-fsdevel, linux-mm, lsf-pc On Wed 27-01-16 22:44:30, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 26-01-16 00:08:28, Tetsuo Handa wrote: > > [...] > > > If it turned out that we are using GFP_NOFS from LSM hooks correctly, > > > I'd expect such GFP_NOFS allocations retry unless SIGKILL is pending. > > > Filesystems might be able to handle GFP_NOFS allocation failures. But > > > userspace might not be able to handle system call failures caused by > > > GFP_NOFS allocation failures; OOM-unkillable processes might unexpectedly > > > terminate as if they are OOM-killed. Would you please add GFP_KILLABLE > > > to list of the topics? > > > > Are there so many places to justify a flag? Isn't it easier to check for > > fatal_signal_pending in the failed path and do the retry otherwise? This > > allows for a more flexible fallback strategy - e.g. drop the locks and > > retry again, sleep for reasonable time, wait for some event etc... This > > sounds much more extensible than a single flag burried down in the > > allocator path. > > If you allow any in-kernel code to directly call out_of_memory(), I'm > OK with that. > > I consider that whether to invoke the OOM killer should not be determined > based on ability to reclaim memory; it should be determined based on > importance and/or purpose of that memory allocation request. Well, in my opinion that's fairly difficult to judge at the site doing the memory allocation. E.g. is it better to loop in allocator to be able to satisfy allocation request to do IO, or is it better to fail the IO with error, or is it better to invoke OOM killer to free some memory and then do the IO? Who knows... This is a policy decision and as such it is better done by the administrator and there should be one common place to tune such things. Not call sites spread around the kernel... > We allocate memory on behalf of userspace processes. If a userspace process > asks for a page via page fault, we are using __GFP_FS. If in-kernel code > does something on behalf of a userspace process, we should use __GFP_FS. > > Forcing in-kernel code to use !__GFP_FS allocation requests is a hack for > workarounding inconvenient circumstances in memory allocation (memory > reclaim deadlock) which is not fault of userspace processes. It is as if you said that using GFP_ATOMIC allocation is a hack for device drivers to do allocation in atomic context. It is a reality of kernel programming that you sometimes have to do allocation in restricted context. One kind of this restricted context is that you cannot recurse back into the filesystem to free memory. I see nothing hacky in it. > Userspace controls oom_score_adj and makes a bet between processes. > If process A wins, the OOM killer kills process B, and process A gets memory. > If process B wins, the OOM killer kills process A, and process B gets memory. > Not invoking the OOM killer due to lack of __GFP_FS is something like forcing > processes to use oom_kill_allocating_task = 1. > > Therefore, since __GFP_KILLABLE does not exist and out_of_memory() is not > exported, I'll change my !__GFP_FS allocation requests to __GFP_NOFAIL > (in order to allow processes to make a bet) if mm people change small !__GFP_FS > allocation requests to fail upon OOM. Note that there is no need to retry such > __GFP_NOFAIL allocation requests if SIGKILL is pending, but __GFP_NOFAIL does > not allow fail upon SIGKILL. __GFP_KILLABLE (with current "no-fail unless chosen > by the OOM killer" behavior) will handle it perfectly. So GFP_KILLABLE with GFP_NOFAIL combination actually makes sense to me. Although most of the places I'm aware of which need GFP_NOFAIL wouldn't use GFP_KILLABLE either - they are places where we have two options: 1) lose user data without a way to tell that back to the user 2) allocate more memory And from these two options, looping trying option 2) and hoping that someone will solve the problem for us is the best we can do. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko 2016-01-25 14:21 ` [Lsf-pc] " Jan Kara 2016-01-25 15:08 ` Tetsuo Handa @ 2016-01-25 18:45 ` Johannes Weiner 2016-01-26 9:50 ` Michal Hocko ` (2 more replies) 2 siblings, 3 replies; 19+ messages in thread From: Johannes Weiner @ 2016-01-25 18:45 UTC (permalink / raw) To: Michal Hocko; +Cc: lsf-pc, linux-mm, linux-fsdevel Hi Michal, On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote: > Hi, > I would like to propose the following topics (mainly for the MM track > but some of them might be of interest for FS people as well) > - gfp flags for allocations requests seems to be quite complicated > and used arbitrarily by many subsystems. GFP_REPEAT is one such > example. Half of the current usage is for low order allocations > requests where it is basically ignored. Moreover the documentation > claims that such a request is _not_ retrying endlessly which is > true only for costly high order allocations. I think we should get > rid of most of the users of this flag (basically all low order ones) > and then come up with something like GFP_BEST_EFFORT which would work > for all orders consistently [1] I think nobody would mind a patch that just cleans this stuff up. Do you expect controversy there? > - GFP_NOFS is another one which would be good to discuss. Its primary > use is to prevent from reclaim recursion back into FS. This makes > such an allocation context weaker and historically we haven't > triggered OOM killer and rather hopelessly retry the request and > rely on somebody else to make a progress for us. There are two issues > here. > First we shouldn't retry endlessly and rather fail the allocation and > allow the FS to handle the error. As per my experiments most FS cope > with that quite reasonably. Btrfs unfortunately handles many of those > failures by BUG_ON which is really unfortunate. Are there any new datapoints on how to deal with failing allocations? IIRC the conclusion last time was that some filesystems simply can't support this without a reservation system - which I don't believe anybody is working on. Does it make sense to rehash this when nothing really changed since last time? > - OOM killer has been discussed a lot throughout this year. We have > discussed this topic the last year at LSF and there has been quite some > progress since then. We have async memory tear down for the OOM victim > [2] which should help in many corner cases. We are still waiting > to make mmap_sem for write killable which would help in some other > classes of corner cases. Whatever we do, however, will not work in > 100% cases. So the primary question is how far are we willing to go to > support different corner cases. Do we want to have a > panic_after_timeout global knob, allow multiple OOM victims after > a timeout? Yes, that sounds like a good topic to cover. I'm honestly surprised that there is so much resistence to trying to make the OOM killer deterministic, and patches that try to fix that are resisted while the thing can still lock up quietly. It would be good to take a step back and consider our priorities there, think about what the ultimate goal of the OOM killer is, and then how to make it operate smoothly without compromising that goal - not the other way round. > - sysrq+f to trigger the oom killer follows some heuristics used by the > OOM killer invoked by the system which means that it is unreliable > and it might skip to kill any task without any explanation why. The > semantic of the knob doesn't seem to clear and it has been even > suggested [3] to remove it altogether as an unuseful debugging aid. Is > this really a general consensus? I think it's an okay debugging aid, but I worry about it coming up so much in discussions about how the OOM killer should behave. We should never *require* manual intervention to put a machine back into known state after it ran out of memory. > - One of the long lasting issue related to the OOM handling is when to > actually declare OOM. There are workloads which might be trashing on > few last remaining pagecache pages or on the swap which makes the > system completely unusable for considerable amount of time yet the > OOM killer is not invoked. Can we finally do something about that? I'm working on this, but it's not an easy situation to detect. We can't decide based on amount of page cache, as you could have very little of it and still be fine. Most of it could still be used-once. We can't decide based on number or rate of (re)faults, because this spikes during startup and workingset changes, or can be even sustained when working with a data set that you'd never expect to fit into memory in the first place, while still making acceptable progress. The only thing that I could come up with as a meaningful metric here is the share of actual walltime that is spent waiting on refetching stuff from disk. If we know that in the last X seconds, the whole system spent more than idk 95% of its time waiting on the disk to read recently evicted data back into the cache, then it's time to kick the OOM killer, as this state is likely not worth maintaining. Such a "thrashing time" metric could be great to export to userspace in general as it can be useful in other situations, such as quickly gauging how comfortable a workload is (inside a container), and how much time is wasted due to underprovisioning of memory. Because it isn't just the pathological cases, you migh just wait a bit here and there and could it still add up to a sizable portion of a job's time. If other people think this could be a useful thing to talk about, I'd be happy to discuss it at the conference. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 18:45 ` Johannes Weiner @ 2016-01-26 9:50 ` Michal Hocko 2016-01-26 17:17 ` Vlastimil Babka 2016-01-28 20:55 ` Dave Chinner 2016-01-26 17:07 ` Vlastimil Babka 2016-01-30 18:18 ` Greg Thelen 2 siblings, 2 replies; 19+ messages in thread From: Michal Hocko @ 2016-01-26 9:50 UTC (permalink / raw) To: Johannes Weiner; +Cc: lsf-pc, linux-mm, linux-fsdevel On Mon 25-01-16 13:45:59, Johannes Weiner wrote: > Hi Michal, > > On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote: > > Hi, > > I would like to propose the following topics (mainly for the MM track > > but some of them might be of interest for FS people as well) > > - gfp flags for allocations requests seems to be quite complicated > > and used arbitrarily by many subsystems. GFP_REPEAT is one such > > example. Half of the current usage is for low order allocations > > requests where it is basically ignored. Moreover the documentation > > claims that such a request is _not_ retrying endlessly which is > > true only for costly high order allocations. I think we should get > > rid of most of the users of this flag (basically all low order ones) > > and then come up with something like GFP_BEST_EFFORT which would work > > for all orders consistently [1] > > I think nobody would mind a patch that just cleans this stuff up. Do > you expect controversy there? Well, I thought the same but the patches didn't get much traction. The reason might be that people are too busy in general to look into changes that are of no immediate benefit so I thought that discussing such a higher level topic at LSF might make sense. I really wish we rethink our current GFP flags battery and try to come up with something that will be more consistent and ideally without the weight of the history tweaks. > > - GFP_NOFS is another one which would be good to discuss. Its primary > > use is to prevent from reclaim recursion back into FS. This makes > > such an allocation context weaker and historically we haven't > > triggered OOM killer and rather hopelessly retry the request and > > rely on somebody else to make a progress for us. There are two issues > > here. > > First we shouldn't retry endlessly and rather fail the allocation and > > allow the FS to handle the error. As per my experiments most FS cope > > with that quite reasonably. Btrfs unfortunately handles many of those > > failures by BUG_ON which is really unfortunate. > > Are there any new datapoints on how to deal with failing allocations? > IIRC the conclusion last time was that some filesystems simply can't > support this without a reservation system - which I don't believe > anybody is working on. Does it make sense to rehash this when nothing > really changed since last time? There have been patches posted during the year to fortify those places which cannot cope with allocation failures for ext[34] and testing has shown that ext* resp. xfs are quite ready to see NOFS allocation failures. It is merely Btrfs which is in the biggest troubles now and this is a work in progress AFAIK. I am perfectly OK to discuss some details with interested FS people during BoF e.g. > > - OOM killer has been discussed a lot throughout this year. We have > > discussed this topic the last year at LSF and there has been quite some > > progress since then. We have async memory tear down for the OOM victim > > [2] which should help in many corner cases. We are still waiting > > to make mmap_sem for write killable which would help in some other > > classes of corner cases. Whatever we do, however, will not work in > > 100% cases. So the primary question is how far are we willing to go to > > support different corner cases. Do we want to have a > > panic_after_timeout global knob, allow multiple OOM victims after > > a timeout? > > Yes, that sounds like a good topic to cover. I'm honestly surprised > that there is so much resistence to trying to make the OOM killer > deterministic, and patches that try to fix that are resisted while the > thing can still lock up quietly. I guess the problem is what different parties see as the deterministic behavior. Timeout based solutions suggested so far were either too convoluted IMHO, not deterministic or too simplistic to attract general interest I guess. > It would be good to take a step back and consider our priorities > there, think about what the ultimate goal of the OOM killer is, and > then how to make it operate smoothly without compromising that goal - > not the other way round. Agreed. > > - sysrq+f to trigger the oom killer follows some heuristics used by the > > OOM killer invoked by the system which means that it is unreliable > > and it might skip to kill any task without any explanation why. The > > semantic of the knob doesn't seem to clear and it has been even > > suggested [3] to remove it altogether as an unuseful debugging aid. Is > > this really a general consensus? > > I think it's an okay debugging aid, but I worry about it coming up so > much in discussions about how the OOM killer should behave. We should > never *require* manual intervention to put a machine back into known > state after it ran out of memory. My argument has been that this is more of an emergency break when the system cannot cope with the current load (not only after OOM) than a debugging aid but it seems that there is indeed not a clear consensus on this topic so I think we should make it clear. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-26 9:50 ` Michal Hocko @ 2016-01-26 17:17 ` Vlastimil Babka 2016-01-26 17:20 ` [Lsf-pc] " Jan Kara 2016-01-28 20:55 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Vlastimil Babka @ 2016-01-26 17:17 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner; +Cc: lsf-pc, linux-mm, linux-fsdevel On 01/26/2016 10:50 AM, Michal Hocko wrote: > On Mon 25-01-16 13:45:59, Johannes Weiner wrote: >> Hi Michal, >> >> On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote: >>> Hi, >>> I would like to propose the following topics (mainly for the MM track >>> but some of them might be of interest for FS people as well) >>> - gfp flags for allocations requests seems to be quite complicated >>> and used arbitrarily by many subsystems. GFP_REPEAT is one such >>> example. Half of the current usage is for low order allocations >>> requests where it is basically ignored. Moreover the documentation >>> claims that such a request is _not_ retrying endlessly which is >>> true only for costly high order allocations. I think we should get >>> rid of most of the users of this flag (basically all low order ones) >>> and then come up with something like GFP_BEST_EFFORT which would work >>> for all orders consistently [1] >> >> I think nobody would mind a patch that just cleans this stuff up. Do >> you expect controversy there? > > Well, I thought the same but the patches didn't get much traction. > The reason might be that people are too busy in general to look > into changes that are of no immediate benefit so I thought that > discussing such a higher level topic at LSF might make sense. I really > wish we rethink our current GFP flags battery and try to come up with > something that will be more consistent and ideally without the weight of > the history tweaks. Agreed. LSF discussion could help both with the traction and to brainstorm a better defined/named set of flags for today's __GFP_REPEAT, __GFP_NORETRY etc. So far it was just me and Michal on the thread and we share the same office... >>> - GFP_NOFS is another one which would be good to discuss. Its primary >>> use is to prevent from reclaim recursion back into FS. This makes >>> such an allocation context weaker and historically we haven't >>> triggered OOM killer and rather hopelessly retry the request and >>> rely on somebody else to make a progress for us. There are two issues >>> here. >>> First we shouldn't retry endlessly and rather fail the allocation and >>> allow the FS to handle the error. As per my experiments most FS cope >>> with that quite reasonably. Btrfs unfortunately handles many of those >>> failures by BUG_ON which is really unfortunate. >> >> Are there any new datapoints on how to deal with failing allocations? >> IIRC the conclusion last time was that some filesystems simply can't >> support this without a reservation system - which I don't believe >> anybody is working on. Does it make sense to rehash this when nothing >> really changed since last time? > > There have been patches posted during the year to fortify those places > which cannot cope with allocation failures for ext[34] and testing > has shown that ext* resp. xfs are quite ready to see NOFS allocation > failures. Hmm from last year I remember Dave Chinner saying there really are some places that can't handle failure, period? That's why all the discussions about reservations, and I would be surprised if all such places were gone today? Which of course doesn't mean that there couldn't be different NOFS places that can handle failures, which however don't happen in current implementation. > It is merely Btrfs which is in the biggest troubles now and > this is a work in progress AFAIK. I am perfectly OK to discuss some > details with interested FS people during BoF e.g. > >>> - OOM killer has been discussed a lot throughout this year. We have >>> discussed this topic the last year at LSF and there has been quite some >>> progress since then. We have async memory tear down for the OOM victim >>> [2] which should help in many corner cases. We are still waiting >>> to make mmap_sem for write killable which would help in some other >>> classes of corner cases. Whatever we do, however, will not work in >>> 100% cases. So the primary question is how far are we willing to go to >>> support different corner cases. Do we want to have a >>> panic_after_timeout global knob, allow multiple OOM victims after >>> a timeout? >> >> Yes, that sounds like a good topic to cover. I'm honestly surprised >> that there is so much resistence to trying to make the OOM killer >> deterministic, and patches that try to fix that are resisted while the >> thing can still lock up quietly. > > I guess the problem is what different parties see as the deterministic > behavior. Timeout based solutions suggested so far were either too > convoluted IMHO, not deterministic or too simplistic to attract general > interest I guess. Yep, a good topic. >> It would be good to take a step back and consider our priorities >> there, think about what the ultimate goal of the OOM killer is, and >> then how to make it operate smoothly without compromising that goal - >> not the other way round. > > Agreed. > >>> - sysrq+f to trigger the oom killer follows some heuristics used by the >>> OOM killer invoked by the system which means that it is unreliable >>> and it might skip to kill any task without any explanation why. The >>> semantic of the knob doesn't seem to clear and it has been even >>> suggested [3] to remove it altogether as an unuseful debugging aid. Is >>> this really a general consensus? >> >> I think it's an okay debugging aid, but I worry about it coming up so >> much in discussions about how the OOM killer should behave. We should >> never *require* manual intervention to put a machine back into known >> state after it ran out of memory. > > My argument has been that this is more of an emergency break when the > system cannot cope with the current load (not only after OOM) than a > debugging aid but it seems that there is indeed not a clear consensus on > this topic so I think we should make it clear. Right. > Thanks! > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics 2016-01-26 17:17 ` Vlastimil Babka @ 2016-01-26 17:20 ` Jan Kara 2016-01-27 9:08 ` Michal Hocko 0 siblings, 1 reply; 19+ messages in thread From: Jan Kara @ 2016-01-26 17:20 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, Johannes Weiner, linux-fsdevel, linux-mm, lsf-pc On Tue 26-01-16 18:17:01, Vlastimil Babka wrote: > >>>- GFP_NOFS is another one which would be good to discuss. Its primary > >>> use is to prevent from reclaim recursion back into FS. This makes > >>> such an allocation context weaker and historically we haven't > >>> triggered OOM killer and rather hopelessly retry the request and > >>> rely on somebody else to make a progress for us. There are two issues > >>> here. > >>> First we shouldn't retry endlessly and rather fail the allocation and > >>> allow the FS to handle the error. As per my experiments most FS cope > >>> with that quite reasonably. Btrfs unfortunately handles many of those > >>> failures by BUG_ON which is really unfortunate. > >> > >>Are there any new datapoints on how to deal with failing allocations? > >>IIRC the conclusion last time was that some filesystems simply can't > >>support this without a reservation system - which I don't believe > >>anybody is working on. Does it make sense to rehash this when nothing > >>really changed since last time? > > > >There have been patches posted during the year to fortify those places > >which cannot cope with allocation failures for ext[34] and testing > >has shown that ext* resp. xfs are quite ready to see NOFS allocation > >failures. > > Hmm from last year I remember Dave Chinner saying there really are some > places that can't handle failure, period? That's why all the discussions > about reservations, and I would be surprised if all such places were gone > today? Which of course doesn't mean that there couldn't be different NOFS > places that can handle failures, which however don't happen in current > implementation. Well, but we have GFP_NOFAIL (or equivalent of thereof opencoded) in there. So yes, there are GFP_NOFAIL | GFP_NOFS allocations and allocator must deal with it somehow. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] proposals for topics 2016-01-26 17:20 ` [Lsf-pc] " Jan Kara @ 2016-01-27 9:08 ` Michal Hocko 0 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2016-01-27 9:08 UTC (permalink / raw) To: Jan Kara Cc: Vlastimil Babka, Johannes Weiner, linux-fsdevel, linux-mm, lsf-pc On Tue 26-01-16 18:20:51, Jan Kara wrote: > On Tue 26-01-16 18:17:01, Vlastimil Babka wrote: [...] > > Hmm from last year I remember Dave Chinner saying there really are some > > places that can't handle failure, period? That's why all the discussions > > about reservations, and I would be surprised if all such places were gone > > today? Which of course doesn't mean that there couldn't be different NOFS > > places that can handle failures, which however don't happen in current > > implementation. > > Well, but we have GFP_NOFAIL (or equivalent of thereof opencoded) in there. > So yes, there are GFP_NOFAIL | GFP_NOFS allocations and allocator must deal > with it somehow. Yes, the allocator deals with them in two ways. a) it allows to trigger the OOM killer and b) gives them access to memory reserves. So while the reservation system sounds like a more robust plan long term but we have a way forward right now and distinguish must not fail and do have a fallback method already. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-26 9:50 ` Michal Hocko 2016-01-26 17:17 ` Vlastimil Babka @ 2016-01-28 20:55 ` Dave Chinner 2016-01-28 22:04 ` Michal Hocko 1 sibling, 1 reply; 19+ messages in thread From: Dave Chinner @ 2016-01-28 20:55 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote: > On Mon 25-01-16 13:45:59, Johannes Weiner wrote: > > Hi Michal, > > > > On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote: > > > - GFP_NOFS is another one which would be good to discuss. Its primary > > > use is to prevent from reclaim recursion back into FS. This makes > > > such an allocation context weaker and historically we haven't > > > triggered OOM killer and rather hopelessly retry the request and > > > rely on somebody else to make a progress for us. There are two issues > > > here. > > > First we shouldn't retry endlessly and rather fail the allocation and > > > allow the FS to handle the error. As per my experiments most FS cope > > > with that quite reasonably. Btrfs unfortunately handles many of those > > > failures by BUG_ON which is really unfortunate. > > > > Are there any new datapoints on how to deal with failing allocations? > > IIRC the conclusion last time was that some filesystems simply can't > > support this without a reservation system - which I don't believe > > anybody is working on. Does it make sense to rehash this when nothing > > really changed since last time? > > There have been patches posted during the year to fortify those places > which cannot cope with allocation failures for ext[34] and testing > has shown that ext* resp. xfs are quite ready to see NOFS allocation > failures. The XFS situation is compeletely unchanged from last year, and the fact that you say it handles NOFS allocation failures just fine makes me seriously question your testing methodology. In XFS, *any* memory allocation failure during a transaction will either cause a panic through null point deference (because we don't check for allocation failure in most cases) or a filesystem shutdown (in the cases where we do check). If you haven't seen these behaviours, then you haven't been failing memory allocations during filesystem modifications. We need to fundamentally change error handling in transactions in XFS to allow arbitrary memory allocation to fail. That is, we need to implement a full transaction rollback capability so we can back out changes made during the transaction before the error occurred. That's a major amount of work, and I'm probably not going to do anything on this in the next year as it's low priority because what we have now works. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-28 20:55 ` Dave Chinner @ 2016-01-28 22:04 ` Michal Hocko 2016-01-31 23:29 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2016-01-28 22:04 UTC (permalink / raw) To: Dave Chinner; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel On Fri 29-01-16 07:55:25, Dave Chinner wrote: > On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote: [...] > > There have been patches posted during the year to fortify those places > > which cannot cope with allocation failures for ext[34] and testing > > has shown that ext* resp. xfs are quite ready to see NOFS allocation > > failures. > > The XFS situation is compeletely unchanged from last year, and the > fact that you say it handles NOFS allocation failures just fine > makes me seriously question your testing methodology. I am certainly open to suggestions there. My testing managed to identify some weaker points in ext[34] which led to RO remounts. __GFP_NOFAIL as the current band aid worked for them. I wasn't able to hit this with xfs. > In XFS, *any* memory allocation failure during a transaction will > either cause a panic through null point deference (because we don't > check for allocation failure in most cases) or a filesystem > shutdown (in the cases where we do check). If you haven't seen these > behaviours, then you haven't been failing memory allocations during > filesystem modifications. > > We need to fundamentally change error handling in transactions in > XFS to allow arbitrary memory allocation to fail. That is, we need > to implement a full transaction rollback capability so we can back > out changes made during the transaction before the error occurred. > That's a major amount of work, and I'm probably not going to do > anything on this in the next year as it's low priority because what > we have now works. I am quite confused now. I remember you were the one who complained about the silent nofail behavior of the allocator because that means you cannot implement an appropriate fallback strategy. Please also note that I am talking solely about GFP_NOFS allocation here. The allocator really cannot do much other than hoplessly retrying and relying on somebody _else_ to make a forward progress. That being said, I do understand that allowing GFP_NOFS allocation to fail is not an easy task and nothing to be done tomorrow or in few months, but I believe that a discussion with FS people about what can/should be done in order to make this happen is valuable. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-28 22:04 ` Michal Hocko @ 2016-01-31 23:29 ` Dave Chinner 2016-02-01 12:24 ` Vlastimil Babka 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2016-01-31 23:29 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel On Thu, Jan 28, 2016 at 11:04:23PM +0100, Michal Hocko wrote: > On Fri 29-01-16 07:55:25, Dave Chinner wrote: > > On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote: > [...] > > > There have been patches posted during the year to fortify those places > > > which cannot cope with allocation failures for ext[34] and testing > > > has shown that ext* resp. xfs are quite ready to see NOFS allocation > > > failures. > > > > The XFS situation is compeletely unchanged from last year, and the > > fact that you say it handles NOFS allocation failures just fine > > makes me seriously question your testing methodology. > > I am certainly open to suggestions there. My testing managed to identify > some weaker points in ext[34] which led to RO remounts. __GFP_NOFAIL as > the current band aid worked for them. I wasn't able to hit this with > xfs. I'd suggest that you turn on error injection to fail memory allocation. See Documentation/fault-injection/fault-injection.txt and start failing random slab allocations whilst running a workload that creates/unlinks lots of files. > > We need to fundamentally change error handling in transactions in > > XFS to allow arbitrary memory allocation to fail. That is, we need > > to implement a full transaction rollback capability so we can back > > out changes made during the transaction before the error occurred. > > That's a major amount of work, and I'm probably not going to do > > anything on this in the next year as it's low priority because what > > we have now works. > > I am quite confused now. I remember you were the one who complained > about the silent nofail behavior of the allocator because that means > you cannot implement an appropriate fallback strategy. I complained about the fact the allocator did not behave as documented (or expected) in that it didn't fail allocations we expected it to fail. > Please also > note that I am talking solely about GFP_NOFS allocation here. The > allocator really cannot do much other than hoplessly retrying and > relying on somebody _else_ to make a forward progress. Well, yes, that's why XFS has, for many years, counted retry attempts and emitted warnings when it is struggling to make allocation progress (in any context). :) > That being said, I do understand that allowing GFP_NOFS allocation to > fail is not an easy task and nothing to be done tomorrow or in few > months, but I believe that a discussion with FS people about what > can/should be done in order to make this happen is valuable. The discussion - from my perspective - is likely to be no different to previous years. None of the proposals that FS people have come up to address the "need memory allocation guarantees" issue have got any traction on the mm side. Unless there's something fundamentally new from the MM side that provides filesystems with a replacement for __GFP_NOFAIL type behaviour, I don't think further discussion is going to change the status quo. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-31 23:29 ` Dave Chinner @ 2016-02-01 12:24 ` Vlastimil Babka 0 siblings, 0 replies; 19+ messages in thread From: Vlastimil Babka @ 2016-02-01 12:24 UTC (permalink / raw) To: Dave Chinner, Michal Hocko Cc: Johannes Weiner, lsf-pc, linux-mm, linux-fsdevel On 02/01/2016 12:29 AM, Dave Chinner wrote: > On Thu, Jan 28, 2016 at 11:04:23PM +0100, Michal Hocko wrote: >> On Fri 29-01-16 07:55:25, Dave Chinner wrote: >>> On Tue, Jan 26, 2016 at 10:50:23AM +0100, Michal Hocko wrote: >> [...] >>>> There have been patches posted during the year to fortify those places >>>> which cannot cope with allocation failures for ext[34] and testing >>>> has shown that ext* resp. xfs are quite ready to see NOFS allocation >>>> failures. >>> >>> The XFS situation is compeletely unchanged from last year, and the >>> fact that you say it handles NOFS allocation failures just fine >>> makes me seriously question your testing methodology. >> >> I am quite confused now. I remember you were the one who complained >> about the silent nofail behavior of the allocator because that means >> you cannot implement an appropriate fallback strategy. > > I complained about the fact the allocator did not behave as > documented (or expected) in that it didn't fail allocations we > expected it to fail. Yes, I believe this is exactly what Michal was talking about in the original e-mail: > - GFP_NOFS is another one which would be good to discuss. Its primary > use is to prevent from reclaim recursion back into FS. This makes > such an allocation context weaker and historically we haven't > triggered OOM killer and rather hopelessly retry the request and > rely on somebody else to make a progress for us. There are two issues > here. > First we shouldn't retry endlessly and rather fail the allocation and > allow the FS to handle the error. As per my experiments most FS cope > with that quite reasonably. Btrfs unfortunately handles many of those > failures by BUG_ON which is really unfortunate. So this should address your complain above. >> That being said, I do understand that allowing GFP_NOFS allocation to >> fail is not an easy task and nothing to be done tomorrow or in few >> months, but I believe that a discussion with FS people about what >> can/should be done in order to make this happen is valuable. > > The discussion - from my perspective - is likely to be no different > to previous years. None of the proposals that FS people have come up > to address the "need memory allocation guarantees" issue have got > any traction on the mm side. Unless there's something fundamentally > new from the MM side that provides filesystems with a replacement > for __GFP_NOFAIL type behaviour, I don't think further discussion is > going to change the status quo. Yeah, the guaranteed reserves as discussed last year didn't happen so far. But that's a separate issue than GPF_NOFS *without* __GFP_NOFAIL. It just got mixed up in this thread. > Cheers, > > Dave. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 18:45 ` Johannes Weiner 2016-01-26 9:50 ` Michal Hocko @ 2016-01-26 17:07 ` Vlastimil Babka 2016-01-26 18:09 ` Johannes Weiner 2016-01-30 18:18 ` Greg Thelen 2 siblings, 1 reply; 19+ messages in thread From: Vlastimil Babka @ 2016-01-26 17:07 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko; +Cc: lsf-pc, linux-mm, linux-fsdevel On 01/25/2016 07:45 PM, Johannes Weiner wrote: >> >- One of the long lasting issue related to the OOM handling is when to >> > actually declare OOM. There are workloads which might be trashing on >> > few last remaining pagecache pages or on the swap which makes the >> > system completely unusable for considerable amount of time yet the >> > OOM killer is not invoked. Can we finally do something about that? > I'm working on this, but it's not an easy situation to detect. > > We can't decide based on amount of page cache, as you could have very > little of it and still be fine. Most of it could still be used-once. > > We can't decide based on number or rate of (re)faults, because this > spikes during startup and workingset changes, or can be even sustained > when working with a data set that you'd never expect to fit into > memory in the first place, while still making acceptable progress. I would hope that workingset should help distinguish workloads thrashing due to low memory and those that can't fit there no matter what? Or would it require tracking lifetime of so many evicted pages that the memory overhead of that would be infeasible? > The only thing that I could come up with as a meaningful metric here > is the share of actual walltime that is spent waiting on refetching > stuff from disk. If we know that in the last X seconds, the whole > system spent more than idk 95% of its time waiting on the disk to read > recently evicted data back into the cache, then it's time to kick the > OOM killer, as this state is likely not worth maintaining. > > Such a "thrashing time" metric could be great to export to userspace > in general as it can be useful in other situations, such as quickly > gauging how comfortable a workload is (inside a container), and how > much time is wasted due to underprovisioning of memory. Because it > isn't just the pathological cases, you migh just wait a bit here and > there and could it still add up to a sizable portion of a job's time. > > If other people think this could be a useful thing to talk about, I'd > be happy to discuss it at the conference. I think this discussion would be useful, yeah. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-26 17:07 ` Vlastimil Babka @ 2016-01-26 18:09 ` Johannes Weiner 0 siblings, 0 replies; 19+ messages in thread From: Johannes Weiner @ 2016-01-26 18:09 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, lsf-pc, linux-mm, linux-fsdevel On Tue, Jan 26, 2016 at 06:07:52PM +0100, Vlastimil Babka wrote: > On 01/25/2016 07:45 PM, Johannes Weiner wrote: > >>>- One of the long lasting issue related to the OOM handling is when to > >>> actually declare OOM. There are workloads which might be trashing on > >>> few last remaining pagecache pages or on the swap which makes the > >>> system completely unusable for considerable amount of time yet the > >>> OOM killer is not invoked. Can we finally do something about that? > >I'm working on this, but it's not an easy situation to detect. > > > >We can't decide based on amount of page cache, as you could have very > >little of it and still be fine. Most of it could still be used-once. > > > >We can't decide based on number or rate of (re)faults, because this > >spikes during startup and workingset changes, or can be even sustained > >when working with a data set that you'd never expect to fit into > >memory in the first place, while still making acceptable progress. > > I would hope that workingset should help distinguish workloads thrashing due > to low memory and those that can't fit there no matter what? Or would it > require tracking lifetime of so many evicted pages that the memory overhead > of that would be infeasible? Yes, using the workingset code is exactly my plan. The only thing it requires on top is a time component. Then we can kick the OOM killer based on the share of time a workload (the system?) spends thrashing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM TOPIC] proposals for topics 2016-01-25 18:45 ` Johannes Weiner 2016-01-26 9:50 ` Michal Hocko 2016-01-26 17:07 ` Vlastimil Babka @ 2016-01-30 18:18 ` Greg Thelen 2 siblings, 0 replies; 19+ messages in thread From: Greg Thelen @ 2016-01-30 18:18 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, lsf-pc, linux-mm@kvack.org, linux-fsdevel On Mon, Jan 25, 2016 at 10:45 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > Hi Michal, > > On Mon, Jan 25, 2016 at 02:33:57PM +0100, Michal Hocko wrote: >> Hi, >> I would like to propose the following topics (mainly for the MM track >> but some of them might be of interest for FS people as well) >> - gfp flags for allocations requests seems to be quite complicated >> and used arbitrarily by many subsystems. GFP_REPEAT is one such >> example. Half of the current usage is for low order allocations >> requests where it is basically ignored. Moreover the documentation >> claims that such a request is _not_ retrying endlessly which is >> true only for costly high order allocations. I think we should get >> rid of most of the users of this flag (basically all low order ones) >> and then come up with something like GFP_BEST_EFFORT which would work >> for all orders consistently [1] > > I think nobody would mind a patch that just cleans this stuff up. Do > you expect controversy there? > >> - GFP_NOFS is another one which would be good to discuss. Its primary >> use is to prevent from reclaim recursion back into FS. This makes >> such an allocation context weaker and historically we haven't >> triggered OOM killer and rather hopelessly retry the request and >> rely on somebody else to make a progress for us. There are two issues >> here. >> First we shouldn't retry endlessly and rather fail the allocation and >> allow the FS to handle the error. As per my experiments most FS cope >> with that quite reasonably. Btrfs unfortunately handles many of those >> failures by BUG_ON which is really unfortunate. > > Are there any new datapoints on how to deal with failing allocations? > IIRC the conclusion last time was that some filesystems simply can't > support this without a reservation system - which I don't believe > anybody is working on. Does it make sense to rehash this when nothing > really changed since last time? > >> - OOM killer has been discussed a lot throughout this year. We have >> discussed this topic the last year at LSF and there has been quite some >> progress since then. We have async memory tear down for the OOM victim >> [2] which should help in many corner cases. We are still waiting >> to make mmap_sem for write killable which would help in some other >> classes of corner cases. Whatever we do, however, will not work in >> 100% cases. So the primary question is how far are we willing to go to >> support different corner cases. Do we want to have a >> panic_after_timeout global knob, allow multiple OOM victims after >> a timeout? > > Yes, that sounds like a good topic to cover. I'm honestly surprised > that there is so much resistence to trying to make the OOM killer > deterministic, and patches that try to fix that are resisted while the > thing can still lock up quietly. > > It would be good to take a step back and consider our priorities > there, think about what the ultimate goal of the OOM killer is, and > then how to make it operate smoothly without compromising that goal - > not the other way round. A few thoughts on our current/future oom killer usage. We've been using the oom killer as a overcommit tie breaker. Victim selection isn't always based on memory usage, instead low priority jobs are the first victims. Thus a deterministic scoring system, independent of memory usage, has been useful. And a scoring system that's based on memcg hierarchy. Because jobs are often defined at container boundaries it's also expedient to oom kill all processes within a memcg. Killing processes isn't always enough to free memory because tmpfs/hugetlbfs aren't oom direct victims. Though a combination of namespaces and kill-all-container-processes is promising, because last referenced on namespace can umount its filesystems. Though this doesn't help if refs to the filesystem exist outside of the namespace (e.g. fd's passed over unix sockets). So other ideas are floating around. And thrash detection is also quite helpful to decided when oom killing is better than hamming reclaim for a really long time. Refaulting is one signal of when to oom kill, but another is that high priority tasks are only willing to spend X before oom killing a lower prio victim (sorry X is vague, because it hasn't been sorted out yet, it could be wallclock, cpu time, disk bandwidth, etc.). >> - sysrq+f to trigger the oom killer follows some heuristics used by the >> OOM killer invoked by the system which means that it is unreliable >> and it might skip to kill any task without any explanation why. The >> semantic of the knob doesn't seem to clear and it has been even >> suggested [3] to remove it altogether as an unuseful debugging aid. Is >> this really a general consensus? > > I think it's an okay debugging aid, but I worry about it coming up so > much in discussions about how the OOM killer should behave. We should > never *require* manual intervention to put a machine back into known > state after it ran out of memory. > >> - One of the long lasting issue related to the OOM handling is when to >> actually declare OOM. There are workloads which might be trashing on >> few last remaining pagecache pages or on the swap which makes the >> system completely unusable for considerable amount of time yet the >> OOM killer is not invoked. Can we finally do something about that? > > I'm working on this, but it's not an easy situation to detect. > > We can't decide based on amount of page cache, as you could have very > little of it and still be fine. Most of it could still be used-once. > > We can't decide based on number or rate of (re)faults, because this > spikes during startup and workingset changes, or can be even sustained > when working with a data set that you'd never expect to fit into > memory in the first place, while still making acceptable progress. > > The only thing that I could come up with as a meaningful metric here > is the share of actual walltime that is spent waiting on refetching > stuff from disk. If we know that in the last X seconds, the whole > system spent more than idk 95% of its time waiting on the disk to read > recently evicted data back into the cache, then it's time to kick the > OOM killer, as this state is likely not worth maintaining. > > Such a "thrashing time" metric could be great to export to userspace > in general as it can be useful in other situations, such as quickly > gauging how comfortable a workload is (inside a container), and how > much time is wasted due to underprovisioning of memory. Because it > isn't just the pathological cases, you migh just wait a bit here and > there and could it still add up to a sizable portion of a job's time. > > If other people think this could be a useful thing to talk about, I'd > be happy to discuss it at the conference. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2016-02-01 12:24 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-25 13:33 [LSF/MM TOPIC] proposals for topics Michal Hocko 2016-01-25 14:21 ` [Lsf-pc] " Jan Kara 2016-01-25 14:40 ` Michal Hocko 2016-01-25 15:08 ` Tetsuo Handa 2016-01-26 9:43 ` Michal Hocko 2016-01-27 13:44 ` Tetsuo Handa 2016-01-27 14:33 ` [Lsf-pc] " Jan Kara 2016-01-25 18:45 ` Johannes Weiner 2016-01-26 9:50 ` Michal Hocko 2016-01-26 17:17 ` Vlastimil Babka 2016-01-26 17:20 ` [Lsf-pc] " Jan Kara 2016-01-27 9:08 ` Michal Hocko 2016-01-28 20:55 ` Dave Chinner 2016-01-28 22:04 ` Michal Hocko 2016-01-31 23:29 ` Dave Chinner 2016-02-01 12:24 ` Vlastimil Babka 2016-01-26 17:07 ` Vlastimil Babka 2016-01-26 18:09 ` Johannes Weiner 2016-01-30 18:18 ` Greg Thelen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).