* memcg: softlimit on internal nodes @ 2013-04-20 0:26 Tejun Heo 2013-04-20 0:42 ` Tejun Heo 2013-04-20 3:16 ` Michal Hocko 0 siblings, 2 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-20 0:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hello, Michal and all. Sorry about asking silly questions and leaving in the middle. I had a plane to catch which I just barely made. I thought about it on the way here and your proposal seems confused. I think the crux of the confusion comes from the fact that you're essentially proposing flipping the meaning of the knob for internal nodes - it means minmum guaranteed allocation - that is, the shrinker won't bother the cgroup if the memory consumption is under the softlimit - and your proposal is to reverse that for cgroups with children so that it actually means "soft" limit - creating pressure if above the limit (IIUC, it isn't entirely that either as the pressure is created iff the whole system is under memory pressure, right?). Regardless of the direction of a configuration, a parent cgroup should gate that configuration in the same direction. ie. If it's a limit for a leaf node when reached, it also is an limit for the whole subtree for an internal cgroup. If it's a configuration which guarantees allocation (in the sense that it'll be excluded in memory reclaim if under limit), the same, if the subtree is under limit, reclaim shouldn't trigger. For example, please consider the following hierarchy where s denotes the "softlimit" and h hardlimit. A (h:8G s:4G) / \ / \ B (h:5G s:1G) C (h:5G s:1G) For hard limit, nobody seems confused how the internal limit should apply - If either B or C goes over 5G, the one going over that limit will be on the receiving end of OOM killer. Also, even if both B and C are individually under 5G, if the sum of the two goes over A's limit - 8G, OOM killer will be activated on the subtree. It'd be a policy decision whether to kill tasks from A, B or C, but the no matter what the parent's limit will be enforced in the subtree. Note that this is a perfectly valid configuration. It is *not* an invalid configuration. It is exactly what the hierarchical configuration is supposed to do. It must not be any different for "softlimit". If B or C are individually under 1G, they won't be targeted by the reclaimer and even if B and C are over 1G, let's say 2G, as long as the sum is under A's "softlimit" - 4G, reclaimer won't look at them. It is exactly the same as hardlimit, just the opposite direction. Now, let's consider the following hierarchy just to be sure. Let's assume that A itself doesn't have any tasks for simplicity. A (h:16G s:4G) / \ / \ B (h:7G s:5G) C (h:7G s:5G) For hardlimit, it is clear that A's limit won't do anything. No matter what B and C do. In exactly the same way, A's "softlimit" doesn't do anything regardless of what B and C do. Just like A's hardlimit doesn't impose any further restrictions on B and C, A's softlimit doesn't give any further guarantee to B and C. There's no difference at all. Now, it's completely silly that "softlimit" is actually allocation guarantee rather than an actual limit. I guess it's born out of similar confusion? Maybe originally the operation was a confused mix of the two and it moved closer to guaranteeing behavior over time? Anyways, it's apparent why actual soft limit - that is something which creates reclaim pressure even when the system as whole isn't under memory pressure - would be useful, and I'm actually kinda surprised that it doesn't already exist. It isn't difficult to imagine use cases where the user doesn't want certain services/applications (say backup, torrent or static http server serving large files) to not consume huge amount of memory without triggering OOM killer. It is something which is fundamentally useful and I think is why people are confused and pulling the current "softlimit" towards something like that. If such actual soft limit is desired (I don't know, it just seems like a very fundamental / logical feature to me), please don't try to somehow overload "softlimit". They are two fundamentally different knobs, both make sense in their own ways, and when you stop confusing the two, there's nothing ambiguous about what what each knob means in hierarchical situations. This goes the same for the "untrusted" flag Ying told me, which seems like another confused way to overload two meanings onto "softlimit". Don't overload! Now let's see if this gogo thing actually works. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo @ 2013-04-20 0:42 ` Tejun Heo 2013-04-20 3:35 ` Greg Thelen 2013-04-20 3:16 ` Michal Hocko 1 sibling, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-20 0:42 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Fri, Apr 19, 2013 at 05:26:20PM -0700, Tejun Heo wrote: > If such actual soft limit is desired (I don't know, it just seems like > a very fundamental / logical feature to me), please don't try to > somehow overload "softlimit". They are two fundamentally different > knobs, both make sense in their own ways, and when you stop confusing > the two, there's nothing ambiguous about what what each knob means in > hierarchical situations. This goes the same for the "untrusted" flag > Ying told me, which seems like another confused way to overload two > meanings onto "softlimit". Don't overload! As for how actually to clean up this yet another mess in memcg, I don't know. Maybe introduce completely new knobs - say, oom_threshold, reclaim_threshold, and reclaim_trigger - and alias hardlimit to oom_threshold and softlimit to recalim_trigger? BTW, "softlimit" should default to 0. Nothing else makes any sense. Maybe you can gate it with "sane_behavior" flag or something. I don't know. It's your mess to clean up. :P Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-20 0:42 ` Tejun Heo @ 2013-04-20 3:35 ` Greg Thelen 2013-04-21 1:53 ` Tejun Heo 0 siblings, 1 reply; 46+ messages in thread From: Greg Thelen @ 2013-04-20 3:35 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm@kvack.org, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse On Fri, Apr 19, 2013 at 5:42 PM, Tejun Heo <tj@kernel.org> wrote: > On Fri, Apr 19, 2013 at 05:26:20PM -0700, Tejun Heo wrote: >> If such actual soft limit is desired (I don't know, it just seems like >> a very fundamental / logical feature to me), please don't try to >> somehow overload "softlimit". They are two fundamentally different >> knobs, both make sense in their own ways, and when you stop confusing >> the two, there's nothing ambiguous about what what each knob means in >> hierarchical situations. This goes the same for the "untrusted" flag >> Ying told me, which seems like another confused way to overload two >> meanings onto "softlimit". Don't overload! > > As for how actually to clean up this yet another mess in memcg, I > don't know. Maybe introduce completely new knobs - say, > oom_threshold, reclaim_threshold, and reclaim_trigger - and alias > hardlimit to oom_threshold and softlimit to recalim_trigger? BTW, > "softlimit" should default to 0. Nothing else makes any sense. I agree that the hard limit could be called the oom_threshold. The meaning of the term reclaim_threshold is not obvious to me. I'd prefer to call the soft limit a reclaim_target. System global pressure can steal memory from a cgroup until its usage drops to the soft limit (aka reclaim_target). Pressure will try to avoid stealing memory below the reclaim target. The soft limit (reclaim_target) is not checked until global pressure exists. Currently we do not have a knob to set a reclaim_threshold, such that when usage exceeds the reclaim_threshold async reclaim is queued. We are not discussing triggering anything when soft limit is exceeded. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-20 3:35 ` Greg Thelen @ 2013-04-21 1:53 ` Tejun Heo 0 siblings, 0 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-21 1:53 UTC (permalink / raw) To: Greg Thelen Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm@kvack.org, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse Hey, Greg. On Fri, Apr 19, 2013 at 08:35:12PM -0700, Greg Thelen wrote: > > As for how actually to clean up this yet another mess in memcg, I > > don't know. Maybe introduce completely new knobs - say, > > oom_threshold, reclaim_threshold, and reclaim_trigger - and alias > > hardlimit to oom_threshold and softlimit to recalim_trigger? BTW, > > "softlimit" should default to 0. Nothing else makes any sense. > > I agree that the hard limit could be called the oom_threshold. > > The meaning of the term reclaim_threshold is not obvious to me. I'd > prefer to call the soft limit a reclaim_target. System global > pressure can steal memory from a cgroup until its usage drops to the > soft limit (aka reclaim_target). Pressure will try to avoid stealing > memory below the reclaim target. The soft limit (reclaim_target) is > not checked until global pressure exists. Currently we do not have a > knob to set a reclaim_threshold, such that when usage exceeds the > reclaim_threshold async reclaim is queued. We are not discussing > triggering anything when soft limit is exceeded. Yeah, reclaim_target seems like a better name for it. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo 2013-04-20 0:42 ` Tejun Heo @ 2013-04-20 3:16 ` Michal Hocko 2013-04-21 2:23 ` Tejun Heo 1 sibling, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-20 3:16 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Fri 19-04-13 17:26:20, Tejun Heo wrote: > Hello, Michal and all. > > Sorry about asking silly questions and leaving in the middle. I had a > plane to catch which I just barely made. I thought about it on the > way here and your proposal seems confused. > > I think the crux of the confusion comes from the fact that you're > essentially proposing flipping the meaning of the knob for internal > nodes - it means minmum guaranteed allocation - that is, the shrinker > won't bother the cgroup if the memory consumption is under the > softlimit - and your proposal is to reverse that for cgroups with > children so that it actually means "soft" limit - creating pressure if > above the limit (IIUC, it isn't entirely that either as the pressure > is created iff the whole system is under memory pressure, right?). No, one of the patches changes that and put the soft reclaim to the hard reclaim path as well - basically try to reclaim over-soft limit groups first and do not both others if you can make your target. Please refer to the patchset for details (http://comments.gmane.org/gmane.linux.kernel.mm/97973) > Regardless of the direction of a configuration, a parent cgroup should > gate that configuration in the same direction. ie. If it's a limit > for a leaf node when reached, it also is an limit for the whole > subtree for an internal cgroup. Agreed and that is exactly what I was saying and what the code does. > If it's a configuration which guarantees allocation (in the sense that > it'll be excluded in memory reclaim if under limit), the same, if the > subtree is under limit, reclaim shouldn't trigger. > > For example, please consider the following hierarchy where s denotes > the "softlimit" and h hardlimit. > > A (h:8G s:4G) > / \ > / \ > B (h:5G s:1G) C (h:5G s:1G) > > For hard limit, nobody seems confused how the internal limit should > apply - If either B or C goes over 5G, the one going over that limit > will be on the receiving end of OOM killer. Right > Also, even if both B and C are individually under 5G, if the sum of > the two goes over A's limit - 8G, OOM killer will be activated on the > subtree. It'd be a policy decision whether to kill tasks from A, B or > C, but the no matter what the parent's limit will be enforced in the > subtree. Note that this is a perfectly valid configuration. Agreed. > It is *not* an invalid configuration. It is exactly what the > hierarchical configuration is supposed to do. > > It must not be any different for "softlimit". If B or C are > individually under 1G, they won't be targeted by the reclaimer and > even if B and C are over 1G, let's say 2G, as long as the sum is under > A's "softlimit" - 4G, reclaimer won't look at them. But we disagree on this one. If B and/or C are above their soft limit we do (soft) reclaim them. It is exactly the same thing as if they were hitting their hard limit (we just enforce the limit lazily). You can look at the soft limit as a lazy limit which is enforced only if there is an external pressure coming up the hierarchy - this can be either global memory presure or a hard limit reached up the hierarchy. Does this makes sense to you? > It is exactly the same as hardlimit, just the opposite direction. > > Now, let's consider the following hierarchy just to be sure. Let's > assume that A itself doesn't have any tasks for simplicity. > > A (h:16G s:4G) > / \ > / \ > B (h:7G s:5G) C (h:7G s:5G) > > For hardlimit, it is clear that A's limit won't do anything. It _does_ if A has tasks which add pressure to B+C. Or even if you do not have any tasks because A might hold some reparented pages from groups which are gone now. > No matter what B and C do. In exactly the same way, A's "softlimit" > doesn't do anything regardless of what B and C do. And same here. > Just like A's hardlimit doesn't impose any further restrictions on B > and C, A's softlimit doesn't give any further guarantee to B and C. > There's no difference at all. If A hits its hard limit then we reclaim that subtree so we _can_ and _do_ reclaim also from B and C. This is what the current code does and soft reclaim doesn't change that at all. The only thing it changes is that it tries to save groups bellow the limit from reclaiming. > Now, it's completely silly that "softlimit" is actually allocation > guarantee rather than an actual limit. I guess it's born out of > similar confusion? Maybe originally the operation was a confused mix > of the two and it moved closer to guaranteeing behavior over time? I wouldn't call it silly. It actually makes a lot of sense if you look at it as a delayed limit which would allow you to allocate more if there is not any outside memory pressure. > Anyways, it's apparent why actual soft limit - that is something which > creates reclaim pressure even when the system as whole isn't under > memory pressure - would be useful, and I'm actually kinda surprised > that it doesn't already exist. It isn't difficult to imagine use > cases where the user doesn't want certain services/applications (say > backup, torrent or static http server serving large files) to not > consume huge amount of memory without triggering OOM killer. It is > something which is fundamentally useful and I think is why people are > confused and pulling the current "softlimit" towards something like > that. Actually the use case is this. Say you have an important workload which shouldn't be influenced by other less important workloads (say backup for simplicity). You set up a soft limit for your important load to match its average working set. The backup doesn't need any hard limit and soft limit set to 0 because a) you do not know how much it would need and b) you like to make run as fast as possible. Check what happens now. Backup uses all the remaining memory until the global reclaims starts. The global reclaim will start reclaiming the backup or even your important workload if it consumed more than its soft limit (say after a peak load). As far as you can reclaim from the backup enough to satisfy the global memory pressure you do not have to hit the important workload. Sounds like a huge win to me! You can even look at the soft limit as to an "intelligent" mlock which keeps the memory "locked" as far as you can keep handling the external memory pressure. This is new with this new re-implementation because the original code uses soft limit only as a hint who to reclaim first but doesn't consider it any further. > If such actual soft limit is desired (I don't know, it just seems like > a very fundamental / logical feature to me), please don't try to > somehow overload "softlimit". They are two fundamentally different > knobs, both make sense in their own ways, and when you stop confusing > the two, there's nothing ambiguous about what what each knob means in > hierarchical situations. This goes the same for the "untrusted" flag > Ying told me, which seems like another confused way to overload two > meanings onto "softlimit". Don't overload! > > Now let's see if this gogo thing actually works. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-20 3:16 ` Michal Hocko @ 2013-04-21 2:23 ` Tejun Heo 2013-04-21 8:55 ` Michel Lespinasse 2013-04-21 12:46 ` Michal Hocko 0 siblings, 2 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-21 2:23 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hello, Michal. On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote: > > For example, please consider the following hierarchy where s denotes > > the "softlimit" and h hardlimit. > > > > A (h:8G s:4G) > > / \ > > / \ > > B (h:5G s:1G) C (h:5G s:1G) ... > > It must not be any different for "softlimit". If B or C are > > individually under 1G, they won't be targeted by the reclaimer and > > even if B and C are over 1G, let's say 2G, as long as the sum is under > > A's "softlimit" - 4G, reclaimer won't look at them. > > But we disagree on this one. If B and/or C are above their soft limit > we do (soft) reclaim them. It is exactly the same thing as if they were > hitting their hard limit (we just enforce the limit lazily). > > You can look at the soft limit as a lazy limit which is enforced only if > there is an external pressure coming up the hierarchy - this can be > either global memory presure or a hard limit reached up the hierarchy. > Does this makes sense to you? When flat, there's no confusion. The problem is that what you describe makes the meaning of softlimit different for internal nodes and leaf nodes. IIUC, it is, at least currently, guarantees that reclaim won't happen for a cgroup under limit. In hierarchical setting, if A's subtree is under limit, its subtree shouldn't be subject to guarantee. Again, you should be gating / stacking the limits as you go down the tree and what you're saying breaks that fundamental hierarchy rule. > > Now, let's consider the following hierarchy just to be sure. Let's > > assume that A itself doesn't have any tasks for simplicity. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > A (h:16G s:4G) > > / \ > > / \ > > B (h:7G s:5G) C (h:7G s:5G) > > > > For hardlimit, it is clear that A's limit won't do anything. > > It _does_ if A has tasks which add pressure to B+C. Or even if you do > not have any tasks because A might hold some reparented pages from > groups which are gone now. See the above. It's to discuss the semantics of limit hierarchy, so let's forget about A's internal usage for now. > > Just like A's hardlimit doesn't impose any further restrictions on B > > and C, A's softlimit doesn't give any further guarantee to B and C. > > There's no difference at all. > > If A hits its hard limit then we reclaim that subtree so we _can_ and > _do_ reclaim also from B and C. This is what the current code does and > soft reclaim doesn't change that at all. The only thing it changes is > that it tries to save groups bellow the limit from reclaiming. Hardlimit and softlimit are in the *opposite* directions and you're saying that softlimit in parent working in the same direction as hardlimit is correct. Stop being so confused. Softlimit is in the opposite direction. Internal node limit in hierarchical setting should of course work in the opposite direction. > > Now, it's completely silly that "softlimit" is actually allocation > > guarantee rather than an actual limit. I guess it's born out of > > similar confusion? Maybe originally the operation was a confused mix > > of the two and it moved closer to guaranteeing behavior over time? > > I wouldn't call it silly. It actually makes a lot of sense if you look > at it as a delayed limit which would allow you to allocate more if there > is not any outside memory pressure. It is silly because it *prevents* reclaim from happening if the cgroup is under the limit which is *the* defining characteristic of the knob. Memory is by *default* allowed to be reclaimed. How can being allowed to do what is allowed by default be a function of a knob? It seems like this confusion is leading you to think weird things about the meaning of the knob in hierarchy. Stop thinking about it as limit. It's a reclaim inhibitor. > Actually the use case is this. Say you have an important workload which > shouldn't be influenced by other less important workloads (say backup > for simplicity). You set up a soft limit for your important load to > match its average working set. The backup doesn't need any hard limit Yes, guarantee. > and soft limit set to 0 because a) you do not know how much it would > need and b) you like to make run as fast as possible. Check what happens > now. Backup uses all the remaining memory until the global reclaims > starts. The global reclaim will start reclaiming the backup or even > your important workload if it consumed more than its soft limit (say > after a peak load). As far as you can reclaim from the backup enough to > satisfy the global memory pressure you do not have to hit the important > workload. Sounds like a huge win to me! I'm not saying the guarantee is useless. I'm saying its name is completely the opposite of what it does and you, while knowing what it actually does in practice, are completely confused what the knob semantically means. > You can even look at the soft limit as to an "intelligent" mlock which > keeps the memory "locked" as far as you can keep handling the external > memory pressure. This is new with this new re-implementation because the > original code uses soft limit only as a hint who to reclaim first but > doesn't consider it any further. Now I'm confused. You're saying softlimit currently doesn't guarantee anything and what it means, even for flat hierarchy, isn't clearly defined? If it can go either way and "softlimit" is being made an allocation guarantee rather than say "if there's any pressure, feel free to reclaim to this point (ie. prioritize reclaim to that point)", that doesn't sound like a good idea. Really, don't mix "don't reclaim below this" and "this shouldn't need more than this, if under pressure, you can be aggressive about reclaiming this one down to this point". That's where all the confusions are coming from. They are two knobs in the opposite directions and shouldn't be merged into a single knob. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-21 2:23 ` Tejun Heo @ 2013-04-21 8:55 ` Michel Lespinasse 2013-04-22 4:24 ` Tejun Heo 2013-04-21 12:46 ` Michal Hocko 1 sibling, 1 reply; 46+ messages in thread From: Michel Lespinasse @ 2013-04-21 8:55 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen Hi Tejun, I don't remember exactly when you left - during the session I expressed to Michal that while I think his proposal is an improvement over the current situation, I think his handling of internal nodes is confus(ed/ing). On Sat, Apr 20, 2013 at 7:23 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Michal. > > On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote: >> > For example, please consider the following hierarchy where s denotes >> > the "softlimit" and h hardlimit. >> > >> > A (h:8G s:4G) >> > / \ >> > / \ >> > B (h:5G s:1G) C (h:5G s:1G) > ... >> > It must not be any different for "softlimit". If B or C are >> > individually under 1G, they won't be targeted by the reclaimer and >> > even if B and C are over 1G, let's say 2G, as long as the sum is under >> > A's "softlimit" - 4G, reclaimer won't look at them. I completely agree with you here. This is important to ensure composability - someone that was using cgroups within a 4GB system can be moved to use cgroups within a hierarchy with a 4GB soft limit on the root, and still have its performance isolated from tasks running in other cgroups in the system. >> > Now, let's consider the following hierarchy just to be sure. Let's >> > assume that A itself doesn't have any tasks for simplicity. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> > >> > A (h:16G s:4G) >> > / \ >> > / \ >> > B (h:7G s:5G) C (h:7G s:5G) >> > >> > For hardlimit, it is clear that A's limit won't do anything. Now the above is a very interesting case. One thing some people worry about is that B and C's configuration might be under a different administrator's control than A's. That is, we could have a situation where the machine's sysadmin set up A for someone else to play with, and that other person set up B and C within his cgroup. In this scenario, one of the issues has to be how do we prevent B and C's configuration settings from reserving (or protecting from reclaim) more memory than the machine's admin intended when he configured A. Michal's proposal resolves this by saying that A,B and C all become reclaimable as soon as A goes over its soft limit. Tejun's proposal (as I understand it) is that B and C protected from reclaim until they grow to 5G each, as their soft limits indicate. I have a third view, which I talked about during Michal's presentation. I think that when A's usage goes over 4G, we should be able to reclaim from A's subtree. If B or C's usage are above their soft limits, then we should reclaim from these cgroups; however if both B and C have usage below their soft limits, then we are in a situation where the soft limits can't be obeyed so we should ignore them and reclaim from both B and C instead. The idea is that I think soft limits should follow these design principles: - Soft limits are used to steer reclaim. We should try to avoid reclaiming from cgroups that are under their soft limits. However, soft limits can't completely prevent reclaim - if all cgroups are under their soft limits, then the soft limits become meaningless and all cgroups become eligible for being reclaimed from (this is a situation that the sysadmin can largely avoid by not over-committing the soft limits). - A child cgroup should not be able to grab more resources than its parent (this is for the situation where the parent and child cgroups might be under separate administrative control). So when a parent cgroup hits its soft limit, the child cgroup soft limits should not be able to prevent us from reclaiming from that hierarchy. The child cgroup soft limits should still be obeyed to steer reclaim within the hierarchy when possible, though. Regardless about these differences, I still want to stress out that Michal's proposal is a clear improvement over what we have, so I see it as a large step in the right direction. > Now I'm confused. You're saying softlimit currently doesn't guarantee > anything and what it means, even for flat hierarchy, isn't clearly > defined? The largest problem with softlimit today is that global reclaim doesn't take it into account at all... So yes, I would say that softlimit is very badly defined today (which may be why people have such trouble agreeing about what it should mean in the first place). -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-21 8:55 ` Michel Lespinasse @ 2013-04-22 4:24 ` Tejun Heo 2013-04-22 7:14 ` Michel Lespinasse 2013-04-22 15:37 ` Michal Hocko 0 siblings, 2 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 4:24 UTC (permalink / raw) To: Michel Lespinasse Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen Hey, Michel. > I don't remember exactly when you left - during the session I > expressed to Michal that while I think his proposal is an improvement > over the current situation, I think his handling of internal nodes is > confus(ed/ing). I think I stayed until near the end of the hierarchy discussion and yeap I heard you saying that. > I completely agree with you here. This is important to ensure > composability - someone that was using cgroups within a 4GB system can > be moved to use cgroups within a hierarchy with a 4GB soft limit on > the root, and still have its performance isolated from tasks running > in other cgroups in the system. And for basic sanity. As you look down through the hierarchy of nested cgroups, the pressure exerted by a limit can only be increased (IOW, the specificity of the control increases) as the level deepens, regardless of the direction of such pressure, which is the only logical thing to do for nested limits. >> > Now, let's consider the following hierarchy just to be sure. Let's >> > assume that A itself doesn't have any tasks for simplicity. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> > >> > A (h:16G s:4G) >> > / \ >> > / \ >> > B (h:7G s:5G) C (h:7G s:5G) >> > >> > For hardlimit, it is clear that A's limit won't do anything. > > Now the above is a very interesting case. It shouldn't be interesting at all. It should be exactly the same. If "softlimit" means actual soft limit prioritizing reclaim down to that point under pressure, it works in the same direction as hardlimit and the limits should behave the same. If "softlimit" means allocation guarantee where a cgroup is exempt from reclaim while under the limit, a knob defining allowance rather than limit, the direction of specificity is flipped. While the direction is flipped, how it behaves should be the same. Otherwise, it ends up breaking the very basics of nesting. Not a particularly bright idea. > One thing some people worry about is that B and C's configuration > might be under a different administrator's control than A's. That is, > we could have a situation where the machine's sysadmin set up A for > someone else to play with, and that other person set up B and C within > his cgroup. In this scenario, one of the issues has to be how do we > prevent B and C's configuration settings from reserving (or protecting > from reclaim) more memory than the machine's admin intended when he > configured A. Cgroup doesn't and will not support delegation of subtrees to different security domains. Please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel.cgroups/6638 In fact, I'm planning to disallow changing ownership of cgroup files when "sane_behavior" is specified. We're having difficult time identifying our own asses as it is and I have no intention of adding the huge extra burden of security policing on top. Delegation, if necessary, will happen from userland. > Michal's proposal resolves this by saying that A,B and C all become > reclaimable as soon as A goes over its soft limit. This makes me doubly upset and reminds me strongly of the .use_hierarchy mess. It's so myopic in coming up with a solution for the problem immediately at hand, it ends up ignoring basic rules and implementing something which is fundamentally broken and confused. Don't twist basic nesting rules to accomodate half-assed delegation mechanism. It's never gonna work properly and we'll need "really_sane_behavior" flag eventually to clean up the mess again, and we'll probably have to clarify that for memcg the 'c' stands for "confused" instead of "control". And I don't even get the delegation argument. Isn't that already covered by hardlimit? Sure, reclaimer won't look at it but if you don't trust a cgroup it of course will be put under certain hardlimit from parent and smacked when it misbehaves. Hardlimit of course should have priority over allocation guarantee and the system wouldn't be in jeopardy due to a delegated cgroup misbehaving. If each knob is given a clear meaning, these things should come naturally. You just need a sane pecking order among the controls. It almost feels surreal that this is suggested as a rationale for creating this chimera of a knob. What the hell is going on here? > I have a third view, which I talked about during Michal's > presentation. I think that when A's usage goes over 4G, we should be > able to reclaim from A's subtree. If B or C's usage are above their > soft limits, then we should reclaim from these cgroups; however if > both B and C have usage below their soft limits, then we are in a > situation where the soft limits can't be obeyed so we should ignore > them and reclaim from both B and C instead. No, the config is valid and *exactly* the same as hardlimit case. It's just in the opposite direction. Don't twist it. It's exactly the same mechanics. Flipping the direction should not change what nesting means. That's what you get and should get when cgroup nesting is used for something which "guarantees" rather than "limits". Whatever twsit you think is a good idea for "softlimit", try to flip the direction and apply it the same to "hardlimit" and see how messed up it gets. > Regardless about these differences, I still want to stress out that > Michal's proposal is a clear improvement over what we have, so I see > it as a large step in the right direction. I'm afraid I don't agree with that. If the current situation is ambiguous, moving to a definite wrong state makes the situation worse, so we need to figure out what this thing actually means first, and it's not like it is a difficult choice to make. It's either actual soft limit or allocation guarantee. It cannot be some random combination of the two. Just pick one and stick with it. >> Now I'm confused. You're saying softlimit currently doesn't guarantee >> anything and what it means, even for flat hierarchy, isn't clearly >> defined? > > The largest problem with softlimit today is that global reclaim > doesn't take it into account at all... So yes, I would say that > softlimit is very badly defined today (which may be why people have > such trouble agreeing about what it should mean in the first place). So, in that case, let's please make "softlimit" an actual soft limit working in the same direction as hardlimit but works in terms of reclaim pressure rather than OOM killing, and please don't tell me how "softlimit" working in the opposite direction of "hardlimit" actually makes sense in the wonderland of memcg. Please have at least some common sense. :( If people need "don't reclaim under this limit", IOW allocation guarantee, please introduce another knob with proper name and properly flipped hierarchy behavior. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 4:24 ` Tejun Heo @ 2013-04-22 7:14 ` Michel Lespinasse 2013-04-22 14:48 ` Tejun Heo 2013-04-22 15:37 ` Michal Hocko 1 sibling, 1 reply; 46+ messages in thread From: Michel Lespinasse @ 2013-04-22 7:14 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Sun, Apr 21, 2013 at 9:24 PM, Tejun Heo <tj@kernel.org> wrote: > Hey, Michel. > >> I don't remember exactly when you left - during the session I >> expressed to Michal that while I think his proposal is an improvement >> over the current situation, I think his handling of internal nodes is >> confus(ed/ing). > > I think I stayed until near the end of the hierarchy discussion and > yeap I heard you saying that. All right. Too bad you had to leave - I think this is a discussion we really need to have, so it would have been the perfect occasion. >>> > Now, let's consider the following hierarchy just to be sure. Let's >>> > assume that A itself doesn't have any tasks for simplicity. >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> > >>> > A (h:16G s:4G) >>> > / \ >>> > / \ >>> > B (h:7G s:5G) C (h:7G s:5G) >>> > >>> > For hardlimit, it is clear that A's limit won't do anything. >> >> One thing some people worry about is that B and C's configuration >> might be under a different administrator's control than A's. That is, >> we could have a situation where the machine's sysadmin set up A for >> someone else to play with, and that other person set up B and C within >> his cgroup. In this scenario, one of the issues has to be how do we >> prevent B and C's configuration settings from reserving (or protecting >> from reclaim) more memory than the machine's admin intended when he >> configured A. > > Cgroup doesn't and will not support delegation of subtrees to > different security domains. Please refer to the following thread. > > http://thread.gmane.org/gmane.linux.kernel.cgroups/6638 Ah, good. This is news to me. To be clear, I don't care much for the delegation scenario myself, but it's always been mentioned as the reason I couldn't get what I want when we've talked about hierarchical soft limit behavior in the past. If the decision not to have subtree delegation sticks, I am perfectly happy with your proposal. > And I don't even get the delegation argument. Isn't that already > covered by hardlimit? Sure, reclaimer won't look at it but if you > don't trust a cgroup it of course will be put under certain hardlimit > from parent and smacked when it misbehaves. Hardlimit of course > should have priority over allocation guarantee and the system wouldn't > be in jeopardy due to a delegated cgroup misbehaving. If each knob is > given a clear meaning, these things should come naturally. You just > need a sane pecking order among the controls. It almost feels surreal > that this is suggested as a rationale for creating this chimera of a > knob. What the hell is going on here? People often overcommit the cgroup hard limits so that one cgroup can make use of a larger share of the machine when the other cgroups are idle. This works well only if you can depend on soft limits to steer reclaim when the other cgroups get active again. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 7:14 ` Michel Lespinasse @ 2013-04-22 14:48 ` Tejun Heo 0 siblings, 0 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 14:48 UTC (permalink / raw) To: Michel Lespinasse Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen Hello, again. On Mon, Apr 22, 2013 at 12:14:53AM -0700, Michel Lespinasse wrote: > > I think I stayed until near the end of the hierarchy discussion and > > yeap I heard you saying that. > > All right. Too bad you had to leave - I think this is a discussion we > really need to have, so it would have been the perfect occasion. Eh well, it would have been better if I stayed but I think it served its purpose. Conferences are great for raising awareness. I usually find actual follow-up discussions done better in mailing lists. > > Cgroup doesn't and will not support delegation of subtrees to > > different security domains. Please refer to the following thread. > > > > http://thread.gmane.org/gmane.linux.kernel.cgroups/6638 > > Ah, good. This is news to me. To be clear, I don't care much for the > delegation scenario myself, but it's always been mentioned as the > reason I couldn't get what I want when we've talked about hierarchical > soft limit behavior in the past. If the decision not to have subtree > delegation sticks, I am perfectly happy with your proposal. Oh, it's sticking. :) > > And I don't even get the delegation argument. Isn't that already > > covered by hardlimit? Sure, reclaimer won't look at it but if you > > don't trust a cgroup it of course will be put under certain hardlimit > > from parent and smacked when it misbehaves. Hardlimit of course > > should have priority over allocation guarantee and the system wouldn't > > be in jeopardy due to a delegated cgroup misbehaving. If each knob is > > given a clear meaning, these things should come naturally. You just > > need a sane pecking order among the controls. It almost feels surreal > > that this is suggested as a rationale for creating this chimera of a > > knob. What the hell is going on here? > > People often overcommit the cgroup hard limits so that one cgroup can > make use of a larger share of the machine when the other cgroups are > idle. > This works well only if you can depend on soft limits to steer reclaim > when the other cgroups get active again. And that's fine too. If you take a step back, it shouldn't be difficult to recognize that what you want is an actual soft limit at the parent level overriding the allocation guarantee (for the lack of a better name). Don't overload "alloc guarantee" with that extra meaning messing up its fundamental properties. Create a separate plane of control which is consistent within itself and give it priority over "alloc guarantee". You sure can discuss the details of the override - should it be round-robin or proportional to whatever or what, but that's a separate discussion and can be firmly labeled as implementation details rather than this twisting of the fundamental semantics of "softlimit". I really am not saying any of the use cases that have been described are invalid. They all sound pretty useful, but, to me, what seems to be recurring is that people want two separate features - actual soft limit and allocation guarantee, and for some reason that I can't understand, fail to recognize they're two very different controls and try to put both into this one poor knob. It's like trying to combine accelerator and (flipped) clutch on a manual car. Sure, it'll work fine while you're accelerating. Good luck while cruising or on a long downhill. You can try to tweak it all you want but things of course will get "interesting" and "questionable" as soon as the conditions change from the specific use cases which the specific tuning is made for. While car analogies can often be misleading, really, please stop trying to combine two completely separate controls into one knob. It won't and can't work and is totally stupid. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 4:24 ` Tejun Heo 2013-04-22 7:14 ` Michel Lespinasse @ 2013-04-22 15:37 ` Michal Hocko 2013-04-22 15:46 ` Tejun Heo 1 sibling, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-22 15:37 UTC (permalink / raw) To: Tejun Heo Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Sun 21-04-13 21:24:45, Tejun Heo wrote: [...] > Cgroup doesn't and will not support delegation of subtrees to > different security domains. Please refer to the following thread. > > http://thread.gmane.org/gmane.linux.kernel.cgroups/6638 > > In fact, I'm planning to disallow changing ownership of cgroup files > when "sane_behavior" is specified. I would be wildly oposing this. Enabling user to play on its own ground while above levels of the groups enforce the reasonable behavior is very important use case. > We're having difficult time identifying our own asses as it is and I > have no intention of adding the huge extra burden of security policing > on top. Delegation, if necessary, will happen from userland. > > Michal's proposal resolves this by saying that A,B and C all become > > reclaimable as soon as A goes over its soft limit. > > This makes me doubly upset and reminds me strongly of the > .use_hierarchy mess. It's so myopic in coming up with a solution for > the problem immediately at hand, it ends up ignoring basic rules and > implementing something which is fundamentally broken and confused. Tejun, stop this, finally! Current soft limit same as the reworked version follow the basic nesting rule we use for the hard limit which says that parent setting is always more strict than its children. So if you parent says you are hitting the hardlimit (resp. over soft limit) then children are reclaimed regardless their hard/soft limit setting. > Don't twist basic nesting rules to accomodate half-assed delegation > mechanism. It's never gonna work properly and we'll need > "really_sane_behavior" flag eventually to clean up the mess again, and > we'll probably have to clarify that for memcg the 'c' stands for > "confused" instead of "control". > > And I don't even get the delegation argument. Isn't that already > covered by hardlimit? No it's not, because you want to overcommit the memory between different groups. And soft limit is a way how to handle memory pressure gracefully in contented situations. > Sure, reclaimer won't look at it but if you don't trust a cgroup > it of course will be put under certain hardlimit from parent and > smacked when it misbehaves. Hardlimit of course should have priority > over allocation guarantee and the system wouldn't be in jeopardy due > to a delegated cgroup misbehaving. If each knob is given a clear > meaning, these things should come naturally. You just need a sane > pecking order among the controls. It almost feels surreal that this > is suggested as a rationale for creating this chimera of a knob. What > the hell is going on here? It is you being confused and refuse to open the damn documentation and read what the hack is soft limit and what it is used for. Read the patch series I was talking about and you will hardly find anything regarding _guarantee_. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:37 ` Michal Hocko @ 2013-04-22 15:46 ` Tejun Heo 2013-04-22 15:54 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-22 15:46 UTC (permalink / raw) To: Michal Hocko Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen Hey, Michal. On Mon, Apr 22, 2013 at 05:37:30PM +0200, Michal Hocko wrote: > > In fact, I'm planning to disallow changing ownership of cgroup files > > when "sane_behavior" is specified. > > I would be wildly oposing this. Enabling user to play on its own ground > while above levels of the groups enforce the reasonable behavior is very > important use case. We can continue this discussion on the original thread and I'm not too firm on this not because it's a sane use case but because it is an extra measure preventing root from shooting its feet which we traditionally allow. That said, really, no good can come from delegating hierarchy to different security domains. It's already discouraged by the userland best practices doc. Just don't do it. > Tejun, stop this, finally! Current soft limit same as the reworked > version follow the basic nesting rule we use for the hard limit which > says that parent setting is always more strict than its children. > So if you parent says you are hitting the hardlimit (resp. over soft > limit) then children are reclaimed regardless their hard/soft limit > setting. Okay, thanks for making it clear. Then, apparently, the fine folks at google are hopelessly confused because at least Greg and Ying told me something which is the completely opposite of what you're saying. You guys need to sort it out. > It is you being confused and refuse to open the damn documentation and > read what the hack is soft limit and what it is used for. Read the patch > series I was talking about and you will hardly find anything regarding > _guarantee_. Oh, if so, I'm happy. Sorry about being brash on the thread; however, please talk with google memcg people. They have very different interpretation of what "softlimit" is and are using it according to that interpretation. If it *is* an actual soft limit, there is no inherent isolation coming from it and that should be clear to everyone. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:46 ` Tejun Heo @ 2013-04-22 15:54 ` Michal Hocko 2013-04-22 16:01 ` Tejun Heo 2013-04-23 9:58 ` Michel Lespinasse 0 siblings, 2 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-22 15:54 UTC (permalink / raw) To: Tejun Heo Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Mon 22-04-13 08:46:20, Tejun Heo wrote: > Hey, Michal. > > On Mon, Apr 22, 2013 at 05:37:30PM +0200, Michal Hocko wrote: > > > In fact, I'm planning to disallow changing ownership of cgroup files > > > when "sane_behavior" is specified. > > > > I would be wildly oposing this. Enabling user to play on its own ground > > while above levels of the groups enforce the reasonable behavior is very > > important use case. > > We can continue this discussion on the original thread and I'm not too > firm on this not because it's a sane use case but because it is an > extra measure preventing root from shooting its feet which we > traditionally allow. That said, really, no good can come from > delegating hierarchy to different security domains. It's already > discouraged by the userland best practices doc. Just don't do it. OK, I will go to the original mail thread and discuss my concerns there. > > Tejun, stop this, finally! Current soft limit same as the reworked > > version follow the basic nesting rule we use for the hard limit which > > says that parent setting is always more strict than its children. > > So if you parent says you are hitting the hardlimit (resp. over soft > > limit) then children are reclaimed regardless their hard/soft limit > > setting. > > Okay, thanks for making it clear. Then, apparently, the fine folks at > google are hopelessly confused because at least Greg and Ying told me > something which is the completely opposite of what you're saying. You > guys need to sort it out. > > > It is you being confused and refuse to open the damn documentation and > > read what the hack is soft limit and what it is used for. Read the patch > > series I was talking about and you will hardly find anything regarding > > _guarantee_. > > Oh, if so, I'm happy. Sorry about being brash on the thread; however, > please talk with google memcg people. They have very different > interpretation of what "softlimit" is and are using it according to > that interpretation. If it *is* an actual soft limit, there is no > inherent isolation coming from it and that should be clear to > everyone. We have discussed that for a long time. I will not speak for Greg & Ying but from my POV we have agreed that the current implementation will work for them with some (minor) changes in their layout. As I have said already with a careful configuration (e.i. setting the soft limit only where it matters - where it protects an important memory which is usually in the leaf nodes) you can actually achieve _high_ probability for not being reclaimed after the rework which was not possible before because of the implementation which was ugly and smelled. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:54 ` Michal Hocko @ 2013-04-22 16:01 ` Tejun Heo 2013-04-23 9:58 ` Michel Lespinasse 1 sibling, 0 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 16:01 UTC (permalink / raw) To: Michal Hocko Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen Hey, On Mon, Apr 22, 2013 at 05:54:54PM +0200, Michal Hocko wrote: > > Oh, if so, I'm happy. Sorry about being brash on the thread; however, > > please talk with google memcg people. They have very different > > interpretation of what "softlimit" is and are using it according to > > that interpretation. If it *is* an actual soft limit, there is no > > inherent isolation coming from it and that should be clear to > > everyone. > > We have discussed that for a long time. I will not speak for Greg & Ying > but from my POV we have agreed that the current implementation will work > for them with some (minor) changes in their layout. > As I have said already with a careful configuration (e.i. setting the > soft limit only where it matters - where it protects an important > memory which is usually in the leaf nodes) you can actually achieve > _high_ probability for not being reclaimed after the rework which was not > possible before because of the implementation which was ugly and > smelled. I don't know. I'm not sure this is a good idea. It's still encouraging abuse of the knob even if that's not the intention and once the usage sticks you end up with something you can't revert afterwards. I think it'd be better to make it *very* clear that "softlimit" can't be used for isolation in any reliable way. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:54 ` Michal Hocko 2013-04-22 16:01 ` Tejun Heo @ 2013-04-23 9:58 ` Michel Lespinasse 2013-04-23 10:17 ` Glauber Costa ` (2 more replies) 1 sibling, 3 replies; 46+ messages in thread From: Michel Lespinasse @ 2013-04-23 9:58 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Mon 22-04-13 08:46:20, Tejun Heo wrote: >> Oh, if so, I'm happy. Sorry about being brash on the thread; however, >> please talk with google memcg people. They have very different >> interpretation of what "softlimit" is and are using it according to >> that interpretation. If it *is* an actual soft limit, there is no >> inherent isolation coming from it and that should be clear to >> everyone. > > We have discussed that for a long time. I will not speak for Greg & Ying > but from my POV we have agreed that the current implementation will work > for them with some (minor) changes in their layout. > As I have said already with a careful configuration (e.i. setting the > soft limit only where it matters - where it protects an important > memory which is usually in the leaf nodes) I don't like your argument that soft limits work if you only set them on leaves. To me this is just a fancy way of saying that hierarchical soft limits don't work. Also it is somewhat problematic to assume that important memory can easily be placed in leaves. This is difficult to ensure when subcontainer destruction, for example, moves the memory back into the parent. > you can actually achieve > _high_ probability for not being reclaimed after the rework which was not > possible before because of the implementation which was ugly and > smelled. So, to be clear, what we (google MM people) want from soft limits is some form of protection against being reclaimed from when your cgroup (or its parent) is below the soft limit. I don't like to call it a guarantee either, because we understand that it comes with some limitations - for example, if all user pages on a given node are yours then allocations from that node might cause some of your pages to be reclaimed, even when you're under your soft limit. But we want some form of (weak) guarantee that can be made to work good enough in practice. Before your change, soft limits didn't actually provide any such form of guarantee, weak or not, since global reclaim would ignore soft limits. With your proposal, soft limits at least do provide the weak guarantee that we want, when not using hierarchies. We see this as a very clear improvement over the previous situation, so we're very happy about your patchset ! However, your proposal takes that weak guarantee away as soon as one tries to use cgroup hierarchies with it, because it reclaims from every child cgroup as soon as the parent hits its soft limit. This is disappointing and also, I have not heard of why you want things to work that way ? Is this an ease of implementation issue or do you consider that requirement as a bad idea ? And if it's the later, what's your counterpoint, is it related to delegation or is it something else that I haven't heard of ? I don't think referring to the existing memcg documentation makes a strong point - the documentation never said that soft limits were not obeyed by global reclaim and yet we both agree that it'd be preferable if they were. So I would like to hear of your reasons (apart from referring to the existing documentation) for not allowing a parent cgroup to protect its children from reclaim when the total charge from that parent is under the parent's soft limit. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 9:58 ` Michel Lespinasse @ 2013-04-23 10:17 ` Glauber Costa 2013-04-23 11:40 ` Michal Hocko 2013-04-23 11:32 ` Michal Hocko 2013-04-23 12:51 ` Michal Hocko 2 siblings, 1 reply; 46+ messages in thread From: Glauber Costa @ 2013-04-23 10:17 UTC (permalink / raw) To: Michel Lespinasse Cc: Michal Hocko, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On 04/23/2013 01:58 PM, Michel Lespinasse wrote: > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: >> On Mon 22-04-13 08:46:20, Tejun Heo wrote: >>> Oh, if so, I'm happy. Sorry about being brash on the thread; however, >>> please talk with google memcg people. They have very different >>> interpretation of what "softlimit" is and are using it according to >>> that interpretation. If it *is* an actual soft limit, there is no >>> inherent isolation coming from it and that should be clear to >>> everyone. >> >> We have discussed that for a long time. I will not speak for Greg & Ying >> but from my POV we have agreed that the current implementation will work >> for them with some (minor) changes in their layout. >> As I have said already with a careful configuration (e.i. setting the >> soft limit only where it matters - where it protects an important >> memory which is usually in the leaf nodes) > > I don't like your argument that soft limits work if you only set them > on leaves. To me this is just a fancy way of saying that hierarchical > soft limits don't work. > > Also it is somewhat problematic to assume that important memory can > easily be placed in leaves. This is difficult to ensure when > subcontainer destruction, for example, moves the memory back into the > parent. > Michal, For the most part, I am siding with you in this discussion. But with this only-in-leaves thing, I am forced to flip (at least for this). You are right when you say that in a configuration with A being parent of B and C, A being over its hard limit will affect reclaim in B and C, and soft limits should work the same. However, "will affect reclaim" is a big vague. More specifically, if the sum of B and C's hard limit is smaller or equal A's hard limit, the only way of either B or C to trigger A's hard limit is for them, themselves, to go over their hard limit. *This* is the case you you are breaking when you try to establish a comparison between soft and hard limits - which is, per se, sane. Translating this to the soft limit speech, if the sum of B and C's soft limit is smaller or equal A's soft limit, and one of them is over the soft limit, that one should be reclaimed. The other should be left alone. I understand perfectly fine that soft limit is a best effort, not a guarantee. But if we don't do that, I understand that we are doing effort, not best effort. This would only be attempted in our first pass. In the second pass, we reclaim from whoever. It is also not that hard to do it: Flatten the tree in a list, with the leaves always being placed before the inner nodes. Start reclaiming from nodes over the soft limit, hierarchically. This means that whenever we reach an inner node and it is *still* over the soft limit, we are guaranteed to have scanned their children already. In the case I described, the children over its soft limit would have been reclaimed, without the well behaving children being touched. Now all three are okay. If we reached an inner node and we still have a soft limit problem, then we are effectively talking about the case you have been describing. Reclaim from whoever you want. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 10:17 ` Glauber Costa @ 2013-04-23 11:40 ` Michal Hocko 2013-04-23 11:54 ` Glauber Costa 2013-04-23 12:51 ` Michel Lespinasse 0 siblings, 2 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 11:40 UTC (permalink / raw) To: Glauber Costa Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On Tue 23-04-13 14:17:22, Glauber Costa wrote: > On 04/23/2013 01:58 PM, Michel Lespinasse wrote: > > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: > >> On Mon 22-04-13 08:46:20, Tejun Heo wrote: > >>> Oh, if so, I'm happy. Sorry about being brash on the thread; however, > >>> please talk with google memcg people. They have very different > >>> interpretation of what "softlimit" is and are using it according to > >>> that interpretation. If it *is* an actual soft limit, there is no > >>> inherent isolation coming from it and that should be clear to > >>> everyone. > >> > >> We have discussed that for a long time. I will not speak for Greg & Ying > >> but from my POV we have agreed that the current implementation will work > >> for them with some (minor) changes in their layout. > >> As I have said already with a careful configuration (e.i. setting the > >> soft limit only where it matters - where it protects an important > >> memory which is usually in the leaf nodes) > > > > I don't like your argument that soft limits work if you only set them > > on leaves. To me this is just a fancy way of saying that hierarchical > > soft limits don't work. > > > > Also it is somewhat problematic to assume that important memory can > > easily be placed in leaves. This is difficult to ensure when > > subcontainer destruction, for example, moves the memory back into the > > parent. > > > > Michal, > > For the most part, I am siding with you in this discussion. > But with this only-in-leaves thing, I am forced to flip (at least for this). > > You are right when you say that in a configuration with A being parent > of B and C, A being over its hard limit will affect reclaim in B and C, > and soft limits should work the same. > > However, "will affect reclaim" is a big vague. More specifically, if the > sum of B and C's hard limit is smaller or equal A's hard limit, the only > way of either B or C to trigger A's hard limit is for them, themselves, > to go over their hard limit. Which is an expectation that you cannot guarantee. You can have B+C>A. > *This* is the case you you are breaking when you try to establish a > comparison between soft and hard limits - which is, per se, sane. > > Translating this to the soft limit speech, if the sum of B and C's soft > limit is smaller or equal A's soft limit, and one of them is over the > soft limit, that one should be reclaimed. The other should be left alone. And yet again. Nothing will prevent you from setting B+C>A. Sure if you configure your hierarchy sanely then everything will just work. > I understand perfectly fine that soft limit is a best effort, not a > guarantee. But if we don't do that, I understand that we are doing > effort, not best effort. > > This would only be attempted in our first pass. In the second pass, we > reclaim from whoever. > > It is also not that hard to do it: Flatten the tree in a list, with the > leaves always being placed before the inner nodes. Glauber, I have already pointed out that bottom-up reclaim doesn't make much sense because it is a bigger chance that useful data is stored in the leaf nodes rather than inner nodes which usually contain mostly reparented pages. > Start reclaiming from nodes over the soft limit, hierarchically. This > means that whenever we reach an inner node and it is *still* over > the soft limit, we are guaranteed to have scanned their children > already. In the case I described, the children over its soft limit > would have been reclaimed, without the well behaving children being > touched. Now all three are okay. > > If we reached an inner node and we still have a soft limit problem, then > we are effectively talking about the case you have been describing. > Reclaim from whoever you want. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 11:40 ` Michal Hocko @ 2013-04-23 11:54 ` Glauber Costa 2013-04-23 12:51 ` Michel Lespinasse 1 sibling, 0 replies; 46+ messages in thread From: Glauber Costa @ 2013-04-23 11:54 UTC (permalink / raw) To: Michal Hocko Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On 04/23/2013 03:40 PM, Michal Hocko wrote: > On Tue 23-04-13 14:17:22, Glauber Costa wrote: >> On 04/23/2013 01:58 PM, Michel Lespinasse wrote: >>> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: >>>> On Mon 22-04-13 08:46:20, Tejun Heo wrote: >>>>> Oh, if so, I'm happy. Sorry about being brash on the thread; however, >>>>> please talk with google memcg people. They have very different >>>>> interpretation of what "softlimit" is and are using it according to >>>>> that interpretation. If it *is* an actual soft limit, there is no >>>>> inherent isolation coming from it and that should be clear to >>>>> everyone. >>>> >>>> We have discussed that for a long time. I will not speak for Greg & Ying >>>> but from my POV we have agreed that the current implementation will work >>>> for them with some (minor) changes in their layout. >>>> As I have said already with a careful configuration (e.i. setting the >>>> soft limit only where it matters - where it protects an important >>>> memory which is usually in the leaf nodes) >>> >>> I don't like your argument that soft limits work if you only set them >>> on leaves. To me this is just a fancy way of saying that hierarchical >>> soft limits don't work. >>> >>> Also it is somewhat problematic to assume that important memory can >>> easily be placed in leaves. This is difficult to ensure when >>> subcontainer destruction, for example, moves the memory back into the >>> parent. >>> >> >> Michal, >> >> For the most part, I am siding with you in this discussion. >> But with this only-in-leaves thing, I am forced to flip (at least for this). >> >> You are right when you say that in a configuration with A being parent >> of B and C, A being over its hard limit will affect reclaim in B and C, >> and soft limits should work the same. >> >> However, "will affect reclaim" is a big vague. More specifically, if the >> sum of B and C's hard limit is smaller or equal A's hard limit, the only >> way of either B or C to trigger A's hard limit is for them, themselves, >> to go over their hard limit. > > Which is an expectation that you cannot guarantee. You can have B+C>A. > You can, but you might not. While you are focusing on one set of setups, you are as a result ending up with a behavior that is not ideal for the other set of setups. I believe what I am proposing here will cover both of them. >> *This* is the case you you are breaking when you try to establish a >> comparison between soft and hard limits - which is, per se, sane. >> >> Translating this to the soft limit speech, if the sum of B and C's soft >> limit is smaller or equal A's soft limit, and one of them is over the >> soft limit, that one should be reclaimed. The other should be left alone. > > And yet again. Nothing will prevent you from setting B+C>A. Sure if you > configure your hierarchy sanely then everything will just work. > Same as above. >> I understand perfectly fine that soft limit is a best effort, not a >> guarantee. But if we don't do that, I understand that we are doing >> effort, not best effort. >> >> This would only be attempted in our first pass. In the second pass, we >> reclaim from whoever. >> >> It is also not that hard to do it: Flatten the tree in a list, with the >> leaves always being placed before the inner nodes. > > Glauber, I have already pointed out that bottom-up reclaim doesn't make > much sense because it is a bigger chance that useful data is stored in > the leaf nodes rather than inner nodes which usually contain mostly > reparented pages. > Read my proposal algorithm again. I will provide you above with two examples, one for each kind of setup. Tell me if and why you believe it won't work: Tree is always B and C, having A as parent. Algorithm: Flatten the tree as B, C, A. Order between B and C doesn't matter, but B and C always come before A. Walk the list as B, C, A. Reclaim hierarchically from all of them. Setup 1: A.soft = 2G. B.soft = C.soft = 1 G. B uses 1 G, C uses 2 G, and A uses 3 G. Scan B: not over soft limit, skip Scan C: over soft limit, reclaim. C now goes back to 1 G. All is fine Scan A: A is now within limits, skip. If A had reparented charges, the whole subtree would still suffer reclaim. Setup 2: A.soft = 2 G, B.soft = C.soft = 4 G. B uses 2 G, C uses 2 G, and A uses 4 G. Scan B: not over soft limit, skip Scan C: not over soft limit, skip Scan A: over soft limit. reclaim. Since A has no charges of itself, reclaim B and C in whichever order, regardless of their soft limit setup. If A had charges, we would proceed the same. Setup 1 doesn't work with your proposal, Setup 2 does. I am offering here something that I believe to work with both. BTW, this is what I described in paragraph bellow: >> Start reclaiming from nodes over the soft limit, hierarchically. This >> means that whenever we reach an inner node and it is *still* over >> the soft limit, we are guaranteed to have scanned their children >> already. In the case I described, the children over its soft limit >> would have been reclaimed, without the well behaving children being >> touched. Now all three are okay. >> >> If we reached an inner node and we still have a soft limit problem, then >> we are effectively talking about the case you have been describing. >> Reclaim from whoever you want. For the record: I am totally fine if you say: "I don't want to pay the complexity now, what I am sending is already better than we have". I stuck to this during the summit, and will say that again here. But what you are saying is that it wouldn't work, that soft limits should never attempt to reach that state, and pretty much building a wall around that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 11:40 ` Michal Hocko 2013-04-23 11:54 ` Glauber Costa @ 2013-04-23 12:51 ` Michel Lespinasse 2013-04-23 13:06 ` Michal Hocko 1 sibling, 1 reply; 46+ messages in thread From: Michel Lespinasse @ 2013-04-23 12:51 UTC (permalink / raw) To: Michal Hocko Cc: Glauber Costa, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On Tue, Apr 23, 2013 at 4:40 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Tue 23-04-13 14:17:22, Glauber Costa wrote: >> On 04/23/2013 01:58 PM, Michel Lespinasse wrote: >> > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: >> >> On Mon 22-04-13 08:46:20, Tejun Heo wrote: >> >>> Oh, if so, I'm happy. Sorry about being brash on the thread; however, >> >>> please talk with google memcg people. They have very different >> >>> interpretation of what "softlimit" is and are using it according to >> >>> that interpretation. If it *is* an actual soft limit, there is no >> >>> inherent isolation coming from it and that should be clear to >> >>> everyone. >> >> >> >> We have discussed that for a long time. I will not speak for Greg & Ying >> >> but from my POV we have agreed that the current implementation will work >> >> for them with some (minor) changes in their layout. >> >> As I have said already with a careful configuration (e.i. setting the >> >> soft limit only where it matters - where it protects an important >> >> memory which is usually in the leaf nodes) >> > >> > I don't like your argument that soft limits work if you only set them >> > on leaves. To me this is just a fancy way of saying that hierarchical >> > soft limits don't work. >> > >> > Also it is somewhat problematic to assume that important memory can >> > easily be placed in leaves. This is difficult to ensure when >> > subcontainer destruction, for example, moves the memory back into the >> > parent. >> > >> >> Michal, >> >> For the most part, I am siding with you in this discussion. >> But with this only-in-leaves thing, I am forced to flip (at least for this). >> >> You are right when you say that in a configuration with A being parent >> of B and C, A being over its hard limit will affect reclaim in B and C, >> and soft limits should work the same. >> >> However, "will affect reclaim" is a big vague. More specifically, if the >> sum of B and C's hard limit is smaller or equal A's hard limit, the only >> way of either B or C to trigger A's hard limit is for them, themselves, >> to go over their hard limit. > > Which is an expectation that you cannot guarantee. You can have B+C>A. > >> *This* is the case you you are breaking when you try to establish a >> comparison between soft and hard limits - which is, per se, sane. >> >> Translating this to the soft limit speech, if the sum of B and C's soft >> limit is smaller or equal A's soft limit, and one of them is over the >> soft limit, that one should be reclaimed. The other should be left alone. > > And yet again. Nothing will prevent you from setting B+C>A. Sure if you > configure your hierarchy sanely then everything will just work. Let's all stop using words such as "sanely" and "work" since we don't see to agree on how they apply here :) The issue I see is that even when people configure soft limits B+C < A, your current proposal still doesn't "leave the other alone" as Glauber and I think we should. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 12:51 ` Michel Lespinasse @ 2013-04-23 13:06 ` Michal Hocko 2013-04-23 13:13 ` Glauber Costa 0 siblings, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-23 13:06 UTC (permalink / raw) To: Michel Lespinasse Cc: Glauber Costa, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On Tue 23-04-13 05:51:36, Michel Lespinasse wrote: [...] > The issue I see is that even when people configure soft limits B+C < > A, your current proposal still doesn't "leave the other alone" as > Glauber and I think we should. If B+C < A then B resp. C get reclaimed only if A is over the limit which means that it couldn't reclaimed enough to get bellow the limit when we bang on it before B and C. We can update the implementation later to be more clever in situations like this but this is not that easy because once we get away from the round robin over the tree then we might end up having other issues - like unfairness etc... That's why I wanted to have this as simple as possible. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 13:06 ` Michal Hocko @ 2013-04-23 13:13 ` Glauber Costa 2013-04-23 13:28 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Glauber Costa @ 2013-04-23 13:13 UTC (permalink / raw) To: Michal Hocko Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On 04/23/2013 05:06 PM, Michal Hocko wrote: > On Tue 23-04-13 05:51:36, Michel Lespinasse wrote: > [...] >> The issue I see is that even when people configure soft limits B+C < >> A, your current proposal still doesn't "leave the other alone" as >> Glauber and I think we should. > > If B+C < A then B resp. C get reclaimed only if A is over the limit > which means that it couldn't reclaimed enough to get bellow the limit > when we bang on it before B and C. We can update the implementation > later to be more clever in situations like this but this is not that > easy because once we get away from the round robin over the tree then we > might end up having other issues - like unfairness etc... That's why I > wanted to have this as simple as possible. > Nobody is opposing this, Michal. What people are opposing is you saying that the children should be reclaimed *regardless* of their softlimit when the parent is over their soft limit. Someone, specially you, saying this, highly threatens further development in this direction. It doesn't really matter if your current set is doing this, simply everybody already agreed that you are moving in a good direction. If you believe that it is desired to protect the children from reclaim in situation in which the offender is only one of the children and that can be easily identified, please state that clearly. Since nobody is really opposing your patchset, that is enough for the discussion to settle. (Can't say how others feel, but can say about myself, and guess about others) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 13:13 ` Glauber Costa @ 2013-04-23 13:28 ` Michal Hocko 0 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 13:28 UTC (permalink / raw) To: Glauber Costa Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Greg Thelen On Tue 23-04-13 17:13:20, Glauber Costa wrote: > On 04/23/2013 05:06 PM, Michal Hocko wrote: > > On Tue 23-04-13 05:51:36, Michel Lespinasse wrote: > > [...] > >> The issue I see is that even when people configure soft limits B+C < > >> A, your current proposal still doesn't "leave the other alone" as > >> Glauber and I think we should. > > > > If B+C < A then B resp. C get reclaimed only if A is over the limit > > which means that it couldn't reclaimed enough to get bellow the limit > > when we bang on it before B and C. We can update the implementation > > later to be more clever in situations like this but this is not that > > easy because once we get away from the round robin over the tree then we > > might end up having other issues - like unfairness etc... That's why I > > wanted to have this as simple as possible. > > > Nobody is opposing this, Michal. > > What people are opposing is you saying that the children should be > reclaimed *regardless* of their softlimit when the parent is over their > soft limit. Someone, specially you, saying this, highly threatens > further development in this direction. OK, I am feeling like repeating myself. Anyway once more. I am _all_ for protecting children that are under their limit if that is _possible_[1]. We are not yet there though for generic configuration. That's why I was so careful about the wording and careful configuration at this stage. Is this sufficient for your concerns? I do not see any giant obstacles in the current implementation to allow this behavior. > It doesn't really matter if your current set is doing this, simply > everybody already agreed that you are moving in a good direction. > > If you believe that it is desired to protect the children from reclaim > in situation in which the offender is only one of the children and that > can be easily identified, please state that clearly. Clearly yes. --- [1] and to be even more clear there are cases where this will never be possible. For an example: A (soft:0) | B (soft:MAX) where B smart ass thinks that his group never gets reclaim although he is the only source of the pressure. This is what I call untrusted environment. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 9:58 ` Michel Lespinasse 2013-04-23 10:17 ` Glauber Costa @ 2013-04-23 11:32 ` Michal Hocko 2013-04-23 12:45 ` Michel Lespinasse 2013-04-23 12:51 ` Michal Hocko 2 siblings, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-23 11:32 UTC (permalink / raw) To: Michel Lespinasse Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Tue 23-04-13 02:58:19, Michel Lespinasse wrote: > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: > > On Mon 22-04-13 08:46:20, Tejun Heo wrote: > >> Oh, if so, I'm happy. Sorry about being brash on the thread; however, > >> please talk with google memcg people. They have very different > >> interpretation of what "softlimit" is and are using it according to > >> that interpretation. If it *is* an actual soft limit, there is no > >> inherent isolation coming from it and that should be clear to > >> everyone. > > > > We have discussed that for a long time. I will not speak for Greg & Ying > > but from my POV we have agreed that the current implementation will work > > for them with some (minor) changes in their layout. > > As I have said already with a careful configuration (e.i. setting the > > soft limit only where it matters - where it protects an important > > memory which is usually in the leaf nodes) > > I don't like your argument that soft limits work if you only set them > on leaves. I didn't say that. Please read it again. "where it protects an important memory which is _usaully_ in the leaf nodes". Intermediate nodes can of course contain some important memory as well and you can well "protect" them by the soft limit you just have to be very careful because what you have in the result is quite complicated structure. You have a node that has some portion of its own memory mixed with reparented pages. You cannot distinguish those two of course so protection is somehow harder to achieve. That is the reason why I encourage not using any limit on the intermediate node which means reclaim the node with my patchset. > To me this is just a fancy way of saying that hierarchical soft limits > don't work. It works same as the hard limit it just triggers later. > Also it is somewhat problematic to assume that important memory can > easily be placed in leaves. This is difficult to ensure when > subcontainer destruction, for example, moves the memory back into the > parent. Is the memory still important then? The workload which uses the memory is done. So this ends up being just a cached data. > > you can actually achieve > > _high_ probability for not being reclaimed after the rework which was not > > possible before because of the implementation which was ugly and > > smelled. > > So, to be clear, what we (google MM people) want from soft limits is > some form of protection against being reclaimed from when your cgroup > (or its parent) is below the soft limit. > > I don't like to call it a guarantee either, because we understand that > it comes with some limitations - for example, if all user pages on a > given node are yours then allocations from that node might cause some > of your pages to be reclaimed, even when you're under your soft limit. > But we want some form of (weak) guarantee that can be made to work > good enough in practice. > > Before your change, soft limits didn't actually provide any such form > of guarantee, weak or not, since global reclaim would ignore soft > limits. > > With your proposal, soft limits at least do provide the weak guarantee > that we want, when not using hierarchies. We see this as a very clear > improvement over the previous situation, so we're very happy about > your patchset ! > > However, your proposal takes that weak guarantee away as soon as one > tries to use cgroup hierarchies with it, because it reclaims from > every child cgroup as soon as the parent hits its soft limit. This is > disappointing and also, I have not heard of why you want things to > work that way ? Sigh. Because if children didn't follow parent's limit then they could easily escape from the reclaim pushing back to an unrelated hierarchies in the tree as the parent wouldn't be able to reclaim down to its limit. > Is this an ease of implementation issue or do you consider that > requirement as a bad idea ? And if it's the later, what's your > counterpoint, is it related to delegation or is it something else that > I haven't heard of ? The implementation can be improved and child groups might be reclaimed _only_ if parent cannot satisfy its soft limit this is not a target of the current re-implementation though. The limit has to be preserved though. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 11:32 ` Michal Hocko @ 2013-04-23 12:45 ` Michel Lespinasse 2013-04-23 12:59 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Michel Lespinasse @ 2013-04-23 12:45 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Tue, Apr 23, 2013 at 4:32 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Tue 23-04-13 02:58:19, Michel Lespinasse wrote: >> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: >> > On Mon 22-04-13 08:46:20, Tejun Heo wrote: >> >> Oh, if so, I'm happy. Sorry about being brash on the thread; however, >> >> please talk with google memcg people. They have very different >> >> interpretation of what "softlimit" is and are using it according to >> >> that interpretation. If it *is* an actual soft limit, there is no >> >> inherent isolation coming from it and that should be clear to >> >> everyone. >> > >> > We have discussed that for a long time. I will not speak for Greg & Ying >> > but from my POV we have agreed that the current implementation will work >> > for them with some (minor) changes in their layout. >> > As I have said already with a careful configuration (e.i. setting the >> > soft limit only where it matters - where it protects an important >> > memory which is usually in the leaf nodes) >> >> I don't like your argument that soft limits work if you only set them >> on leaves. > > I didn't say that. Please read it again. "where it protects an important > memory which is _usaully_ in the leaf nodes". Intermediate nodes can of > course contain some important memory as well and you can well "protect" > them by the soft limit you just have to be very careful because what you > have in the result is quite complicated structure. You have a node that > has some portion of its own memory mixed with reparented pages. You > cannot distinguish those two of course so protection is somehow harder > to achieve. That is the reason why I encourage not using any limit on > the intermediate node which means reclaim the node with my patchset. > >> To me this is just a fancy way of saying that hierarchical soft limits >> don't work. > > It works same as the hard limit it just triggers later. > >> Also it is somewhat problematic to assume that important memory can >> easily be placed in leaves. This is difficult to ensure when >> subcontainer destruction, for example, moves the memory back into the >> parent. > > Is the memory still important then? The workload which uses the memory > is done. So this ends up being just a cached data. Well, even supposing the parent only holds non-important cached data and the leaves have important data... your proposal implies that soft limits on the leaves won't protect their data from reclaim, because the cached data in the parent might cause the parent to go over its own soft limit. If the leaves stay under their own soft limits, I would prefer that the parent's cached data gets reclaimed first. >> > you can actually achieve >> > _high_ probability for not being reclaimed after the rework which was not >> > possible before because of the implementation which was ugly and >> > smelled. >> >> So, to be clear, what we (google MM people) want from soft limits is >> some form of protection against being reclaimed from when your cgroup >> (or its parent) is below the soft limit. >> >> I don't like to call it a guarantee either, because we understand that >> it comes with some limitations - for example, if all user pages on a >> given node are yours then allocations from that node might cause some >> of your pages to be reclaimed, even when you're under your soft limit. >> But we want some form of (weak) guarantee that can be made to work >> good enough in practice. >> >> Before your change, soft limits didn't actually provide any such form >> of guarantee, weak or not, since global reclaim would ignore soft >> limits. >> >> With your proposal, soft limits at least do provide the weak guarantee >> that we want, when not using hierarchies. We see this as a very clear >> improvement over the previous situation, so we're very happy about >> your patchset ! >> >> However, your proposal takes that weak guarantee away as soon as one >> tries to use cgroup hierarchies with it, because it reclaims from >> every child cgroup as soon as the parent hits its soft limit. This is >> disappointing and also, I have not heard of why you want things to >> work that way ? > > Sigh. Because if children didn't follow parent's limit then they could > easily escape from the reclaim pushing back to an unrelated hierarchies > in the tree as the parent wouldn't be able to reclaim down to its limit. To clarify: to you see us having this problem without administrative delegation of the child cgroup configuration ? >> Is this an ease of implementation issue or do you consider that >> requirement as a bad idea ? And if it's the later, what's your >> counterpoint, is it related to delegation or is it something else that >> I haven't heard of ? > > The implementation can be improved and child groups might be reclaimed > _only_ if parent cannot satisfy its soft limit this is not a target of > the current re-implementation though. The limit has to be preserved > though. I'm actually OK with doing things that way; it's only talk about disallowing these further steps that makes me very worried... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 12:45 ` Michel Lespinasse @ 2013-04-23 12:59 ` Michal Hocko 0 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 12:59 UTC (permalink / raw) To: Michel Lespinasse Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Tue 23-04-13 05:45:05, Michel Lespinasse wrote: > On Tue, Apr 23, 2013 at 4:32 AM, Michal Hocko <mhocko@suse.cz> wrote: > > On Tue 23-04-13 02:58:19, Michel Lespinasse wrote: > >> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote: > >> > On Mon 22-04-13 08:46:20, Tejun Heo wrote: > >> >> Oh, if so, I'm happy. Sorry about being brash on the thread; however, > >> >> please talk with google memcg people. They have very different > >> >> interpretation of what "softlimit" is and are using it according to > >> >> that interpretation. If it *is* an actual soft limit, there is no > >> >> inherent isolation coming from it and that should be clear to > >> >> everyone. > >> > > >> > We have discussed that for a long time. I will not speak for Greg & Ying > >> > but from my POV we have agreed that the current implementation will work > >> > for them with some (minor) changes in their layout. > >> > As I have said already with a careful configuration (e.i. setting the > >> > soft limit only where it matters - where it protects an important > >> > memory which is usually in the leaf nodes) > >> > >> I don't like your argument that soft limits work if you only set them > >> on leaves. > > > > I didn't say that. Please read it again. "where it protects an important > > memory which is _usaully_ in the leaf nodes". Intermediate nodes can of > > course contain some important memory as well and you can well "protect" > > them by the soft limit you just have to be very careful because what you > > have in the result is quite complicated structure. You have a node that > > has some portion of its own memory mixed with reparented pages. You > > cannot distinguish those two of course so protection is somehow harder > > to achieve. That is the reason why I encourage not using any limit on > > the intermediate node which means reclaim the node with my patchset. > > > >> To me this is just a fancy way of saying that hierarchical soft limits > >> don't work. > > > > It works same as the hard limit it just triggers later. > > > >> Also it is somewhat problematic to assume that important memory can > >> easily be placed in leaves. This is difficult to ensure when > >> subcontainer destruction, for example, moves the memory back into the > >> parent. > > > > Is the memory still important then? The workload which uses the memory > > is done. So this ends up being just a cached data. > > Well, even supposing the parent only holds non-important cached data > and the leaves have important data... your proposal implies that soft > limits on the leaves won't protect their data from reclaim, because > the cached data in the parent might cause the parent to go over its > own soft limit. Parent would be visited first so it can reclaim from its pages first. Only then we traverse the tree down to children. Just out of curiousity what is the point to set the soft limit in that node in the first place. You want to use the soft limit for isolation but is there anything you want to isolate in that node? More over does it really make sense to set soft limit to less than Sum(children(soft_limit))? > If the leaves stay under their own soft limits, I would prefer that > the parent's cached data gets reclaimed first. > > >> > you can actually achieve > >> > _high_ probability for not being reclaimed after the rework which was not > >> > possible before because of the implementation which was ugly and > >> > smelled. > >> > >> So, to be clear, what we (google MM people) want from soft limits is > >> some form of protection against being reclaimed from when your cgroup > >> (or its parent) is below the soft limit. > >> > >> I don't like to call it a guarantee either, because we understand that > >> it comes with some limitations - for example, if all user pages on a > >> given node are yours then allocations from that node might cause some > >> of your pages to be reclaimed, even when you're under your soft limit. > >> But we want some form of (weak) guarantee that can be made to work > >> good enough in practice. > >> > >> Before your change, soft limits didn't actually provide any such form > >> of guarantee, weak or not, since global reclaim would ignore soft > >> limits. > >> > >> With your proposal, soft limits at least do provide the weak guarantee > >> that we want, when not using hierarchies. We see this as a very clear > >> improvement over the previous situation, so we're very happy about > >> your patchset ! > >> > >> However, your proposal takes that weak guarantee away as soon as one > >> tries to use cgroup hierarchies with it, because it reclaims from > >> every child cgroup as soon as the parent hits its soft limit. This is > >> disappointing and also, I have not heard of why you want things to > >> work that way ? > > > > Sigh. Because if children didn't follow parent's limit then they could > > easily escape from the reclaim pushing back to an unrelated hierarchies > > in the tree as the parent wouldn't be able to reclaim down to its limit. > > To clarify: to you see us having this problem without administrative > delegation of the child cgroup configuration ? In the perfect world where the limits are set up reasonably there is no such issue. Parents would usually have limit higher than sum of their children limits so children wouldn't need to reclaim just because their parent is over the limit. > >> Is this an ease of implementation issue or do you consider that > >> requirement as a bad idea ? And if it's the later, what's your > >> counterpoint, is it related to delegation or is it something else that > >> I haven't heard of ? > > > > The implementation can be improved and child groups might be reclaimed > > _only_ if parent cannot satisfy its soft limit this is not a target of > > the current re-implementation though. The limit has to be preserved > > though. > > I'm actually OK with doing things that way; it's only talk about > disallowing these further steps that makes me very worried... What prevents us from enhancing reclaim further? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 9:58 ` Michel Lespinasse 2013-04-23 10:17 ` Glauber Costa 2013-04-23 11:32 ` Michal Hocko @ 2013-04-23 12:51 ` Michal Hocko 2 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 12:51 UTC (permalink / raw) To: Michel Lespinasse Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Greg Thelen On Tue 23-04-13 02:58:19, Michel Lespinasse wrote: [...] > However, your proposal takes that weak guarantee away as soon as one > tries to use cgroup hierarchies with it, because it reclaims from > every child cgroup as soon as the parent hits its soft limit. Reading this again I am really getting confused. The primary objection used to be that under-soft-limit inter-node subtree shouldn't be reclaimed although there are children over their soft limits. Now we have moved to over-limit inder-node shouldn't hammer its subtree? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-21 2:23 ` Tejun Heo 2013-04-21 8:55 ` Michel Lespinasse @ 2013-04-21 12:46 ` Michal Hocko 2013-04-22 4:39 ` Tejun Heo 1 sibling, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-21 12:46 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen [I am terribly jet lagged so I should probably postpone any serious thinking for few days but let me try] On Sat 20-04-13 19:23:21, Tejun Heo wrote: > Hello, Michal. > > On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote: > > > For example, please consider the following hierarchy where s denotes > > > the "softlimit" and h hardlimit. > > > > > > A (h:8G s:4G) > > > / \ > > > / \ > > > B (h:5G s:1G) C (h:5G s:1G) > ... > > > It must not be any different for "softlimit". If B or C are > > > individually under 1G, they won't be targeted by the reclaimer and > > > even if B and C are over 1G, let's say 2G, as long as the sum is under > > > A's "softlimit" - 4G, reclaimer won't look at them. > > > > But we disagree on this one. If B and/or C are above their soft limit > > we do (soft) reclaim them. It is exactly the same thing as if they were > > hitting their hard limit (we just enforce the limit lazily). > > > > You can look at the soft limit as a lazy limit which is enforced only if > > there is an external pressure coming up the hierarchy - this can be > > either global memory presure or a hard limit reached up the hierarchy. > > Does this makes sense to you? > > When flat, there's no confusion. The problem is that what you > describe makes the meaning of softlimit different for internal nodes > and leaf nodes. No inter and leaf nodes behave very same. Have a look at mem_cgroup_soft_reclaim_eligible. All the confusion comes probably from the understanding of the current semantic of what soft limit and what it should do after my patch. The current implementation stores all subtrees that are over the soft limit in a tree sorted by how much they are excessing the limit. Have a look at mem_cgroup_update_tree and its callers (namely down from __mem_cgroup_commit_charge). My patch _preserves_ this behavior it just makes the code much saner and as a bonus it doesn't touch groups (not hierarchies) under the limit unless necessary which wasn't the case previously. So yes, I can understand why this is confusing for you. The soft limit semantic is different because the limit is/was considered only if it is/was in excess. Maybe I was using word _guarantee_ too often to confuse you, I am sorry if this is the case. The guarantee part comes from the group point of view. So the original semantic of the hierarchical behavior is unchanged. What to does it mean that an inter node is under the soft limit for the subhierarchy is questionable and there are usecases where children groups might be under control of a different (even untrusted) administrators (think about containers) so the implementation is not straight forward. We certainly can do better than just reclaim everybody but this is a subject to later improvements. I will get to the rest of the email later. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-21 12:46 ` Michal Hocko @ 2013-04-22 4:39 ` Tejun Heo 2013-04-22 15:19 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-22 4:39 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hey, Michal. On Sun, Apr 21, 2013 at 02:46:06PM +0200, Michal Hocko wrote: > [I am terribly jet lagged so I should probably postpone any serious > thinking for few days but let me try] Sorry about raising a flame war so soon after the conference week. None of these is really urgent, so please take your time. > The current implementation stores all subtrees that are over the soft > limit in a tree sorted by how much they are excessing the limit. Have > a look at mem_cgroup_update_tree and its callers (namely down from > __mem_cgroup_commit_charge). My patch _preserves_ this behavior it just > makes the code much saner and as a bonus it doesn't touch groups (not > hierarchies) under the limit unless necessary which wasn't the case > previously. What you describe is already confused. What does that knob mean then? Google folks seem to think it's an allocation guarantee but global reclaim is broken and breaches the configuration (which I suppose is arising from their usage of memcg) and I don't understand what your definition of the knob is apart from the description of what's implemented now, which apparently is causing horrible confusion on all the involved parties. > So yes, I can understand why this is confusing for you. The soft limit > semantic is different because the limit is/was considered only if it > is/was in excess. > > Maybe I was using word _guarantee_ too often to confuse you, I am sorry > if this is the case. The guarantee part comes from the group point of > view. So the original semantic of the hierarchical behavior is > unchanged. I don't care what word you use. There are two choices. Pick one and stick with it. Don't make it something which inhibits reclaim if under limit for leaf nodes but behaves somewhat differently if an ancestor is under pressure or whatever. Just pick one. It is either an reclaim inhibitor or actual soft limit. > What to does it mean that an inter node is under the soft limit > for the subhierarchy is questionable and there are usecases where It's not frigging questionable. You're just horribly confused. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 4:39 ` Tejun Heo @ 2013-04-22 15:19 ` Michal Hocko 2013-04-22 15:57 ` Tejun Heo 0 siblings, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-22 15:19 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Sun 21-04-13 21:39:39, Tejun Heo wrote: > Hey, Michal. > > On Sun, Apr 21, 2013 at 02:46:06PM +0200, Michal Hocko wrote: > > [I am terribly jet lagged so I should probably postpone any serious > > thinking for few days but let me try] > > Sorry about raising a flame war so soon after the conference week. > None of these is really urgent, so please take your time. > > > The current implementation stores all subtrees that are over the soft > > limit in a tree sorted by how much they are excessing the limit. Have > > a look at mem_cgroup_update_tree and its callers (namely down from > > __mem_cgroup_commit_charge). My patch _preserves_ this behavior it just > > makes the code much saner and as a bonus it doesn't touch groups (not > > hierarchies) under the limit unless necessary which wasn't the case > > previously. > > What you describe is already confused. What does that knob mean then? Well, it would help to start with Documentation/cgroups/memory.txt " 7. Soft limits Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided a. There is no memory contention b. They do not exceed their hard limit When the system detects memory contention or low memory, control groups are pushed back to their soft limits. If the soft limit of each control group is very high, they are pushed back as much as possible to make sure that one control group does not starve the others of memory. Please note that soft limits is a best-effort feature; it comes with no guarantees, but it does its best to make sure that when memory is heavily contended for, memory is allocated based on the soft limit hints/setup. Currently soft limit based reclaim is set up such that it gets invoked from balance_pgdat (kswapd). " As you can see there no single mention about groups below their soft limits. All we are saying here is that those groups that are above will get reclaimed. > Google folks seem to think it's an allocation guarantee but global > reclaim is broken and breaches the configuration (which I suppose is > arising from their usage of memcg) and I don't understand what your > definition of the knob is apart from the description of what's > implemented now, which apparently is causing horrible confusion on all > the involved parties. OK, I guess I start understanding where all the confusion comes from. Let me stress again that the rework doesn't provide any guarantee. It just integrates the soft limit reclaim into the main reclaim routines, gets rid of a lot of code and last but not least makes a greater chance that under-the-soft limit groups are not reclaimed unless really necessary. So please take these into consideration for the future discussions. > > So yes, I can understand why this is confusing for you. The soft limit > > semantic is different because the limit is/was considered only if it > > is/was in excess. > > > > Maybe I was using word _guarantee_ too often to confuse you, I am sorry > > if this is the case. The guarantee part comes from the group point of > > view. So the original semantic of the hierarchical behavior is > > unchanged. > > I don't care what word you use. There are two choices. Pick one and > stick with it. Don't make it something which inhibits reclaim if > under limit for leaf nodes but behaves somewhat differently if an > ancestor is under pressure or whatever. Just pick one. It is either > an reclaim inhibitor or actual soft limit. OK, I will not repeat the same mistake and let this frustrating discussion going on to "let's redo the soft limit reclaim again #1001" point again. No this is not about guarantee. And _never_ will be! Full stop. We can try to be clever during the outside pressure and prefer reclaiming over soft limit groups first. Which we used to do and will do after rework as well. As a side effect of that a properly designed hierachy with opt-in soft limited groups can actually accomplish some isolation is a nice side effect but no _guarantee_. > > What to does it mean that an inter node is under the soft limit > > for the subhierarchy is questionable and there are usecases where > > It's not frigging questionable. You're just horribly confused. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:19 ` Michal Hocko @ 2013-04-22 15:57 ` Tejun Heo 2013-04-22 15:57 ` Tejun Heo 2013-04-22 16:20 ` Michal Hocko 0 siblings, 2 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 15:57 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote: > We can try to be clever during the outside pressure and prefer > reclaiming over soft limit groups first. Which we used to do and will > do after rework as well. As a side effect of that a properly designed > hierachy with opt-in soft limited groups can actually accomplish some > isolation is a nice side effect but no _guarantee_. Okay, so it *is* a soft limit. Good. If so, a subtree going over the limit of course forces reclaim on its children even though their individual configs aren't over limit. It's exactly the same as hardlimit. There doesn't need to be any difference and there's nothing questionable or interesting about it. Also, then, a cgroup which has been configured explicitly shouldn't be disadvantaged compared to a cgroup with a limit configured. ie. the current behavior of giving maximum to the knob on creation is the correct one. The knob should create *extra* pressure. It shouldn't lessen the pressure. When populated weith other cgroups with limits configured, it would change the relative pressure felt by each but in general it's a limiting mechanism not an isolation one. I think the bulk of confusion is coming from this, so please make that abundantly clear. And, if people want a mechanism for isolation / lessening of pressure, which looks like a valid use case to me, add another knob for that which is prioritized under both hard and soft limits. That is the only sensible way to do it. Alright, no complaint anymore. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:57 ` Tejun Heo @ 2013-04-22 15:57 ` Tejun Heo 2013-04-22 16:20 ` Michal Hocko 1 sibling, 0 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 15:57 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Mon, Apr 22, 2013 at 08:57:03AM -0700, Tejun Heo wrote: > On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote: > > We can try to be clever during the outside pressure and prefer > > reclaiming over soft limit groups first. Which we used to do and will > > do after rework as well. As a side effect of that a properly designed > > hierachy with opt-in soft limited groups can actually accomplish some > > isolation is a nice side effect but no _guarantee_. > > Okay, so it *is* a soft limit. Good. If so, a subtree going over the > limit of course forces reclaim on its children even though their > individual configs aren't over limit. It's exactly the same as > hardlimit. There doesn't need to be any difference and there's > nothing questionable or interesting about it. > > Also, then, a cgroup which has been configured explicitly shouldn't be ^ not > disadvantaged compared to a cgroup with a limit configured. ie. the > current behavior of giving maximum to the knob on creation is the > correct one. The knob should create *extra* pressure. It shouldn't > lessen the pressure. When populated weith other cgroups with limits > configured, it would change the relative pressure felt by each but in > general it's a limiting mechanism not an isolation one. I think the > bulk of confusion is coming from this, so please make that abundantly > clear. > > And, if people want a mechanism for isolation / lessening of pressure, > which looks like a valid use case to me, add another knob for that > which is prioritized under both hard and soft limits. That is the > only sensible way to do it. > > Alright, no complaint anymore. Thanks. > > -- > tejun -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 15:57 ` Tejun Heo 2013-04-22 15:57 ` Tejun Heo @ 2013-04-22 16:20 ` Michal Hocko 2013-04-22 18:30 ` Tejun Heo 1 sibling, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-22 16:20 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Mon 22-04-13 08:57:03, Tejun Heo wrote: > On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote: > > We can try to be clever during the outside pressure and prefer > > reclaiming over soft limit groups first. Which we used to do and will > > do after rework as well. As a side effect of that a properly designed > > hierachy with opt-in soft limited groups can actually accomplish some > > isolation is a nice side effect but no _guarantee_. > > Okay, so it *is* a soft limit. Good. If so, a subtree going over the > limit of course forces reclaim on its children even though their > individual configs aren't over limit. It's exactly the same as > hardlimit. There doesn't need to be any difference and there's > nothing questionable or interesting about it. > > Also, then, a cgroup which has been configured explicitly shouldn't be > disadvantaged compared to a cgroup with a limit configured. ie. the > current behavior of giving maximum to the knob on creation is the > correct one. Although the default limit is correct it is impractical for use because it doesn't allow for "I behave do not reclaim me if you can" cases. And we can implement such a behavior really easily with backward compatibility and new interfaces (aka reuse the soft limit for that). I am approaching this from a simple perspective. Reclaim from everybody who doesn't care about the soft limit (it hasn't been set for that group) or who is above the soft limit. If that is sufficient to meet the reclaim target then there is no reason to touch groups that _do_ care about soft limit and they are under. Although this doesn't give you any guarantee it can give a certain prioritization for groups in the overcommit situations and that is what soft limit was intended for from the very beginning. > The knob should create *extra* pressure. It shouldn't > lessen the pressure. When populated weith other cgroups with limits > configured, it would change the relative pressure felt by each but in > general it's a limiting mechanism not an isolation one. I think the > bulk of confusion is coming from this, so please make that abundantly > clear. > > And, if people want a mechanism for isolation / lessening of pressure, > which looks like a valid use case to me, add another knob for that > which is prioritized under both hard and soft limits. That is the > only sensible way to do it. No, please no yet another knob. We have too many of them already. And even those that are here for a long time can be confusing as one can see. > Alright, no complaint anymore. Thanks. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 16:20 ` Michal Hocko @ 2013-04-22 18:30 ` Tejun Heo 2013-04-23 9:29 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Tejun Heo @ 2013-04-22 18:30 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hey, On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote: > Although the default limit is correct it is impractical for use > because it doesn't allow for "I behave do not reclaim me if you can" > cases. And we can implement such a behavior really easily with backward > compatibility and new interfaces (aka reuse the soft limit for that). Okay, now we're back to square one and I'm reinstating all the mean things I said in this thread. :P No wonder everyone is so confused about this. Michal, you can't overload two controls which exert pressure on the opposite direction onto a single knob and define a sane hierarchical behavior for it. You're making it a point control rather than range one. Maybe you can define some twisted rules serving certain specific use case, but it's gonna be confusing / broken for different use cases. You're so confused that you don't even know you're confused. > I am approaching this from a simple perspective. Reclaim from everybody No, you're just thinking about two immediate problems you're given and trying to jam them into something you already have not realizing those two can't be expressed with a single knob. > who doesn't care about the soft limit (it hasn't been set for that > group) or who is above the soft limit. If that is sufficient to meet the > reclaim target then there is no reason to touch groups that _do_ care > about soft limit and they are under. Although this doesn't give you > any guarantee it can give a certain prioritization for groups in the > overcommit situations and that is what soft limit was intended for from > the very beginning. For $DEITY's sake, soft limit should exert reclaim pressure. That's it. If a group is over limit, it'll feel *extra* pressure until it's back to the limit. Once under the limit, it should be treated equally to any other tasks which are under the limit including the ones without any softlimit configured. It is not different from hardlimit. There's nothing "interesting" about it. Even for flat hierarchy, with your interpretation of the knob, it is impossible to say "I don't really care about this thing, if it goes over 30M, hammer on it", which is a completely reasonable thing to want. > > And, if people want a mechanism for isolation / lessening of pressure, > > which looks like a valid use case to me, add another knob for that > > which is prioritized under both hard and soft limits. That is the > > only sensible way to do it. > > No, please no yet another knob. We have too many of them already. And > even those that are here for a long time can be confusing as one can > see. Yes, sure, knobs are hard, let's combine two controls in the opposite directions into one. That is the crux of the confusion - trying to combine two things which can't and shouldn't be combined. Just forget about the other thing or separate it out. Please take a step back and look at it again. You're really epitomizing the confusion on this subject. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 18:30 ` Tejun Heo @ 2013-04-23 9:29 ` Michal Hocko 2013-04-23 17:09 ` Tejun Heo 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko 2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner 2 siblings, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:29 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Mon 22-04-13 11:30:20, Tejun Heo wrote: > Hey, > > On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote: > > Although the default limit is correct it is impractical for use > > because it doesn't allow for "I behave do not reclaim me if you can" > > cases. And we can implement such a behavior really easily with backward > > compatibility and new interfaces (aka reuse the soft limit for that). > > Okay, now we're back to square one and I'm reinstating all the mean > things I said in this thread. :P No wonder everyone is so confused > about this. Michal, you can't overload two controls which exert > pressure on the opposite direction onto a single knob and define a > sane hierarchical behavior for it. Ohh, well and we are back in the circle again. Nobody is proposing overloading soft reclaim for any bottom-up (if that is what you mean by your opposite direction) pressure handling. > You're making it a point control rather than range one. Be more specific here, please? > Maybe you can define some twisted rules serving certain specific use > case, but it's gonna be confusing / broken for different use cases. Tejun, your argumentation is really hand wavy here. Which use cases will be broken and which one will be confusing. Name one for an illustration. > You're so confused that you don't even know you're confused. Yes, you keep repeating that. But you haven't pointed out any single confusing use case so far. Please please stop this, it is not productive. We are still talking about using soft limit to control overcommit situation as gracefully as possible. I hope we are on the same page about that at least. I will post my series as a reply to this email so that we can get to a more specific discussion because this "you are so confused because something, something, something, dark..." is not funny, nor productive. > > I am approaching this from a simple perspective. Reclaim from everybody > > No, you're just thinking about two immediate problems you're given and > trying to jam them into something you already have not realizing those > two can't be expressed with a single knob. Yes, I am thinking in context of several use cases, all right. One of them is memory isolation via soft limit prioritization. Something that is possible already but it is major PITA to do right. What we have currently is optimized for "let's hammer something". Although useful, not a primary usecase according to my experiences. The primary motivation for the soft limit was to have something to control overcommit situations gracefully AFAIR and let's hammer something and hope it will work doesn't sound gracefully to me. > > who doesn't care about the soft limit (it hasn't been set for that > > group) or who is above the soft limit. If that is sufficient to meet the > > reclaim target then there is no reason to touch groups that _do_ care > > about soft limit and they are under. Although this doesn't give you > > any guarantee it can give a certain prioritization for groups in the > > overcommit situations and that is what soft limit was intended for from > > the very beginning. > > For $DEITY's sake, soft limit should exert reclaim pressure. That's > it. If a group is over limit, it'll feel *extra* pressure until it's > back to the limit. Once under the limit, it should be treated equally > to any other tasks which are under the limit And yet again agreed and nobody is claiming otherwise. Except that > including the ones without any softlimit configured. I haven't seen any specific argument why the default limit shouldn't allow to always reclaim. Having soft unreclaimable groups by default makes it hard to use soft limit reclaim for something more interesting. See the last patch in the series ("memcg: Ignore soft limit until it is explicitly specified"). With this approach you end up setting soft limit for every single group (even those you do not care about) just to make balancing work reasonably for all hierarchies. Anyway, this is just one part of the series and it doesn't make sense to postpone the whole work just for this. If _more people_ really think that the default limit change is really _so_ confusing and unusable then I will not push it over dead bodies of course. > It is not different from hardlimit. There's nothing "interesting" > about it. > > Even for flat hierarchy, with your interpretation of the knob, it is > impossible to say "I don't really care about this thing, if it goes > over 30M, hammer on it", which is a completely reasonable thing to > want. Nothing prevents from this setting. I am just claiming that this is not the most interesting use case for the soft limit and I would like to optimize for more interesting use cases. The patch set will follow -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 9:29 ` Michal Hocko @ 2013-04-23 17:09 ` Tejun Heo 2013-04-26 11:51 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-23 17:09 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hello, Michal. On Tue, Apr 23, 2013 at 11:29:56AM +0200, Michal Hocko wrote: > Ohh, well and we are back in the circle again. Nobody is proposing > overloading soft reclaim for any bottom-up (if that is what you mean by > your opposite direction) pressure handling. > > > You're making it a point control rather than range one. > > Be more specific here, please? > > > Maybe you can define some twisted rules serving certain specific use > > case, but it's gonna be confusing / broken for different use cases. > > Tejun, your argumentation is really hand wavy here. Which use cases will > be broken and which one will be confusing. Name one for an illustration. > > > You're so confused that you don't even know you're confused. > > Yes, you keep repeating that. But you haven't pointed out any single > confusing use case so far. Please please stop this, it is not productive. > We are still talking about using soft limit to control overcommit > situation as gracefully as possible. I hope we are on the same page > about that at least. Hmmm... I think I was at least somewhat clear on my points. I'll try again. Let's see if I can at least make you understand what my point is. Maybe some diagrams will help. Let's consider hardlimit first as there seems to be consensus on what it means. By default, hardlimit is set at max and exerts pressure downwards. <--------------------------------------------------------| 0 max When you configure a hard limit, the diagram becomes. <-----------------------------------------| 0 limit max The configuration now became more specific, right? Now let's say there's one parent and one child. The parent looks like the above and the child like the below. <---------------------| 0 limit' max When you combine the two, you get <---------------------| 0 limit' max In fact, it doesn't matter whether parent is more limited or child is. When composing multiple limits, the only logical thing to do is calculating the intersection - ie. take the most specific of the limits, which naturally doesn't violate both configurations. In hierarchy setup, children need to be summed and all, so it becomes different, but that's the principle. I hope you're with me upto this point. Now, let's think about the other direction. I don't care whether it's strict guarantee, soft protection or just a gentle preferential treatment. The focus is the direction of specificity. Please forget about "softlimit" for now. Just think at the interface level. You don't want to give protection by default, right? The specificity increases along with the amount of memory to "protect". So, the default looks like. |--------------------------------------------------------> 0 max When you configure certain amount, it becomes |-------------------------------------------> 0 prot max The direction of specificity is self-evident from what the default should be. Now, when you combine it with another such protection, say prot'. |---------------------------> 0 prot' max Regardless of what the nesting order is, what you should get is. |---------------------------> 0 prot' max It's exactly the same as limit. When you combine multiple of them, the most specific one wins. This is the basic of composing multiple ranges and it is the same principle that cgroup hierarchy limit configuration follows. When you compose configurations across hierarchy, you get the intersection. Now, when you put both into a single configuration knob, a given config would look like the following. specificity specificity of limit of protection <----------------|---------------------------------------> 0 config max Now, if you try to combine it with another one - config' specificity specificity of limit of protection <-------------------------------|------------------------> 0 config' max The intersection is no longer clearly defined. If you choose config, you violate the protection specificity of config', if you choose config', you violate the limit specificity of config. This is what I meant by you're making it a point configuration rather than a range one. A ranged config allows for well-defined composition through intersection. People tend to do this intuitively which makes it easier and more useful. I don't really care all that much about memcg internals but I do care about maintaining general sanity and consistency of cgroup control knobs especially in hierarchical settings which we traditionally have been horrible at, and I hope you at least can see the problem I'm seeing as it's evident as fire from where I stand. It's breaking the very basic principle which makes hierarchy sensible and useful. The fact that you think "switching the default value to the other end" is just a detail is very bothering because the default value is not determined according to one's whim. It's determined by the direction of specificity and in turn clearly marks and determines further operations including how they are composed. This really illumuniates the intricate and fragile tweaks you're trying to perform in an attemp to make the above point control to suit the use cases that you immediately face - you're choosing the direction of specificity that the knob is gonna follow on instance-by-instance basis - it's one direction for default and leaves if parent is not over limit; however, if it's over limit, you flip the direction, so that it somehow works for the use cases that you have right now. Sure, there are cases where such greedy engineering approach is useful or at least cases where we just have to make do with that, but this is nothing like that. It is a basic interface design which isn't complicated or difficult in itself. > Yes, I am thinking in context of several use cases, all right. One > of them is memory isolation via soft limit prioritization. Something > that is possible already but it is major PITA to do right. What we > have currently is optimized for "let's hammer something". Although > useful, not a primary usecase according to my experiences. The primary > motivation for the soft limit was to have something to control > overcommit situations gracefully AFAIR and let's hammer something and > hope it will work doesn't sound gracefully to me. As I've said multiple times now, I'm not saying any of the presented use cases are invalid. They all look valid to me and I think it's logical to support them; however, combining the two directions of specificities into one knob can't be the solution. Right now, both google and parallels want isolation, so that's the direction they're pushing - the arrows which are headed to the right of the screen. The problem becomes self-evident when you consider use cases which will want the arrows heading to the left of the screen, where over-provision of softlimit would be a natural thing to do just as hardlimit is, and such use cases won't call for and most likely will be hurt by reducing reclaim pressure when under limit. Say, a server or mobile configuration where a couple background jobs - say, indexing and back up - are running, both of which may create sizable amount of dirty data. They need to be done but aren't of high priority. Given the size of the machine and the type of the batch tasks, you wanna give X amount of memory to the batch tasks but want to make sure neither takes too much of it, so configure each to have Y and Z, where Y < X, Z < X but Y + Z > X. This is a reasonable configuration and when the system, as a whole, gets put under memory pressure - say the user launches a memory hog game - you first want the batch tasks to give away memory as fast as possible until the composition of limits is met and then you want them to feel the same pressure as everyone else. You can't combine "soft limit prioritization" and "isolation" into the same knob. Not because of implementation deatils but because they have the opposite directions of specificity. They're two fundamentally incompatible knobs. > > including the ones without any softlimit configured. > > I haven't seen any specific argument why the default limit shouldn't > allow to always reclaim. > Having soft unreclaimable groups by default makes it hard to use soft > limit reclaim for something more interesting. See the last patch > in the series ("memcg: Ignore soft limit until it is explicitly > specified"). With this approach you end up setting soft limit for every > single group (even those you do not care about) just to make balancing > work reasonably for all hierarchies. I think, well at least hope, that it's clear by now, but the above is exactly the kind of twisting and tweaking that I was talking about above. You're flipping things at different places trying to somehow meet the conflicting requirements which currently is put forth by mostly people using it as an isolation mechanism. > Anyway, this is just one part of the series and it doesn't make sense to > postpone the whole work just for this. If _more people_ really think that > the default limit change is really _so_ confusing and unusable then I > will not push it over dead bodies of course. So, here's my problem with the patchset. As sucky as the current situation is, "softlimit" currently doesn't explicitly implement or suggest isolation. People wanting isolation would of course want to push it to do isolation. They just want to get the functionality and interface doesn't matter all that much, which is fine and completely punderstandable, but by pushing it towards isolation, you're cementing the duality of the knob. Frankly, I don't care which direction "softlimit" chooses but you can't put both "limit" and "protection" into the same knob. It's fundamentally broken especially in hierarchies. > Nothing prevents from this setting. I am just claiming that this is not > the most interesting use case for the soft limit and I would like to > optimize for more interesting use cases. Michal, it really is not about optimizing for anything. It is the basic semantics of the knob, which isn't part of what one may call "implementation details". You can't "optimize" them. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-23 17:09 ` Tejun Heo @ 2013-04-26 11:51 ` Michal Hocko 2013-04-26 18:37 ` Tejun Heo 0 siblings, 1 reply; 46+ messages in thread From: Michal Hocko @ 2013-04-26 11:51 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Tue 23-04-13 10:09:00, Tejun Heo wrote: > Hello, Michal. > > On Tue, Apr 23, 2013 at 11:29:56AM +0200, Michal Hocko wrote: > > Ohh, well and we are back in the circle again. Nobody is proposing > > overloading soft reclaim for any bottom-up (if that is what you mean by > > your opposite direction) pressure handling. > > > > > You're making it a point control rather than range one. > > > > Be more specific here, please? > > > > > Maybe you can define some twisted rules serving certain specific use > > > case, but it's gonna be confusing / broken for different use cases. > > > > Tejun, your argumentation is really hand wavy here. Which use cases will > > be broken and which one will be confusing. Name one for an illustration. > > > > > You're so confused that you don't even know you're confused. > > > > Yes, you keep repeating that. But you haven't pointed out any single > > confusing use case so far. Please please stop this, it is not productive. > > We are still talking about using soft limit to control overcommit > > situation as gracefully as possible. I hope we are on the same page > > about that at least. > > Hmmm... I think I was at least somewhat clear on my points. I'll try > again. Let's see if I can at least make you understand what my point > is. Maybe some diagrams will help. Maybe I should have been more explicit about this but _yes I do agree_ that a separate limit would work as well. I just do not want to introduce yet-another-limit unless it is _really_ necessary. We have up to 4 of them depending on the configuration which is a lot already. And the new knob would certainly become a guarantee what ever words we use with more expectations than soft limit and I am afraid that won't be that easy (unless we provide a poison pill for emergency cases). My rework was based on the soft limit semantic which we had for quite some time and tried to enhance it to be more useful. I do understand your concerns about the cleanness of the interface I just objected that the new meaning doesn't add any guarantee. The implementation just tries to be clever who to reclaim to handle an external pressure (for which the soft limit has been introduced in the first place) while using hints from the limit as much as possible . Anyway, I will think about cons and pros of the new limit. I think we shouldn't block the first 3 patches in the series which keep the current semantic and just change the internals to do the same thing. Do you agree? We can discuss single vs. new knob in the mean time of course. [...] Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-26 11:51 ` Michal Hocko @ 2013-04-26 18:37 ` Tejun Heo 2013-04-29 15:27 ` Michal Hocko 0 siblings, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-26 18:37 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hey, On Fri, Apr 26, 2013 at 01:51:20PM +0200, Michal Hocko wrote: > Maybe I should have been more explicit about this but _yes I do agree_ > that a separate limit would work as well. I just do not want to Heh, the point was more about what we shouldn't be doing, but, yeah, it's good that we at least agree on something. :) > Anyway, I will think about cons and pros of the new limit. I think we > shouldn't block the first 3 patches in the series which keep the current > semantic and just change the internals to do the same thing. Do you > agree? As the merge window is coming right up, if it isn't something super urgent, can we please hold it off until after the merge window? It would be really great if we can pin down the semantics of the knob before doing anything. Please. I'll think / study more about it in the coming weeks. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-26 18:37 ` Tejun Heo @ 2013-04-29 15:27 ` Michal Hocko 0 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-29 15:27 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Fri 26-04-13 11:37:41, Tejun Heo wrote: > Hey, > > On Fri, Apr 26, 2013 at 01:51:20PM +0200, Michal Hocko wrote: > > Maybe I should have been more explicit about this but _yes I do agree_ > > that a separate limit would work as well. I just do not want to > > Heh, the point was more about what we shouldn't be doing, but, yeah, > it's good that we at least agree on something. :) > > > Anyway, I will think about cons and pros of the new limit. I think we > > shouldn't block the first 3 patches in the series which keep the current > > semantic and just change the internals to do the same thing. Do you > > agree? > > As the merge window is coming right up, if it isn't something super > urgent, can we please hold it off until after the merge window? It > would be really great if we can pin down the semantics of the knob > before doing anything. I think that merging it into 3.10 would be too ambitious but I think this core code cleanup makes sense for future discussions so I would like to post it for -mm tree at least. The sooner it will be the better IMHO. > Please. I'll think / study more about it in the coming weeks. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC v2 0/4] soft limit rework 2013-04-22 18:30 ` Tejun Heo 2013-04-23 9:29 ` Michal Hocko @ 2013-04-23 9:33 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko ` (3 more replies) 2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner 2 siblings, 4 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw) To: linux-mm Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen This is the second version of the patchset. There were some minor cleanups since the last version and I have moved "memcg: Ignore soft limit until it is explicitly specified" to the end of the series as it seems to be more controversial than I thought. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal [1]. There is certainly room for improvements in that direction. But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. The last step is somehow more controversial as the discussions show. I am redefining meaning of the default soft limit value. I've not chosen 0 as we discussed previously because I want to preserve hierarchical property of the soft limit (if a parent up the hierarchy is over its limit then children are over as well - same as with the hard limit) so I have kept the default untouched - unlimited - but I have slightly changed the meaning of this value. I interpret it as "user doesn't care about soft limit". More precisely the value is ignored unless it has been specified by admin/user so such groups are eligible for soft reclaim even though they do not reach the limit. Such groups do not force their children to be reclaimed so we can look at them as neutral for the soft reclaim. I will attach my testing results later on. Shortlog says: Michal Hocko (4): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Get rid of soft-limit tree infrastructure vmscan, memcg: Do softlimit reclaim also for targeted reclaim memcg: Ignore soft limit until it is explicitly specified And the diffstat: include/linux/memcontrol.h | 12 +- mm/memcontrol.c | 438 +++++--------------------------------------- mm/vmscan.c | 62 ++++--- 3 files changed, 88 insertions(+), 424 deletions(-) which sounds optimistic, doesn't it? --- [1] I have tested this by creating a hierarchy 10 levels deep with 2 groups at each level - all of them below their soft limit and a single group eligible for the reclaim running dd reading a lot of page cache. The system time was withing stdev comparing to the previous implementation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko @ 2013-04-23 9:33 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw) To: linux-mm Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Changes since v1 - __shrink_zone doesn't return the number of shrunk groups as nr_scanned test covers both no groups scanned and no pages from the required zone as pointed by Johannes Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 161 ++++++-------------------------------------- mm/vmscan.c | 62 ++++++++++------- 3 files changed, 59 insertions(+), 174 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..1833c95 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f608546..33424d8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; +/* + * A group is eligible for the soft limit reclaim if it is + * a) is over its soft limit + * b) any parent up the hierarchy is over its soft limit + */ +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + + /* + * If any parent up the hierarchy is over its soft limit then we + * have to obey and reclaim from this group as well. + */ + while((parent = parent_mem_cgroup(parent))) { + if (res_counter_soft_limit_excess(&parent->res)) + return true; } - mem_cgroup_iter_break(root_memcg, victim); - return total; + + return false; } /* @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->memcg->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - /** * mem_cgroup_force_empty_list - clears LRU of a group * @memcg: group to clear diff --git a/mm/vmscan.c b/mm/vmscan.c index df78d17..0d0c9e7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return global_reclaim(sc); +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return false; +} #endif static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) @@ -1942,7 +1952,8 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void shrink_zone(struct zone *zone, struct scan_control *sc) +static void +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; @@ -1961,6 +1972,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec; + if (soft_reclaim && + !mem_cgroup_soft_reclaim_eligible(memcg)) { + memcg = mem_cgroup_iter(root, memcg, &reclaim); + continue; + } + lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1986,6 +2003,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) sc->nr_scanned - nr_scanned, sc)); } + +static void shrink_zone(struct zone *zone, struct scan_control *sc) +{ + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); + unsigned long nr_scanned = sc->nr_scanned; + + __shrink_zone(zone, sc, do_soft_reclaim); + + /* + * No group is over the soft limit or those that are do not have + * pages in the zone we are reclaiming so we have to reclaim everybody + */ + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { + __shrink_zone(zone, sc, false); + return; + } +} + /* Returns true if compaction should go ahead for a high-order request */ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { @@ -2047,8 +2082,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool aborted_reclaim = false; /* @@ -2088,18 +2121,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ } @@ -2620,8 +2641,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2720,15 +2739,6 @@ loop_again: sc.nr_scanned = 0; - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko 2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko @ 2013-04-23 9:33 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko 2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko 3 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw) To: linux-mm Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Now that the soft limit is integrated to the reclaim directly the whole soft-limit tree infrastructure is not needed anymore. Rip it out. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 251 +------------------------------------------------------ 1 file changed, 1 insertion(+), 250 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33424d8..d927e2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -39,7 +39,6 @@ #include <linux/limits.h> #include <linux/export.h> #include <linux/mutex.h> -#include <linux/rbtree.h> #include <linux/slab.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -136,7 +135,6 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -172,10 +170,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; @@ -188,26 +182,6 @@ struct mem_cgroup_lru_info { struct mem_cgroup_per_node *nodeinfo[0]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -528,7 +502,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -741,164 +714,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node(node) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->memcg->res) || - !css_tryget(&mz->memcg->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -1052,9 +867,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -1077,11 +889,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; bool do_numainfo __maybe_unused; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -1089,8 +898,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1923,28 +1730,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, return total; } -/** - * test_mem_cgroup_node_reclaimable - * @memcg: the target memcg - * @nid: the node ID to be checked. - * @noswap : specify true here if the user wants flle only information. - * - * This function returns whether the specified memcg contains any - * reclaimable pages on a node. Returns true if there are any reclaimable - * pages in the node. - */ -static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, - int nid, bool noswap) -{ - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE)) - return true; - if (noswap || !total_swap_pages) - return false; - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON)) - return true; - return false; - -} #if MAX_NUMNODES > 1 /* @@ -2053,11 +1838,6 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) { return 0; } - -static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) -{ - return test_mem_cgroup_node_reclaimable(memcg, 0, noswap); -} #endif /* @@ -2932,9 +2712,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* - * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. + * "charge_statistics" updated event counter. */ memcg_check_events(memcg, page); } @@ -6053,8 +5831,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) for (zone = 0; zone < MAX_NR_ZONES; zone++) { mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); - mz->usage_in_excess = 0; - mz->on_tree = false; mz->memcg = memcg; } memcg->info.nodeinfo[node] = pn; @@ -6110,7 +5886,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) int node; size_t size = memcg_size(); - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node(node) @@ -6192,29 +5967,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); -static void __init mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node(node) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - BUG_ON(!rtpn); - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } -} - static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup *cont) { @@ -6990,7 +6742,6 @@ static int __init mem_cgroup_init(void) { hotcpu_notifier(memcg_cpu_hotplug_callback, 0); enable_swap_cgroup(); - mem_cgroup_soft_limit_tree_init(); memcg_stock_init(); return 0; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko 2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko @ 2013-04-23 9:33 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko 3 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw) To: linux-mm Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. >From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Changes since v1 - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1833c95..80ed1b6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d927e2e..14d3d23 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1841,11 +1841,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1853,12 +1855,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index 0d0c9e7..471bf94 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return global_reclaim(sc); + return true; } #else static bool global_reclaim(struct scan_control *sc) @@ -1973,7 +1973,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko ` (2 preceding siblings ...) 2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko @ 2013-04-23 9:33 ` Michal Hocko 3 siblings, 0 replies; 46+ messages in thread From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw) To: linux-mm Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen The soft limit has been traditionally initialized to RESOURCE_MAX which means that the group is soft unlimited by default and so it gets reclaimed only after all groups that set their limit are bellow their limits. While this scheme is working it is not ideal because it makes hard to configure isolated workloads without setting a limit to basically all groups. Let's consider the following simple hierarchy __A_____ / \ \ A1....An C and let's assume we would like to keep C's working set intact as much as possible (with soft limit set to the estimated working set size) so that A{i} groups do not interfere with it (A{i} might represent backup processes or other maintenance activities which can consume quite a lot of memory). If A{i} groups have a default soft limit then C would be preferred for the reclaim until it eventually gets to its soft limit and then be reclaimed again as the memory pressure from A{i} is bigger and when also A{i} get reclaimed. There are basically 2 options how to handle A{i} groups: - distribute hard limit to (A.limit - C.soft_limit) - set soft limit to 0 The first option is impractical because it would throttle A{i} even though there is quite some idle memory laying around. The later option would certainly work because A{i} would get reclaimed all the time there is a pressure coming from A. This however basically disables any soft limit settings down A{i} hierarchies which sounds unnecessarily strict (not mentioning that we have to set up a limit for every A{i}). Moreover if A is the root memcg then there is no reasonable way to make it stop interefering with other loads because setting the soft limit would kill the limits downwards and the hard limit is not possible to set. Neither of the extremes - unlimited vs. 0 - are ideal apparently. There is a compromise we can do, though. This patch doesn't change the default soft limit value. Rather than that it distinguishes groups with soft limit enabled - it has been set by an user - and disabled which comes as a default. Unlike groups with the limit set to 0 such groups do not propagate their reclaimable state down the hierarchy so they act only for themselves. Getting back to the previous example. Only C would get a limit from admin and the reclaim would reclaim all A{i} and C eventually when it crosses its limit. This means that soft limit is much easier to maintain now because only those groups that are interesting (that the administrator know how much pushback makes sense for a graceful overcommit handling) need to be taken care about and the rest of the groups is reclaimed proportionally. TODO: How do we present default unlimited vs. RESOURCE_MAX set by the user? One possible way could be returning -1 for RES_SOFT_LIMIT && !soft_limited. TODO: update doc Changes since v1 - return -1 when reading memory.soft_limit_in_bytes for unlimited groups. - reorganized checks in mem_cgroup_soft_reclaim_eligible to be more readable. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 14d3d23..03ddbcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -266,6 +266,10 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + /* + * Is the group soft limited? + */ + bool soft_limited; unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ bool oom_lock; @@ -1843,14 +1847,20 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) /* * A group is eligible for the soft limit reclaim under the given root * hierarchy if - * a) it is over its soft limit - * b) any parent up the hierarchy is over its soft limit + * a) doesn't have any soft limit set + * b) is over its soft limit + * c) any parent up the hierarchy is over its soft limit */ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; + /* No specific soft limit set, eligible for soft reclaim */ + if (!memcg->soft_limited) + return true; + + /* Soft limit exceeded, eligible for soft reclaim */ if (res_counter_soft_limit_excess(&memcg->res)) return true; @@ -1859,7 +1869,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { - if (res_counter_soft_limit_excess(&parent->res)) + if (parent->soft_limited && + res_counter_soft_limit_excess(&parent->res)) return true; if (parent == root) break; @@ -4754,10 +4765,13 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, switch (type) { case _MEM: - if (name == RES_USAGE) + if (name == RES_USAGE) { val = mem_cgroup_usage(memcg, false); - else + } else if (name == RES_SOFT_LIMIT && !memcg->soft_limited) { + return simple_read_from_buffer(buf, nbytes, ppos, "-1\n", 3); + } else { val = res_counter_read_u64(&memcg->res, name); + } break; case _MEMSWAP: if (name == RES_USAGE) @@ -5019,6 +5033,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, ret = res_counter_set_soft_limit(&memcg->res, val); else ret = -EINVAL; + + /* + * We could disable soft_limited when we get RESOURCE_MAX but + * then we have a little problem to distinguish the default + * unlimited and limitted but never soft reclaimed groups. + */ + if (!ret) + memcg->soft_limited = true; break; default: ret = -EINVAL; /* should be BUG() ? */ -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-22 18:30 ` Tejun Heo 2013-04-23 9:29 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko @ 2013-04-24 21:45 ` Johannes Weiner 2013-04-25 0:33 ` Tejun Heo 2 siblings, 1 reply; 46+ messages in thread From: Johannes Weiner @ 2013-04-24 21:45 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Mon, Apr 22, 2013 at 11:30:20AM -0700, Tejun Heo wrote: > Hey, > > On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote: > > Although the default limit is correct it is impractical for use > > because it doesn't allow for "I behave do not reclaim me if you can" > > cases. And we can implement such a behavior really easily with backward > > compatibility and new interfaces (aka reuse the soft limit for that). > > Okay, now we're back to square one and I'm reinstating all the mean > things I said in this thread. :P No wonder everyone is so confused > about this. Michal, you can't overload two controls which exert > pressure on the opposite direction onto a single knob and define a > sane hierarchical behavior for it. You're making it a point control > rather than range one. Maybe you can define some twisted rules > serving certain specific use case, but it's gonna be confusing / > broken for different use cases. Historically soft limit meant prioritizing certain memcgs over others and the memcgs over their soft limit should experience relatively more reclaim pressure than the ones below their soft limit. Now, if we go and say you are only reclaimed when you exceed your soft limit we would retain the prioritization aspect. Groups in excess of their soft limits would still experience relatively more reclaim pressure than their well-behaved peers. But it would have the nice side effect of acting more or less like a guarantee as well. I don't think this approach is as unreasonable as you make it out to be, but it does make things more complicated. It could be argued that we should add a separate guarantee knob because two simple knobs might be better than a complicated one. The question is whether this solves Google's problem, though. Currently, when a memcg is selected for a certain type of reclaim, it and all its children are treated as one single leaf entity in the overall hierarchy: when a parent node hits its hard limit, we assume equal fault of every member in the hierarchy for that situation and, consequently, we reclaim all of them equally. We do the same thing for the soft limit: if the parent, whose memory consumption is defined as the sum of memory consumed by all members of the hierarchy, breaches the soft limit then all members are reclaimed equally because no single member is more at fault than the others. I would expect if we added a guarantee knob, this would also mean that no individual memcg can be treated as being within their guaranteed memory if the hierarchy as a whole is in excess of its guarantee. The root of the hierarchy represents the whole hierarchy. Its memory usage is the combined memory usage of all members. The limit set to the hierarchy root applies to the combined memory usage of the hierarchy. Breaching that limit has consequences for the hierarchy as a whole. Be it soft limit or guarantee. This is how hierarchies have always worked and it allows limits to be layered and apply depending on the source of pressure: root (physical memory = 32G) / \ A B (hard limit = 25G, guarantee = 16G) / \ / \ A1 A2 / B2 (guarantee = 10G) / B1 (guarantee = 15G) Remember that hard limits are usually overcommitted, so you allow B to use more of the fair share of memory when A does not need it, but you want to keep it capped to keep latency reasonable when A ramps up. As long as B is hitting its own hard limit, you value B1's and B2's guarantees in the context of pressure local to the hierarchy; in the context of B having 25G worth of memory; in the context of B1 competing with B2 over the memory allowed by B. However, as soon as global reclaim kicks in, the context changes and the priorities shift. Now, B does not have 25G anymore but only 16G *in its competition with A*. We absolutely do not want to respect the guarantees made to B1 and B2. Not only can they not be met anyway, but they are utterly meaningless at this point. They were set with 25G in mind. [ It may be conceivable that you want different guarantees for B1 and B2 depending on where the pressure comes from. One setting for when the 25G limit applies, one setting when the 32G physical memory limit applies. Basically, every group would need a vector of guarantee settings with one setting per ancestor. That being said, I absolutely disagree with the idea of trying to adhere to individual memcg guarantees in the first reclaim cycle, regardless of context and then just ignore them on the second pass. It's a horrible way to guess which context the admin had in mind. ] Now, there is of course the other scenario in which the current hierarchical limit application can get in your way: when you give intermediate nodes their own memory. Because then you may see the need to apply certain limits to that hierarchy root's local memory only instead of all memory in the hierarchy. But once we open that door, you might expect this to be an option for every limit, where even the hard limit of a hierarchy root only applies to that group's local memory instead of the whole hierarchy. I certainly do not want to apply hierarchy semantics for some limits and not for others. But Google has basically asked for hierarchical hard limits and local soft limits / guarantees. In summary, we are now looking at both local and hierarchical limits times number of ancestors PER MEMCG to support all those use cases properly. So I'm asking what I already asked a year ago: are you guys sure you can not change your cgroup tree layout and that we have to solve it by adding new limit semantics?! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner @ 2013-04-25 0:33 ` Tejun Heo 2013-04-29 18:39 ` Johannes Weiner 0 siblings, 1 reply; 46+ messages in thread From: Tejun Heo @ 2013-04-25 0:33 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen Hello, Johannes. On Wed, Apr 24, 2013 at 05:45:31PM -0400, Johannes Weiner wrote: > Historically soft limit meant prioritizing certain memcgs over others > and the memcgs over their soft limit should experience relatively more > reclaim pressure than the ones below their soft limit. > > Now, if we go and say you are only reclaimed when you exceed your soft > limit we would retain the prioritization aspect. Groups in excess of > their soft limits would still experience relatively more reclaim > pressure than their well-behaved peers. But it would have the nice > side effect of acting more or less like a guarantee as well. But, at the same time, it has the not-so-nice side-effect of losing the ability to express negative prioritization. It isn't difficult to imagine use cases where the system doesn't want to partition the whole system into discrete cgroups but wants to limit the amount of resources consumed by low-priority workloads. Also, in the long-term, I really want cgroup to become something generally useful and automatically configurable (optional of course) by the base system according to the types of workloads. For something like that to be possible, the control knobs shouldn't be fiddly, complex, or require full partitioning of the system. > I don't think this approach is as unreasonable as you make it out to > be, but it does make things more complicated. It could be argued that > we should add a separate guarantee knob because two simple knobs might > be better than a complicated one. The problem that I see is that this is being done without clearing up the definition of the knob. The knob's role is being changed or at least solidified into something which makes it inconsistent with everything else in cgroup in a way which seems very reactive to me. I can see such reactive customizations being useful in satisfying certain specific use cases - google's primarily right now; however, it's likely to come back and bite us when we want to do something different or generic with cgroup. It's gonna be something which ends up being labeled as unusuable in other types of setups (e.g. where not all workloads are put under active control or whatever) after causing a lot of head-scratching and not-particularly-happy moments. Cgroup as a whole strongly needs consistency across its control knobs for it to be generally useful. Well, that and past frustrations over interface and implementations of memcg, which seems to bear a lot of similarities with what's going on now, probably have made me go over-board. Sorry about that, but I really hope memcg do better. ... > no single member is more at fault than the others. I would expect if > we added a guarantee knob, this would also mean that no individual > memcg can be treated as being within their guaranteed memory if the > hierarchy as a whole is in excess of its guarantee. I disagree here. It should be symmetrical to how hardlimit works. Let's say there's one parent - P - and child - C. For hardlimit, if P is over limit, it exerts pressure on its subtree regardless of C, and, if P is under limit, it doesn't affect C. For guarantee / protection, it should work the same but in the opposite direction. If P is under limit, it should protect the subtree from reclaim regardless of C. If P is over limit, it shouldn't affect C. As I draw in the other reply to Michal, each knob should be a starting point of a single range in the pre-defined direction and composition of those configurations across hierarchy should result in intersection of them. I can't see any reason to deviate from that here. IOW, protection control shouldn't care about generating memory pressure. That's the job of soft and hard limits, both of which should apparently override protection. That way, each control knob becomes fully consistent within itself across the hierarchy and the questions become those of how soft limit should override protection rather than the semantics of soft limit itself. > The root of the hierarchy represents the whole hierarchy. Its memory > usage is the combined memory usage of all members. The limit set to > the hierarchy root applies to the combined memory usage of the > hierarchy. Breaching that limit has consequences for the hierarchy as > a whole. Be it soft limit or guarantee. > > This is how hierarchies have always worked and it allows limits to be > layered and apply depending on the source of pressure: That's definitely true for soft and hard limits but flipped for guarantees and I think that's the primary source of confusion - guarantee being overloaded onto softlimit. > root (physical memory = 32G) > / \ > A B (hard limit = 25G, guarantee = 16G) > / \ / \ > A1 A2 / B2 (guarantee = 10G) > / > B1 (guarantee = 15G) > > Remember that hard limits are usually overcommitted, so you allow B to > use more of the fair share of memory when A does not need it, but you > want to keep it capped to keep latency reasonable when A ramps up. > > As long as B is hitting its own hard limit, you value B1's and B2's > guarantees in the context of pressure local to the hierarchy; in the > context of B having 25G worth of memory; in the context of B1 > competing with B2 over the memory allowed by B. > > However, as soon as global reclaim kicks in, the context changes and > the priorities shift. Now, B does not have 25G anymore but only 16G > *in its competition with A*. We absolutely do not want to respect the > guarantees made to B1 and B2. Not only can they not be met anyway, > but they are utterly meaningless at this point. They were set with > 25G in mind. I find the configuration confusing. What does it mean? Let's say B doesn't consume memory itself and B1 is inactive. Does that mean B2 is guaranteed upto 16G? Or is it that B2 is still guaranteed only upto 10G? If former, what if the intention was just to prevent B's total going past 16G and the configuration never meant to grant extra 6G to B2? The latter makes more sense as softlimit, but what happens when B itself consumes memory? Is B's internal consumption guaranteed any memory? If so, what if the internal usage is mostly uninteresting and the admin never meant them to get any guarantee and it unnecessarily eats into B1's guarantee when it comes up? If not, what happens when B1 creates a sub-cgroup B11? Do all internal usages of B1 lose the guarantee? If I'm not too confused, most of the confusions arise from the fact that guarantee's specificity is towards max (as evidenced by its default being zero) but composition through hierarchy happening in the other direction (ie. guarantee in internal node exerts pressure towards zero on its subtree). Doesn't something like the following suit what you had in mind better? h: hardlimit, s: softlimit, g: guarantee root (physical memory = 32G) / \ A B (h:25G, s:16G) / \ / \ A1 A2 / B2 (g:10G) / B1 (g:15G) It doesn't solve any of the execution issues arising from having to enforce 16G limit over 10G and 15G guarnatees but there is no room for misinterpreting the intention of the configuration. You could say that this is just a convenient case because it doesn't actually have nesting of the same params. Let's add one then. root (physical memory = 32G) / \ A B (h:25G, s:16G g:15G) / \ / \ A1 A2 / B2 (g:10G) / B1 (g:15G) If we follow the rule of composition by intersection, the interpretation of B's guarantee is clear. If B's subtree is under 15G, regardless of individual usages of B1 and B2, they shouldn't feel reclaim pressure. When B's subtree goes over 15G, B1 and B2 will have to fend off for themselves. If the ones which are over their own guarantee will feel the "normal" reclaim pressure; otherwise, they will continue to evade reclaim. When B's subtree goes over 16G, someone in B's subtree have to pay, preferably the ones not guaranteed anything first. > [ It may be conceivable that you want different guarantees for B1 and > B2 depending on where the pressure comes from. One setting for when > the 25G limit applies, one setting when the 32G physical memory > limit applies. Basically, every group would need a vector of > guarantee settings with one setting per ancestor. I don't get this. If a cgroup is under the guarantee limit and none of its parents are under hard/softlimit, it shouldn't feel any pressure. If a cgroup ia above guarantee, it should feel the same pressure everyone else in that subtree is subject to. If any of the ancestors has triggered soft / hard limit, it's gonna have to give up pages pretty quickly. > That being said, I absolutely disagree with the idea of trying to > adhere to individual memcg guarantees in the first reclaim cycle, > regardless of context and then just ignore them on the second pass. > It's a horrible way to guess which context the admin had in mind. ] I think there needs to be a way to avoid penalizing sub-cgroups under guarnatee amount when there are siblings which can give out pages over guarantee. I don't think I'm following the "guessing the intention" part. Can you please elaborate? > Now, there is of course the other scenario in which the current > hierarchical limit application can get in your way: when you give > intermediate nodes their own memory. Because then you may see the > need to apply certain limits to that hierarchy root's local memory > only instead of all memory in the hierarchy. But once we open that > door, you might expect this to be an option for every limit, where > even the hard limit of a hierarchy root only applies to that group's > local memory instead of the whole hierarchy. I certainly do not want > to apply hierarchy semantics for some limits and not for others. But > Google has basically asked for hierarchical hard limits and local soft > limits / guarantees. So, proportional controllers need this. They need to be able to configure the amount the tasks belonging to an inner node can consume when competing against the children groups. It isn't a particularly pretty thing but a necessity given that we allow tasks and resource consumptions in inner nodes. I was wondering about this and asked Michal whether anybody wants something like that and IIRC his answer was negative. Can you please expand on what google asked for? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes 2013-04-25 0:33 ` Tejun Heo @ 2013-04-29 18:39 ` Johannes Weiner 0 siblings, 0 replies; 46+ messages in thread From: Johannes Weiner @ 2013-04-29 18:39 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse, Greg Thelen On Wed, Apr 24, 2013 at 05:33:35PM -0700, Tejun Heo wrote: > Hello, Johannes. > > On Wed, Apr 24, 2013 at 05:45:31PM -0400, Johannes Weiner wrote: > > Historically soft limit meant prioritizing certain memcgs over others > > and the memcgs over their soft limit should experience relatively more > > reclaim pressure than the ones below their soft limit. > > > > Now, if we go and say you are only reclaimed when you exceed your soft > > limit we would retain the prioritization aspect. Groups in excess of > > their soft limits would still experience relatively more reclaim > > pressure than their well-behaved peers. But it would have the nice > > side effect of acting more or less like a guarantee as well. > > But, at the same time, it has the not-so-nice side-effect of losing > the ability to express negative prioritization. It isn't difficult to > imagine use cases where the system doesn't want to partition the whole > system into discrete cgroups but wants to limit the amount of > resources consumed by low-priority workloads. > > Also, in the long-term, I really want cgroup to become something > generally useful and automatically configurable (optional of course) > by the base system according to the types of workloads. For something > like that to be possible, the control knobs shouldn't be fiddly, > complex, or require full partitioning of the system. > > > I don't think this approach is as unreasonable as you make it out to > > be, but it does make things more complicated. It could be argued that > > we should add a separate guarantee knob because two simple knobs might > > be better than a complicated one. > > The problem that I see is that this is being done without clearing up > the definition of the knob. The knob's role is being changed or at > least solidified into something which makes it inconsistent with > everything else in cgroup in a way which seems very reactive to me. > > I can see such reactive customizations being useful in satisfying > certain specific use cases - google's primarily right now; however, > it's likely to come back and bite us when we want to do something > different or generic with cgroup. It's gonna be something which ends > up being labeled as unusuable in other types of setups (e.g. where not > all workloads are put under active control or whatever) after causing > a lot of head-scratching and not-particularly-happy moments. Cgroup > as a whole strongly needs consistency across its control knobs for it > to be generally useful. > > Well, that and past frustrations over interface and implementations of > memcg, which seems to bear a lot of similarities with what's going on > now, probably have made me go over-board. Sorry about that, but I > really hope memcg do better. I understand your frustration, I want to get it right as well before committing to anything. > > no single member is more at fault than the others. I would expect if > > we added a guarantee knob, this would also mean that no individual > > memcg can be treated as being within their guaranteed memory if the > > hierarchy as a whole is in excess of its guarantee. > > I disagree here. It should be symmetrical to how hardlimit works. > Let's say there's one parent - P - and child - C. For hardlimit, if P > is over limit, it exerts pressure on its subtree regardless of C, and, > if P is under limit, it doesn't affect C. > > For guarantee / protection, it should work the same but in the > opposite direction. If P is under limit, it should protect the > subtree from reclaim regardless of C. If P is over limit, it > shouldn't affect C. > > As I draw in the other reply to Michal, each knob should be a starting > point of a single range in the pre-defined direction and composition > of those configurations across hierarchy should result in intersection > of them. I can't see any reason to deviate from that here. > > IOW, protection control shouldn't care about generating memory > pressure. That's the job of soft and hard limits, both of which > should apparently override protection. That way, each control knob > becomes fully consistent within itself across the hierarchy and the > questions become those of how soft limit should override protection > rather than the semantics of soft limit itself. > > > The root of the hierarchy represents the whole hierarchy. Its memory > > usage is the combined memory usage of all members. The limit set to > > the hierarchy root applies to the combined memory usage of the > > hierarchy. Breaching that limit has consequences for the hierarchy as > > a whole. Be it soft limit or guarantee. > > > > This is how hierarchies have always worked and it allows limits to be > > layered and apply depending on the source of pressure: > > That's definitely true for soft and hard limits but flipped for > guarantees and I think that's the primary source of confusion - > guarantee being overloaded onto softlimit. > > > root (physical memory = 32G) > > / \ > > A B (hard limit = 25G, guarantee = 16G) > > / \ / \ > > A1 A2 / B2 (guarantee = 10G) > > / > > B1 (guarantee = 15G) > > > > Remember that hard limits are usually overcommitted, so you allow B to > > use more of the fair share of memory when A does not need it, but you > > want to keep it capped to keep latency reasonable when A ramps up. > > > > As long as B is hitting its own hard limit, you value B1's and B2's > > guarantees in the context of pressure local to the hierarchy; in the > > context of B having 25G worth of memory; in the context of B1 > > competing with B2 over the memory allowed by B. > > > > However, as soon as global reclaim kicks in, the context changes and > > the priorities shift. Now, B does not have 25G anymore but only 16G > > *in its competition with A*. We absolutely do not want to respect the > > guarantees made to B1 and B2. Not only can they not be met anyway, > > but they are utterly meaningless at this point. They were set with > > 25G in mind. > > I find the configuration confusing. What does it mean? Let's say B > doesn't consume memory itself and B1 is inactive. Does that mean B2 > is guaranteed upto 16G? Or is it that B2 is still guaranteed only > upto 10G? Both. Global memory pressure will leave B and all its children alone as long as their sum memory usage is below 16G. If B2 is the only memory user in there, it means that it won't be reclaimed until it uses 16G. However, I would not call it a guarantee of 16G from B2's point of view, because it does not control B1's usage. > If former, what if the intention was just to prevent B's total going > past 16G and the configuration never meant to grant extra 6G to B2? > > The latter makes more sense as softlimit, but what happens when B > itself consumes memory? Is B's internal consumption guaranteed any > memory? If so, what if the internal usage is mostly uninteresting and > the admin never meant them to get any guarantee and it unnecessarily > eats into B1's guarantee when it comes up? If not, what happens when > B1 creates a sub-cgroup B11? Do all internal usages of B1 lose the > guarantee? > > If I'm not too confused, most of the confusions arise from the fact > that guarantee's specificity is towards max (as evidenced by its > default being zero) but composition through hierarchy happening in the > other direction (ie. guarantee in internal node exerts pressure > towards zero on its subtree). > > Doesn't something like the following suit what you had in mind better? > > h: hardlimit, s: softlimit, g: guarantee > > root (physical memory = 32G) > / \ > A B (h:25G, s:16G) > / \ / \ > A1 A2 / B2 (g:10G) > / > B1 (g:15G) No, because I do not want B1 to be guaranteed half of memory in case of global memory pressure, only in the case where B has 25G available. Also, a soft limit does not guarantee that everything below B is left alone as long as it is within 16G of memory. > It doesn't solve any of the execution issues arising from having to > enforce 16G limit over 10G and 15G guarnatees but there is no room for > misinterpreting the intention of the configuration. You could say > that this is just a convenient case because it doesn't actually have > nesting of the same params. Let's add one then. > > root (physical memory = 32G) > / \ > A B (h:25G, s:16G g:15G) > / \ / \ > A1 A2 / B2 (g:10G) > / > B1 (g:15G) > > If we follow the rule of composition by intersection, the > interpretation of B's guarantee is clear. If B's subtree is under > 15G, regardless of individual usages of B1 and B2, they shouldn't feel > reclaim pressure. When B's subtree goes over 15G, B1 and B2 will have > to fend off for themselves. If the ones which are over their own > guarantee will feel the "normal" reclaim pressure; otherwise, they > will continue to evade reclaim. When B's subtree goes over 16G, > someone in B's subtree have to pay, preferably the ones not guaranteed > anything first. Yes, and that's the "intention guessing" that I do not agree with. The guarantees of B1 and B2 were written for the 25G available to B without global pressure. They mean "if B exceeds 25G, reclaim B2 if it exceeds 10G and reclaim B1 if it exceeds 15G". All of a sudden, your actual constraint is 16G. I don't want to use the guarantees that were meant for a different memory situation as a hint to decide which group should be reclaimed first. Either we have separate limits for the 25G situation and the 16G situation or we need to express guarantees as a percentage of available memory. > > [ It may be conceivable that you want different guarantees for B1 and > > B2 depending on where the pressure comes from. One setting for when > > the 25G limit applies, one setting when the 32G physical memory > > limit applies. Basically, every group would need a vector of > > guarantee settings with one setting per ancestor. > > I don't get this. If a cgroup is under the guarantee limit and none > of its parents are under hard/softlimit, it shouldn't feel any > pressure. If a cgroup ia above guarantee, it should feel the same > pressure everyone else in that subtree is subject to. If any of the > ancestors has triggered soft / hard limit, it's gonna have to give up > pages pretty quickly. > > > That being said, I absolutely disagree with the idea of trying to > > adhere to individual memcg guarantees in the first reclaim cycle, > > regardless of context and then just ignore them on the second pass. > > It's a horrible way to guess which context the admin had in mind. ] > > I think there needs to be a way to avoid penalizing sub-cgroups under > guarnatee amount when there are siblings which can give out pages over > guarantee. I don't think I'm following the "guessing the intention" > part. Can you please elaborate? Hope this is explained above. > > Now, there is of course the other scenario in which the current > > hierarchical limit application can get in your way: when you give > > intermediate nodes their own memory. Because then you may see the > > need to apply certain limits to that hierarchy root's local memory > > only instead of all memory in the hierarchy. But once we open that > > door, you might expect this to be an option for every limit, where > > even the hard limit of a hierarchy root only applies to that group's > > local memory instead of the whole hierarchy. I certainly do not want > > to apply hierarchy semantics for some limits and not for others. But > > Google has basically asked for hierarchical hard limits and local soft > > limits / guarantees. > > So, proportional controllers need this. They need to be able to > configure the amount the tasks belonging to an inner node can consume > when competing against the children groups. It isn't a particularly > pretty thing but a necessity given that we allow tasks and resource > consumptions in inner nodes. I was wondering about this and asked > Michal whether anybody wants something like that and IIRC his answer > was negative. Can you please expand on what google asked for? My understanding is that they have groups of jobs: G1 /|\ / | \ J1 J2 J3 When a job exits, its J group is removed and its leftover cache is reparented to the G group. Obviously, they want that cache to be reclaimed over currently used job memory, but if they set the soft limit in G1 to a very low value, it means that this low soft limit applies to the G1 hierarchy as a whole. Michal's and my suggestion was that they instead move this cache over to another sibling group that is dedicated to collect left over cache, i.e. Jcache. Then set the soft limit of this group to 0. OR do not delete the job groups, set their soft limits to 0, and reap the groups once memory usage in them drops to 0 (easy to do with the events interface we have that wake you up for memory watermark events). Both solutions, to me, sound so much simpler than starting to recognize and provide exclusive limits for local memory usage of inner nodes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2013-04-29 18:39 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo 2013-04-20 0:42 ` Tejun Heo 2013-04-20 3:35 ` Greg Thelen 2013-04-21 1:53 ` Tejun Heo 2013-04-20 3:16 ` Michal Hocko 2013-04-21 2:23 ` Tejun Heo 2013-04-21 8:55 ` Michel Lespinasse 2013-04-22 4:24 ` Tejun Heo 2013-04-22 7:14 ` Michel Lespinasse 2013-04-22 14:48 ` Tejun Heo 2013-04-22 15:37 ` Michal Hocko 2013-04-22 15:46 ` Tejun Heo 2013-04-22 15:54 ` Michal Hocko 2013-04-22 16:01 ` Tejun Heo 2013-04-23 9:58 ` Michel Lespinasse 2013-04-23 10:17 ` Glauber Costa 2013-04-23 11:40 ` Michal Hocko 2013-04-23 11:54 ` Glauber Costa 2013-04-23 12:51 ` Michel Lespinasse 2013-04-23 13:06 ` Michal Hocko 2013-04-23 13:13 ` Glauber Costa 2013-04-23 13:28 ` Michal Hocko 2013-04-23 11:32 ` Michal Hocko 2013-04-23 12:45 ` Michel Lespinasse 2013-04-23 12:59 ` Michal Hocko 2013-04-23 12:51 ` Michal Hocko 2013-04-21 12:46 ` Michal Hocko 2013-04-22 4:39 ` Tejun Heo 2013-04-22 15:19 ` Michal Hocko 2013-04-22 15:57 ` Tejun Heo 2013-04-22 15:57 ` Tejun Heo 2013-04-22 16:20 ` Michal Hocko 2013-04-22 18:30 ` Tejun Heo 2013-04-23 9:29 ` Michal Hocko 2013-04-23 17:09 ` Tejun Heo 2013-04-26 11:51 ` Michal Hocko 2013-04-26 18:37 ` Tejun Heo 2013-04-29 15:27 ` Michal Hocko 2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko 2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko 2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko 2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko 2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner 2013-04-25 0:33 ` Tejun Heo 2013-04-29 18:39 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).