* memcg: softlimit on internal nodes
@ 2013-04-20 0:26 Tejun Heo
2013-04-20 0:42 ` Tejun Heo
2013-04-20 3:16 ` Michal Hocko
0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-20 0:26 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hello, Michal and all.
Sorry about asking silly questions and leaving in the middle. I had a
plane to catch which I just barely made. I thought about it on the
way here and your proposal seems confused.
I think the crux of the confusion comes from the fact that you're
essentially proposing flipping the meaning of the knob for internal
nodes - it means minmum guaranteed allocation - that is, the shrinker
won't bother the cgroup if the memory consumption is under the
softlimit - and your proposal is to reverse that for cgroups with
children so that it actually means "soft" limit - creating pressure if
above the limit (IIUC, it isn't entirely that either as the pressure
is created iff the whole system is under memory pressure, right?).
Regardless of the direction of a configuration, a parent cgroup should
gate that configuration in the same direction. ie. If it's a limit
for a leaf node when reached, it also is an limit for the whole
subtree for an internal cgroup. If it's a configuration which
guarantees allocation (in the sense that it'll be excluded in memory
reclaim if under limit), the same, if the subtree is under limit,
reclaim shouldn't trigger.
For example, please consider the following hierarchy where s denotes
the "softlimit" and h hardlimit.
A (h:8G s:4G)
/ \
/ \
B (h:5G s:1G) C (h:5G s:1G)
For hard limit, nobody seems confused how the internal limit should
apply - If either B or C goes over 5G, the one going over that limit
will be on the receiving end of OOM killer. Also, even if both B and
C are individually under 5G, if the sum of the two goes over A's limit
- 8G, OOM killer will be activated on the subtree. It'd be a policy
decision whether to kill tasks from A, B or C, but the no matter what
the parent's limit will be enforced in the subtree. Note that this is
a perfectly valid configuration. It is *not* an invalid
configuration. It is exactly what the hierarchical configuration is
supposed to do.
It must not be any different for "softlimit". If B or C are
individually under 1G, they won't be targeted by the reclaimer and
even if B and C are over 1G, let's say 2G, as long as the sum is under
A's "softlimit" - 4G, reclaimer won't look at them. It is exactly the
same as hardlimit, just the opposite direction.
Now, let's consider the following hierarchy just to be sure. Let's
assume that A itself doesn't have any tasks for simplicity.
A (h:16G s:4G)
/ \
/ \
B (h:7G s:5G) C (h:7G s:5G)
For hardlimit, it is clear that A's limit won't do anything. No
matter what B and C do. In exactly the same way, A's "softlimit"
doesn't do anything regardless of what B and C do. Just like A's
hardlimit doesn't impose any further restrictions on B and C, A's
softlimit doesn't give any further guarantee to B and C. There's no
difference at all.
Now, it's completely silly that "softlimit" is actually allocation
guarantee rather than an actual limit. I guess it's born out of
similar confusion? Maybe originally the operation was a confused mix
of the two and it moved closer to guaranteeing behavior over time?
Anyways, it's apparent why actual soft limit - that is something which
creates reclaim pressure even when the system as whole isn't under
memory pressure - would be useful, and I'm actually kinda surprised
that it doesn't already exist. It isn't difficult to imagine use
cases where the user doesn't want certain services/applications (say
backup, torrent or static http server serving large files) to not
consume huge amount of memory without triggering OOM killer. It is
something which is fundamentally useful and I think is why people are
confused and pulling the current "softlimit" towards something like
that.
If such actual soft limit is desired (I don't know, it just seems like
a very fundamental / logical feature to me), please don't try to
somehow overload "softlimit". They are two fundamentally different
knobs, both make sense in their own ways, and when you stop confusing
the two, there's nothing ambiguous about what what each knob means in
hierarchical situations. This goes the same for the "untrusted" flag
Ying told me, which seems like another confused way to overload two
meanings onto "softlimit". Don't overload!
Now let's see if this gogo thing actually works.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo
@ 2013-04-20 0:42 ` Tejun Heo
2013-04-20 3:35 ` Greg Thelen
2013-04-20 3:16 ` Michal Hocko
1 sibling, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-20 0:42 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Fri, Apr 19, 2013 at 05:26:20PM -0700, Tejun Heo wrote:
> If such actual soft limit is desired (I don't know, it just seems like
> a very fundamental / logical feature to me), please don't try to
> somehow overload "softlimit". They are two fundamentally different
> knobs, both make sense in their own ways, and when you stop confusing
> the two, there's nothing ambiguous about what what each knob means in
> hierarchical situations. This goes the same for the "untrusted" flag
> Ying told me, which seems like another confused way to overload two
> meanings onto "softlimit". Don't overload!
As for how actually to clean up this yet another mess in memcg, I
don't know. Maybe introduce completely new knobs - say,
oom_threshold, reclaim_threshold, and reclaim_trigger - and alias
hardlimit to oom_threshold and softlimit to recalim_trigger? BTW,
"softlimit" should default to 0. Nothing else makes any sense.
Maybe you can gate it with "sane_behavior" flag or something. I don't
know. It's your mess to clean up. :P
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo
2013-04-20 0:42 ` Tejun Heo
@ 2013-04-20 3:16 ` Michal Hocko
2013-04-21 2:23 ` Tejun Heo
1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-20 3:16 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Fri 19-04-13 17:26:20, Tejun Heo wrote:
> Hello, Michal and all.
>
> Sorry about asking silly questions and leaving in the middle. I had a
> plane to catch which I just barely made. I thought about it on the
> way here and your proposal seems confused.
>
> I think the crux of the confusion comes from the fact that you're
> essentially proposing flipping the meaning of the knob for internal
> nodes - it means minmum guaranteed allocation - that is, the shrinker
> won't bother the cgroup if the memory consumption is under the
> softlimit - and your proposal is to reverse that for cgroups with
> children so that it actually means "soft" limit - creating pressure if
> above the limit (IIUC, it isn't entirely that either as the pressure
> is created iff the whole system is under memory pressure, right?).
No, one of the patches changes that and put the soft reclaim to the hard
reclaim path as well - basically try to reclaim over-soft limit groups
first and do not both others if you can make your target. Please refer
to the patchset for details
(http://comments.gmane.org/gmane.linux.kernel.mm/97973)
> Regardless of the direction of a configuration, a parent cgroup should
> gate that configuration in the same direction. ie. If it's a limit
> for a leaf node when reached, it also is an limit for the whole
> subtree for an internal cgroup.
Agreed and that is exactly what I was saying and what the code does.
> If it's a configuration which guarantees allocation (in the sense that
> it'll be excluded in memory reclaim if under limit), the same, if the
> subtree is under limit, reclaim shouldn't trigger.
>
> For example, please consider the following hierarchy where s denotes
> the "softlimit" and h hardlimit.
>
> A (h:8G s:4G)
> / \
> / \
> B (h:5G s:1G) C (h:5G s:1G)
>
> For hard limit, nobody seems confused how the internal limit should
> apply - If either B or C goes over 5G, the one going over that limit
> will be on the receiving end of OOM killer.
Right
> Also, even if both B and C are individually under 5G, if the sum of
> the two goes over A's limit - 8G, OOM killer will be activated on the
> subtree. It'd be a policy decision whether to kill tasks from A, B or
> C, but the no matter what the parent's limit will be enforced in the
> subtree. Note that this is a perfectly valid configuration.
Agreed.
> It is *not* an invalid configuration. It is exactly what the
> hierarchical configuration is supposed to do.
>
> It must not be any different for "softlimit". If B or C are
> individually under 1G, they won't be targeted by the reclaimer and
> even if B and C are over 1G, let's say 2G, as long as the sum is under
> A's "softlimit" - 4G, reclaimer won't look at them.
But we disagree on this one. If B and/or C are above their soft limit
we do (soft) reclaim them. It is exactly the same thing as if they were
hitting their hard limit (we just enforce the limit lazily).
You can look at the soft limit as a lazy limit which is enforced only if
there is an external pressure coming up the hierarchy - this can be
either global memory presure or a hard limit reached up the hierarchy.
Does this makes sense to you?
> It is exactly the same as hardlimit, just the opposite direction.
>
> Now, let's consider the following hierarchy just to be sure. Let's
> assume that A itself doesn't have any tasks for simplicity.
>
> A (h:16G s:4G)
> / \
> / \
> B (h:7G s:5G) C (h:7G s:5G)
>
> For hardlimit, it is clear that A's limit won't do anything.
It _does_ if A has tasks which add pressure to B+C. Or even if you do
not have any tasks because A might hold some reparented pages from
groups which are gone now.
> No matter what B and C do. In exactly the same way, A's "softlimit"
> doesn't do anything regardless of what B and C do.
And same here.
> Just like A's hardlimit doesn't impose any further restrictions on B
> and C, A's softlimit doesn't give any further guarantee to B and C.
> There's no difference at all.
If A hits its hard limit then we reclaim that subtree so we _can_ and
_do_ reclaim also from B and C. This is what the current code does and
soft reclaim doesn't change that at all. The only thing it changes is
that it tries to save groups bellow the limit from reclaiming.
> Now, it's completely silly that "softlimit" is actually allocation
> guarantee rather than an actual limit. I guess it's born out of
> similar confusion? Maybe originally the operation was a confused mix
> of the two and it moved closer to guaranteeing behavior over time?
I wouldn't call it silly. It actually makes a lot of sense if you look
at it as a delayed limit which would allow you to allocate more if there
is not any outside memory pressure.
> Anyways, it's apparent why actual soft limit - that is something which
> creates reclaim pressure even when the system as whole isn't under
> memory pressure - would be useful, and I'm actually kinda surprised
> that it doesn't already exist. It isn't difficult to imagine use
> cases where the user doesn't want certain services/applications (say
> backup, torrent or static http server serving large files) to not
> consume huge amount of memory without triggering OOM killer. It is
> something which is fundamentally useful and I think is why people are
> confused and pulling the current "softlimit" towards something like
> that.
Actually the use case is this. Say you have an important workload which
shouldn't be influenced by other less important workloads (say backup
for simplicity). You set up a soft limit for your important load to
match its average working set. The backup doesn't need any hard limit
and soft limit set to 0 because a) you do not know how much it would
need and b) you like to make run as fast as possible. Check what happens
now. Backup uses all the remaining memory until the global reclaims
starts. The global reclaim will start reclaiming the backup or even
your important workload if it consumed more than its soft limit (say
after a peak load). As far as you can reclaim from the backup enough to
satisfy the global memory pressure you do not have to hit the important
workload. Sounds like a huge win to me!
You can even look at the soft limit as to an "intelligent" mlock which
keeps the memory "locked" as far as you can keep handling the external
memory pressure. This is new with this new re-implementation because the
original code uses soft limit only as a hint who to reclaim first but
doesn't consider it any further.
> If such actual soft limit is desired (I don't know, it just seems like
> a very fundamental / logical feature to me), please don't try to
> somehow overload "softlimit". They are two fundamentally different
> knobs, both make sense in their own ways, and when you stop confusing
> the two, there's nothing ambiguous about what what each knob means in
> hierarchical situations. This goes the same for the "untrusted" flag
> Ying told me, which seems like another confused way to overload two
> meanings onto "softlimit". Don't overload!
>
> Now let's see if this gogo thing actually works.
>
> Thanks.
>
> --
> tejun
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-20 0:42 ` Tejun Heo
@ 2013-04-20 3:35 ` Greg Thelen
2013-04-21 1:53 ` Tejun Heo
0 siblings, 1 reply; 46+ messages in thread
From: Greg Thelen @ 2013-04-20 3:35 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm@kvack.org, Hugh Dickins, Ying Han,
Glauber Costa, Michel Lespinasse
On Fri, Apr 19, 2013 at 5:42 PM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Apr 19, 2013 at 05:26:20PM -0700, Tejun Heo wrote:
>> If such actual soft limit is desired (I don't know, it just seems like
>> a very fundamental / logical feature to me), please don't try to
>> somehow overload "softlimit". They are two fundamentally different
>> knobs, both make sense in their own ways, and when you stop confusing
>> the two, there's nothing ambiguous about what what each knob means in
>> hierarchical situations. This goes the same for the "untrusted" flag
>> Ying told me, which seems like another confused way to overload two
>> meanings onto "softlimit". Don't overload!
>
> As for how actually to clean up this yet another mess in memcg, I
> don't know. Maybe introduce completely new knobs - say,
> oom_threshold, reclaim_threshold, and reclaim_trigger - and alias
> hardlimit to oom_threshold and softlimit to recalim_trigger? BTW,
> "softlimit" should default to 0. Nothing else makes any sense.
I agree that the hard limit could be called the oom_threshold.
The meaning of the term reclaim_threshold is not obvious to me. I'd
prefer to call the soft limit a reclaim_target. System global
pressure can steal memory from a cgroup until its usage drops to the
soft limit (aka reclaim_target). Pressure will try to avoid stealing
memory below the reclaim target. The soft limit (reclaim_target) is
not checked until global pressure exists. Currently we do not have a
knob to set a reclaim_threshold, such that when usage exceeds the
reclaim_threshold async reclaim is queued. We are not discussing
triggering anything when soft limit is exceeded.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-20 3:35 ` Greg Thelen
@ 2013-04-21 1:53 ` Tejun Heo
0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-21 1:53 UTC (permalink / raw)
To: Greg Thelen
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm@kvack.org, Hugh Dickins, Ying Han,
Glauber Costa, Michel Lespinasse
Hey, Greg.
On Fri, Apr 19, 2013 at 08:35:12PM -0700, Greg Thelen wrote:
> > As for how actually to clean up this yet another mess in memcg, I
> > don't know. Maybe introduce completely new knobs - say,
> > oom_threshold, reclaim_threshold, and reclaim_trigger - and alias
> > hardlimit to oom_threshold and softlimit to recalim_trigger? BTW,
> > "softlimit" should default to 0. Nothing else makes any sense.
>
> I agree that the hard limit could be called the oom_threshold.
>
> The meaning of the term reclaim_threshold is not obvious to me. I'd
> prefer to call the soft limit a reclaim_target. System global
> pressure can steal memory from a cgroup until its usage drops to the
> soft limit (aka reclaim_target). Pressure will try to avoid stealing
> memory below the reclaim target. The soft limit (reclaim_target) is
> not checked until global pressure exists. Currently we do not have a
> knob to set a reclaim_threshold, such that when usage exceeds the
> reclaim_threshold async reclaim is queued. We are not discussing
> triggering anything when soft limit is exceeded.
Yeah, reclaim_target seems like a better name for it.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-20 3:16 ` Michal Hocko
@ 2013-04-21 2:23 ` Tejun Heo
2013-04-21 8:55 ` Michel Lespinasse
2013-04-21 12:46 ` Michal Hocko
0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-21 2:23 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hello, Michal.
On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote:
> > For example, please consider the following hierarchy where s denotes
> > the "softlimit" and h hardlimit.
> >
> > A (h:8G s:4G)
> > / \
> > / \
> > B (h:5G s:1G) C (h:5G s:1G)
...
> > It must not be any different for "softlimit". If B or C are
> > individually under 1G, they won't be targeted by the reclaimer and
> > even if B and C are over 1G, let's say 2G, as long as the sum is under
> > A's "softlimit" - 4G, reclaimer won't look at them.
>
> But we disagree on this one. If B and/or C are above their soft limit
> we do (soft) reclaim them. It is exactly the same thing as if they were
> hitting their hard limit (we just enforce the limit lazily).
>
> You can look at the soft limit as a lazy limit which is enforced only if
> there is an external pressure coming up the hierarchy - this can be
> either global memory presure or a hard limit reached up the hierarchy.
> Does this makes sense to you?
When flat, there's no confusion. The problem is that what you
describe makes the meaning of softlimit different for internal nodes
and leaf nodes. IIUC, it is, at least currently, guarantees that
reclaim won't happen for a cgroup under limit. In hierarchical
setting, if A's subtree is under limit, its subtree shouldn't be
subject to guarantee. Again, you should be gating / stacking the
limits as you go down the tree and what you're saying breaks that
fundamental hierarchy rule.
> > Now, let's consider the following hierarchy just to be sure. Let's
> > assume that A itself doesn't have any tasks for simplicity.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > A (h:16G s:4G)
> > / \
> > / \
> > B (h:7G s:5G) C (h:7G s:5G)
> >
> > For hardlimit, it is clear that A's limit won't do anything.
>
> It _does_ if A has tasks which add pressure to B+C. Or even if you do
> not have any tasks because A might hold some reparented pages from
> groups which are gone now.
See the above. It's to discuss the semantics of limit hierarchy, so
let's forget about A's internal usage for now.
> > Just like A's hardlimit doesn't impose any further restrictions on B
> > and C, A's softlimit doesn't give any further guarantee to B and C.
> > There's no difference at all.
>
> If A hits its hard limit then we reclaim that subtree so we _can_ and
> _do_ reclaim also from B and C. This is what the current code does and
> soft reclaim doesn't change that at all. The only thing it changes is
> that it tries to save groups bellow the limit from reclaiming.
Hardlimit and softlimit are in the *opposite* directions and you're
saying that softlimit in parent working in the same direction as
hardlimit is correct. Stop being so confused. Softlimit is in the
opposite direction. Internal node limit in hierarchical setting
should of course work in the opposite direction.
> > Now, it's completely silly that "softlimit" is actually allocation
> > guarantee rather than an actual limit. I guess it's born out of
> > similar confusion? Maybe originally the operation was a confused mix
> > of the two and it moved closer to guaranteeing behavior over time?
>
> I wouldn't call it silly. It actually makes a lot of sense if you look
> at it as a delayed limit which would allow you to allocate more if there
> is not any outside memory pressure.
It is silly because it *prevents* reclaim from happening if the cgroup
is under the limit which is *the* defining characteristic of the knob.
Memory is by *default* allowed to be reclaimed. How can being allowed
to do what is allowed by default be a function of a knob? It seems
like this confusion is leading you to think weird things about the
meaning of the knob in hierarchy. Stop thinking about it as limit.
It's a reclaim inhibitor.
> Actually the use case is this. Say you have an important workload which
> shouldn't be influenced by other less important workloads (say backup
> for simplicity). You set up a soft limit for your important load to
> match its average working set. The backup doesn't need any hard limit
Yes, guarantee.
> and soft limit set to 0 because a) you do not know how much it would
> need and b) you like to make run as fast as possible. Check what happens
> now. Backup uses all the remaining memory until the global reclaims
> starts. The global reclaim will start reclaiming the backup or even
> your important workload if it consumed more than its soft limit (say
> after a peak load). As far as you can reclaim from the backup enough to
> satisfy the global memory pressure you do not have to hit the important
> workload. Sounds like a huge win to me!
I'm not saying the guarantee is useless. I'm saying its name is
completely the opposite of what it does and you, while knowing what it
actually does in practice, are completely confused what the knob
semantically means.
> You can even look at the soft limit as to an "intelligent" mlock which
> keeps the memory "locked" as far as you can keep handling the external
> memory pressure. This is new with this new re-implementation because the
> original code uses soft limit only as a hint who to reclaim first but
> doesn't consider it any further.
Now I'm confused. You're saying softlimit currently doesn't guarantee
anything and what it means, even for flat hierarchy, isn't clearly
defined? If it can go either way and "softlimit" is being made an
allocation guarantee rather than say "if there's any pressure, feel
free to reclaim to this point (ie. prioritize reclaim to that point)",
that doesn't sound like a good idea.
Really, don't mix "don't reclaim below this" and "this shouldn't need
more than this, if under pressure, you can be aggressive about
reclaiming this one down to this point". That's where all the
confusions are coming from. They are two knobs in the opposite
directions and shouldn't be merged into a single knob.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-21 2:23 ` Tejun Heo
@ 2013-04-21 8:55 ` Michel Lespinasse
2013-04-22 4:24 ` Tejun Heo
2013-04-21 12:46 ` Michal Hocko
1 sibling, 1 reply; 46+ messages in thread
From: Michel Lespinasse @ 2013-04-21 8:55 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
Hi Tejun,
I don't remember exactly when you left - during the session I
expressed to Michal that while I think his proposal is an improvement
over the current situation, I think his handling of internal nodes is
confus(ed/ing).
On Sat, Apr 20, 2013 at 7:23 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Michal.
>
> On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote:
>> > For example, please consider the following hierarchy where s denotes
>> > the "softlimit" and h hardlimit.
>> >
>> > A (h:8G s:4G)
>> > / \
>> > / \
>> > B (h:5G s:1G) C (h:5G s:1G)
> ...
>> > It must not be any different for "softlimit". If B or C are
>> > individually under 1G, they won't be targeted by the reclaimer and
>> > even if B and C are over 1G, let's say 2G, as long as the sum is under
>> > A's "softlimit" - 4G, reclaimer won't look at them.
I completely agree with you here. This is important to ensure
composability - someone that was using cgroups within a 4GB system can
be moved to use cgroups within a hierarchy with a 4GB soft limit on
the root, and still have its performance isolated from tasks running
in other cgroups in the system.
>> > Now, let's consider the following hierarchy just to be sure. Let's
>> > assume that A itself doesn't have any tasks for simplicity.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >
>> > A (h:16G s:4G)
>> > / \
>> > / \
>> > B (h:7G s:5G) C (h:7G s:5G)
>> >
>> > For hardlimit, it is clear that A's limit won't do anything.
Now the above is a very interesting case.
One thing some people worry about is that B and C's configuration
might be under a different administrator's control than A's. That is,
we could have a situation where the machine's sysadmin set up A for
someone else to play with, and that other person set up B and C within
his cgroup. In this scenario, one of the issues has to be how do we
prevent B and C's configuration settings from reserving (or protecting
from reclaim) more memory than the machine's admin intended when he
configured A.
Michal's proposal resolves this by saying that A,B and C all become
reclaimable as soon as A goes over its soft limit.
Tejun's proposal (as I understand it) is that B and C protected from
reclaim until they grow to 5G each, as their soft limits indicate.
I have a third view, which I talked about during Michal's
presentation. I think that when A's usage goes over 4G, we should be
able to reclaim from A's subtree. If B or C's usage are above their
soft limits, then we should reclaim from these cgroups; however if
both B and C have usage below their soft limits, then we are in a
situation where the soft limits can't be obeyed so we should ignore
them and reclaim from both B and C instead.
The idea is that I think soft limits should follow these design principles:
- Soft limits are used to steer reclaim. We should try to avoid
reclaiming from cgroups that are under their soft limits. However,
soft limits can't completely prevent reclaim - if all cgroups are
under their soft limits, then the soft limits become meaningless and
all cgroups become eligible for being reclaimed from (this is a
situation that the sysadmin can largely avoid by not over-committing
the soft limits).
- A child cgroup should not be able to grab more resources than its
parent (this is for the situation where the parent and child cgroups
might be under separate administrative control). So when a parent
cgroup hits its soft limit, the child cgroup soft limits should not be
able to prevent us from reclaiming from that hierarchy. The child
cgroup soft limits should still be obeyed to steer reclaim within the
hierarchy when possible, though.
Regardless about these differences, I still want to stress out that
Michal's proposal is a clear improvement over what we have, so I see
it as a large step in the right direction.
> Now I'm confused. You're saying softlimit currently doesn't guarantee
> anything and what it means, even for flat hierarchy, isn't clearly
> defined?
The largest problem with softlimit today is that global reclaim
doesn't take it into account at all... So yes, I would say that
softlimit is very badly defined today (which may be why people have
such trouble agreeing about what it should mean in the first place).
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-21 2:23 ` Tejun Heo
2013-04-21 8:55 ` Michel Lespinasse
@ 2013-04-21 12:46 ` Michal Hocko
2013-04-22 4:39 ` Tejun Heo
1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-21 12:46 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
[I am terribly jet lagged so I should probably postpone any serious
thinking for few days but let me try]
On Sat 20-04-13 19:23:21, Tejun Heo wrote:
> Hello, Michal.
>
> On Fri, Apr 19, 2013 at 08:16:11PM -0700, Michal Hocko wrote:
> > > For example, please consider the following hierarchy where s denotes
> > > the "softlimit" and h hardlimit.
> > >
> > > A (h:8G s:4G)
> > > / \
> > > / \
> > > B (h:5G s:1G) C (h:5G s:1G)
> ...
> > > It must not be any different for "softlimit". If B or C are
> > > individually under 1G, they won't be targeted by the reclaimer and
> > > even if B and C are over 1G, let's say 2G, as long as the sum is under
> > > A's "softlimit" - 4G, reclaimer won't look at them.
> >
> > But we disagree on this one. If B and/or C are above their soft limit
> > we do (soft) reclaim them. It is exactly the same thing as if they were
> > hitting their hard limit (we just enforce the limit lazily).
> >
> > You can look at the soft limit as a lazy limit which is enforced only if
> > there is an external pressure coming up the hierarchy - this can be
> > either global memory presure or a hard limit reached up the hierarchy.
> > Does this makes sense to you?
>
> When flat, there's no confusion. The problem is that what you
> describe makes the meaning of softlimit different for internal nodes
> and leaf nodes.
No inter and leaf nodes behave very same. Have a look at
mem_cgroup_soft_reclaim_eligible. All the confusion comes probably
from the understanding of the current semantic of what soft limit and
what it should do after my patch.
The current implementation stores all subtrees that are over the soft
limit in a tree sorted by how much they are excessing the limit. Have
a look at mem_cgroup_update_tree and its callers (namely down from
__mem_cgroup_commit_charge). My patch _preserves_ this behavior it just
makes the code much saner and as a bonus it doesn't touch groups (not
hierarchies) under the limit unless necessary which wasn't the case
previously.
So yes, I can understand why this is confusing for you. The soft limit
semantic is different because the limit is/was considered only if it
is/was in excess.
Maybe I was using word _guarantee_ too often to confuse you, I am sorry
if this is the case. The guarantee part comes from the group point of
view. So the original semantic of the hierarchical behavior is
unchanged.
What to does it mean that an inter node is under the soft limit
for the subhierarchy is questionable and there are usecases where
children groups might be under control of a different (even untrusted)
administrators (think about containers) so the implementation is not
straight forward. We certainly can do better than just reclaim everybody
but this is a subject to later improvements.
I will get to the rest of the email later.
[...]
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-21 8:55 ` Michel Lespinasse
@ 2013-04-22 4:24 ` Tejun Heo
2013-04-22 7:14 ` Michel Lespinasse
2013-04-22 15:37 ` Michal Hocko
0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 4:24 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
Hey, Michel.
> I don't remember exactly when you left - during the session I
> expressed to Michal that while I think his proposal is an improvement
> over the current situation, I think his handling of internal nodes is
> confus(ed/ing).
I think I stayed until near the end of the hierarchy discussion and
yeap I heard you saying that.
> I completely agree with you here. This is important to ensure
> composability - someone that was using cgroups within a 4GB system can
> be moved to use cgroups within a hierarchy with a 4GB soft limit on
> the root, and still have its performance isolated from tasks running
> in other cgroups in the system.
And for basic sanity. As you look down through the hierarchy of
nested cgroups, the pressure exerted by a limit can only be increased
(IOW, the specificity of the control increases) as the level deepens,
regardless of the direction of such pressure, which is the only
logical thing to do for nested limits.
>> > Now, let's consider the following hierarchy just to be sure. Let's
>> > assume that A itself doesn't have any tasks for simplicity.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >
>> > A (h:16G s:4G)
>> > / \
>> > / \
>> > B (h:7G s:5G) C (h:7G s:5G)
>> >
>> > For hardlimit, it is clear that A's limit won't do anything.
>
> Now the above is a very interesting case.
It shouldn't be interesting at all. It should be exactly the same.
If "softlimit" means actual soft limit prioritizing reclaim down to
that point under pressure, it works in the same direction as hardlimit
and the limits should behave the same.
If "softlimit" means allocation guarantee where a cgroup is exempt
from reclaim while under the limit, a knob defining allowance rather
than limit, the direction of specificity is flipped. While the
direction is flipped, how it behaves should be the same. Otherwise,
it ends up breaking the very basics of nesting. Not a particularly
bright idea.
> One thing some people worry about is that B and C's configuration
> might be under a different administrator's control than A's. That is,
> we could have a situation where the machine's sysadmin set up A for
> someone else to play with, and that other person set up B and C within
> his cgroup. In this scenario, one of the issues has to be how do we
> prevent B and C's configuration settings from reserving (or protecting
> from reclaim) more memory than the machine's admin intended when he
> configured A.
Cgroup doesn't and will not support delegation of subtrees to
different security domains. Please refer to the following thread.
http://thread.gmane.org/gmane.linux.kernel.cgroups/6638
In fact, I'm planning to disallow changing ownership of cgroup files
when "sane_behavior" is specified. We're having difficult time
identifying our own asses as it is and I have no intention of adding
the huge extra burden of security policing on top. Delegation, if
necessary, will happen from userland.
> Michal's proposal resolves this by saying that A,B and C all become
> reclaimable as soon as A goes over its soft limit.
This makes me doubly upset and reminds me strongly of the
.use_hierarchy mess. It's so myopic in coming up with a solution for
the problem immediately at hand, it ends up ignoring basic rules and
implementing something which is fundamentally broken and confused.
Don't twist basic nesting rules to accomodate half-assed delegation
mechanism. It's never gonna work properly and we'll need
"really_sane_behavior" flag eventually to clean up the mess again, and
we'll probably have to clarify that for memcg the 'c' stands for
"confused" instead of "control".
And I don't even get the delegation argument. Isn't that already
covered by hardlimit? Sure, reclaimer won't look at it but if you
don't trust a cgroup it of course will be put under certain hardlimit
from parent and smacked when it misbehaves. Hardlimit of course
should have priority over allocation guarantee and the system wouldn't
be in jeopardy due to a delegated cgroup misbehaving. If each knob is
given a clear meaning, these things should come naturally. You just
need a sane pecking order among the controls. It almost feels surreal
that this is suggested as a rationale for creating this chimera of a
knob. What the hell is going on here?
> I have a third view, which I talked about during Michal's
> presentation. I think that when A's usage goes over 4G, we should be
> able to reclaim from A's subtree. If B or C's usage are above their
> soft limits, then we should reclaim from these cgroups; however if
> both B and C have usage below their soft limits, then we are in a
> situation where the soft limits can't be obeyed so we should ignore
> them and reclaim from both B and C instead.
No, the config is valid and *exactly* the same as hardlimit case.
It's just in the opposite direction. Don't twist it. It's exactly
the same mechanics. Flipping the direction should not change what
nesting means. That's what you get and should get when cgroup nesting
is used for something which "guarantees" rather than "limits".
Whatever twsit you think is a good idea for "softlimit", try to flip
the direction and apply it the same to "hardlimit" and see how messed
up it gets.
> Regardless about these differences, I still want to stress out that
> Michal's proposal is a clear improvement over what we have, so I see
> it as a large step in the right direction.
I'm afraid I don't agree with that. If the current situation is
ambiguous, moving to a definite wrong state makes the situation worse,
so we need to figure out what this thing actually means first, and
it's not like it is a difficult choice to make. It's either actual
soft limit or allocation guarantee. It cannot be some random
combination of the two. Just pick one and stick with it.
>> Now I'm confused. You're saying softlimit currently doesn't guarantee
>> anything and what it means, even for flat hierarchy, isn't clearly
>> defined?
>
> The largest problem with softlimit today is that global reclaim
> doesn't take it into account at all... So yes, I would say that
> softlimit is very badly defined today (which may be why people have
> such trouble agreeing about what it should mean in the first place).
So, in that case, let's please make "softlimit" an actual soft limit
working in the same direction as hardlimit but works in terms of
reclaim pressure rather than OOM killing, and please don't tell me how
"softlimit" working in the opposite direction of "hardlimit" actually
makes sense in the wonderland of memcg. Please have at least some
common sense. :(
If people need "don't reclaim under this limit", IOW allocation
guarantee, please introduce another knob with proper name and properly
flipped hierarchy behavior.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-21 12:46 ` Michal Hocko
@ 2013-04-22 4:39 ` Tejun Heo
2013-04-22 15:19 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 4:39 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hey, Michal.
On Sun, Apr 21, 2013 at 02:46:06PM +0200, Michal Hocko wrote:
> [I am terribly jet lagged so I should probably postpone any serious
> thinking for few days but let me try]
Sorry about raising a flame war so soon after the conference week.
None of these is really urgent, so please take your time.
> The current implementation stores all subtrees that are over the soft
> limit in a tree sorted by how much they are excessing the limit. Have
> a look at mem_cgroup_update_tree and its callers (namely down from
> __mem_cgroup_commit_charge). My patch _preserves_ this behavior it just
> makes the code much saner and as a bonus it doesn't touch groups (not
> hierarchies) under the limit unless necessary which wasn't the case
> previously.
What you describe is already confused. What does that knob mean then?
Google folks seem to think it's an allocation guarantee but global
reclaim is broken and breaches the configuration (which I suppose is
arising from their usage of memcg) and I don't understand what your
definition of the knob is apart from the description of what's
implemented now, which apparently is causing horrible confusion on all
the involved parties.
> So yes, I can understand why this is confusing for you. The soft limit
> semantic is different because the limit is/was considered only if it
> is/was in excess.
>
> Maybe I was using word _guarantee_ too often to confuse you, I am sorry
> if this is the case. The guarantee part comes from the group point of
> view. So the original semantic of the hierarchical behavior is
> unchanged.
I don't care what word you use. There are two choices. Pick one and
stick with it. Don't make it something which inhibits reclaim if
under limit for leaf nodes but behaves somewhat differently if an
ancestor is under pressure or whatever. Just pick one. It is either
an reclaim inhibitor or actual soft limit.
> What to does it mean that an inter node is under the soft limit
> for the subhierarchy is questionable and there are usecases where
It's not frigging questionable. You're just horribly confused.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 4:24 ` Tejun Heo
@ 2013-04-22 7:14 ` Michel Lespinasse
2013-04-22 14:48 ` Tejun Heo
2013-04-22 15:37 ` Michal Hocko
1 sibling, 1 reply; 46+ messages in thread
From: Michel Lespinasse @ 2013-04-22 7:14 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Sun, Apr 21, 2013 at 9:24 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey, Michel.
>
>> I don't remember exactly when you left - during the session I
>> expressed to Michal that while I think his proposal is an improvement
>> over the current situation, I think his handling of internal nodes is
>> confus(ed/ing).
>
> I think I stayed until near the end of the hierarchy discussion and
> yeap I heard you saying that.
All right. Too bad you had to leave - I think this is a discussion we
really need to have, so it would have been the perfect occasion.
>>> > Now, let's consider the following hierarchy just to be sure. Let's
>>> > assume that A itself doesn't have any tasks for simplicity.
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> >
>>> > A (h:16G s:4G)
>>> > / \
>>> > / \
>>> > B (h:7G s:5G) C (h:7G s:5G)
>>> >
>>> > For hardlimit, it is clear that A's limit won't do anything.
>>
>> One thing some people worry about is that B and C's configuration
>> might be under a different administrator's control than A's. That is,
>> we could have a situation where the machine's sysadmin set up A for
>> someone else to play with, and that other person set up B and C within
>> his cgroup. In this scenario, one of the issues has to be how do we
>> prevent B and C's configuration settings from reserving (or protecting
>> from reclaim) more memory than the machine's admin intended when he
>> configured A.
>
> Cgroup doesn't and will not support delegation of subtrees to
> different security domains. Please refer to the following thread.
>
> http://thread.gmane.org/gmane.linux.kernel.cgroups/6638
Ah, good. This is news to me. To be clear, I don't care much for the
delegation scenario myself, but it's always been mentioned as the
reason I couldn't get what I want when we've talked about hierarchical
soft limit behavior in the past. If the decision not to have subtree
delegation sticks, I am perfectly happy with your proposal.
> And I don't even get the delegation argument. Isn't that already
> covered by hardlimit? Sure, reclaimer won't look at it but if you
> don't trust a cgroup it of course will be put under certain hardlimit
> from parent and smacked when it misbehaves. Hardlimit of course
> should have priority over allocation guarantee and the system wouldn't
> be in jeopardy due to a delegated cgroup misbehaving. If each knob is
> given a clear meaning, these things should come naturally. You just
> need a sane pecking order among the controls. It almost feels surreal
> that this is suggested as a rationale for creating this chimera of a
> knob. What the hell is going on here?
People often overcommit the cgroup hard limits so that one cgroup can
make use of a larger share of the machine when the other cgroups are
idle.
This works well only if you can depend on soft limits to steer reclaim
when the other cgroups get active again.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 7:14 ` Michel Lespinasse
@ 2013-04-22 14:48 ` Tejun Heo
0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 14:48 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Michal Hocko, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
Hello, again.
On Mon, Apr 22, 2013 at 12:14:53AM -0700, Michel Lespinasse wrote:
> > I think I stayed until near the end of the hierarchy discussion and
> > yeap I heard you saying that.
>
> All right. Too bad you had to leave - I think this is a discussion we
> really need to have, so it would have been the perfect occasion.
Eh well, it would have been better if I stayed but I think it served
its purpose. Conferences are great for raising awareness. I usually
find actual follow-up discussions done better in mailing lists.
> > Cgroup doesn't and will not support delegation of subtrees to
> > different security domains. Please refer to the following thread.
> >
> > http://thread.gmane.org/gmane.linux.kernel.cgroups/6638
>
> Ah, good. This is news to me. To be clear, I don't care much for the
> delegation scenario myself, but it's always been mentioned as the
> reason I couldn't get what I want when we've talked about hierarchical
> soft limit behavior in the past. If the decision not to have subtree
> delegation sticks, I am perfectly happy with your proposal.
Oh, it's sticking. :)
> > And I don't even get the delegation argument. Isn't that already
> > covered by hardlimit? Sure, reclaimer won't look at it but if you
> > don't trust a cgroup it of course will be put under certain hardlimit
> > from parent and smacked when it misbehaves. Hardlimit of course
> > should have priority over allocation guarantee and the system wouldn't
> > be in jeopardy due to a delegated cgroup misbehaving. If each knob is
> > given a clear meaning, these things should come naturally. You just
> > need a sane pecking order among the controls. It almost feels surreal
> > that this is suggested as a rationale for creating this chimera of a
> > knob. What the hell is going on here?
>
> People often overcommit the cgroup hard limits so that one cgroup can
> make use of a larger share of the machine when the other cgroups are
> idle.
> This works well only if you can depend on soft limits to steer reclaim
> when the other cgroups get active again.
And that's fine too. If you take a step back, it shouldn't be
difficult to recognize that what you want is an actual soft limit at
the parent level overriding the allocation guarantee (for the lack of
a better name). Don't overload "alloc guarantee" with that extra
meaning messing up its fundamental properties. Create a separate
plane of control which is consistent within itself and give it
priority over "alloc guarantee". You sure can discuss the details of
the override - should it be round-robin or proportional to whatever or
what, but that's a separate discussion and can be firmly labeled as
implementation details rather than this twisting of the fundamental
semantics of "softlimit".
I really am not saying any of the use cases that have been described
are invalid. They all sound pretty useful, but, to me, what seems to
be recurring is that people want two separate features - actual soft
limit and allocation guarantee, and for some reason that I can't
understand, fail to recognize they're two very different controls and
try to put both into this one poor knob.
It's like trying to combine accelerator and (flipped) clutch on a
manual car. Sure, it'll work fine while you're accelerating. Good
luck while cruising or on a long downhill. You can try to tweak it
all you want but things of course will get "interesting" and
"questionable" as soon as the conditions change from the specific use
cases which the specific tuning is made for.
While car analogies can often be misleading, really, please stop
trying to combine two completely separate controls into one knob. It
won't and can't work and is totally stupid.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 4:39 ` Tejun Heo
@ 2013-04-22 15:19 ` Michal Hocko
2013-04-22 15:57 ` Tejun Heo
0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-22 15:19 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Sun 21-04-13 21:39:39, Tejun Heo wrote:
> Hey, Michal.
>
> On Sun, Apr 21, 2013 at 02:46:06PM +0200, Michal Hocko wrote:
> > [I am terribly jet lagged so I should probably postpone any serious
> > thinking for few days but let me try]
>
> Sorry about raising a flame war so soon after the conference week.
> None of these is really urgent, so please take your time.
>
> > The current implementation stores all subtrees that are over the soft
> > limit in a tree sorted by how much they are excessing the limit. Have
> > a look at mem_cgroup_update_tree and its callers (namely down from
> > __mem_cgroup_commit_charge). My patch _preserves_ this behavior it just
> > makes the code much saner and as a bonus it doesn't touch groups (not
> > hierarchies) under the limit unless necessary which wasn't the case
> > previously.
>
> What you describe is already confused. What does that knob mean then?
Well, it would help to start with Documentation/cgroups/memory.txt
"
7. Soft limits
Soft limits allow for greater sharing of memory. The idea behind soft
limits is to allow control groups to use as much of the memory as
needed, provided
a. There is no memory contention
b. They do not exceed their hard limit
When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.
Please note that soft limits is a best-effort feature; it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd).
"
As you can see there no single mention about groups below their soft
limits. All we are saying here is that those groups that are above will
get reclaimed.
> Google folks seem to think it's an allocation guarantee but global
> reclaim is broken and breaches the configuration (which I suppose is
> arising from their usage of memcg) and I don't understand what your
> definition of the knob is apart from the description of what's
> implemented now, which apparently is causing horrible confusion on all
> the involved parties.
OK, I guess I start understanding where all the confusion comes from.
Let me stress again that the rework doesn't provide any guarantee. It
just integrates the soft limit reclaim into the main reclaim routines,
gets rid of a lot of code and last but not least makes a greater chance
that under-the-soft limit groups are not reclaimed unless really
necessary.
So please take these into consideration for the future discussions.
> > So yes, I can understand why this is confusing for you. The soft limit
> > semantic is different because the limit is/was considered only if it
> > is/was in excess.
> >
> > Maybe I was using word _guarantee_ too often to confuse you, I am sorry
> > if this is the case. The guarantee part comes from the group point of
> > view. So the original semantic of the hierarchical behavior is
> > unchanged.
>
> I don't care what word you use. There are two choices. Pick one and
> stick with it. Don't make it something which inhibits reclaim if
> under limit for leaf nodes but behaves somewhat differently if an
> ancestor is under pressure or whatever. Just pick one. It is either
> an reclaim inhibitor or actual soft limit.
OK, I will not repeat the same mistake and let this frustrating
discussion going on to "let's redo the soft limit reclaim again #1001"
point again. No this is not about guarantee. And _never_ will be! Full
stop.
We can try to be clever during the outside pressure and prefer
reclaiming over soft limit groups first. Which we used to do and will
do after rework as well. As a side effect of that a properly designed
hierachy with opt-in soft limited groups can actually accomplish some
isolation is a nice side effect but no _guarantee_.
> > What to does it mean that an inter node is under the soft limit
> > for the subhierarchy is questionable and there are usecases where
>
> It's not frigging questionable. You're just horribly confused.
>
> Thanks.
>
> --
> tejun
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 4:24 ` Tejun Heo
2013-04-22 7:14 ` Michel Lespinasse
@ 2013-04-22 15:37 ` Michal Hocko
2013-04-22 15:46 ` Tejun Heo
1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-22 15:37 UTC (permalink / raw)
To: Tejun Heo
Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Glauber Costa, Greg Thelen
On Sun 21-04-13 21:24:45, Tejun Heo wrote:
[...]
> Cgroup doesn't and will not support delegation of subtrees to
> different security domains. Please refer to the following thread.
>
> http://thread.gmane.org/gmane.linux.kernel.cgroups/6638
>
> In fact, I'm planning to disallow changing ownership of cgroup files
> when "sane_behavior" is specified.
I would be wildly oposing this. Enabling user to play on its own ground
while above levels of the groups enforce the reasonable behavior is very
important use case.
> We're having difficult time identifying our own asses as it is and I
> have no intention of adding the huge extra burden of security policing
> on top. Delegation, if necessary, will happen from userland.
> > Michal's proposal resolves this by saying that A,B and C all become
> > reclaimable as soon as A goes over its soft limit.
>
> This makes me doubly upset and reminds me strongly of the
> .use_hierarchy mess. It's so myopic in coming up with a solution for
> the problem immediately at hand, it ends up ignoring basic rules and
> implementing something which is fundamentally broken and confused.
Tejun, stop this, finally! Current soft limit same as the reworked
version follow the basic nesting rule we use for the hard limit which
says that parent setting is always more strict than its children.
So if you parent says you are hitting the hardlimit (resp. over soft
limit) then children are reclaimed regardless their hard/soft limit
setting.
> Don't twist basic nesting rules to accomodate half-assed delegation
> mechanism. It's never gonna work properly and we'll need
> "really_sane_behavior" flag eventually to clean up the mess again, and
> we'll probably have to clarify that for memcg the 'c' stands for
> "confused" instead of "control".
>
> And I don't even get the delegation argument. Isn't that already
> covered by hardlimit?
No it's not, because you want to overcommit the memory between different
groups. And soft limit is a way how to handle memory pressure gracefully
in contented situations.
> Sure, reclaimer won't look at it but if you don't trust a cgroup
> it of course will be put under certain hardlimit from parent and
> smacked when it misbehaves. Hardlimit of course should have priority
> over allocation guarantee and the system wouldn't be in jeopardy due
> to a delegated cgroup misbehaving. If each knob is given a clear
> meaning, these things should come naturally. You just need a sane
> pecking order among the controls. It almost feels surreal that this
> is suggested as a rationale for creating this chimera of a knob. What
> the hell is going on here?
It is you being confused and refuse to open the damn documentation and
read what the hack is soft limit and what it is used for. Read the patch
series I was talking about and you will hardly find anything regarding
_guarantee_.
[...]
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:37 ` Michal Hocko
@ 2013-04-22 15:46 ` Tejun Heo
2013-04-22 15:54 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 15:46 UTC (permalink / raw)
To: Michal Hocko
Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Glauber Costa, Greg Thelen
Hey, Michal.
On Mon, Apr 22, 2013 at 05:37:30PM +0200, Michal Hocko wrote:
> > In fact, I'm planning to disallow changing ownership of cgroup files
> > when "sane_behavior" is specified.
>
> I would be wildly oposing this. Enabling user to play on its own ground
> while above levels of the groups enforce the reasonable behavior is very
> important use case.
We can continue this discussion on the original thread and I'm not too
firm on this not because it's a sane use case but because it is an
extra measure preventing root from shooting its feet which we
traditionally allow. That said, really, no good can come from
delegating hierarchy to different security domains. It's already
discouraged by the userland best practices doc. Just don't do it.
> Tejun, stop this, finally! Current soft limit same as the reworked
> version follow the basic nesting rule we use for the hard limit which
> says that parent setting is always more strict than its children.
> So if you parent says you are hitting the hardlimit (resp. over soft
> limit) then children are reclaimed regardless their hard/soft limit
> setting.
Okay, thanks for making it clear. Then, apparently, the fine folks at
google are hopelessly confused because at least Greg and Ying told me
something which is the completely opposite of what you're saying. You
guys need to sort it out.
> It is you being confused and refuse to open the damn documentation and
> read what the hack is soft limit and what it is used for. Read the patch
> series I was talking about and you will hardly find anything regarding
> _guarantee_.
Oh, if so, I'm happy. Sorry about being brash on the thread; however,
please talk with google memcg people. They have very different
interpretation of what "softlimit" is and are using it according to
that interpretation. If it *is* an actual soft limit, there is no
inherent isolation coming from it and that should be clear to
everyone.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:46 ` Tejun Heo
@ 2013-04-22 15:54 ` Michal Hocko
2013-04-22 16:01 ` Tejun Heo
2013-04-23 9:58 ` Michel Lespinasse
0 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-22 15:54 UTC (permalink / raw)
To: Tejun Heo
Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Glauber Costa, Greg Thelen
On Mon 22-04-13 08:46:20, Tejun Heo wrote:
> Hey, Michal.
>
> On Mon, Apr 22, 2013 at 05:37:30PM +0200, Michal Hocko wrote:
> > > In fact, I'm planning to disallow changing ownership of cgroup files
> > > when "sane_behavior" is specified.
> >
> > I would be wildly oposing this. Enabling user to play on its own ground
> > while above levels of the groups enforce the reasonable behavior is very
> > important use case.
>
> We can continue this discussion on the original thread and I'm not too
> firm on this not because it's a sane use case but because it is an
> extra measure preventing root from shooting its feet which we
> traditionally allow. That said, really, no good can come from
> delegating hierarchy to different security domains. It's already
> discouraged by the userland best practices doc. Just don't do it.
OK, I will go to the original mail thread and discuss my concerns there.
> > Tejun, stop this, finally! Current soft limit same as the reworked
> > version follow the basic nesting rule we use for the hard limit which
> > says that parent setting is always more strict than its children.
> > So if you parent says you are hitting the hardlimit (resp. over soft
> > limit) then children are reclaimed regardless their hard/soft limit
> > setting.
>
> Okay, thanks for making it clear. Then, apparently, the fine folks at
> google are hopelessly confused because at least Greg and Ying told me
> something which is the completely opposite of what you're saying. You
> guys need to sort it out.
>
> > It is you being confused and refuse to open the damn documentation and
> > read what the hack is soft limit and what it is used for. Read the patch
> > series I was talking about and you will hardly find anything regarding
> > _guarantee_.
>
> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
> please talk with google memcg people. They have very different
> interpretation of what "softlimit" is and are using it according to
> that interpretation. If it *is* an actual soft limit, there is no
> inherent isolation coming from it and that should be clear to
> everyone.
We have discussed that for a long time. I will not speak for Greg & Ying
but from my POV we have agreed that the current implementation will work
for them with some (minor) changes in their layout.
As I have said already with a careful configuration (e.i. setting the
soft limit only where it matters - where it protects an important
memory which is usually in the leaf nodes) you can actually achieve
_high_ probability for not being reclaimed after the rework which was not
possible before because of the implementation which was ugly and
smelled.
>
> Thanks.
>
> --
> tejun
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:19 ` Michal Hocko
@ 2013-04-22 15:57 ` Tejun Heo
2013-04-22 15:57 ` Tejun Heo
2013-04-22 16:20 ` Michal Hocko
0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 15:57 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote:
> We can try to be clever during the outside pressure and prefer
> reclaiming over soft limit groups first. Which we used to do and will
> do after rework as well. As a side effect of that a properly designed
> hierachy with opt-in soft limited groups can actually accomplish some
> isolation is a nice side effect but no _guarantee_.
Okay, so it *is* a soft limit. Good. If so, a subtree going over the
limit of course forces reclaim on its children even though their
individual configs aren't over limit. It's exactly the same as
hardlimit. There doesn't need to be any difference and there's
nothing questionable or interesting about it.
Also, then, a cgroup which has been configured explicitly shouldn't be
disadvantaged compared to a cgroup with a limit configured. ie. the
current behavior of giving maximum to the knob on creation is the
correct one. The knob should create *extra* pressure. It shouldn't
lessen the pressure. When populated weith other cgroups with limits
configured, it would change the relative pressure felt by each but in
general it's a limiting mechanism not an isolation one. I think the
bulk of confusion is coming from this, so please make that abundantly
clear.
And, if people want a mechanism for isolation / lessening of pressure,
which looks like a valid use case to me, add another knob for that
which is prioritized under both hard and soft limits. That is the
only sensible way to do it.
Alright, no complaint anymore. Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:57 ` Tejun Heo
@ 2013-04-22 15:57 ` Tejun Heo
2013-04-22 16:20 ` Michal Hocko
1 sibling, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 15:57 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Mon, Apr 22, 2013 at 08:57:03AM -0700, Tejun Heo wrote:
> On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote:
> > We can try to be clever during the outside pressure and prefer
> > reclaiming over soft limit groups first. Which we used to do and will
> > do after rework as well. As a side effect of that a properly designed
> > hierachy with opt-in soft limited groups can actually accomplish some
> > isolation is a nice side effect but no _guarantee_.
>
> Okay, so it *is* a soft limit. Good. If so, a subtree going over the
> limit of course forces reclaim on its children even though their
> individual configs aren't over limit. It's exactly the same as
> hardlimit. There doesn't need to be any difference and there's
> nothing questionable or interesting about it.
>
> Also, then, a cgroup which has been configured explicitly shouldn't be
^
not
> disadvantaged compared to a cgroup with a limit configured. ie. the
> current behavior of giving maximum to the knob on creation is the
> correct one. The knob should create *extra* pressure. It shouldn't
> lessen the pressure. When populated weith other cgroups with limits
> configured, it would change the relative pressure felt by each but in
> general it's a limiting mechanism not an isolation one. I think the
> bulk of confusion is coming from this, so please make that abundantly
> clear.
>
> And, if people want a mechanism for isolation / lessening of pressure,
> which looks like a valid use case to me, add another knob for that
> which is prioritized under both hard and soft limits. That is the
> only sensible way to do it.
>
> Alright, no complaint anymore. Thanks.
>
> --
> tejun
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:54 ` Michal Hocko
@ 2013-04-22 16:01 ` Tejun Heo
2013-04-23 9:58 ` Michel Lespinasse
1 sibling, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 16:01 UTC (permalink / raw)
To: Michal Hocko
Cc: Michel Lespinasse, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Glauber Costa, Greg Thelen
Hey,
On Mon, Apr 22, 2013 at 05:54:54PM +0200, Michal Hocko wrote:
> > Oh, if so, I'm happy. Sorry about being brash on the thread; however,
> > please talk with google memcg people. They have very different
> > interpretation of what "softlimit" is and are using it according to
> > that interpretation. If it *is* an actual soft limit, there is no
> > inherent isolation coming from it and that should be clear to
> > everyone.
>
> We have discussed that for a long time. I will not speak for Greg & Ying
> but from my POV we have agreed that the current implementation will work
> for them with some (minor) changes in their layout.
> As I have said already with a careful configuration (e.i. setting the
> soft limit only where it matters - where it protects an important
> memory which is usually in the leaf nodes) you can actually achieve
> _high_ probability for not being reclaimed after the rework which was not
> possible before because of the implementation which was ugly and
> smelled.
I don't know. I'm not sure this is a good idea. It's still
encouraging abuse of the knob even if that's not the intention and
once the usage sticks you end up with something you can't revert
afterwards. I think it'd be better to make it *very* clear that
"softlimit" can't be used for isolation in any reliable way.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:57 ` Tejun Heo
2013-04-22 15:57 ` Tejun Heo
@ 2013-04-22 16:20 ` Michal Hocko
2013-04-22 18:30 ` Tejun Heo
1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-22 16:20 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Mon 22-04-13 08:57:03, Tejun Heo wrote:
> On Mon, Apr 22, 2013 at 05:19:08PM +0200, Michal Hocko wrote:
> > We can try to be clever during the outside pressure and prefer
> > reclaiming over soft limit groups first. Which we used to do and will
> > do after rework as well. As a side effect of that a properly designed
> > hierachy with opt-in soft limited groups can actually accomplish some
> > isolation is a nice side effect but no _guarantee_.
>
> Okay, so it *is* a soft limit. Good. If so, a subtree going over the
> limit of course forces reclaim on its children even though their
> individual configs aren't over limit. It's exactly the same as
> hardlimit. There doesn't need to be any difference and there's
> nothing questionable or interesting about it.
>
> Also, then, a cgroup which has been configured explicitly shouldn't be
> disadvantaged compared to a cgroup with a limit configured. ie. the
> current behavior of giving maximum to the knob on creation is the
> correct one.
Although the default limit is correct it is impractical for use
because it doesn't allow for "I behave do not reclaim me if you can"
cases. And we can implement such a behavior really easily with backward
compatibility and new interfaces (aka reuse the soft limit for that).
I am approaching this from a simple perspective. Reclaim from everybody
who doesn't care about the soft limit (it hasn't been set for that
group) or who is above the soft limit. If that is sufficient to meet the
reclaim target then there is no reason to touch groups that _do_ care
about soft limit and they are under. Although this doesn't give you
any guarantee it can give a certain prioritization for groups in the
overcommit situations and that is what soft limit was intended for from
the very beginning.
> The knob should create *extra* pressure. It shouldn't
> lessen the pressure. When populated weith other cgroups with limits
> configured, it would change the relative pressure felt by each but in
> general it's a limiting mechanism not an isolation one. I think the
> bulk of confusion is coming from this, so please make that abundantly
> clear.
>
> And, if people want a mechanism for isolation / lessening of pressure,
> which looks like a valid use case to me, add another knob for that
> which is prioritized under both hard and soft limits. That is the
> only sensible way to do it.
No, please no yet another knob. We have too many of them already. And
even those that are here for a long time can be confusing as one can
see.
> Alright, no complaint anymore. Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 16:20 ` Michal Hocko
@ 2013-04-22 18:30 ` Tejun Heo
2013-04-23 9:29 ` Michal Hocko
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Tejun Heo @ 2013-04-22 18:30 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hey,
On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote:
> Although the default limit is correct it is impractical for use
> because it doesn't allow for "I behave do not reclaim me if you can"
> cases. And we can implement such a behavior really easily with backward
> compatibility and new interfaces (aka reuse the soft limit for that).
Okay, now we're back to square one and I'm reinstating all the mean
things I said in this thread. :P No wonder everyone is so confused
about this. Michal, you can't overload two controls which exert
pressure on the opposite direction onto a single knob and define a
sane hierarchical behavior for it. You're making it a point control
rather than range one. Maybe you can define some twisted rules
serving certain specific use case, but it's gonna be confusing /
broken for different use cases.
You're so confused that you don't even know you're confused.
> I am approaching this from a simple perspective. Reclaim from everybody
No, you're just thinking about two immediate problems you're given and
trying to jam them into something you already have not realizing those
two can't be expressed with a single knob.
> who doesn't care about the soft limit (it hasn't been set for that
> group) or who is above the soft limit. If that is sufficient to meet the
> reclaim target then there is no reason to touch groups that _do_ care
> about soft limit and they are under. Although this doesn't give you
> any guarantee it can give a certain prioritization for groups in the
> overcommit situations and that is what soft limit was intended for from
> the very beginning.
For $DEITY's sake, soft limit should exert reclaim pressure. That's
it. If a group is over limit, it'll feel *extra* pressure until it's
back to the limit. Once under the limit, it should be treated equally
to any other tasks which are under the limit including the ones
without any softlimit configured. It is not different from hardlimit.
There's nothing "interesting" about it.
Even for flat hierarchy, with your interpretation of the knob, it is
impossible to say "I don't really care about this thing, if it goes
over 30M, hammer on it", which is a completely reasonable thing to
want.
> > And, if people want a mechanism for isolation / lessening of pressure,
> > which looks like a valid use case to me, add another knob for that
> > which is prioritized under both hard and soft limits. That is the
> > only sensible way to do it.
>
> No, please no yet another knob. We have too many of them already. And
> even those that are here for a long time can be confusing as one can
> see.
Yes, sure, knobs are hard, let's combine two controls in the opposite
directions into one.
That is the crux of the confusion - trying to combine two things which
can't and shouldn't be combined. Just forget about the other thing or
separate it out. Please take a step back and look at it again.
You're really epitomizing the confusion on this subject.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 18:30 ` Tejun Heo
@ 2013-04-23 9:29 ` Michal Hocko
2013-04-23 17:09 ` Tejun Heo
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner
2 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:29 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Mon 22-04-13 11:30:20, Tejun Heo wrote:
> Hey,
>
> On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote:
> > Although the default limit is correct it is impractical for use
> > because it doesn't allow for "I behave do not reclaim me if you can"
> > cases. And we can implement such a behavior really easily with backward
> > compatibility and new interfaces (aka reuse the soft limit for that).
>
> Okay, now we're back to square one and I'm reinstating all the mean
> things I said in this thread. :P No wonder everyone is so confused
> about this. Michal, you can't overload two controls which exert
> pressure on the opposite direction onto a single knob and define a
> sane hierarchical behavior for it.
Ohh, well and we are back in the circle again. Nobody is proposing
overloading soft reclaim for any bottom-up (if that is what you mean by
your opposite direction) pressure handling.
> You're making it a point control rather than range one.
Be more specific here, please?
> Maybe you can define some twisted rules serving certain specific use
> case, but it's gonna be confusing / broken for different use cases.
Tejun, your argumentation is really hand wavy here. Which use cases will
be broken and which one will be confusing. Name one for an illustration.
> You're so confused that you don't even know you're confused.
Yes, you keep repeating that. But you haven't pointed out any single
confusing use case so far. Please please stop this, it is not productive.
We are still talking about using soft limit to control overcommit
situation as gracefully as possible. I hope we are on the same page
about that at least.
I will post my series as a reply to this email so that we can get to
a more specific discussion because this "you are so confused because
something, something, something, dark..." is not funny, nor productive.
> > I am approaching this from a simple perspective. Reclaim from everybody
>
> No, you're just thinking about two immediate problems you're given and
> trying to jam them into something you already have not realizing those
> two can't be expressed with a single knob.
Yes, I am thinking in context of several use cases, all right. One
of them is memory isolation via soft limit prioritization. Something
that is possible already but it is major PITA to do right. What we
have currently is optimized for "let's hammer something". Although
useful, not a primary usecase according to my experiences. The primary
motivation for the soft limit was to have something to control
overcommit situations gracefully AFAIR and let's hammer something and
hope it will work doesn't sound gracefully to me.
> > who doesn't care about the soft limit (it hasn't been set for that
> > group) or who is above the soft limit. If that is sufficient to meet the
> > reclaim target then there is no reason to touch groups that _do_ care
> > about soft limit and they are under. Although this doesn't give you
> > any guarantee it can give a certain prioritization for groups in the
> > overcommit situations and that is what soft limit was intended for from
> > the very beginning.
>
> For $DEITY's sake, soft limit should exert reclaim pressure. That's
> it. If a group is over limit, it'll feel *extra* pressure until it's
> back to the limit. Once under the limit, it should be treated equally
> to any other tasks which are under the limit
And yet again agreed and nobody is claiming otherwise. Except that
> including the ones without any softlimit configured.
I haven't seen any specific argument why the default limit shouldn't
allow to always reclaim.
Having soft unreclaimable groups by default makes it hard to use soft
limit reclaim for something more interesting. See the last patch
in the series ("memcg: Ignore soft limit until it is explicitly
specified"). With this approach you end up setting soft limit for every
single group (even those you do not care about) just to make balancing
work reasonably for all hierarchies.
Anyway, this is just one part of the series and it doesn't make sense to
postpone the whole work just for this. If _more people_ really think that
the default limit change is really _so_ confusing and unusable then I
will not push it over dead bodies of course.
> It is not different from hardlimit. There's nothing "interesting"
> about it.
>
> Even for flat hierarchy, with your interpretation of the knob, it is
> impossible to say "I don't really care about this thing, if it goes
> over 30M, hammer on it", which is a completely reasonable thing to
> want.
Nothing prevents from this setting. I am just claiming that this is not
the most interesting use case for the soft limit and I would like to
optimize for more interesting use cases.
The patch set will follow
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC v2 0/4] soft limit rework
2013-04-22 18:30 ` Tejun Heo
2013-04-23 9:29 ` Michal Hocko
@ 2013-04-23 9:33 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
` (3 more replies)
2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner
2 siblings, 4 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
This is the second version of the patchset. There were some minor
cleanups since the last version and I have moved "memcg: Ignore soft
limit until it is explicitly specified" to the end of the series as it
seems to be more controversial than I thought.
The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all-mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
shouldn't be a big deal [1]. There is certainly room for improvements in
that direction. But let's keep it simple for now.
As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before. The clean up is in a separate patch because
I felt it would be easier to review that way.
The second step is soft limit reclaim integration into targeted
reclaim. It should be rather straight forward. Soft limit has been used
only for the global reclaim so far but it makes for any kind of pressure
coming from up-the-hierarchy, including targeted reclaim.
The last step is somehow more controversial as the discussions show. I
am redefining meaning of the default soft limit value. I've not chosen
0 as we discussed previously because I want to preserve hierarchical
property of the soft limit (if a parent up the hierarchy is over its
limit then children are over as well - same as with the hard limit) so
I have kept the default untouched - unlimited - but I have slightly
changed the meaning of this value. I interpret it as "user doesn't
care about soft limit". More precisely the value is ignored unless it
has been specified by admin/user so such groups are eligible for soft
reclaim even though they do not reach the limit. Such groups do not
force their children to be reclaimed so we can look at them as neutral
for the soft reclaim.
I will attach my testing results later on.
Shortlog says:
Michal Hocko (4):
memcg: integrate soft reclaim tighter with zone shrinking code
memcg: Get rid of soft-limit tree infrastructure
vmscan, memcg: Do softlimit reclaim also for targeted reclaim
memcg: Ignore soft limit until it is explicitly specified
And the diffstat:
include/linux/memcontrol.h | 12 +-
mm/memcontrol.c | 438 +++++---------------------------------------
mm/vmscan.c | 62 ++++---
3 files changed, 88 insertions(+), 424 deletions(-)
which sounds optimistic, doesn't it?
---
[1] I have tested this by creating a hierarchy 10 levels deep with
2 groups at each level - all of them below their soft limit and a
single group eligible for the reclaim running dd reading a lot of page
cache. The system time was withing stdev comparing to the previous
implementation
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
@ 2013-04-23 9:33 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
` (2 subsequent siblings)
3 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and
reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not
cost free. Although this overhead hasn't turned out to be a bottle neck
the implementation is suboptimal because mem_cgroup_update_tree has no
idea which zones consumed memory over the limit so we could easily end
up having a group on a node-zone tree having only few pages from that
node-zone.
This patch doesn't try to fix node-zone trees management because it
seems that integrating soft reclaim into zone shrinking sounds much
easier and more appropriate for several reasons.
First of all 0 priority reclaim was a crude hack which might lead to
big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
of dirty/writeback pages).
Soft reclaim should be applicable also to the targeted reclaim which is
awkward right now without additional hacks.
Last but not least the whole infrastructure eats quite some code.
After this patch shrink_zone is done in 2 passes. First it tries to do the
soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned
during the first pass. Only groups which are over their soft limit or
any of their parents up the hierarchy is over the limit are considered
eligible during the first pass.
Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.
Changes since v1
- __shrink_zone doesn't return the number of shrunk groups as nr_scanned
test covers both no groups scanned and no pages from the required zone
as pointed by Johannes
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
include/linux/memcontrol.h | 10 +--
mm/memcontrol.c | 161 ++++++--------------------------------------
mm/vmscan.c | 62 ++++++++++-------
3 files changed, 59 insertions(+), 174 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..1833c95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
{
- return 0;
+ return false;
}
static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..33424d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
}
#endif
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
- struct zone *zone,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- struct mem_cgroup *victim = NULL;
- int total = 0;
- int loop = 0;
- unsigned long excess;
- unsigned long nr_scanned;
- struct mem_cgroup_reclaim_cookie reclaim = {
- .zone = zone,
- .priority = 0,
- };
+/*
+ * A group is eligible for the soft limit reclaim if it is
+ * a) is over its soft limit
+ * b) any parent up the hierarchy is over its soft limit
+ */
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *parent = memcg;
- excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
- while (1) {
- victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
- if (!victim) {
- loop++;
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!total)
- break;
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
- break;
- }
- continue;
- }
- if (!mem_cgroup_reclaimable(victim, false))
- continue;
- total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
- zone, &nr_scanned);
- *total_scanned += nr_scanned;
- if (!res_counter_soft_limit_excess(&root_memcg->res))
- break;
+ if (res_counter_soft_limit_excess(&memcg->res))
+ return true;
+
+ /*
+ * If any parent up the hierarchy is over its soft limit then we
+ * have to obey and reclaim from this group as well.
+ */
+ while((parent = parent_mem_cgroup(parent))) {
+ if (res_counter_soft_limit_excess(&parent->res))
+ return true;
}
- mem_cgroup_iter_break(root_memcg, victim);
- return total;
+
+ return false;
}
/*
@@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- unsigned long nr_reclaimed = 0;
- struct mem_cgroup_per_zone *mz, *next_mz = NULL;
- unsigned long reclaimed;
- int loop = 0;
- struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
- unsigned long nr_scanned;
-
- if (order > 0)
- return 0;
-
- mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
- /*
- * This loop can run a while, specially if mem_cgroup's continuously
- * keep exceeding their soft limit and putting the system under
- * pressure
- */
- do {
- if (next_mz)
- mz = next_mz;
- else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
- if (!mz)
- break;
-
- nr_scanned = 0;
- reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
- gfp_mask, &nr_scanned);
- nr_reclaimed += reclaimed;
- *total_scanned += nr_scanned;
- spin_lock(&mctz->lock);
-
- /*
- * If we failed to reclaim anything from this memory cgroup
- * it is time to move on to the next cgroup
- */
- next_mz = NULL;
- if (!reclaimed) {
- do {
- /*
- * Loop until we find yet another one.
- *
- * By the time we get the soft_limit lock
- * again, someone might have aded the
- * group back on the RB tree. Iterate to
- * make sure we get a different mem.
- * mem_cgroup_largest_soft_limit_node returns
- * NULL if no other cgroup is present on
- * the tree
- */
- next_mz =
- __mem_cgroup_largest_soft_limit_node(mctz);
- if (next_mz == mz)
- css_put(&next_mz->memcg->css);
- else /* next_mz == NULL or other memcg */
- break;
- } while (1);
- }
- __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->memcg->res);
- /*
- * One school of thought says that we should not add
- * back the node to the tree if reclaim returns 0.
- * But our reclaim could return 0, simply because due
- * to priority we are exposing a smaller subset of
- * memory to reclaim from. Consider this as a longer
- * term TODO.
- */
- /* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- css_put(&mz->memcg->css);
- loop++;
- /*
- * Could not reclaim anything and there are no more
- * mem cgroups to try or we seem to be looping without
- * reclaiming anything.
- */
- if (!nr_reclaimed &&
- (next_mz == NULL ||
- loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
- break;
- } while (!nr_reclaimed);
- if (next_mz)
- css_put(&next_mz->memcg->css);
- return nr_reclaimed;
-}
-
/**
* mem_cgroup_force_empty_list - clears LRU of a group
* @memcg: group to clear
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..0d0c9e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup;
}
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+ return global_reclaim(sc);
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
return true;
}
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+ return false;
+}
#endif
static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1942,7 +1952,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static void
+__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
{
unsigned long nr_reclaimed, nr_scanned;
@@ -1961,6 +1972,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
do {
struct lruvec *lruvec;
+ if (soft_reclaim &&
+ !mem_cgroup_soft_reclaim_eligible(memcg)) {
+ memcg = mem_cgroup_iter(root, memcg, &reclaim);
+ continue;
+ }
+
lruvec = mem_cgroup_zone_lruvec(zone, memcg);
shrink_lruvec(lruvec, sc);
@@ -1986,6 +2003,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
sc->nr_scanned - nr_scanned, sc));
}
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+ bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
+ unsigned long nr_scanned = sc->nr_scanned;
+
+ __shrink_zone(zone, sc, do_soft_reclaim);
+
+ /*
+ * No group is over the soft limit or those that are do not have
+ * pages in the zone we are reclaiming so we have to reclaim everybody
+ */
+ if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
+ __shrink_zone(zone, sc, false);
+ return;
+ }
+}
+
/* Returns true if compaction should go ahead for a high-order request */
static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
@@ -2047,8 +2082,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
bool aborted_reclaim = false;
/*
@@ -2088,18 +2121,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
continue;
}
}
- /*
- * This steals pages from memory cgroups over softlimit
- * and returns the number of reclaimed pages and
- * scanned pages. This works for global memory pressure
- * and balancing, not for a memcg's limit.
- */
- nr_soft_scanned = 0;
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- sc->order, sc->gfp_mask,
- &nr_soft_scanned);
- sc->nr_reclaimed += nr_soft_reclaimed;
- sc->nr_scanned += nr_soft_scanned;
/* need some check for avoid more shrink_zone() */
}
@@ -2620,8 +2641,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
struct reclaim_state *reclaim_state = current->reclaim_state;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
@@ -2720,15 +2739,6 @@ loop_again:
sc.nr_scanned = 0;
- nr_soft_scanned = 0;
- /*
- * Call soft limit reclaim before calling shrink_zone.
- */
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- order, sc.gfp_mask,
- &nr_soft_scanned);
- sc.nr_reclaimed += nr_soft_reclaimed;
-
/*
* We put equal pressure on every zone, unless
* one zone has way too many pages free
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
@ 2013-04-23 9:33 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
3 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore. Rip it out.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
mm/memcontrol.c | 251 +------------------------------------------------------
1 file changed, 1 insertion(+), 250 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33424d8..d927e2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -39,7 +39,6 @@
#include <linux/limits.h>
#include <linux/export.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -136,7 +135,6 @@ static const char * const mem_cgroup_lru_names[] = {
*/
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
- MEM_CGROUP_TARGET_SOFTLIMIT,
MEM_CGROUP_TARGET_NUMAINFO,
MEM_CGROUP_NTARGETS,
};
@@ -172,10 +170,6 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
- struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
@@ -188,26 +182,6 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[0];
};
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
- struct rb_root rb_root;
- spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
- struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
- struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
u64 threshold;
@@ -528,7 +502,6 @@ static bool move_file(void)
* limit reclaim to prevent infinite loops, if they ever occur.
*/
#define MEM_CGROUP_MAX_RECLAIM_LOOPS 100
-#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -741,164 +714,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
return mem_cgroup_zoneinfo(memcg, nid, zid);
}
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
-
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
-{
- struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
- struct mem_cgroup_per_zone *mz_node;
-
- if (mz->on_tree)
- return;
-
- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
- return;
- while (*p) {
- parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
- tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess)
- p = &(*p)->rb_left;
- /*
- * We can't avoid mem cgroups that are over their soft
- * limit by the same amount
- */
- else if (mz->usage_in_excess >= mz_node->usage_in_excess)
- p = &(*p)->rb_right;
- }
- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- if (!mz->on_tree)
- return;
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- spin_lock(&mctz->lock);
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
- unsigned long long excess;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
- mctz = soft_limit_tree_from_page(page);
-
- /*
- * Necessary to update all ancestors when hierarchy is used.
- * because their event counter is not touched.
- */
- for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- mz = mem_cgroup_zoneinfo(memcg, nid, zid);
- excess = res_counter_soft_limit_excess(&memcg->res);
- /*
- * We have to update the tree if mz is on RB-tree or
- * mem is over its softlimit.
- */
- if (excess || mz->on_tree) {
- spin_lock(&mctz->lock);
- /* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- /*
- * Insert again. mz->usage_in_excess will be updated.
- * If excess is 0, no tree ops.
- */
- __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- }
- }
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
- int node, zone;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
-
- for_each_node(node) {
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- mz = mem_cgroup_zoneinfo(memcg, node, zone);
- mctz = soft_limit_tree_node_zone(node, zone);
- mem_cgroup_remove_exceeded(memcg, mz, mctz);
- }
- }
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz;
-
-retry:
- mz = NULL;
- rightmost = rb_last(&mctz->rb_root);
- if (!rightmost)
- goto done; /* Nothing to reclaim from */
-
- mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
- /*
- * Remove the node now but someone else can add it back,
- * we will to add it back at the end of reclaim to its correct
- * position in the tree.
- */
- __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
- !css_tryget(&mz->memcg->css))
- goto retry;
-done:
- return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct mem_cgroup_per_zone *mz;
-
- spin_lock(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
- spin_unlock(&mctz->lock);
- return mz;
-}
-
/*
* Implementation Note: reading percpu statistics for memcg.
*
@@ -1052,9 +867,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
case MEM_CGROUP_TARGET_THRESH:
next = val + THRESHOLDS_EVENTS_TARGET;
break;
- case MEM_CGROUP_TARGET_SOFTLIMIT:
- next = val + SOFTLIMIT_EVENTS_TARGET;
- break;
case MEM_CGROUP_TARGET_NUMAINFO:
next = val + NUMAINFO_EVENTS_TARGET;
break;
@@ -1077,11 +889,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
- bool do_softlimit;
bool do_numainfo __maybe_unused;
- do_softlimit = mem_cgroup_event_ratelimit(memcg,
- MEM_CGROUP_TARGET_SOFTLIMIT);
#if MAX_NUMNODES > 1
do_numainfo = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_NUMAINFO);
@@ -1089,8 +898,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
preempt_enable();
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
- mem_cgroup_update_tree(memcg, page);
#if MAX_NUMNODES > 1
if (unlikely(do_numainfo))
atomic_inc(&memcg->numainfo_events);
@@ -1923,28 +1730,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
return total;
}
-/**
- * test_mem_cgroup_node_reclaimable
- * @memcg: the target memcg
- * @nid: the node ID to be checked.
- * @noswap : specify true here if the user wants flle only information.
- *
- * This function returns whether the specified memcg contains any
- * reclaimable pages on a node. Returns true if there are any reclaimable
- * pages in the node.
- */
-static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
- int nid, bool noswap)
-{
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
- return true;
- if (noswap || !total_swap_pages)
- return false;
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
- return true;
- return false;
-
-}
#if MAX_NUMNODES > 1
/*
@@ -2053,11 +1838,6 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
{
return 0;
}
-
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
- return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
-}
#endif
/*
@@ -2932,9 +2712,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
unlock_page_cgroup(pc);
/*
- * "charge_statistics" updated event counter. Then, check it.
- * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
- * if they exceeds softlimit.
+ * "charge_statistics" updated event counter.
*/
memcg_check_events(memcg, page);
}
@@ -6053,8 +5831,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
lruvec_init(&mz->lruvec);
- mz->usage_in_excess = 0;
- mz->on_tree = false;
mz->memcg = memcg;
}
memcg->info.nodeinfo[node] = pn;
@@ -6110,7 +5886,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
int node;
size_t size = memcg_size();
- mem_cgroup_remove_from_trees(memcg);
free_css_id(&mem_cgroup_subsys, &memcg->css);
for_each_node(node)
@@ -6192,29 +5967,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(parent_mem_cgroup);
-static void __init mem_cgroup_soft_limit_tree_init(void)
-{
- struct mem_cgroup_tree_per_node *rtpn;
- struct mem_cgroup_tree_per_zone *rtpz;
- int tmp, node, zone;
-
- for_each_node(node) {
- tmp = node;
- if (!node_state(node, N_NORMAL_MEMORY))
- tmp = -1;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
- BUG_ON(!rtpn);
-
- soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- rtpz = &rtpn->rb_tree_per_zone[zone];
- rtpz->rb_root = RB_ROOT;
- spin_lock_init(&rtpz->lock);
- }
- }
-}
-
static struct cgroup_subsys_state * __ref
mem_cgroup_css_alloc(struct cgroup *cont)
{
@@ -6990,7 +6742,6 @@ static int __init mem_cgroup_init(void)
{
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
enable_swap_cgroup();
- mem_cgroup_soft_limit_tree_init();
memcg_stock_init();
return 0;
}
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
@ 2013-04-23 9:33 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
3 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the
soft limit reclaim doesn't use any special code paths and it is a
part of the zone shrinking code which is used by both global and
targeted reclaims.
>From semantic point of view it is even natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is
a memory pressure. It is not important whether the pressure comes from
the limit or imbalanced zones.
This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.
Say we have
A (over soft limit)
\
B (below s.l., hit the hard limit)
/ \
C D (below s.l.)
B is the source of the outside memory pressure now for D but we
shouldn't soft reclaim it because it is behaving well under B subtree
and we can still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).
Changes since v1
- add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
include/linux/memcontrol.h | 6 ++++--
mm/memcontrol.c | 14 +++++++++-----
mm/vmscan.c | 4 ++--
3 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1833c95..80ed1b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root);
void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root)
{
return false;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d927e2e..14d3d23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1841,11 +1841,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
#endif
/*
- * A group is eligible for the soft limit reclaim if it is
- * a) is over its soft limit
+ * A group is eligible for the soft limit reclaim under the given root
+ * hierarchy if
+ * a) it is over its soft limit
* b) any parent up the hierarchy is over its soft limit
*/
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root)
{
struct mem_cgroup *parent = memcg;
@@ -1853,12 +1855,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
return true;
/*
- * If any parent up the hierarchy is over its soft limit then we
- * have to obey and reclaim from this group as well.
+ * If any parent up to the root in the hierarchy is over its soft limit
+ * then we have to obey and reclaim from this group as well.
*/
while((parent = parent_mem_cgroup(parent))) {
if (res_counter_soft_limit_excess(&parent->res))
return true;
+ if (parent == root)
+ break;
}
return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d0c9e7..471bf94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc)
static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
{
- return global_reclaim(sc);
+ return true;
}
#else
static bool global_reclaim(struct scan_control *sc)
@@ -1973,7 +1973,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
struct lruvec *lruvec;
if (soft_reclaim &&
- !mem_cgroup_soft_reclaim_eligible(memcg)) {
+ !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
memcg = mem_cgroup_iter(root, memcg, &reclaim);
continue;
}
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
` (2 preceding siblings ...)
2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
@ 2013-04-23 9:33 ` Michal Hocko
3 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 9:33 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
The soft limit has been traditionally initialized to RESOURCE_MAX
which means that the group is soft unlimited by default and so it
gets reclaimed only after all groups that set their limit are bellow
their limits. While this scheme is working it is not ideal because it
makes hard to configure isolated workloads without setting a limit to
basically all groups. Let's consider the following simple hierarchy
__A_____
/ \ \
A1....An C
and let's assume we would like to keep C's working set intact as much
as possible (with soft limit set to the estimated working set size)
so that A{i} groups do not interfere with it (A{i} might represent
backup processes or other maintenance activities which can consume
quite a lot of memory). If A{i} groups have a default soft limit then C
would be preferred for the reclaim until it eventually gets to its soft
limit and then be reclaimed again as the memory pressure from A{i} is
bigger and when also A{i} get reclaimed.
There are basically 2 options how to handle A{i} groups:
- distribute hard limit to (A.limit - C.soft_limit)
- set soft limit to 0
The first option is impractical because it would throttle A{i} even
though there is quite some idle memory laying around. The later option
would certainly work because A{i} would get reclaimed all the time there
is a pressure coming from A. This however basically disables any soft
limit settings down A{i} hierarchies which sounds unnecessarily strict
(not mentioning that we have to set up a limit for every A{i}).
Moreover if A is the root memcg then there is no reasonable way to make
it stop interefering with other loads because setting the soft limit
would kill the limits downwards and the hard limit is not possible to
set.
Neither of the extremes - unlimited vs. 0 - are ideal apparently. There
is a compromise we can do, though. This patch doesn't change the default
soft limit value. Rather than that it distinguishes groups with soft
limit enabled - it has been set by an user - and disabled which comes
as a default. Unlike groups with the limit set to 0 such groups do not
propagate their reclaimable state down the hierarchy so they act only
for themselves.
Getting back to the previous example. Only C would get a limit from
admin and the reclaim would reclaim all A{i} and C eventually when it
crosses its limit.
This means that soft limit is much easier to maintain now because only
those groups that are interesting (that the administrator know how much
pushback makes sense for a graceful overcommit handling) need to be
taken care about and the rest of the groups is reclaimed proportionally.
TODO: How do we present default unlimited vs. RESOURCE_MAX set by
the user? One possible way could be returning -1 for RES_SOFT_LIMIT &&
!soft_limited.
TODO: update doc
Changes since v1
- return -1 when reading memory.soft_limit_in_bytes for unlimited
groups.
- reorganized checks in mem_cgroup_soft_reclaim_eligible to be more
readable.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
mm/memcontrol.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14d3d23..03ddbcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -266,6 +266,10 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
+ /*
+ * Is the group soft limited?
+ */
+ bool soft_limited;
unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
bool oom_lock;
@@ -1843,14 +1847,20 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
/*
* A group is eligible for the soft limit reclaim under the given root
* hierarchy if
- * a) it is over its soft limit
- * b) any parent up the hierarchy is over its soft limit
+ * a) doesn't have any soft limit set
+ * b) is over its soft limit
+ * c) any parent up the hierarchy is over its soft limit
*/
bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root)
{
struct mem_cgroup *parent = memcg;
+ /* No specific soft limit set, eligible for soft reclaim */
+ if (!memcg->soft_limited)
+ return true;
+
+ /* Soft limit exceeded, eligible for soft reclaim */
if (res_counter_soft_limit_excess(&memcg->res))
return true;
@@ -1859,7 +1869,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
* then we have to obey and reclaim from this group as well.
*/
while((parent = parent_mem_cgroup(parent))) {
- if (res_counter_soft_limit_excess(&parent->res))
+ if (parent->soft_limited &&
+ res_counter_soft_limit_excess(&parent->res))
return true;
if (parent == root)
break;
@@ -4754,10 +4765,13 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
switch (type) {
case _MEM:
- if (name == RES_USAGE)
+ if (name == RES_USAGE) {
val = mem_cgroup_usage(memcg, false);
- else
+ } else if (name == RES_SOFT_LIMIT && !memcg->soft_limited) {
+ return simple_read_from_buffer(buf, nbytes, ppos, "-1\n", 3);
+ } else {
val = res_counter_read_u64(&memcg->res, name);
+ }
break;
case _MEMSWAP:
if (name == RES_USAGE)
@@ -5019,6 +5033,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
ret = res_counter_set_soft_limit(&memcg->res, val);
else
ret = -EINVAL;
+
+ /*
+ * We could disable soft_limited when we get RESOURCE_MAX but
+ * then we have a little problem to distinguish the default
+ * unlimited and limitted but never soft reclaimed groups.
+ */
+ if (!ret)
+ memcg->soft_limited = true;
break;
default:
ret = -EINVAL; /* should be BUG() ? */
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 15:54 ` Michal Hocko
2013-04-22 16:01 ` Tejun Heo
@ 2013-04-23 9:58 ` Michel Lespinasse
2013-04-23 10:17 ` Glauber Costa
` (2 more replies)
1 sibling, 3 replies; 46+ messages in thread
From: Michel Lespinasse @ 2013-04-23 9:58 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 22-04-13 08:46:20, Tejun Heo wrote:
>> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
>> please talk with google memcg people. They have very different
>> interpretation of what "softlimit" is and are using it according to
>> that interpretation. If it *is* an actual soft limit, there is no
>> inherent isolation coming from it and that should be clear to
>> everyone.
>
> We have discussed that for a long time. I will not speak for Greg & Ying
> but from my POV we have agreed that the current implementation will work
> for them with some (minor) changes in their layout.
> As I have said already with a careful configuration (e.i. setting the
> soft limit only where it matters - where it protects an important
> memory which is usually in the leaf nodes)
I don't like your argument that soft limits work if you only set them
on leaves. To me this is just a fancy way of saying that hierarchical
soft limits don't work.
Also it is somewhat problematic to assume that important memory can
easily be placed in leaves. This is difficult to ensure when
subcontainer destruction, for example, moves the memory back into the
parent.
> you can actually achieve
> _high_ probability for not being reclaimed after the rework which was not
> possible before because of the implementation which was ugly and
> smelled.
So, to be clear, what we (google MM people) want from soft limits is
some form of protection against being reclaimed from when your cgroup
(or its parent) is below the soft limit.
I don't like to call it a guarantee either, because we understand that
it comes with some limitations - for example, if all user pages on a
given node are yours then allocations from that node might cause some
of your pages to be reclaimed, even when you're under your soft limit.
But we want some form of (weak) guarantee that can be made to work
good enough in practice.
Before your change, soft limits didn't actually provide any such form
of guarantee, weak or not, since global reclaim would ignore soft
limits.
With your proposal, soft limits at least do provide the weak guarantee
that we want, when not using hierarchies. We see this as a very clear
improvement over the previous situation, so we're very happy about
your patchset !
However, your proposal takes that weak guarantee away as soon as one
tries to use cgroup hierarchies with it, because it reclaims from
every child cgroup as soon as the parent hits its soft limit. This is
disappointing and also, I have not heard of why you want things to
work that way ? Is this an ease of implementation issue or do you
consider that requirement as a bad idea ? And if it's the later,
what's your counterpoint, is it related to delegation or is it
something else that I haven't heard of ?
I don't think referring to the existing memcg documentation makes a
strong point - the documentation never said that soft limits were not
obeyed by global reclaim and yet we both agree that it'd be preferable
if they were. So I would like to hear of your reasons (apart from
referring to the existing documentation) for not allowing a parent
cgroup to protect its children from reclaim when the total charge from
that parent is under the parent's soft limit.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 9:58 ` Michel Lespinasse
@ 2013-04-23 10:17 ` Glauber Costa
2013-04-23 11:40 ` Michal Hocko
2013-04-23 11:32 ` Michal Hocko
2013-04-23 12:51 ` Michal Hocko
2 siblings, 1 reply; 46+ messages in thread
From: Glauber Costa @ 2013-04-23 10:17 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Michal Hocko, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On 04/23/2013 01:58 PM, Michel Lespinasse wrote:
> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> On Mon 22-04-13 08:46:20, Tejun Heo wrote:
>>> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
>>> please talk with google memcg people. They have very different
>>> interpretation of what "softlimit" is and are using it according to
>>> that interpretation. If it *is* an actual soft limit, there is no
>>> inherent isolation coming from it and that should be clear to
>>> everyone.
>>
>> We have discussed that for a long time. I will not speak for Greg & Ying
>> but from my POV we have agreed that the current implementation will work
>> for them with some (minor) changes in their layout.
>> As I have said already with a careful configuration (e.i. setting the
>> soft limit only where it matters - where it protects an important
>> memory which is usually in the leaf nodes)
>
> I don't like your argument that soft limits work if you only set them
> on leaves. To me this is just a fancy way of saying that hierarchical
> soft limits don't work.
>
> Also it is somewhat problematic to assume that important memory can
> easily be placed in leaves. This is difficult to ensure when
> subcontainer destruction, for example, moves the memory back into the
> parent.
>
Michal,
For the most part, I am siding with you in this discussion.
But with this only-in-leaves thing, I am forced to flip (at least for this).
You are right when you say that in a configuration with A being parent
of B and C, A being over its hard limit will affect reclaim in B and C,
and soft limits should work the same.
However, "will affect reclaim" is a big vague. More specifically, if the
sum of B and C's hard limit is smaller or equal A's hard limit, the only
way of either B or C to trigger A's hard limit is for them, themselves,
to go over their hard limit.
*This* is the case you you are breaking when you try to establish a
comparison between soft and hard limits - which is, per se, sane.
Translating this to the soft limit speech, if the sum of B and C's soft
limit is smaller or equal A's soft limit, and one of them is over the
soft limit, that one should be reclaimed. The other should be left alone.
I understand perfectly fine that soft limit is a best effort, not a
guarantee. But if we don't do that, I understand that we are doing
effort, not best effort.
This would only be attempted in our first pass. In the second pass, we
reclaim from whoever.
It is also not that hard to do it: Flatten the tree in a list, with the
leaves always being placed before the inner nodes. Start reclaiming from
nodes over the soft limit, hierarchically. This means that whenever we
reach an inner node and it is *still* over the soft limit, we are
guaranteed to have scanned their children already. In the case I
described, the children over its soft limit would have been reclaimed,
without the well behaving children being touched. Now all three are okay.
If we reached an inner node and we still have a soft limit problem, then
we are effectively talking about the case you have been describing.
Reclaim from whoever you want.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 9:58 ` Michel Lespinasse
2013-04-23 10:17 ` Glauber Costa
@ 2013-04-23 11:32 ` Michal Hocko
2013-04-23 12:45 ` Michel Lespinasse
2013-04-23 12:51 ` Michal Hocko
2 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 11:32 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Tue 23-04-13 02:58:19, Michel Lespinasse wrote:
> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 22-04-13 08:46:20, Tejun Heo wrote:
> >> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
> >> please talk with google memcg people. They have very different
> >> interpretation of what "softlimit" is and are using it according to
> >> that interpretation. If it *is* an actual soft limit, there is no
> >> inherent isolation coming from it and that should be clear to
> >> everyone.
> >
> > We have discussed that for a long time. I will not speak for Greg & Ying
> > but from my POV we have agreed that the current implementation will work
> > for them with some (minor) changes in their layout.
> > As I have said already with a careful configuration (e.i. setting the
> > soft limit only where it matters - where it protects an important
> > memory which is usually in the leaf nodes)
>
> I don't like your argument that soft limits work if you only set them
> on leaves.
I didn't say that. Please read it again. "where it protects an important
memory which is _usaully_ in the leaf nodes". Intermediate nodes can of
course contain some important memory as well and you can well "protect"
them by the soft limit you just have to be very careful because what you
have in the result is quite complicated structure. You have a node that
has some portion of its own memory mixed with reparented pages. You
cannot distinguish those two of course so protection is somehow harder
to achieve. That is the reason why I encourage not using any limit on
the intermediate node which means reclaim the node with my patchset.
> To me this is just a fancy way of saying that hierarchical soft limits
> don't work.
It works same as the hard limit it just triggers later.
> Also it is somewhat problematic to assume that important memory can
> easily be placed in leaves. This is difficult to ensure when
> subcontainer destruction, for example, moves the memory back into the
> parent.
Is the memory still important then? The workload which uses the memory
is done. So this ends up being just a cached data.
> > you can actually achieve
> > _high_ probability for not being reclaimed after the rework which was not
> > possible before because of the implementation which was ugly and
> > smelled.
>
> So, to be clear, what we (google MM people) want from soft limits is
> some form of protection against being reclaimed from when your cgroup
> (or its parent) is below the soft limit.
>
> I don't like to call it a guarantee either, because we understand that
> it comes with some limitations - for example, if all user pages on a
> given node are yours then allocations from that node might cause some
> of your pages to be reclaimed, even when you're under your soft limit.
> But we want some form of (weak) guarantee that can be made to work
> good enough in practice.
>
> Before your change, soft limits didn't actually provide any such form
> of guarantee, weak or not, since global reclaim would ignore soft
> limits.
>
> With your proposal, soft limits at least do provide the weak guarantee
> that we want, when not using hierarchies. We see this as a very clear
> improvement over the previous situation, so we're very happy about
> your patchset !
>
> However, your proposal takes that weak guarantee away as soon as one
> tries to use cgroup hierarchies with it, because it reclaims from
> every child cgroup as soon as the parent hits its soft limit. This is
> disappointing and also, I have not heard of why you want things to
> work that way ?
Sigh. Because if children didn't follow parent's limit then they could
easily escape from the reclaim pushing back to an unrelated hierarchies
in the tree as the parent wouldn't be able to reclaim down to its limit.
> Is this an ease of implementation issue or do you consider that
> requirement as a bad idea ? And if it's the later, what's your
> counterpoint, is it related to delegation or is it something else that
> I haven't heard of ?
The implementation can be improved and child groups might be reclaimed
_only_ if parent cannot satisfy its soft limit this is not a target of
the current re-implementation though. The limit has to be preserved
though.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 10:17 ` Glauber Costa
@ 2013-04-23 11:40 ` Michal Hocko
2013-04-23 11:54 ` Glauber Costa
2013-04-23 12:51 ` Michel Lespinasse
0 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 11:40 UTC (permalink / raw)
To: Glauber Costa
Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On Tue 23-04-13 14:17:22, Glauber Costa wrote:
> On 04/23/2013 01:58 PM, Michel Lespinasse wrote:
> > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> On Mon 22-04-13 08:46:20, Tejun Heo wrote:
> >>> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
> >>> please talk with google memcg people. They have very different
> >>> interpretation of what "softlimit" is and are using it according to
> >>> that interpretation. If it *is* an actual soft limit, there is no
> >>> inherent isolation coming from it and that should be clear to
> >>> everyone.
> >>
> >> We have discussed that for a long time. I will not speak for Greg & Ying
> >> but from my POV we have agreed that the current implementation will work
> >> for them with some (minor) changes in their layout.
> >> As I have said already with a careful configuration (e.i. setting the
> >> soft limit only where it matters - where it protects an important
> >> memory which is usually in the leaf nodes)
> >
> > I don't like your argument that soft limits work if you only set them
> > on leaves. To me this is just a fancy way of saying that hierarchical
> > soft limits don't work.
> >
> > Also it is somewhat problematic to assume that important memory can
> > easily be placed in leaves. This is difficult to ensure when
> > subcontainer destruction, for example, moves the memory back into the
> > parent.
> >
>
> Michal,
>
> For the most part, I am siding with you in this discussion.
> But with this only-in-leaves thing, I am forced to flip (at least for this).
>
> You are right when you say that in a configuration with A being parent
> of B and C, A being over its hard limit will affect reclaim in B and C,
> and soft limits should work the same.
>
> However, "will affect reclaim" is a big vague. More specifically, if the
> sum of B and C's hard limit is smaller or equal A's hard limit, the only
> way of either B or C to trigger A's hard limit is for them, themselves,
> to go over their hard limit.
Which is an expectation that you cannot guarantee. You can have B+C>A.
> *This* is the case you you are breaking when you try to establish a
> comparison between soft and hard limits - which is, per se, sane.
>
> Translating this to the soft limit speech, if the sum of B and C's soft
> limit is smaller or equal A's soft limit, and one of them is over the
> soft limit, that one should be reclaimed. The other should be left alone.
And yet again. Nothing will prevent you from setting B+C>A. Sure if you
configure your hierarchy sanely then everything will just work.
> I understand perfectly fine that soft limit is a best effort, not a
> guarantee. But if we don't do that, I understand that we are doing
> effort, not best effort.
>
> This would only be attempted in our first pass. In the second pass, we
> reclaim from whoever.
>
> It is also not that hard to do it: Flatten the tree in a list, with the
> leaves always being placed before the inner nodes.
Glauber, I have already pointed out that bottom-up reclaim doesn't make
much sense because it is a bigger chance that useful data is stored in
the leaf nodes rather than inner nodes which usually contain mostly
reparented pages.
> Start reclaiming from nodes over the soft limit, hierarchically. This
> means that whenever we reach an inner node and it is *still* over
> the soft limit, we are guaranteed to have scanned their children
> already. In the case I described, the children over its soft limit
> would have been reclaimed, without the well behaving children being
> touched. Now all three are okay.
>
> If we reached an inner node and we still have a soft limit problem, then
> we are effectively talking about the case you have been describing.
> Reclaim from whoever you want.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 11:40 ` Michal Hocko
@ 2013-04-23 11:54 ` Glauber Costa
2013-04-23 12:51 ` Michel Lespinasse
1 sibling, 0 replies; 46+ messages in thread
From: Glauber Costa @ 2013-04-23 11:54 UTC (permalink / raw)
To: Michal Hocko
Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On 04/23/2013 03:40 PM, Michal Hocko wrote:
> On Tue 23-04-13 14:17:22, Glauber Costa wrote:
>> On 04/23/2013 01:58 PM, Michel Lespinasse wrote:
>>> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
>>>> On Mon 22-04-13 08:46:20, Tejun Heo wrote:
>>>>> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
>>>>> please talk with google memcg people. They have very different
>>>>> interpretation of what "softlimit" is and are using it according to
>>>>> that interpretation. If it *is* an actual soft limit, there is no
>>>>> inherent isolation coming from it and that should be clear to
>>>>> everyone.
>>>>
>>>> We have discussed that for a long time. I will not speak for Greg & Ying
>>>> but from my POV we have agreed that the current implementation will work
>>>> for them with some (minor) changes in their layout.
>>>> As I have said already with a careful configuration (e.i. setting the
>>>> soft limit only where it matters - where it protects an important
>>>> memory which is usually in the leaf nodes)
>>>
>>> I don't like your argument that soft limits work if you only set them
>>> on leaves. To me this is just a fancy way of saying that hierarchical
>>> soft limits don't work.
>>>
>>> Also it is somewhat problematic to assume that important memory can
>>> easily be placed in leaves. This is difficult to ensure when
>>> subcontainer destruction, for example, moves the memory back into the
>>> parent.
>>>
>>
>> Michal,
>>
>> For the most part, I am siding with you in this discussion.
>> But with this only-in-leaves thing, I am forced to flip (at least for this).
>>
>> You are right when you say that in a configuration with A being parent
>> of B and C, A being over its hard limit will affect reclaim in B and C,
>> and soft limits should work the same.
>>
>> However, "will affect reclaim" is a big vague. More specifically, if the
>> sum of B and C's hard limit is smaller or equal A's hard limit, the only
>> way of either B or C to trigger A's hard limit is for them, themselves,
>> to go over their hard limit.
>
> Which is an expectation that you cannot guarantee. You can have B+C>A.
>
You can, but you might not. While you are focusing on one set of setups,
you are as a result ending up with a behavior that is not ideal for the
other set of setups.
I believe what I am proposing here will cover both of them.
>> *This* is the case you you are breaking when you try to establish a
>> comparison between soft and hard limits - which is, per se, sane.
>>
>> Translating this to the soft limit speech, if the sum of B and C's soft
>> limit is smaller or equal A's soft limit, and one of them is over the
>> soft limit, that one should be reclaimed. The other should be left alone.
>
> And yet again. Nothing will prevent you from setting B+C>A. Sure if you
> configure your hierarchy sanely then everything will just work.
>
Same as above.
>> I understand perfectly fine that soft limit is a best effort, not a
>> guarantee. But if we don't do that, I understand that we are doing
>> effort, not best effort.
>>
>> This would only be attempted in our first pass. In the second pass, we
>> reclaim from whoever.
>>
>> It is also not that hard to do it: Flatten the tree in a list, with the
>> leaves always being placed before the inner nodes.
>
> Glauber, I have already pointed out that bottom-up reclaim doesn't make
> much sense because it is a bigger chance that useful data is stored in
> the leaf nodes rather than inner nodes which usually contain mostly
> reparented pages.
>
Read my proposal algorithm again. I will provide you above with two
examples, one for each kind of setup. Tell me if and why you believe it
won't work:
Tree is always B and C, having A as parent.
Algorithm: Flatten the tree as B, C, A. Order between B and C doesn't
matter, but B and C always come before A. Walk the list as B, C, A.
Reclaim hierarchically from all of them.
Setup 1: A.soft = 2G. B.soft = C.soft = 1 G. B uses 1 G, C uses 2 G, and
A uses 3 G.
Scan B: not over soft limit, skip
Scan C: over soft limit, reclaim. C now goes back to 1 G. All is fine
Scan A: A is now within limits, skip.
If A had reparented charges, the whole subtree would still suffer reclaim.
Setup 2: A.soft = 2 G, B.soft = C.soft = 4 G. B uses 2 G, C uses 2 G,
and A uses 4 G.
Scan B: not over soft limit, skip
Scan C: not over soft limit, skip
Scan A: over soft limit. reclaim. Since A has no charges of itself,
reclaim B and C in whichever order, regardless of their soft limit
setup. If A had charges, we would proceed the same.
Setup 1 doesn't work with your proposal, Setup 2 does.
I am offering here something that I believe to work with both.
BTW, this is what I described in paragraph bellow:
>> Start reclaiming from nodes over the soft limit, hierarchically. This
>> means that whenever we reach an inner node and it is *still* over
>> the soft limit, we are guaranteed to have scanned their children
>> already. In the case I described, the children over its soft limit
>> would have been reclaimed, without the well behaving children being
>> touched. Now all three are okay.
>>
>> If we reached an inner node and we still have a soft limit problem, then
>> we are effectively talking about the case you have been describing.
>> Reclaim from whoever you want.
For the record: I am totally fine if you say: "I don't want to pay the
complexity now, what I am sending is already better than we have". I
stuck to this during the summit, and will say that again here.
But what you are saying is that it wouldn't work, that soft limits
should never attempt to reach that state, and pretty much building a
wall around that case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 11:32 ` Michal Hocko
@ 2013-04-23 12:45 ` Michel Lespinasse
2013-04-23 12:59 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Michel Lespinasse @ 2013-04-23 12:45 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Tue, Apr 23, 2013 at 4:32 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 23-04-13 02:58:19, Michel Lespinasse wrote:
>> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Mon 22-04-13 08:46:20, Tejun Heo wrote:
>> >> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
>> >> please talk with google memcg people. They have very different
>> >> interpretation of what "softlimit" is and are using it according to
>> >> that interpretation. If it *is* an actual soft limit, there is no
>> >> inherent isolation coming from it and that should be clear to
>> >> everyone.
>> >
>> > We have discussed that for a long time. I will not speak for Greg & Ying
>> > but from my POV we have agreed that the current implementation will work
>> > for them with some (minor) changes in their layout.
>> > As I have said already with a careful configuration (e.i. setting the
>> > soft limit only where it matters - where it protects an important
>> > memory which is usually in the leaf nodes)
>>
>> I don't like your argument that soft limits work if you only set them
>> on leaves.
>
> I didn't say that. Please read it again. "where it protects an important
> memory which is _usaully_ in the leaf nodes". Intermediate nodes can of
> course contain some important memory as well and you can well "protect"
> them by the soft limit you just have to be very careful because what you
> have in the result is quite complicated structure. You have a node that
> has some portion of its own memory mixed with reparented pages. You
> cannot distinguish those two of course so protection is somehow harder
> to achieve. That is the reason why I encourage not using any limit on
> the intermediate node which means reclaim the node with my patchset.
>
>> To me this is just a fancy way of saying that hierarchical soft limits
>> don't work.
>
> It works same as the hard limit it just triggers later.
>
>> Also it is somewhat problematic to assume that important memory can
>> easily be placed in leaves. This is difficult to ensure when
>> subcontainer destruction, for example, moves the memory back into the
>> parent.
>
> Is the memory still important then? The workload which uses the memory
> is done. So this ends up being just a cached data.
Well, even supposing the parent only holds non-important cached data
and the leaves have important data... your proposal implies that soft
limits on the leaves won't protect their data from reclaim, because
the cached data in the parent might cause the parent to go over its
own soft limit. If the leaves stay under their own soft limits, I
would prefer that the parent's cached data gets reclaimed first.
>> > you can actually achieve
>> > _high_ probability for not being reclaimed after the rework which was not
>> > possible before because of the implementation which was ugly and
>> > smelled.
>>
>> So, to be clear, what we (google MM people) want from soft limits is
>> some form of protection against being reclaimed from when your cgroup
>> (or its parent) is below the soft limit.
>>
>> I don't like to call it a guarantee either, because we understand that
>> it comes with some limitations - for example, if all user pages on a
>> given node are yours then allocations from that node might cause some
>> of your pages to be reclaimed, even when you're under your soft limit.
>> But we want some form of (weak) guarantee that can be made to work
>> good enough in practice.
>>
>> Before your change, soft limits didn't actually provide any such form
>> of guarantee, weak or not, since global reclaim would ignore soft
>> limits.
>>
>> With your proposal, soft limits at least do provide the weak guarantee
>> that we want, when not using hierarchies. We see this as a very clear
>> improvement over the previous situation, so we're very happy about
>> your patchset !
>>
>> However, your proposal takes that weak guarantee away as soon as one
>> tries to use cgroup hierarchies with it, because it reclaims from
>> every child cgroup as soon as the parent hits its soft limit. This is
>> disappointing and also, I have not heard of why you want things to
>> work that way ?
>
> Sigh. Because if children didn't follow parent's limit then they could
> easily escape from the reclaim pushing back to an unrelated hierarchies
> in the tree as the parent wouldn't be able to reclaim down to its limit.
To clarify: to you see us having this problem without administrative
delegation of the child cgroup configuration ?
>> Is this an ease of implementation issue or do you consider that
>> requirement as a bad idea ? And if it's the later, what's your
>> counterpoint, is it related to delegation or is it something else that
>> I haven't heard of ?
>
> The implementation can be improved and child groups might be reclaimed
> _only_ if parent cannot satisfy its soft limit this is not a target of
> the current re-implementation though. The limit has to be preserved
> though.
I'm actually OK with doing things that way; it's only talk about
disallowing these further steps that makes me very worried...
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 9:58 ` Michel Lespinasse
2013-04-23 10:17 ` Glauber Costa
2013-04-23 11:32 ` Michal Hocko
@ 2013-04-23 12:51 ` Michal Hocko
2 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 12:51 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Tue 23-04-13 02:58:19, Michel Lespinasse wrote:
[...]
> However, your proposal takes that weak guarantee away as soon as one
> tries to use cgroup hierarchies with it, because it reclaims from
> every child cgroup as soon as the parent hits its soft limit.
Reading this again I am really getting confused. The primary objection
used to be that under-soft-limit inter-node subtree shouldn't be
reclaimed although there are children over their soft limits. Now we
have moved to over-limit inder-node shouldn't hammer its subtree?
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 11:40 ` Michal Hocko
2013-04-23 11:54 ` Glauber Costa
@ 2013-04-23 12:51 ` Michel Lespinasse
2013-04-23 13:06 ` Michal Hocko
1 sibling, 1 reply; 46+ messages in thread
From: Michel Lespinasse @ 2013-04-23 12:51 UTC (permalink / raw)
To: Michal Hocko
Cc: Glauber Costa, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On Tue, Apr 23, 2013 at 4:40 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 23-04-13 14:17:22, Glauber Costa wrote:
>> On 04/23/2013 01:58 PM, Michel Lespinasse wrote:
>> > On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> >> On Mon 22-04-13 08:46:20, Tejun Heo wrote:
>> >>> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
>> >>> please talk with google memcg people. They have very different
>> >>> interpretation of what "softlimit" is and are using it according to
>> >>> that interpretation. If it *is* an actual soft limit, there is no
>> >>> inherent isolation coming from it and that should be clear to
>> >>> everyone.
>> >>
>> >> We have discussed that for a long time. I will not speak for Greg & Ying
>> >> but from my POV we have agreed that the current implementation will work
>> >> for them with some (minor) changes in their layout.
>> >> As I have said already with a careful configuration (e.i. setting the
>> >> soft limit only where it matters - where it protects an important
>> >> memory which is usually in the leaf nodes)
>> >
>> > I don't like your argument that soft limits work if you only set them
>> > on leaves. To me this is just a fancy way of saying that hierarchical
>> > soft limits don't work.
>> >
>> > Also it is somewhat problematic to assume that important memory can
>> > easily be placed in leaves. This is difficult to ensure when
>> > subcontainer destruction, for example, moves the memory back into the
>> > parent.
>> >
>>
>> Michal,
>>
>> For the most part, I am siding with you in this discussion.
>> But with this only-in-leaves thing, I am forced to flip (at least for this).
>>
>> You are right when you say that in a configuration with A being parent
>> of B and C, A being over its hard limit will affect reclaim in B and C,
>> and soft limits should work the same.
>>
>> However, "will affect reclaim" is a big vague. More specifically, if the
>> sum of B and C's hard limit is smaller or equal A's hard limit, the only
>> way of either B or C to trigger A's hard limit is for them, themselves,
>> to go over their hard limit.
>
> Which is an expectation that you cannot guarantee. You can have B+C>A.
>
>> *This* is the case you you are breaking when you try to establish a
>> comparison between soft and hard limits - which is, per se, sane.
>>
>> Translating this to the soft limit speech, if the sum of B and C's soft
>> limit is smaller or equal A's soft limit, and one of them is over the
>> soft limit, that one should be reclaimed. The other should be left alone.
>
> And yet again. Nothing will prevent you from setting B+C>A. Sure if you
> configure your hierarchy sanely then everything will just work.
Let's all stop using words such as "sanely" and "work" since we don't
see to agree on how they apply here :)
The issue I see is that even when people configure soft limits B+C <
A, your current proposal still doesn't "leave the other alone" as
Glauber and I think we should.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 12:45 ` Michel Lespinasse
@ 2013-04-23 12:59 ` Michal Hocko
0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 12:59 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Tejun Heo, Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki,
cgroups, linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Greg Thelen
On Tue 23-04-13 05:45:05, Michel Lespinasse wrote:
> On Tue, Apr 23, 2013 at 4:32 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Tue 23-04-13 02:58:19, Michel Lespinasse wrote:
> >> On Mon, Apr 22, 2013 at 8:54 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > On Mon 22-04-13 08:46:20, Tejun Heo wrote:
> >> >> Oh, if so, I'm happy. Sorry about being brash on the thread; however,
> >> >> please talk with google memcg people. They have very different
> >> >> interpretation of what "softlimit" is and are using it according to
> >> >> that interpretation. If it *is* an actual soft limit, there is no
> >> >> inherent isolation coming from it and that should be clear to
> >> >> everyone.
> >> >
> >> > We have discussed that for a long time. I will not speak for Greg & Ying
> >> > but from my POV we have agreed that the current implementation will work
> >> > for them with some (minor) changes in their layout.
> >> > As I have said already with a careful configuration (e.i. setting the
> >> > soft limit only where it matters - where it protects an important
> >> > memory which is usually in the leaf nodes)
> >>
> >> I don't like your argument that soft limits work if you only set them
> >> on leaves.
> >
> > I didn't say that. Please read it again. "where it protects an important
> > memory which is _usaully_ in the leaf nodes". Intermediate nodes can of
> > course contain some important memory as well and you can well "protect"
> > them by the soft limit you just have to be very careful because what you
> > have in the result is quite complicated structure. You have a node that
> > has some portion of its own memory mixed with reparented pages. You
> > cannot distinguish those two of course so protection is somehow harder
> > to achieve. That is the reason why I encourage not using any limit on
> > the intermediate node which means reclaim the node with my patchset.
> >
> >> To me this is just a fancy way of saying that hierarchical soft limits
> >> don't work.
> >
> > It works same as the hard limit it just triggers later.
> >
> >> Also it is somewhat problematic to assume that important memory can
> >> easily be placed in leaves. This is difficult to ensure when
> >> subcontainer destruction, for example, moves the memory back into the
> >> parent.
> >
> > Is the memory still important then? The workload which uses the memory
> > is done. So this ends up being just a cached data.
>
> Well, even supposing the parent only holds non-important cached data
> and the leaves have important data... your proposal implies that soft
> limits on the leaves won't protect their data from reclaim, because
> the cached data in the parent might cause the parent to go over its
> own soft limit.
Parent would be visited first so it can reclaim from its pages first.
Only then we traverse the tree down to children.
Just out of curiousity what is the point to set the soft limit in that
node in the first place. You want to use the soft limit for isolation
but is there anything you want to isolate in that node? More over does
it really make sense to set soft limit to less than
Sum(children(soft_limit))?
> If the leaves stay under their own soft limits, I would prefer that
> the parent's cached data gets reclaimed first.
>
> >> > you can actually achieve
> >> > _high_ probability for not being reclaimed after the rework which was not
> >> > possible before because of the implementation which was ugly and
> >> > smelled.
> >>
> >> So, to be clear, what we (google MM people) want from soft limits is
> >> some form of protection against being reclaimed from when your cgroup
> >> (or its parent) is below the soft limit.
> >>
> >> I don't like to call it a guarantee either, because we understand that
> >> it comes with some limitations - for example, if all user pages on a
> >> given node are yours then allocations from that node might cause some
> >> of your pages to be reclaimed, even when you're under your soft limit.
> >> But we want some form of (weak) guarantee that can be made to work
> >> good enough in practice.
> >>
> >> Before your change, soft limits didn't actually provide any such form
> >> of guarantee, weak or not, since global reclaim would ignore soft
> >> limits.
> >>
> >> With your proposal, soft limits at least do provide the weak guarantee
> >> that we want, when not using hierarchies. We see this as a very clear
> >> improvement over the previous situation, so we're very happy about
> >> your patchset !
> >>
> >> However, your proposal takes that weak guarantee away as soon as one
> >> tries to use cgroup hierarchies with it, because it reclaims from
> >> every child cgroup as soon as the parent hits its soft limit. This is
> >> disappointing and also, I have not heard of why you want things to
> >> work that way ?
> >
> > Sigh. Because if children didn't follow parent's limit then they could
> > easily escape from the reclaim pushing back to an unrelated hierarchies
> > in the tree as the parent wouldn't be able to reclaim down to its limit.
>
> To clarify: to you see us having this problem without administrative
> delegation of the child cgroup configuration ?
In the perfect world where the limits are set up reasonably there is no
such issue. Parents would usually have limit higher than sum of their
children limits so children wouldn't need to reclaim just because their
parent is over the limit.
> >> Is this an ease of implementation issue or do you consider that
> >> requirement as a bad idea ? And if it's the later, what's your
> >> counterpoint, is it related to delegation or is it something else that
> >> I haven't heard of ?
> >
> > The implementation can be improved and child groups might be reclaimed
> > _only_ if parent cannot satisfy its soft limit this is not a target of
> > the current re-implementation though. The limit has to be preserved
> > though.
>
> I'm actually OK with doing things that way; it's only talk about
> disallowing these further steps that makes me very worried...
What prevents us from enhancing reclaim further?
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 12:51 ` Michel Lespinasse
@ 2013-04-23 13:06 ` Michal Hocko
2013-04-23 13:13 ` Glauber Costa
0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 13:06 UTC (permalink / raw)
To: Michel Lespinasse
Cc: Glauber Costa, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On Tue 23-04-13 05:51:36, Michel Lespinasse wrote:
[...]
> The issue I see is that even when people configure soft limits B+C <
> A, your current proposal still doesn't "leave the other alone" as
> Glauber and I think we should.
If B+C < A then B resp. C get reclaimed only if A is over the limit
which means that it couldn't reclaimed enough to get bellow the limit
when we bang on it before B and C. We can update the implementation
later to be more clever in situations like this but this is not that
easy because once we get away from the round robin over the tree then we
might end up having other issues - like unfairness etc... That's why I
wanted to have this as simple as possible.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 13:06 ` Michal Hocko
@ 2013-04-23 13:13 ` Glauber Costa
2013-04-23 13:28 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Glauber Costa @ 2013-04-23 13:13 UTC (permalink / raw)
To: Michal Hocko
Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On 04/23/2013 05:06 PM, Michal Hocko wrote:
> On Tue 23-04-13 05:51:36, Michel Lespinasse wrote:
> [...]
>> The issue I see is that even when people configure soft limits B+C <
>> A, your current proposal still doesn't "leave the other alone" as
>> Glauber and I think we should.
>
> If B+C < A then B resp. C get reclaimed only if A is over the limit
> which means that it couldn't reclaimed enough to get bellow the limit
> when we bang on it before B and C. We can update the implementation
> later to be more clever in situations like this but this is not that
> easy because once we get away from the round robin over the tree then we
> might end up having other issues - like unfairness etc... That's why I
> wanted to have this as simple as possible.
>
Nobody is opposing this, Michal.
What people are opposing is you saying that the children should be
reclaimed *regardless* of their softlimit when the parent is over their
soft limit. Someone, specially you, saying this, highly threatens
further development in this direction.
It doesn't really matter if your current set is doing this, simply
everybody already agreed that you are moving in a good direction.
If you believe that it is desired to protect the children from reclaim
in situation in which the offender is only one of the children and that
can be easily identified, please state that clearly.
Since nobody is really opposing your patchset, that is enough for the
discussion to settle. (Can't say how others feel, but can say about
myself, and guess about others)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 13:13 ` Glauber Costa
@ 2013-04-23 13:28 ` Michal Hocko
0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-23 13:28 UTC (permalink / raw)
To: Glauber Costa
Cc: Michel Lespinasse, Tejun Heo, Johannes Weiner, Balbir Singh,
KAMEZAWA Hiroyuki, cgroups, linux-mm, Hugh Dickins, Ying Han,
Greg Thelen
On Tue 23-04-13 17:13:20, Glauber Costa wrote:
> On 04/23/2013 05:06 PM, Michal Hocko wrote:
> > On Tue 23-04-13 05:51:36, Michel Lespinasse wrote:
> > [...]
> >> The issue I see is that even when people configure soft limits B+C <
> >> A, your current proposal still doesn't "leave the other alone" as
> >> Glauber and I think we should.
> >
> > If B+C < A then B resp. C get reclaimed only if A is over the limit
> > which means that it couldn't reclaimed enough to get bellow the limit
> > when we bang on it before B and C. We can update the implementation
> > later to be more clever in situations like this but this is not that
> > easy because once we get away from the round robin over the tree then we
> > might end up having other issues - like unfairness etc... That's why I
> > wanted to have this as simple as possible.
> >
> Nobody is opposing this, Michal.
>
> What people are opposing is you saying that the children should be
> reclaimed *regardless* of their softlimit when the parent is over their
> soft limit. Someone, specially you, saying this, highly threatens
> further development in this direction.
OK, I am feeling like repeating myself. Anyway once more. I am _all_ for
protecting children that are under their limit if that is _possible_[1].
We are not yet there though for generic configuration. That's why I was
so careful about the wording and careful configuration at this stage.
Is this sufficient for your concerns?
I do not see any giant obstacles in the current implementation to allow
this behavior.
> It doesn't really matter if your current set is doing this, simply
> everybody already agreed that you are moving in a good direction.
>
> If you believe that it is desired to protect the children from reclaim
> in situation in which the offender is only one of the children and that
> can be easily identified, please state that clearly.
Clearly yes.
---
[1] and to be even more clear there are cases where this will never be
possible. For an example:
A (soft:0)
|
B (soft:MAX)
where B smart ass thinks that his group never gets reclaim although he
is the only source of the pressure. This is what I call untrusted
environment.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 9:29 ` Michal Hocko
@ 2013-04-23 17:09 ` Tejun Heo
2013-04-26 11:51 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-23 17:09 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hello, Michal.
On Tue, Apr 23, 2013 at 11:29:56AM +0200, Michal Hocko wrote:
> Ohh, well and we are back in the circle again. Nobody is proposing
> overloading soft reclaim for any bottom-up (if that is what you mean by
> your opposite direction) pressure handling.
>
> > You're making it a point control rather than range one.
>
> Be more specific here, please?
>
> > Maybe you can define some twisted rules serving certain specific use
> > case, but it's gonna be confusing / broken for different use cases.
>
> Tejun, your argumentation is really hand wavy here. Which use cases will
> be broken and which one will be confusing. Name one for an illustration.
>
> > You're so confused that you don't even know you're confused.
>
> Yes, you keep repeating that. But you haven't pointed out any single
> confusing use case so far. Please please stop this, it is not productive.
> We are still talking about using soft limit to control overcommit
> situation as gracefully as possible. I hope we are on the same page
> about that at least.
Hmmm... I think I was at least somewhat clear on my points. I'll try
again. Let's see if I can at least make you understand what my point
is. Maybe some diagrams will help.
Let's consider hardlimit first as there seems to be consensus on what
it means. By default, hardlimit is set at max and exerts pressure
downwards.
<--------------------------------------------------------|
0 max
When you configure a hard limit, the diagram becomes.
<-----------------------------------------|
0 limit max
The configuration now became more specific, right? Now let's say
there's one parent and one child. The parent looks like the above and
the child like the below.
<---------------------|
0 limit' max
When you combine the two, you get
<---------------------|
0 limit' max
In fact, it doesn't matter whether parent is more limited or child is.
When composing multiple limits, the only logical thing to do is
calculating the intersection - ie. take the most specific of the
limits, which naturally doesn't violate both configurations. In
hierarchy setup, children need to be summed and all, so it becomes
different, but that's the principle. I hope you're with me upto this
point.
Now, let's think about the other direction. I don't care whether it's
strict guarantee, soft protection or just a gentle preferential
treatment. The focus is the direction of specificity. Please forget
about "softlimit" for now. Just think at the interface level. You
don't want to give protection by default, right? The specificity
increases along with the amount of memory to "protect". So, the
default looks like.
|-------------------------------------------------------->
0 max
When you configure certain amount, it becomes
|------------------------------------------->
0 prot max
The direction of specificity is self-evident from what the default
should be. Now, when you combine it with another such protection, say
prot'.
|--------------------------->
0 prot' max
Regardless of what the nesting order is, what you should get is.
|--------------------------->
0 prot' max
It's exactly the same as limit. When you combine multiple of them,
the most specific one wins. This is the basic of composing multiple
ranges and it is the same principle that cgroup hierarchy limit
configuration follows. When you compose configurations across
hierarchy, you get the intersection.
Now, when you put both into a single configuration knob, a given
config would look like the following.
specificity specificity
of limit of protection
<----------------|--------------------------------------->
0 config max
Now, if you try to combine it with another one - config'
specificity specificity
of limit of protection
<-------------------------------|------------------------>
0 config' max
The intersection is no longer clearly defined. If you choose config,
you violate the protection specificity of config', if you choose
config', you violate the limit specificity of config. This is what I
meant by you're making it a point configuration rather than a range
one.
A ranged config allows for well-defined composition through
intersection. People tend to do this intuitively which makes it
easier and more useful.
I don't really care all that much about memcg internals but I do care
about maintaining general sanity and consistency of cgroup control
knobs especially in hierarchical settings which we traditionally have
been horrible at, and I hope you at least can see the problem I'm
seeing as it's evident as fire from where I stand. It's breaking the
very basic principle which makes hierarchy sensible and useful.
The fact that you think "switching the default value to the other end"
is just a detail is very bothering because the default value is not
determined according to one's whim. It's determined by the direction
of specificity and in turn clearly marks and determines further
operations including how they are composed.
This really illumuniates the intricate and fragile tweaks you're
trying to perform in an attemp to make the above point control to suit
the use cases that you immediately face - you're choosing the
direction of specificity that the knob is gonna follow on
instance-by-instance basis - it's one direction for default and leaves
if parent is not over limit; however, if it's over limit, you flip the
direction, so that it somehow works for the use cases that you have
right now. Sure, there are cases where such greedy engineering
approach is useful or at least cases where we just have to make do
with that, but this is nothing like that. It is a basic interface
design which isn't complicated or difficult in itself.
> Yes, I am thinking in context of several use cases, all right. One
> of them is memory isolation via soft limit prioritization. Something
> that is possible already but it is major PITA to do right. What we
> have currently is optimized for "let's hammer something". Although
> useful, not a primary usecase according to my experiences. The primary
> motivation for the soft limit was to have something to control
> overcommit situations gracefully AFAIR and let's hammer something and
> hope it will work doesn't sound gracefully to me.
As I've said multiple times now, I'm not saying any of the presented
use cases are invalid. They all look valid to me and I think it's
logical to support them; however, combining the two directions of
specificities into one knob can't be the solution. Right now, both
google and parallels want isolation, so that's the direction they're
pushing - the arrows which are headed to the right of the screen.
The problem becomes self-evident when you consider use cases which
will want the arrows heading to the left of the screen, where
over-provision of softlimit would be a natural thing to do just as
hardlimit is, and such use cases won't call for and most likely will
be hurt by reducing reclaim pressure when under limit.
Say, a server or mobile configuration where a couple background jobs -
say, indexing and back up - are running, both of which may create
sizable amount of dirty data. They need to be done but aren't of high
priority. Given the size of the machine and the type of the batch
tasks, you wanna give X amount of memory to the batch tasks but want
to make sure neither takes too much of it, so configure each to have Y
and Z, where Y < X, Z < X but Y + Z > X. This is a reasonable
configuration and when the system, as a whole, gets put under memory
pressure - say the user launches a memory hog game - you first want
the batch tasks to give away memory as fast as possible until the
composition of limits is met and then you want them to feel the same
pressure as everyone else.
You can't combine "soft limit prioritization" and "isolation" into the
same knob. Not because of implementation deatils but because they
have the opposite directions of specificity. They're two
fundamentally incompatible knobs.
> > including the ones without any softlimit configured.
>
> I haven't seen any specific argument why the default limit shouldn't
> allow to always reclaim.
> Having soft unreclaimable groups by default makes it hard to use soft
> limit reclaim for something more interesting. See the last patch
> in the series ("memcg: Ignore soft limit until it is explicitly
> specified"). With this approach you end up setting soft limit for every
> single group (even those you do not care about) just to make balancing
> work reasonably for all hierarchies.
I think, well at least hope, that it's clear by now, but the above is
exactly the kind of twisting and tweaking that I was talking about
above. You're flipping things at different places trying to somehow
meet the conflicting requirements which currently is put forth by
mostly people using it as an isolation mechanism.
> Anyway, this is just one part of the series and it doesn't make sense to
> postpone the whole work just for this. If _more people_ really think that
> the default limit change is really _so_ confusing and unusable then I
> will not push it over dead bodies of course.
So, here's my problem with the patchset. As sucky as the current
situation is, "softlimit" currently doesn't explicitly implement or
suggest isolation. People wanting isolation would of course want to
push it to do isolation. They just want to get the functionality and
interface doesn't matter all that much, which is fine and completely
punderstandable, but by pushing it towards isolation, you're cementing
the duality of the knob. Frankly, I don't care which direction
"softlimit" chooses but you can't put both "limit" and "protection"
into the same knob. It's fundamentally broken especially in
hierarchies.
> Nothing prevents from this setting. I am just claiming that this is not
> the most interesting use case for the soft limit and I would like to
> optimize for more interesting use cases.
Michal, it really is not about optimizing for anything. It is the
basic semantics of the knob, which isn't part of what one may call
"implementation details". You can't "optimize" them.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-22 18:30 ` Tejun Heo
2013-04-23 9:29 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
@ 2013-04-24 21:45 ` Johannes Weiner
2013-04-25 0:33 ` Tejun Heo
2 siblings, 1 reply; 46+ messages in thread
From: Johannes Weiner @ 2013-04-24 21:45 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm,
Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse,
Greg Thelen
On Mon, Apr 22, 2013 at 11:30:20AM -0700, Tejun Heo wrote:
> Hey,
>
> On Mon, Apr 22, 2013 at 06:20:12PM +0200, Michal Hocko wrote:
> > Although the default limit is correct it is impractical for use
> > because it doesn't allow for "I behave do not reclaim me if you can"
> > cases. And we can implement such a behavior really easily with backward
> > compatibility and new interfaces (aka reuse the soft limit for that).
>
> Okay, now we're back to square one and I'm reinstating all the mean
> things I said in this thread. :P No wonder everyone is so confused
> about this. Michal, you can't overload two controls which exert
> pressure on the opposite direction onto a single knob and define a
> sane hierarchical behavior for it. You're making it a point control
> rather than range one. Maybe you can define some twisted rules
> serving certain specific use case, but it's gonna be confusing /
> broken for different use cases.
Historically soft limit meant prioritizing certain memcgs over others
and the memcgs over their soft limit should experience relatively more
reclaim pressure than the ones below their soft limit.
Now, if we go and say you are only reclaimed when you exceed your soft
limit we would retain the prioritization aspect. Groups in excess of
their soft limits would still experience relatively more reclaim
pressure than their well-behaved peers. But it would have the nice
side effect of acting more or less like a guarantee as well.
I don't think this approach is as unreasonable as you make it out to
be, but it does make things more complicated. It could be argued that
we should add a separate guarantee knob because two simple knobs might
be better than a complicated one.
The question is whether this solves Google's problem, though.
Currently, when a memcg is selected for a certain type of reclaim, it
and all its children are treated as one single leaf entity in the
overall hierarchy: when a parent node hits its hard limit, we assume
equal fault of every member in the hierarchy for that situation and,
consequently, we reclaim all of them equally. We do the same thing
for the soft limit: if the parent, whose memory consumption is defined
as the sum of memory consumed by all members of the hierarchy,
breaches the soft limit then all members are reclaimed equally because
no single member is more at fault than the others. I would expect if
we added a guarantee knob, this would also mean that no individual
memcg can be treated as being within their guaranteed memory if the
hierarchy as a whole is in excess of its guarantee.
The root of the hierarchy represents the whole hierarchy. Its memory
usage is the combined memory usage of all members. The limit set to
the hierarchy root applies to the combined memory usage of the
hierarchy. Breaching that limit has consequences for the hierarchy as
a whole. Be it soft limit or guarantee.
This is how hierarchies have always worked and it allows limits to be
layered and apply depending on the source of pressure:
root (physical memory = 32G)
/ \
A B (hard limit = 25G, guarantee = 16G)
/ \ / \
A1 A2 / B2 (guarantee = 10G)
/
B1 (guarantee = 15G)
Remember that hard limits are usually overcommitted, so you allow B to
use more of the fair share of memory when A does not need it, but you
want to keep it capped to keep latency reasonable when A ramps up.
As long as B is hitting its own hard limit, you value B1's and B2's
guarantees in the context of pressure local to the hierarchy; in the
context of B having 25G worth of memory; in the context of B1
competing with B2 over the memory allowed by B.
However, as soon as global reclaim kicks in, the context changes and
the priorities shift. Now, B does not have 25G anymore but only 16G
*in its competition with A*. We absolutely do not want to respect the
guarantees made to B1 and B2. Not only can they not be met anyway,
but they are utterly meaningless at this point. They were set with
25G in mind.
[ It may be conceivable that you want different guarantees for B1 and
B2 depending on where the pressure comes from. One setting for when
the 25G limit applies, one setting when the 32G physical memory
limit applies. Basically, every group would need a vector of
guarantee settings with one setting per ancestor.
That being said, I absolutely disagree with the idea of trying to
adhere to individual memcg guarantees in the first reclaim cycle,
regardless of context and then just ignore them on the second pass.
It's a horrible way to guess which context the admin had in mind. ]
Now, there is of course the other scenario in which the current
hierarchical limit application can get in your way: when you give
intermediate nodes their own memory. Because then you may see the
need to apply certain limits to that hierarchy root's local memory
only instead of all memory in the hierarchy. But once we open that
door, you might expect this to be an option for every limit, where
even the hard limit of a hierarchy root only applies to that group's
local memory instead of the whole hierarchy. I certainly do not want
to apply hierarchy semantics for some limits and not for others. But
Google has basically asked for hierarchical hard limits and local soft
limits / guarantees.
In summary, we are now looking at both local and hierarchical limits
times number of ancestors PER MEMCG to support all those use cases
properly.
So I'm asking what I already asked a year ago: are you guys sure you
can not change your cgroup tree layout and that we have to solve it by
adding new limit semantics?!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner
@ 2013-04-25 0:33 ` Tejun Heo
2013-04-29 18:39 ` Johannes Weiner
0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-25 0:33 UTC (permalink / raw)
To: Johannes Weiner
Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm,
Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse,
Greg Thelen
Hello, Johannes.
On Wed, Apr 24, 2013 at 05:45:31PM -0400, Johannes Weiner wrote:
> Historically soft limit meant prioritizing certain memcgs over others
> and the memcgs over their soft limit should experience relatively more
> reclaim pressure than the ones below their soft limit.
>
> Now, if we go and say you are only reclaimed when you exceed your soft
> limit we would retain the prioritization aspect. Groups in excess of
> their soft limits would still experience relatively more reclaim
> pressure than their well-behaved peers. But it would have the nice
> side effect of acting more or less like a guarantee as well.
But, at the same time, it has the not-so-nice side-effect of losing
the ability to express negative prioritization. It isn't difficult to
imagine use cases where the system doesn't want to partition the whole
system into discrete cgroups but wants to limit the amount of
resources consumed by low-priority workloads.
Also, in the long-term, I really want cgroup to become something
generally useful and automatically configurable (optional of course)
by the base system according to the types of workloads. For something
like that to be possible, the control knobs shouldn't be fiddly,
complex, or require full partitioning of the system.
> I don't think this approach is as unreasonable as you make it out to
> be, but it does make things more complicated. It could be argued that
> we should add a separate guarantee knob because two simple knobs might
> be better than a complicated one.
The problem that I see is that this is being done without clearing up
the definition of the knob. The knob's role is being changed or at
least solidified into something which makes it inconsistent with
everything else in cgroup in a way which seems very reactive to me.
I can see such reactive customizations being useful in satisfying
certain specific use cases - google's primarily right now; however,
it's likely to come back and bite us when we want to do something
different or generic with cgroup. It's gonna be something which ends
up being labeled as unusuable in other types of setups (e.g. where not
all workloads are put under active control or whatever) after causing
a lot of head-scratching and not-particularly-happy moments. Cgroup
as a whole strongly needs consistency across its control knobs for it
to be generally useful.
Well, that and past frustrations over interface and implementations of
memcg, which seems to bear a lot of similarities with what's going on
now, probably have made me go over-board. Sorry about that, but I
really hope memcg do better.
...
> no single member is more at fault than the others. I would expect if
> we added a guarantee knob, this would also mean that no individual
> memcg can be treated as being within their guaranteed memory if the
> hierarchy as a whole is in excess of its guarantee.
I disagree here. It should be symmetrical to how hardlimit works.
Let's say there's one parent - P - and child - C. For hardlimit, if P
is over limit, it exerts pressure on its subtree regardless of C, and,
if P is under limit, it doesn't affect C.
For guarantee / protection, it should work the same but in the
opposite direction. If P is under limit, it should protect the
subtree from reclaim regardless of C. If P is over limit, it
shouldn't affect C.
As I draw in the other reply to Michal, each knob should be a starting
point of a single range in the pre-defined direction and composition
of those configurations across hierarchy should result in intersection
of them. I can't see any reason to deviate from that here.
IOW, protection control shouldn't care about generating memory
pressure. That's the job of soft and hard limits, both of which
should apparently override protection. That way, each control knob
becomes fully consistent within itself across the hierarchy and the
questions become those of how soft limit should override protection
rather than the semantics of soft limit itself.
> The root of the hierarchy represents the whole hierarchy. Its memory
> usage is the combined memory usage of all members. The limit set to
> the hierarchy root applies to the combined memory usage of the
> hierarchy. Breaching that limit has consequences for the hierarchy as
> a whole. Be it soft limit or guarantee.
>
> This is how hierarchies have always worked and it allows limits to be
> layered and apply depending on the source of pressure:
That's definitely true for soft and hard limits but flipped for
guarantees and I think that's the primary source of confusion -
guarantee being overloaded onto softlimit.
> root (physical memory = 32G)
> / \
> A B (hard limit = 25G, guarantee = 16G)
> / \ / \
> A1 A2 / B2 (guarantee = 10G)
> /
> B1 (guarantee = 15G)
>
> Remember that hard limits are usually overcommitted, so you allow B to
> use more of the fair share of memory when A does not need it, but you
> want to keep it capped to keep latency reasonable when A ramps up.
>
> As long as B is hitting its own hard limit, you value B1's and B2's
> guarantees in the context of pressure local to the hierarchy; in the
> context of B having 25G worth of memory; in the context of B1
> competing with B2 over the memory allowed by B.
>
> However, as soon as global reclaim kicks in, the context changes and
> the priorities shift. Now, B does not have 25G anymore but only 16G
> *in its competition with A*. We absolutely do not want to respect the
> guarantees made to B1 and B2. Not only can they not be met anyway,
> but they are utterly meaningless at this point. They were set with
> 25G in mind.
I find the configuration confusing. What does it mean? Let's say B
doesn't consume memory itself and B1 is inactive. Does that mean B2
is guaranteed upto 16G? Or is it that B2 is still guaranteed only
upto 10G?
If former, what if the intention was just to prevent B's total going
past 16G and the configuration never meant to grant extra 6G to B2?
The latter makes more sense as softlimit, but what happens when B
itself consumes memory? Is B's internal consumption guaranteed any
memory? If so, what if the internal usage is mostly uninteresting and
the admin never meant them to get any guarantee and it unnecessarily
eats into B1's guarantee when it comes up? If not, what happens when
B1 creates a sub-cgroup B11? Do all internal usages of B1 lose the
guarantee?
If I'm not too confused, most of the confusions arise from the fact
that guarantee's specificity is towards max (as evidenced by its
default being zero) but composition through hierarchy happening in the
other direction (ie. guarantee in internal node exerts pressure
towards zero on its subtree).
Doesn't something like the following suit what you had in mind better?
h: hardlimit, s: softlimit, g: guarantee
root (physical memory = 32G)
/ \
A B (h:25G, s:16G)
/ \ / \
A1 A2 / B2 (g:10G)
/
B1 (g:15G)
It doesn't solve any of the execution issues arising from having to
enforce 16G limit over 10G and 15G guarnatees but there is no room for
misinterpreting the intention of the configuration. You could say
that this is just a convenient case because it doesn't actually have
nesting of the same params. Let's add one then.
root (physical memory = 32G)
/ \
A B (h:25G, s:16G g:15G)
/ \ / \
A1 A2 / B2 (g:10G)
/
B1 (g:15G)
If we follow the rule of composition by intersection, the
interpretation of B's guarantee is clear. If B's subtree is under
15G, regardless of individual usages of B1 and B2, they shouldn't feel
reclaim pressure. When B's subtree goes over 15G, B1 and B2 will have
to fend off for themselves. If the ones which are over their own
guarantee will feel the "normal" reclaim pressure; otherwise, they
will continue to evade reclaim. When B's subtree goes over 16G,
someone in B's subtree have to pay, preferably the ones not guaranteed
anything first.
> [ It may be conceivable that you want different guarantees for B1 and
> B2 depending on where the pressure comes from. One setting for when
> the 25G limit applies, one setting when the 32G physical memory
> limit applies. Basically, every group would need a vector of
> guarantee settings with one setting per ancestor.
I don't get this. If a cgroup is under the guarantee limit and none
of its parents are under hard/softlimit, it shouldn't feel any
pressure. If a cgroup ia above guarantee, it should feel the same
pressure everyone else in that subtree is subject to. If any of the
ancestors has triggered soft / hard limit, it's gonna have to give up
pages pretty quickly.
> That being said, I absolutely disagree with the idea of trying to
> adhere to individual memcg guarantees in the first reclaim cycle,
> regardless of context and then just ignore them on the second pass.
> It's a horrible way to guess which context the admin had in mind. ]
I think there needs to be a way to avoid penalizing sub-cgroups under
guarnatee amount when there are siblings which can give out pages over
guarantee. I don't think I'm following the "guessing the intention"
part. Can you please elaborate?
> Now, there is of course the other scenario in which the current
> hierarchical limit application can get in your way: when you give
> intermediate nodes their own memory. Because then you may see the
> need to apply certain limits to that hierarchy root's local memory
> only instead of all memory in the hierarchy. But once we open that
> door, you might expect this to be an option for every limit, where
> even the hard limit of a hierarchy root only applies to that group's
> local memory instead of the whole hierarchy. I certainly do not want
> to apply hierarchy semantics for some limits and not for others. But
> Google has basically asked for hierarchical hard limits and local soft
> limits / guarantees.
So, proportional controllers need this. They need to be able to
configure the amount the tasks belonging to an inner node can consume
when competing against the children groups. It isn't a particularly
pretty thing but a necessity given that we allow tasks and resource
consumptions in inner nodes. I was wondering about this and asked
Michal whether anybody wants something like that and IIRC his answer
was negative. Can you please expand on what google asked for?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-23 17:09 ` Tejun Heo
@ 2013-04-26 11:51 ` Michal Hocko
2013-04-26 18:37 ` Tejun Heo
0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2013-04-26 11:51 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Tue 23-04-13 10:09:00, Tejun Heo wrote:
> Hello, Michal.
>
> On Tue, Apr 23, 2013 at 11:29:56AM +0200, Michal Hocko wrote:
> > Ohh, well and we are back in the circle again. Nobody is proposing
> > overloading soft reclaim for any bottom-up (if that is what you mean by
> > your opposite direction) pressure handling.
> >
> > > You're making it a point control rather than range one.
> >
> > Be more specific here, please?
> >
> > > Maybe you can define some twisted rules serving certain specific use
> > > case, but it's gonna be confusing / broken for different use cases.
> >
> > Tejun, your argumentation is really hand wavy here. Which use cases will
> > be broken and which one will be confusing. Name one for an illustration.
> >
> > > You're so confused that you don't even know you're confused.
> >
> > Yes, you keep repeating that. But you haven't pointed out any single
> > confusing use case so far. Please please stop this, it is not productive.
> > We are still talking about using soft limit to control overcommit
> > situation as gracefully as possible. I hope we are on the same page
> > about that at least.
>
> Hmmm... I think I was at least somewhat clear on my points. I'll try
> again. Let's see if I can at least make you understand what my point
> is. Maybe some diagrams will help.
Maybe I should have been more explicit about this but _yes I do agree_
that a separate limit would work as well. I just do not want to
introduce yet-another-limit unless it is _really_ necessary. We have up
to 4 of them depending on the configuration which is a lot already. And
the new knob would certainly become a guarantee what ever words we use
with more expectations than soft limit and I am afraid that won't be
that easy (unless we provide a poison pill for emergency cases).
My rework was based on the soft limit semantic which we had for quite
some time and tried to enhance it to be more useful. I do understand
your concerns about the cleanness of the interface I just objected that
the new meaning doesn't add any guarantee. The implementation just tries
to be clever who to reclaim to handle an external pressure (for which
the soft limit has been introduced in the first place) while using hints
from the limit as much as possible .
Anyway, I will think about cons and pros of the new limit. I think we
shouldn't block the first 3 patches in the series which keep the current
semantic and just change the internals to do the same thing. Do you
agree?
We can discuss single vs. new knob in the mean time of course.
[...]
Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-26 11:51 ` Michal Hocko
@ 2013-04-26 18:37 ` Tejun Heo
2013-04-29 15:27 ` Michal Hocko
0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2013-04-26 18:37 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
Hey,
On Fri, Apr 26, 2013 at 01:51:20PM +0200, Michal Hocko wrote:
> Maybe I should have been more explicit about this but _yes I do agree_
> that a separate limit would work as well. I just do not want to
Heh, the point was more about what we shouldn't be doing, but, yeah,
it's good that we at least agree on something. :)
> Anyway, I will think about cons and pros of the new limit. I think we
> shouldn't block the first 3 patches in the series which keep the current
> semantic and just change the internals to do the same thing. Do you
> agree?
As the merge window is coming right up, if it isn't something super
urgent, can we please hold it off until after the merge window? It
would be really great if we can pin down the semantics of the knob
before doing anything. Please. I'll think / study more about it in
the coming weeks.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-26 18:37 ` Tejun Heo
@ 2013-04-29 15:27 ` Michal Hocko
0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2013-04-29 15:27 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
linux-mm, Hugh Dickins, Ying Han, Glauber Costa,
Michel Lespinasse, Greg Thelen
On Fri 26-04-13 11:37:41, Tejun Heo wrote:
> Hey,
>
> On Fri, Apr 26, 2013 at 01:51:20PM +0200, Michal Hocko wrote:
> > Maybe I should have been more explicit about this but _yes I do agree_
> > that a separate limit would work as well. I just do not want to
>
> Heh, the point was more about what we shouldn't be doing, but, yeah,
> it's good that we at least agree on something. :)
>
> > Anyway, I will think about cons and pros of the new limit. I think we
> > shouldn't block the first 3 patches in the series which keep the current
> > semantic and just change the internals to do the same thing. Do you
> > agree?
>
> As the merge window is coming right up, if it isn't something super
> urgent, can we please hold it off until after the merge window? It
> would be really great if we can pin down the semantics of the knob
> before doing anything.
I think that merging it into 3.10 would be too ambitious but I think
this core code cleanup makes sense for future discussions so I would
like to post it for -mm tree at least. The sooner it will be the better
IMHO.
> Please. I'll think / study more about it in the coming weeks.
>
> Thanks.
>
> --
> tejun
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: memcg: softlimit on internal nodes
2013-04-25 0:33 ` Tejun Heo
@ 2013-04-29 18:39 ` Johannes Weiner
0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2013-04-29 18:39 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm,
Hugh Dickins, Ying Han, Glauber Costa, Michel Lespinasse,
Greg Thelen
On Wed, Apr 24, 2013 at 05:33:35PM -0700, Tejun Heo wrote:
> Hello, Johannes.
>
> On Wed, Apr 24, 2013 at 05:45:31PM -0400, Johannes Weiner wrote:
> > Historically soft limit meant prioritizing certain memcgs over others
> > and the memcgs over their soft limit should experience relatively more
> > reclaim pressure than the ones below their soft limit.
> >
> > Now, if we go and say you are only reclaimed when you exceed your soft
> > limit we would retain the prioritization aspect. Groups in excess of
> > their soft limits would still experience relatively more reclaim
> > pressure than their well-behaved peers. But it would have the nice
> > side effect of acting more or less like a guarantee as well.
>
> But, at the same time, it has the not-so-nice side-effect of losing
> the ability to express negative prioritization. It isn't difficult to
> imagine use cases where the system doesn't want to partition the whole
> system into discrete cgroups but wants to limit the amount of
> resources consumed by low-priority workloads.
>
> Also, in the long-term, I really want cgroup to become something
> generally useful and automatically configurable (optional of course)
> by the base system according to the types of workloads. For something
> like that to be possible, the control knobs shouldn't be fiddly,
> complex, or require full partitioning of the system.
>
> > I don't think this approach is as unreasonable as you make it out to
> > be, but it does make things more complicated. It could be argued that
> > we should add a separate guarantee knob because two simple knobs might
> > be better than a complicated one.
>
> The problem that I see is that this is being done without clearing up
> the definition of the knob. The knob's role is being changed or at
> least solidified into something which makes it inconsistent with
> everything else in cgroup in a way which seems very reactive to me.
>
> I can see such reactive customizations being useful in satisfying
> certain specific use cases - google's primarily right now; however,
> it's likely to come back and bite us when we want to do something
> different or generic with cgroup. It's gonna be something which ends
> up being labeled as unusuable in other types of setups (e.g. where not
> all workloads are put under active control or whatever) after causing
> a lot of head-scratching and not-particularly-happy moments. Cgroup
> as a whole strongly needs consistency across its control knobs for it
> to be generally useful.
>
> Well, that and past frustrations over interface and implementations of
> memcg, which seems to bear a lot of similarities with what's going on
> now, probably have made me go over-board. Sorry about that, but I
> really hope memcg do better.
I understand your frustration, I want to get it right as well before
committing to anything.
> > no single member is more at fault than the others. I would expect if
> > we added a guarantee knob, this would also mean that no individual
> > memcg can be treated as being within their guaranteed memory if the
> > hierarchy as a whole is in excess of its guarantee.
>
> I disagree here. It should be symmetrical to how hardlimit works.
> Let's say there's one parent - P - and child - C. For hardlimit, if P
> is over limit, it exerts pressure on its subtree regardless of C, and,
> if P is under limit, it doesn't affect C.
>
> For guarantee / protection, it should work the same but in the
> opposite direction. If P is under limit, it should protect the
> subtree from reclaim regardless of C. If P is over limit, it
> shouldn't affect C.
>
> As I draw in the other reply to Michal, each knob should be a starting
> point of a single range in the pre-defined direction and composition
> of those configurations across hierarchy should result in intersection
> of them. I can't see any reason to deviate from that here.
>
> IOW, protection control shouldn't care about generating memory
> pressure. That's the job of soft and hard limits, both of which
> should apparently override protection. That way, each control knob
> becomes fully consistent within itself across the hierarchy and the
> questions become those of how soft limit should override protection
> rather than the semantics of soft limit itself.
>
> > The root of the hierarchy represents the whole hierarchy. Its memory
> > usage is the combined memory usage of all members. The limit set to
> > the hierarchy root applies to the combined memory usage of the
> > hierarchy. Breaching that limit has consequences for the hierarchy as
> > a whole. Be it soft limit or guarantee.
> >
> > This is how hierarchies have always worked and it allows limits to be
> > layered and apply depending on the source of pressure:
>
> That's definitely true for soft and hard limits but flipped for
> guarantees and I think that's the primary source of confusion -
> guarantee being overloaded onto softlimit.
>
> > root (physical memory = 32G)
> > / \
> > A B (hard limit = 25G, guarantee = 16G)
> > / \ / \
> > A1 A2 / B2 (guarantee = 10G)
> > /
> > B1 (guarantee = 15G)
> >
> > Remember that hard limits are usually overcommitted, so you allow B to
> > use more of the fair share of memory when A does not need it, but you
> > want to keep it capped to keep latency reasonable when A ramps up.
> >
> > As long as B is hitting its own hard limit, you value B1's and B2's
> > guarantees in the context of pressure local to the hierarchy; in the
> > context of B having 25G worth of memory; in the context of B1
> > competing with B2 over the memory allowed by B.
> >
> > However, as soon as global reclaim kicks in, the context changes and
> > the priorities shift. Now, B does not have 25G anymore but only 16G
> > *in its competition with A*. We absolutely do not want to respect the
> > guarantees made to B1 and B2. Not only can they not be met anyway,
> > but they are utterly meaningless at this point. They were set with
> > 25G in mind.
>
> I find the configuration confusing. What does it mean? Let's say B
> doesn't consume memory itself and B1 is inactive. Does that mean B2
> is guaranteed upto 16G? Or is it that B2 is still guaranteed only
> upto 10G?
Both.
Global memory pressure will leave B and all its children alone as long
as their sum memory usage is below 16G. If B2 is the only memory user
in there, it means that it won't be reclaimed until it uses 16G.
However, I would not call it a guarantee of 16G from B2's point of
view, because it does not control B1's usage.
> If former, what if the intention was just to prevent B's total going
> past 16G and the configuration never meant to grant extra 6G to B2?
>
> The latter makes more sense as softlimit, but what happens when B
> itself consumes memory? Is B's internal consumption guaranteed any
> memory? If so, what if the internal usage is mostly uninteresting and
> the admin never meant them to get any guarantee and it unnecessarily
> eats into B1's guarantee when it comes up? If not, what happens when
> B1 creates a sub-cgroup B11? Do all internal usages of B1 lose the
> guarantee?
>
> If I'm not too confused, most of the confusions arise from the fact
> that guarantee's specificity is towards max (as evidenced by its
> default being zero) but composition through hierarchy happening in the
> other direction (ie. guarantee in internal node exerts pressure
> towards zero on its subtree).
>
> Doesn't something like the following suit what you had in mind better?
>
> h: hardlimit, s: softlimit, g: guarantee
>
> root (physical memory = 32G)
> / \
> A B (h:25G, s:16G)
> / \ / \
> A1 A2 / B2 (g:10G)
> /
> B1 (g:15G)
No, because I do not want B1 to be guaranteed half of memory in case
of global memory pressure, only in the case where B has 25G available.
Also, a soft limit does not guarantee that everything below B is left
alone as long as it is within 16G of memory.
> It doesn't solve any of the execution issues arising from having to
> enforce 16G limit over 10G and 15G guarnatees but there is no room for
> misinterpreting the intention of the configuration. You could say
> that this is just a convenient case because it doesn't actually have
> nesting of the same params. Let's add one then.
>
> root (physical memory = 32G)
> / \
> A B (h:25G, s:16G g:15G)
> / \ / \
> A1 A2 / B2 (g:10G)
> /
> B1 (g:15G)
>
> If we follow the rule of composition by intersection, the
> interpretation of B's guarantee is clear. If B's subtree is under
> 15G, regardless of individual usages of B1 and B2, they shouldn't feel
> reclaim pressure. When B's subtree goes over 15G, B1 and B2 will have
> to fend off for themselves. If the ones which are over their own
> guarantee will feel the "normal" reclaim pressure; otherwise, they
> will continue to evade reclaim. When B's subtree goes over 16G,
> someone in B's subtree have to pay, preferably the ones not guaranteed
> anything first.
Yes, and that's the "intention guessing" that I do not agree with.
The guarantees of B1 and B2 were written for the 25G available to B
without global pressure. They mean "if B exceeds 25G, reclaim B2 if
it exceeds 10G and reclaim B1 if it exceeds 15G".
All of a sudden, your actual constraint is 16G. I don't want to use
the guarantees that were meant for a different memory situation as a
hint to decide which group should be reclaimed first.
Either we have separate limits for the 25G situation and the 16G
situation or we need to express guarantees as a percentage of
available memory.
> > [ It may be conceivable that you want different guarantees for B1 and
> > B2 depending on where the pressure comes from. One setting for when
> > the 25G limit applies, one setting when the 32G physical memory
> > limit applies. Basically, every group would need a vector of
> > guarantee settings with one setting per ancestor.
>
> I don't get this. If a cgroup is under the guarantee limit and none
> of its parents are under hard/softlimit, it shouldn't feel any
> pressure. If a cgroup ia above guarantee, it should feel the same
> pressure everyone else in that subtree is subject to. If any of the
> ancestors has triggered soft / hard limit, it's gonna have to give up
> pages pretty quickly.
>
> > That being said, I absolutely disagree with the idea of trying to
> > adhere to individual memcg guarantees in the first reclaim cycle,
> > regardless of context and then just ignore them on the second pass.
> > It's a horrible way to guess which context the admin had in mind. ]
>
> I think there needs to be a way to avoid penalizing sub-cgroups under
> guarnatee amount when there are siblings which can give out pages over
> guarantee. I don't think I'm following the "guessing the intention"
> part. Can you please elaborate?
Hope this is explained above.
> > Now, there is of course the other scenario in which the current
> > hierarchical limit application can get in your way: when you give
> > intermediate nodes their own memory. Because then you may see the
> > need to apply certain limits to that hierarchy root's local memory
> > only instead of all memory in the hierarchy. But once we open that
> > door, you might expect this to be an option for every limit, where
> > even the hard limit of a hierarchy root only applies to that group's
> > local memory instead of the whole hierarchy. I certainly do not want
> > to apply hierarchy semantics for some limits and not for others. But
> > Google has basically asked for hierarchical hard limits and local soft
> > limits / guarantees.
>
> So, proportional controllers need this. They need to be able to
> configure the amount the tasks belonging to an inner node can consume
> when competing against the children groups. It isn't a particularly
> pretty thing but a necessity given that we allow tasks and resource
> consumptions in inner nodes. I was wondering about this and asked
> Michal whether anybody wants something like that and IIRC his answer
> was negative. Can you please expand on what google asked for?
My understanding is that they have groups of jobs:
G1
/|\
/ | \
J1 J2 J3
When a job exits, its J group is removed and its leftover cache is
reparented to the G group. Obviously, they want that cache to be
reclaimed over currently used job memory, but if they set the soft
limit in G1 to a very low value, it means that this low soft limit
applies to the G1 hierarchy as a whole.
Michal's and my suggestion was that they instead move this cache over
to another sibling group that is dedicated to collect left over cache,
i.e. Jcache. Then set the soft limit of this group to 0.
OR do not delete the job groups, set their soft limits to 0, and reap
the groups once memory usage in them drops to 0 (easy to do with the
events interface we have that wake you up for memory watermark events).
Both solutions, to me, sound so much simpler than starting to
recognize and provide exclusive limits for local memory usage of inner
nodes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2013-04-29 18:39 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-20 0:26 memcg: softlimit on internal nodes Tejun Heo
2013-04-20 0:42 ` Tejun Heo
2013-04-20 3:35 ` Greg Thelen
2013-04-21 1:53 ` Tejun Heo
2013-04-20 3:16 ` Michal Hocko
2013-04-21 2:23 ` Tejun Heo
2013-04-21 8:55 ` Michel Lespinasse
2013-04-22 4:24 ` Tejun Heo
2013-04-22 7:14 ` Michel Lespinasse
2013-04-22 14:48 ` Tejun Heo
2013-04-22 15:37 ` Michal Hocko
2013-04-22 15:46 ` Tejun Heo
2013-04-22 15:54 ` Michal Hocko
2013-04-22 16:01 ` Tejun Heo
2013-04-23 9:58 ` Michel Lespinasse
2013-04-23 10:17 ` Glauber Costa
2013-04-23 11:40 ` Michal Hocko
2013-04-23 11:54 ` Glauber Costa
2013-04-23 12:51 ` Michel Lespinasse
2013-04-23 13:06 ` Michal Hocko
2013-04-23 13:13 ` Glauber Costa
2013-04-23 13:28 ` Michal Hocko
2013-04-23 11:32 ` Michal Hocko
2013-04-23 12:45 ` Michel Lespinasse
2013-04-23 12:59 ` Michal Hocko
2013-04-23 12:51 ` Michal Hocko
2013-04-21 12:46 ` Michal Hocko
2013-04-22 4:39 ` Tejun Heo
2013-04-22 15:19 ` Michal Hocko
2013-04-22 15:57 ` Tejun Heo
2013-04-22 15:57 ` Tejun Heo
2013-04-22 16:20 ` Michal Hocko
2013-04-22 18:30 ` Tejun Heo
2013-04-23 9:29 ` Michal Hocko
2013-04-23 17:09 ` Tejun Heo
2013-04-26 11:51 ` Michal Hocko
2013-04-26 18:37 ` Tejun Heo
2013-04-29 15:27 ` Michal Hocko
2013-04-23 9:33 ` [RFC v2 0/4] soft limit rework Michal Hocko
2013-04-23 9:33 ` [RFC v2 1/4] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-04-23 9:33 ` [RFC v2 2/4] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
2013-04-23 9:33 ` [RFC v2 3/4] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-04-23 9:33 ` [RFC v2 4/4] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
2013-04-24 21:45 ` memcg: softlimit on internal nodes Johannes Weiner
2013-04-25 0:33 ` Tejun Heo
2013-04-29 18:39 ` Johannes Weiner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).