From: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Subject: Re: [PATCH v2] memcg: execute partial memcg freeing in mem_cgroup_destroy
Date: Fri, 17 Aug 2012 08:51:28 +0400 [thread overview]
Message-ID: <502DCDD0.3060502@parallels.com> (raw)
In-Reply-To: <xr93vcgiazok.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
On 08/17/2012 08:51 AM, Greg Thelen wrote:
> On Thu, Aug 16 2012, Glauber Costa wrote:
>
>> A lot of the initialization we do in mem_cgroup_create() is done with
>> softirqs enabled. This include grabbing a css id, which holds
>> &ss->id_lock->rlock, and the per-zone trees, which holds
>> rtpz->lock->rlock. All of those signal to the lockdep mechanism that
>> those locks can be used in SOFTIRQ-ON-W context. This means that the
>> freeing of memcg structure must happen in a compatible context,
>> otherwise we'll get a deadlock.
>>
>> The reference counting mechanism we use allows the memcg structure to be
>> freed later and outlive the actual memcg destruction from the
>> filesystem. However, we have little, if any, means to guarantee in which
>> context the last memcg_put will happen. The best we can do is test it
>> and try to make sure no invalid context releases are happening. But as
>> we add more code to memcg, the possible interactions grow in number and
>> expose more ways to get context conflicts.
>>
>> Greg Thelen reported a bug with that patchset applied that would trigger
>> if a task would hold a reference to a memcg through its kmem counter.
>> This would mean that killing that task would eventually get us to
>> __mem_cgroup_free() after dropping the last kernel page reference, in an
>> invalid IN-SOFTIRQ-W.
>>
>> Besides that, he raised the quite valid concern that keeping the full
>> memcg around for an unbounded period of time can eventually exhaust the
>> css_id space, and pin a lot of not needed memory. For instance, a
>> O(nr_cpus) percpu data for the stats is kept around, and we don't expect
>> to use it after the memcg is gone.
>>
>> Both those problems can be avoided by freeing as much as we can in
>> mem_cgroup_destroy(), and leaving only the memcg structure and the
>> static branches to be removed later. That freeing run on a predictable
>> context, getting rid of the softirq problem, and also reduces pressure
>> both on the css_id space and total dangling memory.
>
> Thank you for the patch. I think it is a step in the right direction,
> but I suspect a problem (noted below).
>
>> I consider this safe because all the page_cgroup references to user
>> pages are reparented to the imediate parent, so late uncharges won't
>> trigger the common uncharge paths with a destroyed memcg.
>>
>> Although we don't migrate kernel pages to parent, we also don't call the
>> common uncharge paths for those pages, rather uncharging the
>> res_counters directly. So we are safe on this side of the wall as well.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Reported-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> ---
>> mm/memcontrol.c | 25 ++++++++++++-------------
>> 1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9a82965..78cb394 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5169,18 +5169,9 @@ static void free_work(struct work_struct *work)
>> vfree(memcg);
>> }
>>
>> -static void free_rcu(struct rcu_head *rcu_head)
>> -{
>> - struct mem_cgroup *memcg;
>> -
>> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
>> - INIT_WORK(&memcg->work_freeing, free_work);
>> - schedule_work(&memcg->work_freeing);
>> -}
>> -
>> /*
>> - * At destroying mem_cgroup, references from swap_cgroup can remain.
>> - * (scanning all at force_empty is too costly...)
>> + * At destroying mem_cgroup, references from swap_cgroup and other places can
>> + * remain. (scanning all at force_empty is too costly...)
>> *
>> * Instead of clearing all references at force_empty, we remember
>> * the number of reference from swap_cgroup and free mem_cgroup when
>> @@ -5188,6 +5179,14 @@ static void free_rcu(struct rcu_head *rcu_head)
>> *
>> * Removal of cgroup itself succeeds regardless of refs from swap.
>> */
>> +static void free_rcu(struct rcu_head *rcu_head)
>> +{
>> + struct mem_cgroup *memcg;
>> +
>> + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
>> + INIT_WORK(&memcg->work_freeing, free_work);
>> + schedule_work(&memcg->work_freeing);
>> +}
>>
>> static void __mem_cgroup_free(struct mem_cgroup *memcg)
>> {
>> @@ -5200,7 +5199,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>> free_mem_cgroup_per_zone_info(memcg, node);
>>
>> free_percpu(memcg->stat);
>> - call_rcu(&memcg->rcu_freeing, free_rcu);
>> }
>>
>> static void mem_cgroup_get(struct mem_cgroup *memcg)
>> @@ -5212,7 +5210,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
>> {
>> if (atomic_sub_and_test(count, &memcg->refcnt)) {
>> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> - __mem_cgroup_free(memcg);
>> + call_rcu(&memcg->rcu_freeing, free_rcu);
>> if (parent)
>> mem_cgroup_put(parent);
>> }
>> @@ -5377,6 +5375,7 @@ static void mem_cgroup_destroy(struct cgroup *cont)
>>
>> kmem_cgroup_destroy(memcg);
>>
>> + __mem_cgroup_free(memcg);
>
> I suspect that this will free the css_id before all swap entries have
> dropped their references with mem_cgroup_uncharge_swap() ? I think we
> only want to call __mem_cgroup_free() once all non kernel page
> references have been released. This would include mem_cgroup_destroy()
> and any pending calls to mem_cgroup_uncharge_swap(). I'm not sure, but
> may be a second refcount or some optimization with the kmem_accounted
> bitmask can efficiently handle this.
>
Can we demonstrate that? I agree there might be a potential problem, and
that is why I sent this separately. But the impression I got after
testing and reading the code, was that the memcg information in
pc->mem_cgroup would be updated to the parent.
This means that any later call to uncharge or uncharge_swap would just
uncharge from the parent memcg and we'd have no problem.
WARNING: multiple messages have this Message-ID (diff)
From: Glauber Costa <glommer@parallels.com>
To: Greg Thelen <gthelen@google.com>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
kamezawa.hiroyu@jp.fujitsu.com,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH v2] memcg: execute partial memcg freeing in mem_cgroup_destroy
Date: Fri, 17 Aug 2012 08:51:28 +0400 [thread overview]
Message-ID: <502DCDD0.3060502@parallels.com> (raw)
In-Reply-To: <xr93vcgiazok.fsf@gthelen.mtv.corp.google.com>
On 08/17/2012 08:51 AM, Greg Thelen wrote:
> On Thu, Aug 16 2012, Glauber Costa wrote:
>
>> A lot of the initialization we do in mem_cgroup_create() is done with
>> softirqs enabled. This include grabbing a css id, which holds
>> &ss->id_lock->rlock, and the per-zone trees, which holds
>> rtpz->lock->rlock. All of those signal to the lockdep mechanism that
>> those locks can be used in SOFTIRQ-ON-W context. This means that the
>> freeing of memcg structure must happen in a compatible context,
>> otherwise we'll get a deadlock.
>>
>> The reference counting mechanism we use allows the memcg structure to be
>> freed later and outlive the actual memcg destruction from the
>> filesystem. However, we have little, if any, means to guarantee in which
>> context the last memcg_put will happen. The best we can do is test it
>> and try to make sure no invalid context releases are happening. But as
>> we add more code to memcg, the possible interactions grow in number and
>> expose more ways to get context conflicts.
>>
>> Greg Thelen reported a bug with that patchset applied that would trigger
>> if a task would hold a reference to a memcg through its kmem counter.
>> This would mean that killing that task would eventually get us to
>> __mem_cgroup_free() after dropping the last kernel page reference, in an
>> invalid IN-SOFTIRQ-W.
>>
>> Besides that, he raised the quite valid concern that keeping the full
>> memcg around for an unbounded period of time can eventually exhaust the
>> css_id space, and pin a lot of not needed memory. For instance, a
>> O(nr_cpus) percpu data for the stats is kept around, and we don't expect
>> to use it after the memcg is gone.
>>
>> Both those problems can be avoided by freeing as much as we can in
>> mem_cgroup_destroy(), and leaving only the memcg structure and the
>> static branches to be removed later. That freeing run on a predictable
>> context, getting rid of the softirq problem, and also reduces pressure
>> both on the css_id space and total dangling memory.
>
> Thank you for the patch. I think it is a step in the right direction,
> but I suspect a problem (noted below).
>
>> I consider this safe because all the page_cgroup references to user
>> pages are reparented to the imediate parent, so late uncharges won't
>> trigger the common uncharge paths with a destroyed memcg.
>>
>> Although we don't migrate kernel pages to parent, we also don't call the
>> common uncharge paths for those pages, rather uncharging the
>> res_counters directly. So we are safe on this side of the wall as well.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Reported-by: Greg Thelen <gthelen@google.com>
>> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>> mm/memcontrol.c | 25 ++++++++++++-------------
>> 1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9a82965..78cb394 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5169,18 +5169,9 @@ static void free_work(struct work_struct *work)
>> vfree(memcg);
>> }
>>
>> -static void free_rcu(struct rcu_head *rcu_head)
>> -{
>> - struct mem_cgroup *memcg;
>> -
>> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
>> - INIT_WORK(&memcg->work_freeing, free_work);
>> - schedule_work(&memcg->work_freeing);
>> -}
>> -
>> /*
>> - * At destroying mem_cgroup, references from swap_cgroup can remain.
>> - * (scanning all at force_empty is too costly...)
>> + * At destroying mem_cgroup, references from swap_cgroup and other places can
>> + * remain. (scanning all at force_empty is too costly...)
>> *
>> * Instead of clearing all references at force_empty, we remember
>> * the number of reference from swap_cgroup and free mem_cgroup when
>> @@ -5188,6 +5179,14 @@ static void free_rcu(struct rcu_head *rcu_head)
>> *
>> * Removal of cgroup itself succeeds regardless of refs from swap.
>> */
>> +static void free_rcu(struct rcu_head *rcu_head)
>> +{
>> + struct mem_cgroup *memcg;
>> +
>> + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
>> + INIT_WORK(&memcg->work_freeing, free_work);
>> + schedule_work(&memcg->work_freeing);
>> +}
>>
>> static void __mem_cgroup_free(struct mem_cgroup *memcg)
>> {
>> @@ -5200,7 +5199,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>> free_mem_cgroup_per_zone_info(memcg, node);
>>
>> free_percpu(memcg->stat);
>> - call_rcu(&memcg->rcu_freeing, free_rcu);
>> }
>>
>> static void mem_cgroup_get(struct mem_cgroup *memcg)
>> @@ -5212,7 +5210,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
>> {
>> if (atomic_sub_and_test(count, &memcg->refcnt)) {
>> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> - __mem_cgroup_free(memcg);
>> + call_rcu(&memcg->rcu_freeing, free_rcu);
>> if (parent)
>> mem_cgroup_put(parent);
>> }
>> @@ -5377,6 +5375,7 @@ static void mem_cgroup_destroy(struct cgroup *cont)
>>
>> kmem_cgroup_destroy(memcg);
>>
>> + __mem_cgroup_free(memcg);
>
> I suspect that this will free the css_id before all swap entries have
> dropped their references with mem_cgroup_uncharge_swap() ? I think we
> only want to call __mem_cgroup_free() once all non kernel page
> references have been released. This would include mem_cgroup_destroy()
> and any pending calls to mem_cgroup_uncharge_swap(). I'm not sure, but
> may be a second refcount or some optimization with the kmem_accounted
> bitmask can efficiently handle this.
>
Can we demonstrate that? I agree there might be a potential problem, and
that is why I sent this separately. But the impression I got after
testing and reading the code, was that the memcg information in
pc->mem_cgroup would be updated to the parent.
This means that any later call to uncharge or uncharge_swap would just
uncharge from the parent memcg and we'd have no problem.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-08-17 4:51 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-16 11:01 [PATCH v2] memcg: execute partial memcg freeing in mem_cgroup_destroy Glauber Costa
[not found] ` <1345114903-20627-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-17 4:51 ` Greg Thelen
2012-08-17 4:51 ` Greg Thelen
[not found] ` <xr93vcgiazok.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2012-08-17 4:51 ` Glauber Costa [this message]
2012-08-17 4:51 ` Glauber Costa
2012-08-17 6:37 ` Greg Thelen
[not found] ` <xr93a9xu9g7z.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2012-08-17 6:36 ` Glauber Costa
2012-08-17 6:36 ` Glauber Costa
2012-08-21 13:34 ` Glauber Costa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=502DCDD0.3060502@parallels.com \
--to=glommer-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
--cc=kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.