From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: Re: Possible regression with cgroups in 3.11 Date: Fri, 8 Nov 2013 16:36:34 +0800 Message-ID: <527CA292.7090104@huawei.com> References: <5270BFE7.4000602@huawei.com> <20131031130647.0ff6f2c7@gandalf.local.home> <20131031192732.2dbb14b3@gandalf.local.home> <5277932C.40400@huawei.com> <5278B3F1.9040502@huawei.com> <20131107235301.GB1092@cmpxchg.org> <20131108001437.GC1092@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Johannes Weiner Cc: Markus Blank-Burian , Steven Rostedt , Hugh Dickins , Michal Hocko , David Rientjes , Ying Han , Greg Thelen , Michel Lespinasse , Tejun Heo , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>>> kworker/3:5-7605 [003] .... 987.475678: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>>> kworker/3:5-7605 [003] .... 987.478677: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>>> kworker/3:5-7605 [003] .... 987.481675: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>> >>> So it's much more likely this is a memcg bug rather than a cgroup bug. >>> I hope memcg guys could look into it, or you could do a git-bisect if >>> you can reliably reproduce the bug. >> >> I think there is a problem with ref counting and memcg. >> >> The old scheme would wait with the charge reparenting until all >> references were gone for good, whereas the new scheme has only a RCU >> grace period between disabling tryget and offlining the css. >> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both >> the tryget and the res_counter charge that would make it visible to >> offline_css(), which means that there is a possible race condition >> between cgroup teardown and an ongoing charge: >> >> #0: destroy #1: charge >> >> rcu_read_lock() >> css_tryget() >> rcu_read_unlock() >> disable tryget() >> call_rcu() >> offline_css() >> reparent_charges() >> res_counter_charge() >> css_put() >> css_free() >> pc->mem_cgroup = deadcg >> add page to lru >> >> If the res_counter is hierarchical, there is now a leaked charge from >> the dead group in the parent counter with no corresponding page on the >> LRU, which will lead to this endless loop when deleting the parent. > > Hugh, I bet your problem is actually the same thing, where > reparent_charges is looping and the workqueue is not actually stuck, > it's just that waiting for the CPU callbacks is the longest thing the > loop does so it's most likely to show up in the trace. > >> The race window can be seconds if the res_counter hits its limit and >> page reclaim is entered between css_tryget() and the res counter >> charge succeeding. >> >> I thought about calling reparent_charges() again from css_free() at >> first to catch any raced charges. But that won't work if the last >> reference is actually put by the charger because then it'll drop into >> the loop before putting the page on the LRU. > > Actually, it *should* work after all, because the final css_put() > schedules css_free() in a workqueue, which will just block until the > charge finishes and lru-links the page. Not the most elegant > behavior, but hey, neither is livelocking! > > So how about this? > Thanks for the analysis and fix! But we fix the bug by adding a call to mem_cgroup_reparent_charge() in css_free() while we are dead-looping in css_offline()? We won't call css_free() if css_offline() hasn't finished. > --- > From: Johannes Weiner > Subject: [patch] mm: memcg: reparent charges during css_free() > > Signed-off-by: Johannes Weiner > Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+ FYI, I don't think Markus and google ever experience this issue before 3.11. > --- > mm/memcontrol.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cc4f9cbe760e..3dce2b50891c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) > static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > - > + /* > + * XXX: css_offline() would be where we should reparent all > + * memory to prepare the cgroup for destruction. However, > + * memcg does not do css_tryget() and res_counter charging > + * under the same RCU lock region, which means that charging > + * could race with offlining, potentially leaking charges and > + * sending out pages with stale cgroup pointers: > + * > + * #0 #1 > + * rcu_read_lock() > + * css_tryget() > + * rcu_read_unlock() > + * disable css_tryget() > + * call_rcu() > + * offline_css() > + * reparent_charges() > + * res_counter_charge() > + * css_put() > + * css_free() > + * pc->mem_cgroup = dead memcg > + * add page to lru > + * > + * We still reparent most charges in offline_css() simply > + * because we don't want all these pages stuck if a long-term > + * reference like a swap entry is holding on to the cgroup > + * past offlining, but make sure we catch any raced charges: > + */ > + mem_cgroup_reparent_charges(memcg); > memcg_destroy_kmem(memcg); > __mem_cgroup_free(memcg); > } >