From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com [209.85.216.49]) by kanga.kvack.org (Postfix) with ESMTP id 06A546B0032 for ; Fri, 9 Jan 2015 12:43:23 -0500 (EST) Received: by mail-qa0-f49.google.com with SMTP id dc16so7945621qab.8 for ; Fri, 09 Jan 2015 09:43:22 -0800 (PST) Received: from service87.mimecast.com (service87.mimecast.com. [91.220.42.44]) by mx.google.com with ESMTP id u9si11991811qab.87.2015.01.09.09.43.21 for ; Fri, 09 Jan 2015 09:43:22 -0800 (PST) Message-ID: <54B01335.4060901@arm.com> Date: Fri, 09 Jan 2015 17:43:17 +0000 From: "Suzuki K. Poulose" MIME-Version: 1.0 Subject: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo , Johannes Weiner Cc: linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Hi We have hit a hang on ARM64 defconfig, while running LTP tests on=20 3.19-rc3. We are in the process of a git bisect and will update the results as and when we find the commit. During the ksm ltp run, the test hangs trying to mount memcg with the=20 following strace output: mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? ERESTARTNOINTR= =20 (To be restarted) mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? ERESTARTNOINTR= =20 (To be restarted) [ ... repeated forever ... ] At this point, one can try mounting the memcg to verify the problem. # mount -t cgroup -o memory memcg memcg_dir --hangs-- Strangely, if we run the mount command from a cold boot (i.e. without=20 running LTP first), then it succeeds. Upon a quick look we are hitting the following code : kernel/cgroup.c: cgroup_mount() : 1779 for_each_subsys(ss, i) { 1780 if (!(opts.subsys_mask & (1 << i)) || 1781 ss->root =3D=3D &cgrp_dfl_root) 1782 continue; 1783 1784 if=20 (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { 1785 mutex_unlock(&cgroup_mutex); 1786 msleep(10); 1787 ret =3D restart_syscall(); <=3D=3D=3D=3D=3D 1788 goto out_free; 1789 } 1790 cgroup_put(&ss->root->cgrp); 1791 } with ss->root->cgrp.self.refct.percpu_count_ptr =3D=3D __PERCPU_REF_ATOMIC_= DEAD Any ideas? Thanks Suzuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f45.google.com (mail-qa0-f45.google.com [209.85.216.45]) by kanga.kvack.org (Postfix) with ESMTP id 350556B0032 for ; Fri, 9 Jan 2015 16:46:53 -0500 (EST) Received: by mail-qa0-f45.google.com with SMTP id f12so8874286qad.4 for ; Fri, 09 Jan 2015 13:46:53 -0800 (PST) Received: from mail-qc0-x22a.google.com (mail-qc0-x22a.google.com. [2607:f8b0:400d:c01::22a]) by mx.google.com with ESMTPS id 39si13898199qgp.25.2015.01.09.13.46.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 09 Jan 2015 13:46:52 -0800 (PST) Received: by mail-qc0-f170.google.com with SMTP id x3so11220698qcv.1 for ; Fri, 09 Jan 2015 13:46:51 -0800 (PST) Date: Fri, 9 Jan 2015 16:46:49 -0500 From: Tejun Heo Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150109214649.GF2785@htj.dyndns.org> References: <54B01335.4060901@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B01335.4060901@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Suzuki K. Poulose" Cc: Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3. > We are > in the process of a git bisect and will update the results as and > when we find the commit. > > During the ksm ltp run, the test hangs trying to mount memcg with the > following strace > output: > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > be restarted) > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > be restarted) > [ ... repeated forever ... ] > > At this point, one can try mounting the memcg to verify the problem. > # mount -t cgroup -o memory memcg memcg_dir > --hangs-- > > Strangely, if we run the mount command from a cold boot (i.e. without > running LTP first), > then it succeeds. I don't know what LTP is doing and this could actually be hitting on an actual bug but if it's trying to move memcg back from unified hierarchy to an old one, that might hang - it should prolly made to just fail at that point. Anyways, any chance you can find out what happened, in terms of cgroup mounting, to memcg upto that point? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id B805B6B0032 for ; Sat, 10 Jan 2015 03:55:40 -0500 (EST) Received: by mail-pa0-f51.google.com with SMTP id ey11so22968355pad.10 for ; Sat, 10 Jan 2015 00:55:40 -0800 (PST) Received: from mx2.parallels.com (mx2.parallels.com. [199.115.105.18]) by mx.google.com with ESMTPS id yr3si16241504pbb.248.2015.01.10.00.55.38 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 10 Jan 2015 00:55:39 -0800 (PST) Date: Sat, 10 Jan 2015 11:55:25 +0300 From: Vladimir Davydov Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150110085525.GD2110@esperanza> References: <54B01335.4060901@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <54B01335.4060901@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Suzuki K. Poulose" Cc: Tejun Heo , Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > Hi > > We have hit a hang on ARM64 defconfig, while running LTP tests on > 3.19-rc3. We are > in the process of a git bisect and will update the results as and > when we find the commit. > > During the ksm ltp run, the test hangs trying to mount memcg with > the following strace > output: > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? > ERESTARTNOINTR (To be restarted) > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? > ERESTARTNOINTR (To be restarted) > [ ... repeated forever ... ] > > At this point, one can try mounting the memcg to verify the problem. > # mount -t cgroup -o memory memcg memcg_dir > --hangs-- > > Strangely, if we run the mount command from a cold boot (i.e. > without running LTP first), > then it succeeds. > > Upon a quick look we are hitting the following code : > kernel/cgroup.c: cgroup_mount() : > > 1779 for_each_subsys(ss, i) { > 1780 if (!(opts.subsys_mask & (1 << i)) || > 1781 ss->root == &cgrp_dfl_root) > 1782 continue; > 1783 > 1784 if > (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { > 1785 mutex_unlock(&cgroup_mutex); > 1786 msleep(10); > 1787 ret = restart_syscall(); <===== > 1788 goto out_free; > 1789 } > 1790 cgroup_put(&ss->root->cgrp); > 1791 } > > with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD > > Any ideas? The problem is that the memory cgroup controller takes a css reference per each charged page and does not reparent charged pages on css offline, while cgroup_mount/cgroup_kill_sb expect all css references to offline cgroups to be gone soon, restarting the syscall if the ref count != 0. As a result, if you create a memory cgroup, charge some page cache to it, and then remove it, unmount/mount will hang forever. May be, we should kill the ref counter to the memory controller root in cgroup_kill_sb only if there is no children at all, neither online nor offline. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f171.google.com (mail-qc0-f171.google.com [209.85.216.171]) by kanga.kvack.org (Postfix) with ESMTP id CD7C46B0032 for ; Sat, 10 Jan 2015 16:43:21 -0500 (EST) Received: by mail-qc0-f171.google.com with SMTP id r5so13853396qcx.2 for ; Sat, 10 Jan 2015 13:43:21 -0800 (PST) Received: from mail-qg0-x22f.google.com (mail-qg0-x22f.google.com. [2607:f8b0:400d:c04::22f]) by mx.google.com with ESMTPS id ep4si17231848qcb.44.2015.01.10.13.43.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 10 Jan 2015 13:43:20 -0800 (PST) Received: by mail-qg0-f47.google.com with SMTP id q108so13502941qgd.6 for ; Sat, 10 Jan 2015 13:43:20 -0800 (PST) Date: Sat, 10 Jan 2015 16:43:16 -0500 From: Tejun Heo Subject: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150110214316.GF25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110085525.GD2110@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: "Suzuki K. Poulose" , Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Currently, if a hierarchy doesn't have any live children when it's unmounted, the hierarchy starts dying by killing its refcnt. The expectation is that even if there are lingering dead children which are lingering due to remaining references, they'll be put in a finite amount of time. When the children are finally released, the hierarchy is destroyed and all controllers bound to it also are released. However, for memcg, the premise that the lingering refs will be put in a finite amount time is not true. In the absense of memory pressure, dead memcg's may hang around indefinitely pinned by its pages. This unfortunately may lead to indefinite hang on the next mount attempt involving memcg as the mount logic waits for it to get released. While we can change hierarchy destruction logic such that a hierarchy is only destroyed when it's not mounted anywhere and all its children, live or dead, are gone, this makes whether the hierarchy gets destroyed or not to be determined by factors opaque to userland. Userland may or may not get a new hierarchy on the next mount attempt. Worse, if it explicitly wants to create a new hierarchy with different options or controller compositions involving memcg, it will fail in an essentially arbitrary manner. We want to guarantee that a hierarchy is destroyed once the conditions, unmounted and no visible children, are met. To aid it, this patch introduces a new callback cgroup_subsys->unbind() which is invoked right before the hierarchy a subsystem is bound to starts dying. memcg can implement this callback and initiate draining of remaining refs so that the hierarchy can eventually be released in a finite amount of time. Signed-off-by: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov --- Hello, > May be, we should kill the ref counter to the memory controller root in > cgroup_kill_sb only if there is no children at all, neither online nor > offline. Ah, thanks for the analysis, but I really wanna avoid making hierarchy destruction conditions opaque to userland. This is userland visible behavior. It shouldn't be determined by kernel internals invisible outside. This patch adds ss->unbind() which memcg can hook into to kick off draining of residual refs. If this would work, I'll add this patch to cgroup/for-3.19-fixes, possibly with stable cc'd. Thanks. Documentation/cgroups/cgroups.txt | 12 +++++++++++- include/linux/cgroup.h | 1 + kernel/cgroup.c | 14 ++++++++++++-- 3 files changed, 24 insertions(+), 3 deletions(-) --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -637,7 +637,7 @@ void exit(struct task_struct *task) Called during task exit. -void bind(struct cgroup *root) +void bind(struct cgroup_subsys_state *root_css) (cgroup_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy @@ -645,6 +645,16 @@ and root cgroup. Currently this will onl the default hierarchy (which never has sub-cgroups) and a hierarchy that is being created/destroyed (and hence has no sub-cgroups). +void unbind(struct cgroup_subsys_state *root_css) +(cgroup_mutex held by caller) + +Called when a cgroup subsys is unbound from its current hierarchy. The +subsystem must guarantee that all offline cgroups are going to be +released in a finite amount of time after this function is called. +Such draining can be asynchronous. The following binding of the +subsystem to a hierarchy will be delayed till the draining is +complete. + 4. Extended attribute usage =========================== --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -654,6 +654,7 @@ struct cgroup_subsys { struct cgroup_subsys_state *old_css, struct task_struct *task); void (*bind)(struct cgroup_subsys_state *root_css); + void (*unbind)(struct cgroup_subsys_state *root_css); int disabled; int early_init; --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_ * And don't kill the default root. */ if (css_has_online_children(&root->cgrp.self) || - root == &cgrp_dfl_root) + root == &cgrp_dfl_root) { cgroup_put(&root->cgrp); - else + } else { + struct cgroup_subsys *ss; + int i; + + mutex_lock(&cgroup_mutex); + for_each_subsys(ss, i) + if ((root->subsys_mask & (1UL << i)) && ss->unbind) + ss->unbind(cgroup_css(&root->cgrp, ss)); + mutex_unlock(&cgroup_mutex); + percpu_ref_kill(&root->cgrp.self.refcnt); + } kernfs_kill_sb(sb); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by kanga.kvack.org (Postfix) with ESMTP id 9D7B86B0032 for ; Sun, 11 Jan 2015 15:55:58 -0500 (EST) Received: by mail-we0-f182.google.com with SMTP id w62so16021002wes.13 for ; Sun, 11 Jan 2015 12:55:58 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id hs6si31871436wjb.68.2015.01.11.12.55.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 11 Jan 2015 12:55:57 -0800 (PST) Date: Sun, 11 Jan 2015 15:55:43 -0500 From: Johannes Weiner Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150111205543.GA5480@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110214316.GF25319@htj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Vladimir Davydov , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov > --- > Hello, > > > May be, we should kill the ref counter to the memory controller root in > > cgroup_kill_sb only if there is no children at all, neither online nor > > offline. > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > destruction conditions opaque to userland. This is userland visible > behavior. It shouldn't be determined by kernel internals invisible > outside. This patch adds ss->unbind() which memcg can hook into to > kick off draining of residual refs. If this would work, I'll add this > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. How about this ->unbind() for memcg? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id C1E126B0032 for ; Mon, 12 Jan 2015 03:01:25 -0500 (EST) Received: by mail-pd0-f181.google.com with SMTP id v10so29106897pde.12 for ; Mon, 12 Jan 2015 00:01:25 -0800 (PST) Received: from mx2.parallels.com (mx2.parallels.com. [199.115.105.18]) by mx.google.com with ESMTPS id va1si22220596pbc.211.2015.01.12.00.01.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Jan 2015 00:01:24 -0800 (PST) Date: Mon, 12 Jan 2015 11:01:14 +0300 From: Vladimir Davydov Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112080114.GE2110@esperanza> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Tejun Heo , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote: > On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > > > May be, we should kill the ref counter to the memory controller root in > > > cgroup_kill_sb only if there is no children at all, neither online nor > > > offline. > > > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > > destruction conditions opaque to userland. This is userland visible > > behavior. It shouldn't be determined by kernel internals invisible > > outside. This patch adds ss->unbind() which memcg can hook into to > > kick off draining of residual refs. If this would work, I'll add this > > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. > > How about this ->unbind() for memcg? > > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. Page cache pages are moved to the root memory cgroup. ... and kmem pages are ignored. I reckon we could reparent them (I submitted the patch set some time ago), but that's going to be tricky and will complicate regular kmem charge/uncharge paths, as well as list_lru_add/del. I don't think we can put up with it, provided we only want reparenting on unmount, do we not? Come to think of it, I wonder how many users actually want to mount different controllers subset after unmount. Because we could allow mounting the same subset perfectly well, even if it includes memcg. BTW, AFAIU in the unified hierarchy we won't have this problem at all, because by definition it mounts all controllers IIRC, so do we need to bother fixing this in such a complicated manner at all for the setup that's going to be deprecated anyway? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com [209.85.216.46]) by kanga.kvack.org (Postfix) with ESMTP id 609156B0032 for ; Mon, 12 Jan 2015 06:28:50 -0500 (EST) Received: by mail-qa0-f46.google.com with SMTP id j7so7193008qaq.5 for ; Mon, 12 Jan 2015 03:28:50 -0800 (PST) Received: from mail-qa0-x22c.google.com (mail-qa0-x22c.google.com. [2607:f8b0:400d:c00::22c]) by mx.google.com with ESMTPS id q16si22042294qan.37.2015.01.12.03.28.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 12 Jan 2015 03:28:49 -0800 (PST) Received: by mail-qa0-f44.google.com with SMTP id w8so7174396qac.3 for ; Mon, 12 Jan 2015 03:28:48 -0800 (PST) Date: Mon, 12 Jan 2015 06:28:45 -0500 From: Tejun Heo Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112112845.GS25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150112080114.GE2110@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: Johannes Weiner , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Hello, Vladimir. On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > Come to think of it, I wonder how many users actually want to mount > different controllers subset after unmount. Because we could allow It wouldn't be a common use case but, on the face of it, we still support it. If we collecctively decide that once a sub cgroup is created for any controller no further hierarchy configuration for that controller is allowed, that'd work too, but one way or the other, the behavior, I believe, should be well-defined. As it currently stands, the conditions and failure mode are opaque to userland, which is never a good thing. > mounting the same subset perfectly well, even if it includes memcg. BTW, > AFAIU in the unified hierarchy we won't have this problem at all, > because by definition it mounts all controllers IIRC, so do we need to > bother fixing this in such a complicated manner at all for the setup > that's going to be deprecated anyway? There will likely be a quite long transition period and if and when the old things can be removed, this added cleanup logic can go away with it. It depends on how complex the implementation would get but as long as it isn't too much and stays mostly isolated from the saner paths, I think it's probably the right thing to do. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f174.google.com (mail-pd0-f174.google.com [209.85.192.174]) by kanga.kvack.org (Postfix) with ESMTP id 18EA06B006E for ; Mon, 12 Jan 2015 08:00:07 -0500 (EST) Received: by mail-pd0-f174.google.com with SMTP id fp1so30325182pdb.5 for ; Mon, 12 Jan 2015 05:00:06 -0800 (PST) Received: from mx2.parallels.com (mx2.parallels.com. [199.115.105.18]) by mx.google.com with ESMTPS id ti8si23331879pbc.26.2015.01.12.05.00.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Jan 2015 05:00:05 -0800 (PST) Date: Mon, 12 Jan 2015 15:59:56 +0300 From: Vladimir Davydov Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112125956.GF2110@esperanza> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> <20150112112845.GS25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20150112112845.GS25319@htj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Johannes Weiner , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote: > On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > > Come to think of it, I wonder how many users actually want to mount > > different controllers subset after unmount. Because we could allow > > It wouldn't be a common use case but, on the face of it, we still > support it. If we collecctively decide that once a sub cgroup is > created for any controller no further hierarchy configuration for that > controller is allowed, that'd work too, but one way or the other, the > behavior, I believe, should be well-defined. As it currently stands, > the conditions and failure mode are opaque to userland, which is never > a good thing. > > > mounting the same subset perfectly well, even if it includes memcg. BTW, > > AFAIU in the unified hierarchy we won't have this problem at all, > > because by definition it mounts all controllers IIRC, so do we need to > > bother fixing this in such a complicated manner at all for the setup > > that's going to be deprecated anyway? > > There will likely be a quite long transition period and if and when > the old things can be removed, this added cleanup logic can go away > with it. It depends on how complex the implementation would get but > as long as it isn't too much and stays mostly isolated from the saner > paths, I think it's probably the right thing to do. We can't just move kmem objects from a per-memcg kmem_cache to the global one fixing page counters, because in contrast to page cache and swap we don't even track all kmem allocations. So we have to keep all per-memcg kmem_cache's somewhere after unmount until they can finally be destroyed, but the whole logic behind per-memcg kmem_cache's destruction is currently tightly interwoven with that of css's (we destroy kmem_cache's from css_free), and there won't be any css's after unmount. That said, it isn't possible to add a couple of isolated functions, which will live their own lives and can be easily removed once we've switched to the unified hierarchy. Quite the contrary, implementing of kmem reparenting would make me rethink and complicate kmemcg code all over the place. That's why I'm rather reluctant to do it. I haven't dug deep into the cgroup core, but may be we could detach the old root in cgroup_kill_sb() and leave it dangling until the last reference to it has gone? BTW, IIRC the problem always existed for kmem-active memory cgroups, because we never had kmem reparenting. May be, we could therefore just document somewhere that kmem accounting is highly discouraged to be used in the legacy hierarchy and merge these two patches as is to handle page cache and swap charges? We won't break anything, because it was always broken :-) Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f41.google.com (mail-qa0-f41.google.com [209.85.216.41]) by kanga.kvack.org (Postfix) with ESMTP id E7E9A6B006E for ; Mon, 12 Jan 2015 08:05:09 -0500 (EST) Received: by mail-qa0-f41.google.com with SMTP id bm13so7469969qab.0 for ; Mon, 12 Jan 2015 05:05:09 -0800 (PST) Received: from mail-qg0-x229.google.com (mail-qg0-x229.google.com. [2607:f8b0:400d:c04::229]) by mx.google.com with ESMTPS id r6si22234986qay.111.2015.01.12.05.05.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 12 Jan 2015 05:05:09 -0800 (PST) Received: by mail-qg0-f41.google.com with SMTP id e89so17337438qgf.0 for ; Mon, 12 Jan 2015 05:05:08 -0800 (PST) Date: Mon, 12 Jan 2015 08:05:02 -0500 From: Tejun Heo Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112130502.GV25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> <20150112112845.GS25319@htj.dyndns.org> <20150112125956.GF2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150112125956.GF2110@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: Johannes Weiner , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote: > I haven't dug deep into the cgroup core, but may be we could detach the > old root in cgroup_kill_sb() and leave it dangling until the last > reference to it has gone? The root isn't the problem here. Individual controllers are as there's only one copy of each and we most likely don't want to carry over child csses from one hierarchy to the next as the controller may operate under a different set of rules. > BTW, IIRC the problem always existed for kmem-active memory cgroups, > because we never had kmem reparenting. May be, we could therefore just > document somewhere that kmem accounting is highly discouraged to be used > in the legacy hierarchy and merge these two patches as is to handle page > cache and swap charges? We won't break anything, because it was always > broken :-) If we're going that route, I think it'd be better to declare hierarchy lifetime rules as essentially opaque to userland and destroy hierarchies only when all its children, dead or alive, are gone. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f177.google.com (mail-qc0-f177.google.com [209.85.216.177]) by kanga.kvack.org (Postfix) with ESMTP id CFAC36B0032 for ; Mon, 12 Jan 2015 12:02:18 -0500 (EST) Received: by mail-qc0-f177.google.com with SMTP id x3so18869556qcv.8 for ; Mon, 12 Jan 2015 09:02:17 -0800 (PST) Received: from service88.mimecast.com (service88.mimecast.com. [195.130.217.12]) by mx.google.com with ESMTP id z20si23243645qax.23.2015.01.12.09.02.16 for ; Mon, 12 Jan 2015 09:02:16 -0800 (PST) Date: Mon, 12 Jan 2015 17:02:11 +0000 From: "Suzuki K. Poulose" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150112170210.GA1288@e106634-lin.cambridge.arm.com> References: <54B01335.4060901@arm.com> <20150109214649.GF2785@htj.dyndns.org> MIME-Version: 1.0 In-Reply-To: <20150109214649.GF2785@htj.dyndns.org> Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-kernel@vger.kernel.org, vdavydov@parallels.com, hannes@cmpxchg.org, Will.Deacon@arm.com, linux-mm@kvack.org, suzuki.poulose@arm.com On Fri, Jan 09, 2015 at 09:46:49PM +0000, Tejun Heo wrote: > On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-= rc3. > > We are > > in the process of a git bisect and will update the results as and > > when we find the commit. > > > > During the ksm ltp run, the test hangs trying to mount memcg with the > > following strace > > output: > > > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? ERESTARTNOIN= TR (To > > be restarted) > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? ERESTARTNOIN= TR (To > > be restarted) > > [ ... repeated forever ... ] > > > > At this point, one can try mounting the memcg to verify the problem. > > # mount -t cgroup -o memory memcg memcg_dir > > --hangs-- > > > > Strangely, if we run the mount command from a cold boot (i.e. without > > running LTP first), > > then it succeeds. > > I don't know what LTP is doing and this could actually be hitting on > an actual bug but if it's trying to move memcg back from unified > hierarchy to an old one, that might hang - it should prolly made to > just fail at that point. Anyways, any chance you can find out what > happened, in terms of cgroup mounting, to memcg upto that point? > This is what the test(ksm03) does, roughly from strace : faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/", F_OK) =3D 0 faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) =3D -1 E= NOENT (No such file or directory) mkdirat(AT_FDCWD, "/dev/cgroup", 0777) =3D 0 mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D 0 --- set memory limit. Create a new set /dev/cgroups/1 and moves test to tha= t group --- mkdirat(AT_FDCWD, "/dev/cgroup/1", 0777) =3D 0 openat(AT_FDCWD, "/dev/cgroup/1/memory.limit_in_bytes", O_WRONLY|O_CREAT|O_= TRUNC, 0666) =3D 3 fstat(3, {st_dev=3Dmakedev(0, 24), st_ino=3D41, st_mode=3DS_IFREG|0644, st_= nlink=3D1, st_uid=3D0, st_gid=3D0, st_blksize=3D4096, st_blocks=3D0, st_siz= e=3D0, st_atime=3D2015/01/12-15:10:13, st_mtime=3D2015/01/12-15:10:13, st_c= time=3D2015/01/12-15:10:13}) =3D 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = =3D 0x7fb2903000 write(3, "1073741824", 10) =3D 10 close(3) =3D 0 munmap(0x7fb2903000, 65536) =3D 0 getpid() =3D 1324 openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_WRONLY|O_CREAT|O_TRUNC, 0666) =3D= 3 fstat(3, {st_dev=3Dmakedev(0, 24), st_ino=3D37, st_mode=3DS_IFREG|0644, st_= nlink=3D1, st_uid=3D0, st_gid=3D0, st_blksize=3D4096, st_blocks=3D0, st_siz= e=3D0, st_atime=3D2015/01/12-15:10:13, st_mtime=3D2015/01/12-15:10:13, st_c= time=3D2015/01/12-15:10:13}) =3D 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = =3D 0x7fb2903000 write(3, "1324", 4) =3D 4 close(3) =3D 0 munmap(0x7fb2903000, 65536) =3D 0 clone(child_stack=3D0, flags=3DCLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGC= HLD, child_tidptr=3D0x7fb2a7f0d0) =3D 1325 clone(child_stack=3D0, flags=3DCLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGC= HLD, child_tidptr=3D0x7fb2a7f0d0) =3D 1326 clone(child_stack=3D0, flags=3DCLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGC= HLD, child_tidptr=3D0x7fb2a7f0d0) =3D 1327 --- Creates 3 children, perform a lot of memory operations with shared page= s verify the ksm for activity and wait for children to exit --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) =3D=3D 0}], WSTOPPED|WCONTINUED,= NULL) =3D 1325 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) =3D=3D 0}], WSTOPPED|WCONTINUED,= NULL) =3D 1326 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) =3D=3D 0}], WSTOPPED|WCONTINUED,= NULL) =3D 1327 wait4(-1, 0x7fe5625f3c, WSTOPPED|WCONTINUED, NULL) =3D -1 ECHILD (No child = processes) --- cleanup: Move tasks under /dev/cgroups/1/ to /dev/cgroups/ and delete s= ubdir, umount cgroup --- faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) =3D -1 E= NOENT (No such file or directory) openat(AT_FDCWD, "/dev/cgroup/tasks", O_WRONLY) =3D 205 openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_RDONLY) =3D 206 fstat(206, {st_dev=3Dmakedev(0, 24), st_ino=3D37, st_mode=3DS_IFREG|0644, s= t_nlink=3D1, st_uid=3D0, st_gid=3D0, st_blksize=3D4096, st_blocks=3D0, st_s= ize=3D0, st_atime=3D2015/01/12-15:10:13, st_mtime=3D2015/01/12-15:10:13, st= _ctime=3D2015/01/12-15:10:13}) =3D 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = =3D 0x7fb1c53000 read(206, "1324\n", 4096) =3D 5 write(205, "1324", 4) =3D 4 read(206, "", 4096) =3D 0 close(205) =3D 0 close(206) =3D 0 munmap(0x7fb1c53000, 65536) =3D 0 unlinkat(AT_FDCWD, "/dev/cgroup/1", AT_REMOVEDIR) =3D 0 umount2("/dev/cgroup", 0) =3D 0 unlinkat(AT_FDCWD, "/dev/cgroup", AT_REMOVEDIR) =3D 0 exit_group(0) =3D ? The next invocation of the same test fails to mount the cgroup memory. Thanks Suzuki > Thanks. > > -- > tejun > -- IMPORTANT NOTICE: The contents of this email and any attachments are con= fidential and may also be privileged. If you are not the intended recipient= , please notify the sender immediately and do not disclose the contents to = any other person, use it for any purpose, or store or copy the information = in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Regist= ered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, R= egistered in England & Wales, Company No: 2548782 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by kanga.kvack.org (Postfix) with ESMTP id 26EC26B0032 for ; Wed, 14 Jan 2015 06:16:35 -0500 (EST) Received: by mail-qc0-f182.google.com with SMTP id l6so2586543qcy.13 for ; Wed, 14 Jan 2015 03:16:34 -0800 (PST) Received: from service87.mimecast.com (service87.mimecast.com. [91.220.42.44]) by mx.google.com with ESMTP id a5si30653781qat.0.2015.01.14.03.16.33 for ; Wed, 14 Jan 2015 03:16:33 -0800 (PST) Message-ID: <54B6500D.6080206@arm.com> Date: Wed, 14 Jan 2015 11:16:29 +0000 From: "Suzuki K. Poulose" MIME-Version: 1.0 Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Tejun Heo Cc: Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Will Deacon On 11/01/15 20:55, Johannes Weiner wrote: > On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: >> Currently, if a hierarchy doesn't have any live children when it's >> unmounted, the hierarchy starts dying by killing its refcnt. The >> expectation is that even if there are lingering dead children which >> are lingering due to remaining references, they'll be put in a finite >> amount of time. When the children are finally released, the hierarchy >> is destroyed and all controllers bound to it also are released. >> >> However, for memcg, the premise that the lingering refs will be put in >> a finite amount time is not true. In the absense of memory pressure, >> dead memcg's may hang around indefinitely pinned by its pages. This >> unfortunately may lead to indefinite hang on the next mount attempt >> involving memcg as the mount logic waits for it to get released. >> >> While we can change hierarchy destruction logic such that a hierarchy >> is only destroyed when it's not mounted anywhere and all its children, >> live or dead, are gone, this makes whether the hierarchy gets >> destroyed or not to be determined by factors opaque to userland. >> Userland may or may not get a new hierarchy on the next mount attempt. >> Worse, if it explicitly wants to create a new hierarchy with different >> options or controller compositions involving memcg, it will fail in an >> essentially arbitrary manner. >> >> We want to guarantee that a hierarchy is destroyed once the >> conditions, unmounted and no visible children, are met. To aid it, >> this patch introduces a new callback cgroup_subsys->unbind() which is >> invoked right before the hierarchy a subsystem is bound to starts >> dying. memcg can implement this callback and initiate draining of >> remaining refs so that the hierarchy can eventually be released in a >> finite amount of time. >> >> Signed-off-by: Tejun Heo >> Cc: Li Zefan >> Cc: Johannes Weiner >> Cc: Michal Hocko >> Cc: Vladimir Davydov >> --- >> Hello, >> >>> May be, we should kill the ref counter to the memory controller root in >>> cgroup_kill_sb only if there is no children at all, neither online nor >>> offline. >> >> Ah, thanks for the analysis, but I really wanna avoid making hierarchy >> destruction conditions opaque to userland. This is userland visible >> behavior. It shouldn't be determined by kernel internals invisible >> outside. This patch adds ss->unbind() which memcg can hook into to >> kick off draining of residual refs. If this would work, I'll add this >> patch to cgroup/for-3.19-fixes, possibly with stable cc'd. > > How about this ->unbind() for memcg? > > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references du= ring > unbind > This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in=20 mm/memcontrol.c). I have manually applied it. With these two patches in, I am still getting the failure. Also, the=20 kworker thread is taking up 100% time (unbind_work) and continues to do=20 so even after 6minutes. Is there something I missed ? Thanks Suzuki > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. Page cache pages are moved to the root memory cgroup. > > Signed-off-by: Johannes Weiner > --- > include/linux/swap_cgroup.h | 6 +++ > mm/memcontrol.c | 126 +++++++++++++++++++++++++++++++++++++= +++++++ > mm/swap_cgroup.c | 38 +++++++++++++ > 3 files changed, 170 insertions(+) > > diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h > index 145306bdc92f..ffe0866d2997 100644 > --- a/include/linux/swap_cgroup.h > +++ b/include/linux/swap_cgroup.h > @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent= , > =09=09=09=09=09unsigned short old, unsigned short new); > extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned shor= t id); > extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); > +extern unsigned long swap_cgroup_zap_records(unsigned short id); > extern int swap_cgroup_swapon(int type, unsigned long max_pages); > extern void swap_cgroup_swapoff(int type); > > @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > =09return 0; > } > > +static inline unsigned long swap_cgroup_zap_records(unsigned short id) > +{ > +=09return 0; > +} > + > static inline int > swap_cgroup_swapon(int type, unsigned long max_pages) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 692e96407627..40c426add613 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_= state *root_css) > =09=09mem_cgroup_from_css(root_css)->use_hierarchy =3D true; > } > > +static void unbind_lru_list(struct mem_cgroup *memcg, > +=09=09=09 struct zone *zone, enum lru_list lru) > +{ > +=09struct lruvec *lruvec =3D mem_cgroup_zone_lruvec(zone, memcg); > +=09struct list_head *list =3D &lruvec->lists[lru]; > + > +=09while (!list_empty(list)) { > +=09=09unsigned int nr_pages; > +=09=09unsigned long flags; > +=09=09struct page *page; > + > +=09=09spin_lock_irqsave(&zone->lru_lock, flags); > +=09=09if (list_empty(list)) { > +=09=09=09spin_unlock_irqrestore(&zone->lru_lock, flags); > +=09=09=09break; > +=09=09} > +=09=09page =3D list_last_entry(list, struct page, lru); > +=09=09if (!get_page_unless_zero(page)) { > +=09=09=09list_move(&page->lru, list); > +=09=09=09spin_unlock_irqrestore(&zone->lru_lock, flags); > +=09=09=09continue; > +=09=09} > +=09=09BUG_ON(!PageLRU(page)); > +=09=09ClearPageLRU(page); > +=09=09del_page_from_lru_list(page, lruvec, lru); > +=09=09spin_unlock_irqrestore(&zone->lru_lock, flags); > + > +=09=09compound_lock(page); > +=09=09nr_pages =3D hpage_nr_pages(page); > + > +=09=09if (!mem_cgroup_move_account(page, nr_pages, > +=09=09=09=09=09 memcg, root_mem_cgroup)) { > +=09=09=09/* > +=09=09=09 * root_mem_cgroup page counters are not used, > +=09=09=09 * otherwise we'd have to charge them here. > +=09=09=09 */ > +=09=09=09page_counter_uncharge(&memcg->memory, nr_pages); > +=09=09=09if (do_swap_account) > +=09=09=09=09page_counter_uncharge(&memcg->memsw, nr_pages); > +=09=09=09css_put_many(&memcg->css, nr_pages); > +=09=09} > + > +=09=09compound_unlock(page); > + > +=09=09putback_lru_page(page); > +=09} > +} > + > +static void unbind_work_fn(struct work_struct *work) > +{ > +=09struct cgroup_subsys_state *css; > +retry: > +=09drain_all_stock(root_mem_cgroup); > + > +=09rcu_read_lock(); > +=09css_for_each_child(css, &root_mem_cgroup->css) { > +=09=09struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); > + > +=09=09/* Drop references from swap-out records */ > +=09=09if (do_swap_account) { > +=09=09=09long zapped; > + > +=09=09=09zapped =3D swap_cgroup_zap_records(memcg->css.id); > +=09=09=09page_counter_uncharge(&memcg->memsw, zapped); > +=09=09=09css_put_many(&memcg->css, zapped); > +=09=09} > + > +=09=09/* Drop references from leftover LRU pages */ > +=09=09css_get(css); > +=09=09rcu_read_unlock(); > +=09=09atomic_inc(&memcg->moving_account); > +=09=09synchronize_rcu(); > +=09=09while (page_counter_read(&memcg->memory) - > +=09=09 page_counter_read(&memcg->kmem) > 0) { > +=09=09=09struct zone *zone; > +=09=09=09enum lru_list lru; > + > +=09=09=09lru_add_drain_all(); > + > +=09=09=09for_each_zone(zone) > +=09=09=09=09for_each_lru(lru) > +=09=09=09=09=09unbind_lru_list(memcg, zone, lru); > + > +=09=09=09cond_resched(); > +=09=09} > +=09=09atomic_dec(&memcg->moving_account); > +=09=09rcu_read_lock(); > +=09=09css_put(css); > +=09} > +=09rcu_read_unlock(); > +=09/* > +=09 * Swap-in is racy: > +=09 * > +=09 * #0 #1 > +=09 * lookup_swap_cgroup_id() > +=09 * rcu_read_lock() > +=09 * mem_cgroup_lookup() > +=09 * css_tryget_online() > +=09 * rcu_read_unlock() > +=09 * cgroup_kill_sb() > +=09 * !css_has_online_children() > +=09 * ->unbind() > +=09 * page_counter_try_charge() > +=09 * css_put() > +=09 * css_free() > +=09 * pc->mem_cgroup =3D dead memcg > +=09 * add page to lru > +=09 * > +=09 * Loop until until all references established from previously > +=09 * existing swap-out records have been transferred to pages on > +=09 * the LRU and then uncharged from there. > +=09 */ > +=09if (!list_empty(&root_mem_cgroup->css.children)) { > +=09=09msleep(10); > +=09=09goto retry; > +=09} > +} > + > +static DECLARE_WORK(unbind_work, unbind_work_fn); > + > +static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css) > +{ > +=09schedule_work(&unbind_work); > +} > + > static u64 memory_current_read(struct cgroup_subsys_state *css, > =09=09=09 struct cftype *cft) > { > @@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys =3D { > =09.cancel_attach =3D mem_cgroup_cancel_attach, > =09.attach =3D mem_cgroup_move_task, > =09.bind =3D mem_cgroup_bind, > +=09.unbind =3D mem_cgroup_unbind, > =09.dfl_cftypes =3D memory_files, > =09.legacy_cftypes =3D mem_cgroup_legacy_files, > =09.early_init =3D 0, > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c > index b5f7f24b8dd1..665923a558c4 100644 > --- a/mm/swap_cgroup.c > +++ b/mm/swap_cgroup.c > @@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent= ) > =09return lookup_swap_cgroup(ent, NULL)->id; > } > > +/** > + * swap_cgroup_zap_records - delete all swapout records of one cgroup > + * @id: memcg id > + * > + * Returns the number of deleted records. > + */ > +unsigned long swap_cgroup_zap_records(unsigned short id) > +{ > +=09unsigned long zapped =3D 0; > +=09unsigned int type; > + > +=09for (type =3D 0; type < MAX_SWAPFILES; type++) { > +=09=09struct swap_cgroup_ctrl *ctrl; > +=09=09unsigned long flags; > +=09=09unsigned int page; > + > +=09=09ctrl =3D &swap_cgroup_ctrl[type]; > +=09=09spin_lock_irqsave(&ctrl->lock, flags); > +=09=09for (page =3D 0; page < ctrl->length; page++) { > +=09=09=09struct swap_cgroup *base; > +=09=09=09pgoff_t offset; > + > +=09=09=09base =3D page_address(ctrl->map[page]); > +=09=09=09for (offset =3D 0; offset < SC_PER_PAGE; offset++) { > +=09=09=09=09struct swap_cgroup *sc; > + > +=09=09=09=09sc =3D base + offset; > +=09=09=09=09if (sc->id =3D=3D id) { > +=09=09=09=09=09sc->id =3D 0; > +=09=09=09=09=09zapped++; > +=09=09=09=09} > +=09=09=09} > +=09=09} > +=09=09spin_unlock_irqrestore(&ctrl->lock, flags); > +=09} > +=09return zapped; > +} > + > int swap_cgroup_swapon(int type, unsigned long max_pages) > { > =09void *array; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f181.google.com (mail-we0-f181.google.com [74.125.82.181]) by kanga.kvack.org (Postfix) with ESMTP id DB4236B0038 for ; Thu, 15 Jan 2015 12:26:55 -0500 (EST) Received: by mail-we0-f181.google.com with SMTP id q58so16010334wes.12 for ; Thu, 15 Jan 2015 09:26:55 -0800 (PST) Received: from mail-wg0-x231.google.com (mail-wg0-x231.google.com. [2a00:1450:400c:c00::231]) by mx.google.com with ESMTPS id w4si4022281wju.37.2015.01.15.09.26.55 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 15 Jan 2015 09:26:55 -0800 (PST) Received: by mail-wg0-f49.google.com with SMTP id n12so16259495wgh.8 for ; Thu, 15 Jan 2015 09:26:55 -0800 (PST) Date: Thu, 15 Jan 2015 18:26:52 +0100 From: Michal Hocko Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150115172652.GF7008@dhcp22.suse.cz> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110214316.GF25319@htj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Vladimir Davydov , "Suzuki K. Poulose" , Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Sat 10-01-15 16:43:16, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov Ohh, I have missed this one as I wasn't on the CC list. FWIW this approach makes sense to me. I just think that we should have a way to fail. E.g. kmem pages are impossible to reclaim because there might be some objects lingering somewhere not bound to a task context and reparenting is hard as Vladimir has pointed out several times already. Normal LRU pages should be reclaimable or reparented to the root easily. I cannot judge the implementation but I agree with the fact that memcg controller should be the one to take an action. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id F2B376B0032 for ; Thu, 15 Jan 2015 12:56:14 -0500 (EST) Received: by mail-wg0-f42.google.com with SMTP id k14so16416175wgh.1 for ; Thu, 15 Jan 2015 09:56:14 -0800 (PST) Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com. [2a00:1450:400c:c05::235]) by mx.google.com with ESMTPS id a5si11740133wix.30.2015.01.15.09.56.13 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 15 Jan 2015 09:56:13 -0800 (PST) Received: by mail-wi0-f181.google.com with SMTP id hi2so19528934wib.2 for ; Thu, 15 Jan 2015 09:56:13 -0800 (PST) Date: Thu, 15 Jan 2015 18:56:09 +0100 From: Michal Hocko Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150115175609.GG7008@dhcp22.suse.cz> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Tejun Heo , Vladimir Davydov , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon On Sun 11-01-15 15:55:43, Johannes Weiner wrote: > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. OK, that makes sense to me. > Page cache pages are moved to the root memory cgroup. OK, this is better than reclaiming them. [...] > +static void unbind_lru_list(struct mem_cgroup *memcg, > + struct zone *zone, enum lru_list lru) > +{ > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); > + struct list_head *list = &lruvec->lists[lru]; > + > + while (!list_empty(list)) { > + unsigned int nr_pages; > + unsigned long flags; > + struct page *page; > + > + spin_lock_irqsave(&zone->lru_lock, flags); > + if (list_empty(list)) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + break; > + } > + page = list_last_entry(list, struct page, lru); taking lru_lock for each page calls for troubles. The lock would bounce like crazy. It shouldn't be a big problem to list_move to a local list and then work on that one without the lock. Those pages wouldn't be visible for the reclaim but that would be only temporary. Or if that is not acceptable then just batch at least some number of pages (popular SWAP_CLUSTER_MAX). > + if (!get_page_unless_zero(page)) { > + list_move(&page->lru, list); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + continue; > + } > + BUG_ON(!PageLRU(page)); > + ClearPageLRU(page); > + del_page_from_lru_list(page, lruvec, lru); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + > + compound_lock(page); > + nr_pages = hpage_nr_pages(page); > + > + if (!mem_cgroup_move_account(page, nr_pages, > + memcg, root_mem_cgroup)) { > + /* > + * root_mem_cgroup page counters are not used, > + * otherwise we'd have to charge them here. > + */ > + page_counter_uncharge(&memcg->memory, nr_pages); > + if (do_swap_account) > + page_counter_uncharge(&memcg->memsw, nr_pages); > + css_put_many(&memcg->css, nr_pages); > + } > + > + compound_unlock(page); > + > + putback_lru_page(page); > + } > +} > + > +static void unbind_work_fn(struct work_struct *work) > +{ > + struct cgroup_subsys_state *css; > +retry: > + drain_all_stock(root_mem_cgroup); > + > + rcu_read_lock(); > + css_for_each_child(css, &root_mem_cgroup->css) { > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + > + /* Drop references from swap-out records */ > + if (do_swap_account) { > + long zapped; > + > + zapped = swap_cgroup_zap_records(memcg->css.id); > + page_counter_uncharge(&memcg->memsw, zapped); > + css_put_many(&memcg->css, zapped); > + } > + > + /* Drop references from leftover LRU pages */ > + css_get(css); > + rcu_read_unlock(); > + atomic_inc(&memcg->moving_account); > + synchronize_rcu(); Why do we need this? Who can migrate to/from offline memcgs? > + while (page_counter_read(&memcg->memory) - > + page_counter_read(&memcg->kmem) > 0) { > + struct zone *zone; > + enum lru_list lru; > + > + lru_add_drain_all(); > + > + for_each_zone(zone) > + for_each_lru(lru) > + unbind_lru_list(memcg, zone, lru); > + > + cond_resched(); > + } > + atomic_dec(&memcg->moving_account); > + rcu_read_lock(); > + css_put(css); > + } > + rcu_read_unlock(); > + /* > + * Swap-in is racy: > + * > + * #0 #1 > + * lookup_swap_cgroup_id() > + * rcu_read_lock() > + * mem_cgroup_lookup() > + * css_tryget_online() > + * rcu_read_unlock() > + * cgroup_kill_sb() > + * !css_has_online_children() > + * ->unbind() > + * page_counter_try_charge() > + * css_put() > + * css_free() > + * pc->mem_cgroup = dead memcg > + * add page to lru > + * > + * Loop until until all references established from previously > + * existing swap-out records have been transferred to pages on > + * the LRU and then uncharged from there. > + */ > + if (!list_empty(&root_mem_cgroup->css.children)) { But what if kmem pages pin the memcg? We would loop for ever. Or am I missing something? > + msleep(10); > + goto retry; > + } > +} [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com [209.85.216.47]) by kanga.kvack.org (Postfix) with ESMTP id D79646B0032 for ; Mon, 19 Jan 2015 07:51:31 -0500 (EST) Received: by mail-qa0-f47.google.com with SMTP id n8so23720696qaq.6 for ; Mon, 19 Jan 2015 04:51:31 -0800 (PST) Received: from service87.mimecast.com (service87.mimecast.com. [91.220.42.44]) by mx.google.com with ESMTP id f3si17052709qaq.129.2015.01.19.04.51.29 for ; Mon, 19 Jan 2015 04:51:30 -0800 (PST) Message-ID: <54BCFDCF.9090603@arm.com> Date: Mon, 19 Jan 2015 12:51:27 +0000 From: "Suzuki K. Poulose" MIME-Version: 1.0 Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> In-Reply-To: <20150110085525.GD2110@esperanza> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: Tejun Heo , Johannes Weiner , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Will Deacon , mhocko@suse.cz, akpm@linux-foundation.org On 10/01/15 08:55, Vladimir Davydov wrote: > On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: >> Hi >> >> We have hit a hang on ARM64 defconfig, while running LTP tests on >> 3.19-rc3. We are >> in the process of a git bisect and will update the results as and >> when we find the commit. >> >> During the ksm ltp run, the test hangs trying to mount memcg with >> the following strace >> output: >> >> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? >> ERESTARTNOINTR (To be restarted) >> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") =3D ? >> ERESTARTNOINTR (To be restarted) >> [ ... repeated forever ... ] >> >> At this point, one can try mounting the memcg to verify the problem. >> # mount -t cgroup -o memory memcg memcg_dir >> --hangs-- >> >> Strangely, if we run the mount command from a cold boot (i.e. >> without running LTP first), >> then it succeeds. >> >> Upon a quick look we are hitting the following code : >> kernel/cgroup.c: cgroup_mount() : >> >> 1779 for_each_subsys(ss, i) { >> 1780 if (!(opts.subsys_mask & (1 << i)) || >> 1781 ss->root =3D=3D &cgrp_dfl_root) >> 1782 continue; >> 1783 >> 1784 if >> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { >> 1785 mutex_unlock(&cgroup_mutex); >> 1786 msleep(10); >> 1787 ret =3D restart_syscall(); <=3D=3D=3D=3D=3D >> 1788 goto out_free; >> 1789 } >> 1790 cgroup_put(&ss->root->cgrp); >> 1791 } >> >> with ss->root->cgrp.self.refct.percpu_count_ptr =3D=3D __PERCPU_REF_ATOM= IC_DEAD >> >> Any ideas? > > The problem is that the memory cgroup controller takes a css reference > per each charged page and does not reparent charged pages on css > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > offline cgroups to be gone soon, restarting the syscall if the ref count > !=3D 0. As a result, if you create a memory cgroup, charge some page cach= e > to it, and then remove it, unmount/mount will hang forever. > > May be, we should kill the ref counter to the memory controller root in > cgroup_kill_sb only if there is no children at all, neither online nor > offline. > Still reproducible on 3.19-rc5 with the same setup. From git bisect, the=20 last good commit is : commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 Author: Pranith Kumar Date: Wed Dec 10 15:42:28 2014 -0800 slab: replace smp_read_barrier_depends() with lockless_dereference() Thanks Suzuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f175.google.com (mail-pd0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id E34F26B0032 for ; Wed, 21 Jan 2015 11:40:16 -0500 (EST) Received: by mail-pd0-f175.google.com with SMTP id fl12so17790717pdb.6 for ; Wed, 21 Jan 2015 08:40:16 -0800 (PST) Received: from foss-mx-na.foss.arm.com (foss-mx-na.foss.arm.com. [217.140.108.86]) by mx.google.com with ESMTP id nq15si4287798pdb.212.2015.01.21.08.40.14 for ; Wed, 21 Jan 2015 08:40:15 -0800 (PST) Date: Wed, 21 Jan 2015 16:39:55 +0000 From: Will Deacon Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150121163955.GM4549@arm.com> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54BCFDCF.9090603@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Suzuki K. Poulose" Cc: Vladimir Davydov , Tejun Heo , Johannes Weiner , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: > On 10/01/15 08:55, Vladimir Davydov wrote: > > The problem is that the memory cgroup controller takes a css reference > > per each charged page and does not reparent charged pages on css > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > > offline cgroups to be gone soon, restarting the syscall if the ref count > > != 0. As a result, if you create a memory cgroup, charge some page cache > > to it, and then remove it, unmount/mount will hang forever. > > > > May be, we should kill the ref counter to the memory controller root in > > cgroup_kill_sb only if there is no children at all, neither online nor > > offline. > > > > Still reproducible on 3.19-rc5 with the same setup. Yeah, I'm seeing the same failure on my setup too. > From git bisect, the last good commit is : > > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 > Author: Pranith Kumar > Date: Wed Dec 10 15:42:28 2014 -0800 > > slab: replace smp_read_barrier_depends() with lockless_dereference() So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") as the offending commit. Will -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by kanga.kvack.org (Postfix) with ESMTP id C5A6C6B0032 for ; Thu, 22 Jan 2015 08:46:06 -0500 (EST) Received: by mail-wg0-f50.google.com with SMTP id b13so1792570wgh.9 for ; Thu, 22 Jan 2015 05:46:06 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id bt4si10019787wib.2.2015.01.22.05.46.03 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Jan 2015 05:46:03 -0800 (PST) Date: Thu, 22 Jan 2015 08:45:50 -0500 From: Johannes Weiner Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122134550.GA13876@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150121163955.GM4549@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Will Deacon Cc: "Suzuki K. Poulose" , Vladimir Davydov , Tejun Heo , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote: > On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: > > On 10/01/15 08:55, Vladimir Davydov wrote: > > > The problem is that the memory cgroup controller takes a css reference > > > per each charged page and does not reparent charged pages on css > > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > > > offline cgroups to be gone soon, restarting the syscall if the ref count > > > != 0. As a result, if you create a memory cgroup, charge some page cache > > > to it, and then remove it, unmount/mount will hang forever. > > > > > > May be, we should kill the ref counter to the memory controller root in > > > cgroup_kill_sb only if there is no children at all, neither online nor > > > offline. > > > > > > > Still reproducible on 3.19-rc5 with the same setup. > > Yeah, I'm seeing the same failure on my setup too. > > > From git bisect, the last good commit is : > > > > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 > > Author: Pranith Kumar > > Date: Wed Dec 10 15:42:28 2014 -0800 > > > > slab: replace smp_read_barrier_depends() with lockless_dereference() > > So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") > as the offending commit. With b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), page cache can pin an old css and its ancestors indefinitely, making that hang in a second mount() very likely. However, swap entries have also been doing that for quite a while now, and as Vladimir pointed out, the same is true for kernel memory. This latest change just makes this existing bug easier to trigger. I think we have to update the lifetime rules to reflect reality here: memory and swap lifetime is indefinite, so once the memory controller is used, it has state that is independent from whether its mounted or not. We can support an identical remount, but have to fail mounting with new parameters that would change the behavior of the controller. Suzuki, Will, could you give the following patch a shot? Tejun, would that route be acceptable to you? Thanks --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f173.google.com (mail-qc0-f173.google.com [209.85.216.173]) by kanga.kvack.org (Postfix) with ESMTP id C14216B0032 for ; Thu, 22 Jan 2015 09:34:58 -0500 (EST) Received: by mail-qc0-f173.google.com with SMTP id m20so1445144qcx.4 for ; Thu, 22 Jan 2015 06:34:58 -0800 (PST) Received: from mail-qg0-x232.google.com (mail-qg0-x232.google.com. [2607:f8b0:400d:c04::232]) by mx.google.com with ESMTPS id q16si9088682qam.36.2015.01.22.06.34.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Jan 2015 06:34:57 -0800 (PST) Received: by mail-qg0-f50.google.com with SMTP id f51so1390107qge.9 for ; Thu, 22 Jan 2015 06:34:57 -0800 (PST) Date: Thu, 22 Jan 2015 09:34:54 -0500 From: Tejun Heo Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122143454.GA4507@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122134550.GA13876@phnom.home.cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Hello, On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote: > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index bb263d0caab3..9a09308c8066 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, > goto out_unlock; > } > > - if (root->flags ^ opts.flags) > - pr_warn("new mount options do not match the existing superblock, will be ignored\n"); > + if (root->flags ^ opts.flags) { > + pr_warn("new mount options do not match the existing superblock\n"); > + ret = -EBUSY; > + goto out_unlock; > + } Do we really need the above chunk? > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb) > * > * And don't kill the default root. > */ > - if (css_has_online_children(&root->cgrp.self) || > + if (!list_empty(&root->cgrp.self.children) || > root == &cgrp_dfl_root) > cgroup_put(&root->cgrp); I tried to do something a bit more advanced so that eventual async release of dying children, if they happen, can also release the hierarchy but I don't think it really matters unless we can forcefully drain. So, shouldn't just the above part be enough? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f172.google.com (mail-we0-f172.google.com [74.125.82.172]) by kanga.kvack.org (Postfix) with ESMTP id BD88E6B0070 for ; Thu, 22 Jan 2015 10:19:52 -0500 (EST) Received: by mail-we0-f172.google.com with SMTP id k11so2361690wes.3 for ; Thu, 22 Jan 2015 07:19:52 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id cj10si5170189wid.85.2015.01.22.07.19.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Jan 2015 07:19:51 -0800 (PST) Date: Thu, 22 Jan 2015 10:19:43 -0500 From: Johannes Weiner Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122151943.GA27368@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> <20150122143454.GA4507@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122143454.GA4507@htj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Hi, On Thu, Jan 22, 2015 at 09:34:54AM -0500, Tejun Heo wrote: > On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote: > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > > index bb263d0caab3..9a09308c8066 100644 > > --- a/kernel/cgroup.c > > +++ b/kernel/cgroup.c > > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, > > goto out_unlock; > > } > > > > - if (root->flags ^ opts.flags) > > - pr_warn("new mount options do not match the existing superblock, will be ignored\n"); > > + if (root->flags ^ opts.flags) { > > + pr_warn("new mount options do not match the existing superblock\n"); > > + ret = -EBUSY; > > + goto out_unlock; > > + } > > Do we really need the above chunk? Inform and ignore or fail hard? I guess we can drop this hunk and keep with the current behavior. > > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb) > > * > > * And don't kill the default root. > > */ > > - if (css_has_online_children(&root->cgrp.self) || > > + if (!list_empty(&root->cgrp.self.children) || > > root == &cgrp_dfl_root) > > cgroup_put(&root->cgrp); > > I tried to do something a bit more advanced so that eventual async > release of dying children, if they happen, can also release the > hierarchy but I don't think it really matters unless we can forcefully > drain. So, shouldn't just the above part be enough? Yep, I'd be fine with that. --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f52.google.com (mail-qg0-f52.google.com [209.85.192.52]) by kanga.kvack.org (Postfix) with ESMTP id E94EB6B0070 for ; Thu, 22 Jan 2015 10:28:21 -0500 (EST) Received: by mail-qg0-f52.google.com with SMTP id z107so1641381qgd.11 for ; Thu, 22 Jan 2015 07:28:21 -0800 (PST) Received: from mail-qc0-x232.google.com (mail-qc0-x232.google.com. [2607:f8b0:400d:c01::232]) by mx.google.com with ESMTPS id 61si4967479qgx.12.2015.01.22.07.28.21 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Jan 2015 07:28:21 -0800 (PST) Received: by mail-qc0-f178.google.com with SMTP id b13so1710871qcw.9 for ; Thu, 22 Jan 2015 07:28:21 -0800 (PST) Date: Thu, 22 Jan 2015 10:28:17 -0500 From: Tejun Heo Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122152817.GD4507@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> <20150122143454.GA4507@htj.dyndns.org> <20150122151943.GA27368@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122151943.GA27368@phnom.home.cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" On Thu, Jan 22, 2015 at 10:19:43AM -0500, Johannes Weiner wrote: > From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 22 Jan 2015 08:16:31 -0500 > Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller > lifetime > > Since b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), re-mounting the memory controller after using it is > very likely to hang. > > The cgroup core assumes that any remaining references after deleting a > cgroup are temporary in nature, and synchroneously waits for them, but > the above-mentioned commit has left-over page cache pin its css until > it is reclaimed naturally. That being said, swap entries and charged > kernel memory have been doing the same indefinite pinning forever, the > bug is just more likely to trigger with left-over page cache. > > Reparenting kernel memory is highly impractical, which leaves changing > the cgroup assumptions to reflect this: once a controller has been > mounted and used, it has internal state that is independent from mount > and cgroup lifetime. It can be unmounted and remounted, but it can't > be reconfigured during subsequent mounts. > > Don't offline the controller root as long as there are any children, > dead or alive. A remount will no longer wait for these old references > to drain, it will simply mount the persistent controller state again. > > Reported-by: "Suzuki K. Poulose" > Reported-by: Will Deacon > Signed-off-by: Johannes Weiner Applied to cgroup/for-3.19-fixes. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f48.google.com (mail-qa0-f48.google.com [209.85.216.48]) by kanga.kvack.org (Postfix) with ESMTP id 3234D6B0032 for ; Fri, 23 Jan 2015 10:00:03 -0500 (EST) Received: by mail-qa0-f48.google.com with SMTP id v8so6063323qal.7 for ; Fri, 23 Jan 2015 07:00:03 -0800 (PST) Received: from service87.mimecast.com (service87.mimecast.com. [91.220.42.44]) by mx.google.com with ESMTP id z1si2269095qar.33.2015.01.23.07.00.01 for ; Fri, 23 Jan 2015 07:00:02 -0800 (PST) Message-ID: <54C261F0.9070606@arm.com> Date: Fri, 23 Jan 2015 15:00:00 +0000 From: "Suzuki K. Poulose" MIME-Version: 1.0 Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> In-Reply-To: <20150122134550.GA13876@phnom.home.cmpxchg.org> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Will Deacon Cc: Vladimir Davydov , Tejun Heo , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" On 22/01/15 13:45, Johannes Weiner wrote: > On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote: >> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: >>> On 10/01/15 08:55, Vladimir Davydov wrote: >>>> The problem is that the memory cgroup controller takes a css reference >>>> per each charged page and does not reparent charged pages on css >>>> offline, while cgroup_mount/cgroup_kill_sb expect all css references t= o >>>> offline cgroups to be gone soon, restarting the syscall if the ref cou= nt >>>> !=3D 0. As a result, if you create a memory cgroup, charge some page c= ache >>>> to it, and then remove it, unmount/mount will hang forever. >>>> >>>> May be, we should kill the ref counter to the memory controller root i= n >>>> cgroup_kill_sb only if there is no children at all, neither online nor >>>> offline. >>>> >>> >>> Still reproducible on 3.19-rc5 with the same setup. >> >> Yeah, I'm seeing the same failure on my setup too. >> >>> From git bisect, the last good commit is : >>> >>> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 >>> Author: Pranith Kumar >>> Date: Wed Dec 10 15:42:28 2014 -0800 >>> >>> slab: replace smp_read_barrier_depends() with lockless_dereferenc= e() >> >> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters"= ) >> as the offending commit. > > With b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), page cache can pin an old css and its ancestors > indefinitely, making that hang in a second mount() very likely. > > However, swap entries have also been doing that for quite a while now, > and as Vladimir pointed out, the same is true for kernel memory. This > latest change just makes this existing bug easier to trigger. > > I think we have to update the lifetime rules to reflect reality here: > memory and swap lifetime is indefinite, so once the memory controller > is used, it has state that is independent from whether its mounted or > not. We can support an identical remount, but have to fail mounting > with new parameters that would change the behavior of the controller. > > Suzuki, Will, could you give the following patch a shot? > > Tejun, would that route be acceptable to you? > > Thanks > > --- > From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 22 Jan 2015 08:16:31 -0500 > Subject: [patch] kernel: cgroup: prevent mount hang due to memory control= ler > lifetime > > > Don't offline the controller root as long as there are any children, > dead or alive. A remount will no longer wait for these old references > to drain, it will simply mount the persistent controller state again. > > Reported-by: "Suzuki K. Poulose" > Reported-by: Will Deacon > Signed-off-by: Johannes Weiner This one fixes the issue. Tested-by : Suzuki K. Poulose Thanks Suzuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758066AbbAIRnY (ORCPT ); Fri, 9 Jan 2015 12:43:24 -0500 Received: from service87.mimecast.com ([91.220.42.44]:58995 "EHLO service87.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753969AbbAIRnW convert rfc822-to-8bit (ORCPT ); Fri, 9 Jan 2015 12:43:22 -0500 Message-ID: <54B01335.4060901@arm.com> Date: Fri, 09 Jan 2015 17:43:17 +0000 From: "Suzuki K. Poulose" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Tejun Heo , Johannes Weiner CC: linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: [Regression] 3.19-rc3 : memcg: Hang in mount memcg X-OriginalArrivalTime: 09 Jan 2015 17:43:18.0220 (UTC) FILETIME=[C0BB50C0:01D02C33] X-MC-Unique: 115010917432000201 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3. We are in the process of a git bisect and will update the results as and when we find the commit. During the ksm ltp run, the test hangs trying to mount memcg with the following strace output: mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To be restarted) mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To be restarted) [ ... repeated forever ... ] At this point, one can try mounting the memcg to verify the problem. # mount -t cgroup -o memory memcg memcg_dir --hangs-- Strangely, if we run the mount command from a cold boot (i.e. without running LTP first), then it succeeds. Upon a quick look we are hitting the following code : kernel/cgroup.c: cgroup_mount() : 1779 for_each_subsys(ss, i) { 1780 if (!(opts.subsys_mask & (1 << i)) || 1781 ss->root == &cgrp_dfl_root) 1782 continue; 1783 1784 if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { 1785 mutex_unlock(&cgroup_mutex); 1786 msleep(10); 1787 ret = restart_syscall(); <===== 1788 goto out_free; 1789 } 1790 cgroup_put(&ss->root->cgrp); 1791 } with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD Any ideas? Thanks Suzuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752492AbbAIVqx (ORCPT ); Fri, 9 Jan 2015 16:46:53 -0500 Received: from mail-qa0-f51.google.com ([209.85.216.51]:43682 "EHLO mail-qa0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751213AbbAIVqw (ORCPT ); Fri, 9 Jan 2015 16:46:52 -0500 Date: Fri, 9 Jan 2015 16:46:49 -0500 From: Tejun Heo To: "Suzuki K. Poulose" Cc: Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150109214649.GF2785@htj.dyndns.org> References: <54B01335.4060901@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B01335.4060901@arm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3. > We are > in the process of a git bisect and will update the results as and > when we find the commit. > > During the ksm ltp run, the test hangs trying to mount memcg with the > following strace > output: > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > be restarted) > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > be restarted) > [ ... repeated forever ... ] > > At this point, one can try mounting the memcg to verify the problem. > # mount -t cgroup -o memory memcg memcg_dir > --hangs-- > > Strangely, if we run the mount command from a cold boot (i.e. without > running LTP first), > then it succeeds. I don't know what LTP is doing and this could actually be hitting on an actual bug but if it's trying to move memcg back from unified hierarchy to an old one, that might hang - it should prolly made to just fail at that point. Anyways, any chance you can find out what happened, in terms of cgroup mounting, to memcg upto that point? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754503AbbAJIzj (ORCPT ); Sat, 10 Jan 2015 03:55:39 -0500 Received: from mx2.parallels.com ([199.115.105.18]:45073 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751044AbbAJIzi (ORCPT ); Sat, 10 Jan 2015 03:55:38 -0500 Date: Sat, 10 Jan 2015 11:55:25 +0300 From: Vladimir Davydov To: "Suzuki K. Poulose" CC: Tejun Heo , Johannes Weiner , , "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150110085525.GD2110@esperanza> References: <54B01335.4060901@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <54B01335.4060901@arm.com> X-Originating-IP: [81.5.99.36] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > Hi > > We have hit a hang on ARM64 defconfig, while running LTP tests on > 3.19-rc3. We are > in the process of a git bisect and will update the results as and > when we find the commit. > > During the ksm ltp run, the test hangs trying to mount memcg with > the following strace > output: > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? > ERESTARTNOINTR (To be restarted) > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? > ERESTARTNOINTR (To be restarted) > [ ... repeated forever ... ] > > At this point, one can try mounting the memcg to verify the problem. > # mount -t cgroup -o memory memcg memcg_dir > --hangs-- > > Strangely, if we run the mount command from a cold boot (i.e. > without running LTP first), > then it succeeds. > > Upon a quick look we are hitting the following code : > kernel/cgroup.c: cgroup_mount() : > > 1779 for_each_subsys(ss, i) { > 1780 if (!(opts.subsys_mask & (1 << i)) || > 1781 ss->root == &cgrp_dfl_root) > 1782 continue; > 1783 > 1784 if > (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { > 1785 mutex_unlock(&cgroup_mutex); > 1786 msleep(10); > 1787 ret = restart_syscall(); <===== > 1788 goto out_free; > 1789 } > 1790 cgroup_put(&ss->root->cgrp); > 1791 } > > with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD > > Any ideas? The problem is that the memory cgroup controller takes a css reference per each charged page and does not reparent charged pages on css offline, while cgroup_mount/cgroup_kill_sb expect all css references to offline cgroups to be gone soon, restarting the syscall if the ref count != 0. As a result, if you create a memory cgroup, charge some page cache to it, and then remove it, unmount/mount will hang forever. May be, we should kill the ref counter to the memory controller root in cgroup_kill_sb only if there is no children at all, neither online nor offline. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754900AbbAJVnW (ORCPT ); Sat, 10 Jan 2015 16:43:22 -0500 Received: from mail-qa0-f48.google.com ([209.85.216.48]:43871 "EHLO mail-qa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751335AbbAJVnU (ORCPT ); Sat, 10 Jan 2015 16:43:20 -0500 Date: Sat, 10 Jan 2015 16:43:16 -0500 From: Tejun Heo To: Vladimir Davydov Cc: "Suzuki K. Poulose" , Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150110214316.GF25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110085525.GD2110@esperanza> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, if a hierarchy doesn't have any live children when it's unmounted, the hierarchy starts dying by killing its refcnt. The expectation is that even if there are lingering dead children which are lingering due to remaining references, they'll be put in a finite amount of time. When the children are finally released, the hierarchy is destroyed and all controllers bound to it also are released. However, for memcg, the premise that the lingering refs will be put in a finite amount time is not true. In the absense of memory pressure, dead memcg's may hang around indefinitely pinned by its pages. This unfortunately may lead to indefinite hang on the next mount attempt involving memcg as the mount logic waits for it to get released. While we can change hierarchy destruction logic such that a hierarchy is only destroyed when it's not mounted anywhere and all its children, live or dead, are gone, this makes whether the hierarchy gets destroyed or not to be determined by factors opaque to userland. Userland may or may not get a new hierarchy on the next mount attempt. Worse, if it explicitly wants to create a new hierarchy with different options or controller compositions involving memcg, it will fail in an essentially arbitrary manner. We want to guarantee that a hierarchy is destroyed once the conditions, unmounted and no visible children, are met. To aid it, this patch introduces a new callback cgroup_subsys->unbind() which is invoked right before the hierarchy a subsystem is bound to starts dying. memcg can implement this callback and initiate draining of remaining refs so that the hierarchy can eventually be released in a finite amount of time. Signed-off-by: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov --- Hello, > May be, we should kill the ref counter to the memory controller root in > cgroup_kill_sb only if there is no children at all, neither online nor > offline. Ah, thanks for the analysis, but I really wanna avoid making hierarchy destruction conditions opaque to userland. This is userland visible behavior. It shouldn't be determined by kernel internals invisible outside. This patch adds ss->unbind() which memcg can hook into to kick off draining of residual refs. If this would work, I'll add this patch to cgroup/for-3.19-fixes, possibly with stable cc'd. Thanks. Documentation/cgroups/cgroups.txt | 12 +++++++++++- include/linux/cgroup.h | 1 + kernel/cgroup.c | 14 ++++++++++++-- 3 files changed, 24 insertions(+), 3 deletions(-) --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -637,7 +637,7 @@ void exit(struct task_struct *task) Called during task exit. -void bind(struct cgroup *root) +void bind(struct cgroup_subsys_state *root_css) (cgroup_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy @@ -645,6 +645,16 @@ and root cgroup. Currently this will onl the default hierarchy (which never has sub-cgroups) and a hierarchy that is being created/destroyed (and hence has no sub-cgroups). +void unbind(struct cgroup_subsys_state *root_css) +(cgroup_mutex held by caller) + +Called when a cgroup subsys is unbound from its current hierarchy. The +subsystem must guarantee that all offline cgroups are going to be +released in a finite amount of time after this function is called. +Such draining can be asynchronous. The following binding of the +subsystem to a hierarchy will be delayed till the draining is +complete. + 4. Extended attribute usage =========================== --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -654,6 +654,7 @@ struct cgroup_subsys { struct cgroup_subsys_state *old_css, struct task_struct *task); void (*bind)(struct cgroup_subsys_state *root_css); + void (*unbind)(struct cgroup_subsys_state *root_css); int disabled; int early_init; --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_ * And don't kill the default root. */ if (css_has_online_children(&root->cgrp.self) || - root == &cgrp_dfl_root) + root == &cgrp_dfl_root) { cgroup_put(&root->cgrp); - else + } else { + struct cgroup_subsys *ss; + int i; + + mutex_lock(&cgroup_mutex); + for_each_subsys(ss, i) + if ((root->subsys_mask & (1UL << i)) && ss->unbind) + ss->unbind(cgroup_css(&root->cgrp, ss)); + mutex_unlock(&cgroup_mutex); + percpu_ref_kill(&root->cgrp.self.refcnt); + } kernfs_kill_sb(sb); } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752222AbbAKU4A (ORCPT ); Sun, 11 Jan 2015 15:56:00 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:57106 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751333AbbAKUz7 (ORCPT ); Sun, 11 Jan 2015 15:55:59 -0500 Date: Sun, 11 Jan 2015 15:55:43 -0500 From: Johannes Weiner To: Tejun Heo Cc: Vladimir Davydov , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150111205543.GA5480@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110214316.GF25319@htj.dyndns.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov > --- > Hello, > > > May be, we should kill the ref counter to the memory controller root in > > cgroup_kill_sb only if there is no children at all, neither online nor > > offline. > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > destruction conditions opaque to userland. This is userland visible > behavior. It shouldn't be determined by kernel internals invisible > outside. This patch adds ss->unbind() which memcg can hook into to > kick off draining of residual refs. If this would work, I'll add this > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. How about this ->unbind() for memcg? >>From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Sun, 11 Jan 2015 10:29:05 -0500 Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during unbind Cgroup core assumes that any outstanding css references after offlining are temporary in nature, and e.g. mount waits for them to disappear and release the root cgroup. But leftover page cache and swapout records in an offlined memcg are only dropped when the pages get reclaimed under pressure or the swapped out pages get faulted in from other cgroups, and so those cgroup operations can hang forever. Implement the ->unbind() callback to actively get rid of outstanding references when cgroup core wants them gone. Swap out records are deleted, such that the swap-in path will charge those pages to the faulting task. Page cache pages are moved to the root memory cgroup. Signed-off-by: Johannes Weiner --- include/linux/swap_cgroup.h | 6 +++ mm/memcontrol.c | 126 ++++++++++++++++++++++++++++++++++++++++++++ mm/swap_cgroup.c | 38 +++++++++++++ 3 files changed, 170 insertions(+) diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index 145306bdc92f..ffe0866d2997 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); +extern unsigned long swap_cgroup_zap_records(unsigned short id); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return 0; } +static inline unsigned long swap_cgroup_zap_records(unsigned short id) +{ + return 0; +} + static inline int swap_cgroup_swapon(int type, unsigned long max_pages) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 692e96407627..40c426add613 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) mem_cgroup_from_css(root_css)->use_hierarchy = true; } +static void unbind_lru_list(struct mem_cgroup *memcg, + struct zone *zone, enum lru_list lru) +{ + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); + struct list_head *list = &lruvec->lists[lru]; + + while (!list_empty(list)) { + unsigned int nr_pages; + unsigned long flags; + struct page *page; + + spin_lock_irqsave(&zone->lru_lock, flags); + if (list_empty(list)) { + spin_unlock_irqrestore(&zone->lru_lock, flags); + break; + } + page = list_last_entry(list, struct page, lru); + if (!get_page_unless_zero(page)) { + list_move(&page->lru, list); + spin_unlock_irqrestore(&zone->lru_lock, flags); + continue; + } + BUG_ON(!PageLRU(page)); + ClearPageLRU(page); + del_page_from_lru_list(page, lruvec, lru); + spin_unlock_irqrestore(&zone->lru_lock, flags); + + compound_lock(page); + nr_pages = hpage_nr_pages(page); + + if (!mem_cgroup_move_account(page, nr_pages, + memcg, root_mem_cgroup)) { + /* + * root_mem_cgroup page counters are not used, + * otherwise we'd have to charge them here. + */ + page_counter_uncharge(&memcg->memory, nr_pages); + if (do_swap_account) + page_counter_uncharge(&memcg->memsw, nr_pages); + css_put_many(&memcg->css, nr_pages); + } + + compound_unlock(page); + + putback_lru_page(page); + } +} + +static void unbind_work_fn(struct work_struct *work) +{ + struct cgroup_subsys_state *css; +retry: + drain_all_stock(root_mem_cgroup); + + rcu_read_lock(); + css_for_each_child(css, &root_mem_cgroup->css) { + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + /* Drop references from swap-out records */ + if (do_swap_account) { + long zapped; + + zapped = swap_cgroup_zap_records(memcg->css.id); + page_counter_uncharge(&memcg->memsw, zapped); + css_put_many(&memcg->css, zapped); + } + + /* Drop references from leftover LRU pages */ + css_get(css); + rcu_read_unlock(); + atomic_inc(&memcg->moving_account); + synchronize_rcu(); + while (page_counter_read(&memcg->memory) - + page_counter_read(&memcg->kmem) > 0) { + struct zone *zone; + enum lru_list lru; + + lru_add_drain_all(); + + for_each_zone(zone) + for_each_lru(lru) + unbind_lru_list(memcg, zone, lru); + + cond_resched(); + } + atomic_dec(&memcg->moving_account); + rcu_read_lock(); + css_put(css); + } + rcu_read_unlock(); + /* + * Swap-in is racy: + * + * #0 #1 + * lookup_swap_cgroup_id() + * rcu_read_lock() + * mem_cgroup_lookup() + * css_tryget_online() + * rcu_read_unlock() + * cgroup_kill_sb() + * !css_has_online_children() + * ->unbind() + * page_counter_try_charge() + * css_put() + * css_free() + * pc->mem_cgroup = dead memcg + * add page to lru + * + * Loop until until all references established from previously + * existing swap-out records have been transferred to pages on + * the LRU and then uncharged from there. + */ + if (!list_empty(&root_mem_cgroup->css.children)) { + msleep(10); + goto retry; + } +} + +static DECLARE_WORK(unbind_work, unbind_work_fn); + +static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css) +{ + schedule_work(&unbind_work); +} + static u64 memory_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, + .unbind = mem_cgroup_unbind, .dfl_cftypes = memory_files, .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0, diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c index b5f7f24b8dd1..665923a558c4 100644 --- a/mm/swap_cgroup.c +++ b/mm/swap_cgroup.c @@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return lookup_swap_cgroup(ent, NULL)->id; } +/** + * swap_cgroup_zap_records - delete all swapout records of one cgroup + * @id: memcg id + * + * Returns the number of deleted records. + */ +unsigned long swap_cgroup_zap_records(unsigned short id) +{ + unsigned long zapped = 0; + unsigned int type; + + for (type = 0; type < MAX_SWAPFILES; type++) { + struct swap_cgroup_ctrl *ctrl; + unsigned long flags; + unsigned int page; + + ctrl = &swap_cgroup_ctrl[type]; + spin_lock_irqsave(&ctrl->lock, flags); + for (page = 0; page < ctrl->length; page++) { + struct swap_cgroup *base; + pgoff_t offset; + + base = page_address(ctrl->map[page]); + for (offset = 0; offset < SC_PER_PAGE; offset++) { + struct swap_cgroup *sc; + + sc = base + offset; + if (sc->id == id) { + sc->id = 0; + zapped++; + } + } + } + spin_unlock_irqrestore(&ctrl->lock, flags); + } + return zapped; +} + int swap_cgroup_swapon(int type, unsigned long max_pages) { void *array; -- 2.2.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751572AbbALIBY (ORCPT ); Mon, 12 Jan 2015 03:01:24 -0500 Received: from mx2.parallels.com ([199.115.105.18]:57192 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750991AbbALIBX (ORCPT ); Mon, 12 Jan 2015 03:01:23 -0500 Date: Mon, 12 Jan 2015 11:01:14 +0300 From: Vladimir Davydov To: Johannes Weiner CC: Tejun Heo , "Suzuki K. Poulose" , , "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112080114.GE2110@esperanza> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote: > On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > > > May be, we should kill the ref counter to the memory controller root in > > > cgroup_kill_sb only if there is no children at all, neither online nor > > > offline. > > > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > > destruction conditions opaque to userland. This is userland visible > > behavior. It shouldn't be determined by kernel internals invisible > > outside. This patch adds ss->unbind() which memcg can hook into to > > kick off draining of residual refs. If this would work, I'll add this > > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. > > How about this ->unbind() for memcg? > > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. Page cache pages are moved to the root memory cgroup. ... and kmem pages are ignored. I reckon we could reparent them (I submitted the patch set some time ago), but that's going to be tricky and will complicate regular kmem charge/uncharge paths, as well as list_lru_add/del. I don't think we can put up with it, provided we only want reparenting on unmount, do we not? Come to think of it, I wonder how many users actually want to mount different controllers subset after unmount. Because we could allow mounting the same subset perfectly well, even if it includes memcg. BTW, AFAIU in the unified hierarchy we won't have this problem at all, because by definition it mounts all controllers IIRC, so do we need to bother fixing this in such a complicated manner at all for the setup that's going to be deprecated anyway? Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752007AbbALL2u (ORCPT ); Mon, 12 Jan 2015 06:28:50 -0500 Received: from mail-qg0-f51.google.com ([209.85.192.51]:46397 "EHLO mail-qg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750725AbbALL2t (ORCPT ); Mon, 12 Jan 2015 06:28:49 -0500 Date: Mon, 12 Jan 2015 06:28:45 -0500 From: Tejun Heo To: Vladimir Davydov Cc: Johannes Weiner , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112112845.GS25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150112080114.GE2110@esperanza> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Vladimir. On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > Come to think of it, I wonder how many users actually want to mount > different controllers subset after unmount. Because we could allow It wouldn't be a common use case but, on the face of it, we still support it. If we collecctively decide that once a sub cgroup is created for any controller no further hierarchy configuration for that controller is allowed, that'd work too, but one way or the other, the behavior, I believe, should be well-defined. As it currently stands, the conditions and failure mode are opaque to userland, which is never a good thing. > mounting the same subset perfectly well, even if it includes memcg. BTW, > AFAIU in the unified hierarchy we won't have this problem at all, > because by definition it mounts all controllers IIRC, so do we need to > bother fixing this in such a complicated manner at all for the setup > that's going to be deprecated anyway? There will likely be a quite long transition period and if and when the old things can be removed, this added cleanup logic can go away with it. It depends on how complex the implementation would get but as long as it isn't too much and stays mostly isolated from the saner paths, I think it's probably the right thing to do. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752319AbbALNAH (ORCPT ); Mon, 12 Jan 2015 08:00:07 -0500 Received: from mx2.parallels.com ([199.115.105.18]:59201 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750991AbbALNAF (ORCPT ); Mon, 12 Jan 2015 08:00:05 -0500 Date: Mon, 12 Jan 2015 15:59:56 +0300 From: Vladimir Davydov To: Tejun Heo CC: Johannes Weiner , "Suzuki K. Poulose" , , "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112125956.GF2110@esperanza> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> <20150112112845.GS25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20150112112845.GS25319@htj.dyndns.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote: > On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > > Come to think of it, I wonder how many users actually want to mount > > different controllers subset after unmount. Because we could allow > > It wouldn't be a common use case but, on the face of it, we still > support it. If we collecctively decide that once a sub cgroup is > created for any controller no further hierarchy configuration for that > controller is allowed, that'd work too, but one way or the other, the > behavior, I believe, should be well-defined. As it currently stands, > the conditions and failure mode are opaque to userland, which is never > a good thing. > > > mounting the same subset perfectly well, even if it includes memcg. BTW, > > AFAIU in the unified hierarchy we won't have this problem at all, > > because by definition it mounts all controllers IIRC, so do we need to > > bother fixing this in such a complicated manner at all for the setup > > that's going to be deprecated anyway? > > There will likely be a quite long transition period and if and when > the old things can be removed, this added cleanup logic can go away > with it. It depends on how complex the implementation would get but > as long as it isn't too much and stays mostly isolated from the saner > paths, I think it's probably the right thing to do. We can't just move kmem objects from a per-memcg kmem_cache to the global one fixing page counters, because in contrast to page cache and swap we don't even track all kmem allocations. So we have to keep all per-memcg kmem_cache's somewhere after unmount until they can finally be destroyed, but the whole logic behind per-memcg kmem_cache's destruction is currently tightly interwoven with that of css's (we destroy kmem_cache's from css_free), and there won't be any css's after unmount. That said, it isn't possible to add a couple of isolated functions, which will live their own lives and can be easily removed once we've switched to the unified hierarchy. Quite the contrary, implementing of kmem reparenting would make me rethink and complicate kmemcg code all over the place. That's why I'm rather reluctant to do it. I haven't dug deep into the cgroup core, but may be we could detach the old root in cgroup_kill_sb() and leave it dangling until the last reference to it has gone? BTW, IIRC the problem always existed for kmem-active memory cgroups, because we never had kmem reparenting. May be, we could therefore just document somewhere that kmem accounting is highly discouraged to be used in the legacy hierarchy and merge these two patches as is to handle page cache and swap charges? We won't break anything, because it was always broken :-) Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752858AbbALNFh (ORCPT ); Mon, 12 Jan 2015 08:05:37 -0500 Received: from mail-qa0-f53.google.com ([209.85.216.53]:46877 "EHLO mail-qa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752306AbbALNFJ (ORCPT ); Mon, 12 Jan 2015 08:05:09 -0500 Date: Mon, 12 Jan 2015 08:05:02 -0500 From: Tejun Heo To: Vladimir Davydov Cc: Johannes Weiner , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150112130502.GV25319@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> <20150112080114.GE2110@esperanza> <20150112112845.GS25319@htj.dyndns.org> <20150112125956.GF2110@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150112125956.GF2110@esperanza> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote: > I haven't dug deep into the cgroup core, but may be we could detach the > old root in cgroup_kill_sb() and leave it dangling until the last > reference to it has gone? The root isn't the problem here. Individual controllers are as there's only one copy of each and we most likely don't want to carry over child csses from one hierarchy to the next as the controller may operate under a different set of rules. > BTW, IIRC the problem always existed for kmem-active memory cgroups, > because we never had kmem reparenting. May be, we could therefore just > document somewhere that kmem accounting is highly discouraged to be used > in the legacy hierarchy and merge these two patches as is to handle page > cache and swap charges? We won't break anything, because it was always > broken :-) If we're going that route, I think it'd be better to declare hierarchy lifetime rules as essentially opaque to userland and destroy hierarchies only when all its children, dead or alive, are gone. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751352AbbALRCS (ORCPT ); Mon, 12 Jan 2015 12:02:18 -0500 Received: from service88.mimecast.com ([195.130.217.12]:47236 "EHLO service88.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751149AbbALRCR convert rfc822-to-8bit (ORCPT ); Mon, 12 Jan 2015 12:02:17 -0500 Date: Mon, 12 Jan 2015 17:02:11 +0000 From: "Suzuki K. Poulose" To: Tejun Heo CC: linux-kernel@vger.kernel.org, vdavydov@parallels.com, hannes@cmpxchg.org, Will.Deacon@arm.com, linux-mm@kvack.org, suzuki.poulose@arm.com Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150112170210.GA1288@e106634-lin.cambridge.arm.com> References: <54B01335.4060901@arm.com> <20150109214649.GF2785@htj.dyndns.org> MIME-Version: 1.0 In-Reply-To: <20150109214649.GF2785@htj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-MC-Unique: 115011217021500802 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: 8BIT Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 09, 2015 at 09:46:49PM +0000, Tejun Heo wrote: > On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: > > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3. > > We are > > in the process of a git bisect and will update the results as and > > when we find the commit. > > > > During the ksm ltp run, the test hangs trying to mount memcg with the > > following strace > > output: > > > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > > be restarted) > > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To > > be restarted) > > [ ... repeated forever ... ] > > > > At this point, one can try mounting the memcg to verify the problem. > > # mount -t cgroup -o memory memcg memcg_dir > > --hangs-- > > > > Strangely, if we run the mount command from a cold boot (i.e. without > > running LTP first), > > then it succeeds. > > I don't know what LTP is doing and this could actually be hitting on > an actual bug but if it's trying to move memcg back from unified > hierarchy to an old one, that might hang - it should prolly made to > just fail at that point. Anyways, any chance you can find out what > happened, in terms of cgroup mounting, to memcg upto that point? > This is what the test(ksm03) does, roughly from strace : faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/", F_OK) = 0 faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory) mkdirat(AT_FDCWD, "/dev/cgroup", 0777) = 0 mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = 0 --- set memory limit. Create a new set /dev/cgroups/1 and moves test to that group --- mkdirat(AT_FDCWD, "/dev/cgroup/1", 0777) = 0 openat(AT_FDCWD, "/dev/cgroup/1/memory.limit_in_bytes", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 fstat(3, {st_dev=makedev(0, 24), st_ino=41, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000 write(3, "1073741824", 10) = 10 close(3) = 0 munmap(0x7fb2903000, 65536) = 0 getpid() = 1324 openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 fstat(3, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000 write(3, "1324", 4) = 4 close(3) = 0 munmap(0x7fb2903000, 65536) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1325 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1326 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1327 --- Creates 3 children, perform a lot of memory operations with shared pages verify the ksm for activity and wait for children to exit --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1325 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1326 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1327 wait4(-1, 0x7fe5625f3c, WSTOPPED|WCONTINUED, NULL) = -1 ECHILD (No child processes) --- cleanup: Move tasks under /dev/cgroups/1/ to /dev/cgroups/ and delete subdir, umount cgroup --- faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/dev/cgroup/tasks", O_WRONLY) = 205 openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_RDONLY) = 206 fstat(206, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb1c53000 read(206, "1324\n", 4096) = 5 write(205, "1324", 4) = 4 read(206, "", 4096) = 0 close(205) = 0 close(206) = 0 munmap(0x7fb1c53000, 65536) = 0 unlinkat(AT_FDCWD, "/dev/cgroup/1", AT_REMOVEDIR) = 0 umount2("/dev/cgroup", 0) = 0 unlinkat(AT_FDCWD, "/dev/cgroup", AT_REMOVEDIR) = 0 exit_group(0) = ? The next invocation of the same test fails to mount the cgroup memory. Thanks Suzuki > Thanks. > > -- > tejun > -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751896AbbANLQg (ORCPT ); Wed, 14 Jan 2015 06:16:36 -0500 Received: from service87.mimecast.com ([91.220.42.44]:50593 "EHLO service87.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751139AbbANLQe convert rfc822-to-8bit (ORCPT ); Wed, 14 Jan 2015 06:16:34 -0500 Message-ID: <54B6500D.6080206@arm.com> Date: Wed, 14 Jan 2015 11:16:29 +0000 From: "Suzuki K. Poulose" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Johannes Weiner , Tejun Heo CC: Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> X-OriginalArrivalTime: 14 Jan 2015 11:16:30.0089 (UTC) FILETIME=[8BAC4F90:01D02FEB] X-MC-Unique: 115011411163207401 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/01/15 20:55, Johannes Weiner wrote: > On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: >> Currently, if a hierarchy doesn't have any live children when it's >> unmounted, the hierarchy starts dying by killing its refcnt. The >> expectation is that even if there are lingering dead children which >> are lingering due to remaining references, they'll be put in a finite >> amount of time. When the children are finally released, the hierarchy >> is destroyed and all controllers bound to it also are released. >> >> However, for memcg, the premise that the lingering refs will be put in >> a finite amount time is not true. In the absense of memory pressure, >> dead memcg's may hang around indefinitely pinned by its pages. This >> unfortunately may lead to indefinite hang on the next mount attempt >> involving memcg as the mount logic waits for it to get released. >> >> While we can change hierarchy destruction logic such that a hierarchy >> is only destroyed when it's not mounted anywhere and all its children, >> live or dead, are gone, this makes whether the hierarchy gets >> destroyed or not to be determined by factors opaque to userland. >> Userland may or may not get a new hierarchy on the next mount attempt. >> Worse, if it explicitly wants to create a new hierarchy with different >> options or controller compositions involving memcg, it will fail in an >> essentially arbitrary manner. >> >> We want to guarantee that a hierarchy is destroyed once the >> conditions, unmounted and no visible children, are met. To aid it, >> this patch introduces a new callback cgroup_subsys->unbind() which is >> invoked right before the hierarchy a subsystem is bound to starts >> dying. memcg can implement this callback and initiate draining of >> remaining refs so that the hierarchy can eventually be released in a >> finite amount of time. >> >> Signed-off-by: Tejun Heo >> Cc: Li Zefan >> Cc: Johannes Weiner >> Cc: Michal Hocko >> Cc: Vladimir Davydov >> --- >> Hello, >> >>> May be, we should kill the ref counter to the memory controller root in >>> cgroup_kill_sb only if there is no children at all, neither online nor >>> offline. >> >> Ah, thanks for the analysis, but I really wanna avoid making hierarchy >> destruction conditions opaque to userland. This is userland visible >> behavior. It shouldn't be determined by kernel internals invisible >> outside. This patch adds ss->unbind() which memcg can hook into to >> kick off draining of residual refs. If this would work, I'll add this >> patch to cgroup/for-3.19-fixes, possibly with stable cc'd. > > How about this ->unbind() for memcg? > > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in mm/memcontrol.c). I have manually applied it. With these two patches in, I am still getting the failure. Also, the kworker thread is taking up 100% time (unbind_work) and continues to do so even after 6minutes. Is there something I missed ? Thanks Suzuki > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. Page cache pages are moved to the root memory cgroup. > > Signed-off-by: Johannes Weiner > --- > include/linux/swap_cgroup.h | 6 +++ > mm/memcontrol.c | 126 ++++++++++++++++++++++++++++++++++++++++++++ > mm/swap_cgroup.c | 38 +++++++++++++ > 3 files changed, 170 insertions(+) > > diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h > index 145306bdc92f..ffe0866d2997 100644 > --- a/include/linux/swap_cgroup.h > +++ b/include/linux/swap_cgroup.h > @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, > unsigned short old, unsigned short new); > extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); > extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); > +extern unsigned long swap_cgroup_zap_records(unsigned short id); > extern int swap_cgroup_swapon(int type, unsigned long max_pages); > extern void swap_cgroup_swapoff(int type); > > @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > return 0; > } > > +static inline unsigned long swap_cgroup_zap_records(unsigned short id) > +{ > + return 0; > +} > + > static inline int > swap_cgroup_swapon(int type, unsigned long max_pages) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 692e96407627..40c426add613 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) > mem_cgroup_from_css(root_css)->use_hierarchy = true; > } > > +static void unbind_lru_list(struct mem_cgroup *memcg, > + struct zone *zone, enum lru_list lru) > +{ > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); > + struct list_head *list = &lruvec->lists[lru]; > + > + while (!list_empty(list)) { > + unsigned int nr_pages; > + unsigned long flags; > + struct page *page; > + > + spin_lock_irqsave(&zone->lru_lock, flags); > + if (list_empty(list)) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + break; > + } > + page = list_last_entry(list, struct page, lru); > + if (!get_page_unless_zero(page)) { > + list_move(&page->lru, list); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + continue; > + } > + BUG_ON(!PageLRU(page)); > + ClearPageLRU(page); > + del_page_from_lru_list(page, lruvec, lru); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + > + compound_lock(page); > + nr_pages = hpage_nr_pages(page); > + > + if (!mem_cgroup_move_account(page, nr_pages, > + memcg, root_mem_cgroup)) { > + /* > + * root_mem_cgroup page counters are not used, > + * otherwise we'd have to charge them here. > + */ > + page_counter_uncharge(&memcg->memory, nr_pages); > + if (do_swap_account) > + page_counter_uncharge(&memcg->memsw, nr_pages); > + css_put_many(&memcg->css, nr_pages); > + } > + > + compound_unlock(page); > + > + putback_lru_page(page); > + } > +} > + > +static void unbind_work_fn(struct work_struct *work) > +{ > + struct cgroup_subsys_state *css; > +retry: > + drain_all_stock(root_mem_cgroup); > + > + rcu_read_lock(); > + css_for_each_child(css, &root_mem_cgroup->css) { > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + > + /* Drop references from swap-out records */ > + if (do_swap_account) { > + long zapped; > + > + zapped = swap_cgroup_zap_records(memcg->css.id); > + page_counter_uncharge(&memcg->memsw, zapped); > + css_put_many(&memcg->css, zapped); > + } > + > + /* Drop references from leftover LRU pages */ > + css_get(css); > + rcu_read_unlock(); > + atomic_inc(&memcg->moving_account); > + synchronize_rcu(); > + while (page_counter_read(&memcg->memory) - > + page_counter_read(&memcg->kmem) > 0) { > + struct zone *zone; > + enum lru_list lru; > + > + lru_add_drain_all(); > + > + for_each_zone(zone) > + for_each_lru(lru) > + unbind_lru_list(memcg, zone, lru); > + > + cond_resched(); > + } > + atomic_dec(&memcg->moving_account); > + rcu_read_lock(); > + css_put(css); > + } > + rcu_read_unlock(); > + /* > + * Swap-in is racy: > + * > + * #0 #1 > + * lookup_swap_cgroup_id() > + * rcu_read_lock() > + * mem_cgroup_lookup() > + * css_tryget_online() > + * rcu_read_unlock() > + * cgroup_kill_sb() > + * !css_has_online_children() > + * ->unbind() > + * page_counter_try_charge() > + * css_put() > + * css_free() > + * pc->mem_cgroup = dead memcg > + * add page to lru > + * > + * Loop until until all references established from previously > + * existing swap-out records have been transferred to pages on > + * the LRU and then uncharged from there. > + */ > + if (!list_empty(&root_mem_cgroup->css.children)) { > + msleep(10); > + goto retry; > + } > +} > + > +static DECLARE_WORK(unbind_work, unbind_work_fn); > + > +static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css) > +{ > + schedule_work(&unbind_work); > +} > + > static u64 memory_current_read(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .cancel_attach = mem_cgroup_cancel_attach, > .attach = mem_cgroup_move_task, > .bind = mem_cgroup_bind, > + .unbind = mem_cgroup_unbind, > .dfl_cftypes = memory_files, > .legacy_cftypes = mem_cgroup_legacy_files, > .early_init = 0, > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c > index b5f7f24b8dd1..665923a558c4 100644 > --- a/mm/swap_cgroup.c > +++ b/mm/swap_cgroup.c > @@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > return lookup_swap_cgroup(ent, NULL)->id; > } > > +/** > + * swap_cgroup_zap_records - delete all swapout records of one cgroup > + * @id: memcg id > + * > + * Returns the number of deleted records. > + */ > +unsigned long swap_cgroup_zap_records(unsigned short id) > +{ > + unsigned long zapped = 0; > + unsigned int type; > + > + for (type = 0; type < MAX_SWAPFILES; type++) { > + struct swap_cgroup_ctrl *ctrl; > + unsigned long flags; > + unsigned int page; > + > + ctrl = &swap_cgroup_ctrl[type]; > + spin_lock_irqsave(&ctrl->lock, flags); > + for (page = 0; page < ctrl->length; page++) { > + struct swap_cgroup *base; > + pgoff_t offset; > + > + base = page_address(ctrl->map[page]); > + for (offset = 0; offset < SC_PER_PAGE; offset++) { > + struct swap_cgroup *sc; > + > + sc = base + offset; > + if (sc->id == id) { > + sc->id = 0; > + zapped++; > + } > + } > + } > + spin_unlock_irqrestore(&ctrl->lock, flags); > + } > + return zapped; > +} > + > int swap_cgroup_swapon(int type, unsigned long max_pages) > { > void *array; > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754790AbbAOR05 (ORCPT ); Thu, 15 Jan 2015 12:26:57 -0500 Received: from mail-we0-f176.google.com ([74.125.82.176]:35674 "EHLO mail-we0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752254AbbAOR04 (ORCPT ); Thu, 15 Jan 2015 12:26:56 -0500 Date: Thu, 15 Jan 2015 18:26:52 +0100 From: Michal Hocko To: Tejun Heo Cc: Vladimir Davydov , "Suzuki K. Poulose" , Johannes Weiner , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150115172652.GF7008@dhcp22.suse.cz> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150110214316.GF25319@htj.dyndns.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 10-01-15 16:43:16, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov Ohh, I have missed this one as I wasn't on the CC list. FWIW this approach makes sense to me. I just think that we should have a way to fail. E.g. kmem pages are impossible to reclaim because there might be some objects lingering somewhere not bound to a task context and reparenting is hard as Vladimir has pointed out several times already. Normal LRU pages should be reclaimable or reparented to the root easily. I cannot judge the implementation but I agree with the fact that memcg controller should be the one to take an action. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753170AbbAOR4Q (ORCPT ); Thu, 15 Jan 2015 12:56:16 -0500 Received: from mail-wg0-f51.google.com ([74.125.82.51]:47450 "EHLO mail-wg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751147AbbAOR4O (ORCPT ); Thu, 15 Jan 2015 12:56:14 -0500 Date: Thu, 15 Jan 2015 18:56:09 +0100 From: Michal Hocko To: Johannes Weiner Cc: Tejun Heo , Vladimir Davydov , "Suzuki K. Poulose" , linux-mm@kvack.org, "linux-kernel@vger.kernel.org" , Will Deacon Subject: Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Message-ID: <20150115175609.GG7008@dhcp22.suse.cz> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <20150110214316.GF25319@htj.dyndns.org> <20150111205543.GA5480@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150111205543.GA5480@phnom.home.cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 11-01-15 15:55:43, Johannes Weiner wrote: > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. OK, that makes sense to me. > Page cache pages are moved to the root memory cgroup. OK, this is better than reclaiming them. [...] > +static void unbind_lru_list(struct mem_cgroup *memcg, > + struct zone *zone, enum lru_list lru) > +{ > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); > + struct list_head *list = &lruvec->lists[lru]; > + > + while (!list_empty(list)) { > + unsigned int nr_pages; > + unsigned long flags; > + struct page *page; > + > + spin_lock_irqsave(&zone->lru_lock, flags); > + if (list_empty(list)) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + break; > + } > + page = list_last_entry(list, struct page, lru); taking lru_lock for each page calls for troubles. The lock would bounce like crazy. It shouldn't be a big problem to list_move to a local list and then work on that one without the lock. Those pages wouldn't be visible for the reclaim but that would be only temporary. Or if that is not acceptable then just batch at least some number of pages (popular SWAP_CLUSTER_MAX). > + if (!get_page_unless_zero(page)) { > + list_move(&page->lru, list); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + continue; > + } > + BUG_ON(!PageLRU(page)); > + ClearPageLRU(page); > + del_page_from_lru_list(page, lruvec, lru); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + > + compound_lock(page); > + nr_pages = hpage_nr_pages(page); > + > + if (!mem_cgroup_move_account(page, nr_pages, > + memcg, root_mem_cgroup)) { > + /* > + * root_mem_cgroup page counters are not used, > + * otherwise we'd have to charge them here. > + */ > + page_counter_uncharge(&memcg->memory, nr_pages); > + if (do_swap_account) > + page_counter_uncharge(&memcg->memsw, nr_pages); > + css_put_many(&memcg->css, nr_pages); > + } > + > + compound_unlock(page); > + > + putback_lru_page(page); > + } > +} > + > +static void unbind_work_fn(struct work_struct *work) > +{ > + struct cgroup_subsys_state *css; > +retry: > + drain_all_stock(root_mem_cgroup); > + > + rcu_read_lock(); > + css_for_each_child(css, &root_mem_cgroup->css) { > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + > + /* Drop references from swap-out records */ > + if (do_swap_account) { > + long zapped; > + > + zapped = swap_cgroup_zap_records(memcg->css.id); > + page_counter_uncharge(&memcg->memsw, zapped); > + css_put_many(&memcg->css, zapped); > + } > + > + /* Drop references from leftover LRU pages */ > + css_get(css); > + rcu_read_unlock(); > + atomic_inc(&memcg->moving_account); > + synchronize_rcu(); Why do we need this? Who can migrate to/from offline memcgs? > + while (page_counter_read(&memcg->memory) - > + page_counter_read(&memcg->kmem) > 0) { > + struct zone *zone; > + enum lru_list lru; > + > + lru_add_drain_all(); > + > + for_each_zone(zone) > + for_each_lru(lru) > + unbind_lru_list(memcg, zone, lru); > + > + cond_resched(); > + } > + atomic_dec(&memcg->moving_account); > + rcu_read_lock(); > + css_put(css); > + } > + rcu_read_unlock(); > + /* > + * Swap-in is racy: > + * > + * #0 #1 > + * lookup_swap_cgroup_id() > + * rcu_read_lock() > + * mem_cgroup_lookup() > + * css_tryget_online() > + * rcu_read_unlock() > + * cgroup_kill_sb() > + * !css_has_online_children() > + * ->unbind() > + * page_counter_try_charge() > + * css_put() > + * css_free() > + * pc->mem_cgroup = dead memcg > + * add page to lru > + * > + * Loop until until all references established from previously > + * existing swap-out records have been transferred to pages on > + * the LRU and then uncharged from there. > + */ > + if (!list_empty(&root_mem_cgroup->css.children)) { But what if kmem pages pin the memcg? We would loop for ever. Or am I missing something? > + msleep(10); > + goto retry; > + } > +} [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752350AbbASMvb (ORCPT ); Mon, 19 Jan 2015 07:51:31 -0500 Received: from service87.mimecast.com ([91.220.42.44]:46526 "EHLO service87.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752273AbbASMva convert rfc822-to-8bit (ORCPT ); Mon, 19 Jan 2015 07:51:30 -0500 Message-ID: <54BCFDCF.9090603@arm.com> Date: Mon, 19 Jan 2015 12:51:27 +0000 From: "Suzuki K. Poulose" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Vladimir Davydov CC: Tejun Heo , Johannes Weiner , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Will Deacon , mhocko@suse.cz, akpm@linux-foundation.org Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> In-Reply-To: <20150110085525.GD2110@esperanza> X-OriginalArrivalTime: 19 Jan 2015 12:51:27.0250 (UTC) FILETIME=[A382D720:01D033E6] X-MC-Unique: 115011912512800901 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/01/15 08:55, Vladimir Davydov wrote: > On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote: >> Hi >> >> We have hit a hang on ARM64 defconfig, while running LTP tests on >> 3.19-rc3. We are >> in the process of a git bisect and will update the results as and >> when we find the commit. >> >> During the ksm ltp run, the test hangs trying to mount memcg with >> the following strace >> output: >> >> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? >> ERESTARTNOINTR (To be restarted) >> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? >> ERESTARTNOINTR (To be restarted) >> [ ... repeated forever ... ] >> >> At this point, one can try mounting the memcg to verify the problem. >> # mount -t cgroup -o memory memcg memcg_dir >> --hangs-- >> >> Strangely, if we run the mount command from a cold boot (i.e. >> without running LTP first), >> then it succeeds. >> >> Upon a quick look we are hitting the following code : >> kernel/cgroup.c: cgroup_mount() : >> >> 1779 for_each_subsys(ss, i) { >> 1780 if (!(opts.subsys_mask & (1 << i)) || >> 1781 ss->root == &cgrp_dfl_root) >> 1782 continue; >> 1783 >> 1784 if >> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { >> 1785 mutex_unlock(&cgroup_mutex); >> 1786 msleep(10); >> 1787 ret = restart_syscall(); <===== >> 1788 goto out_free; >> 1789 } >> 1790 cgroup_put(&ss->root->cgrp); >> 1791 } >> >> with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD >> >> Any ideas? > > The problem is that the memory cgroup controller takes a css reference > per each charged page and does not reparent charged pages on css > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > offline cgroups to be gone soon, restarting the syscall if the ref count > != 0. As a result, if you create a memory cgroup, charge some page cache > to it, and then remove it, unmount/mount will hang forever. > > May be, we should kill the ref counter to the memory controller root in > cgroup_kill_sb only if there is no children at all, neither online nor > offline. > Still reproducible on 3.19-rc5 with the same setup. From git bisect, the last good commit is : commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 Author: Pranith Kumar Date: Wed Dec 10 15:42:28 2014 -0800 slab: replace smp_read_barrier_depends() with lockless_dereference() Thanks Suzuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753092AbbAUQkK (ORCPT ); Wed, 21 Jan 2015 11:40:10 -0500 Received: from foss-mx-na.foss.arm.com ([217.140.108.86]:40259 "EHLO foss-mx-na.foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751758AbbAUQkF (ORCPT ); Wed, 21 Jan 2015 11:40:05 -0500 Date: Wed, 21 Jan 2015 16:39:55 +0000 From: Will Deacon To: "Suzuki K. Poulose" Cc: Vladimir Davydov , Tejun Heo , Johannes Weiner , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150121163955.GM4549@arm.com> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54BCFDCF.9090603@arm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: > On 10/01/15 08:55, Vladimir Davydov wrote: > > The problem is that the memory cgroup controller takes a css reference > > per each charged page and does not reparent charged pages on css > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > > offline cgroups to be gone soon, restarting the syscall if the ref count > > != 0. As a result, if you create a memory cgroup, charge some page cache > > to it, and then remove it, unmount/mount will hang forever. > > > > May be, we should kill the ref counter to the memory controller root in > > cgroup_kill_sb only if there is no children at all, neither online nor > > offline. > > > > Still reproducible on 3.19-rc5 with the same setup. Yeah, I'm seeing the same failure on my setup too. > From git bisect, the last good commit is : > > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 > Author: Pranith Kumar > Date: Wed Dec 10 15:42:28 2014 -0800 > > slab: replace smp_read_barrier_depends() with lockless_dereference() So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") as the offending commit. Will From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752436AbbAVNqI (ORCPT ); Thu, 22 Jan 2015 08:46:08 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:48812 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751299AbbAVNqG (ORCPT ); Thu, 22 Jan 2015 08:46:06 -0500 Date: Thu, 22 Jan 2015 08:45:50 -0500 From: Johannes Weiner To: Will Deacon Cc: "Suzuki K. Poulose" , Vladimir Davydov , Tejun Heo , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122134550.GA13876@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150121163955.GM4549@arm.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote: > On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: > > On 10/01/15 08:55, Vladimir Davydov wrote: > > > The problem is that the memory cgroup controller takes a css reference > > > per each charged page and does not reparent charged pages on css > > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to > > > offline cgroups to be gone soon, restarting the syscall if the ref count > > > != 0. As a result, if you create a memory cgroup, charge some page cache > > > to it, and then remove it, unmount/mount will hang forever. > > > > > > May be, we should kill the ref counter to the memory controller root in > > > cgroup_kill_sb only if there is no children at all, neither online nor > > > offline. > > > > > > > Still reproducible on 3.19-rc5 with the same setup. > > Yeah, I'm seeing the same failure on my setup too. > > > From git bisect, the last good commit is : > > > > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 > > Author: Pranith Kumar > > Date: Wed Dec 10 15:42:28 2014 -0800 > > > > slab: replace smp_read_barrier_depends() with lockless_dereference() > > So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") > as the offending commit. With b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), page cache can pin an old css and its ancestors indefinitely, making that hang in a second mount() very likely. However, swap entries have also been doing that for quite a while now, and as Vladimir pointed out, the same is true for kernel memory. This latest change just makes this existing bug easier to trigger. I think we have to update the lifetime rules to reflect reality here: memory and swap lifetime is indefinite, so once the memory controller is used, it has state that is independent from whether its mounted or not. We can support an identical remount, but have to fail mounting with new parameters that would change the behavior of the controller. Suzuki, Will, could you give the following patch a shot? Tejun, would that route be acceptable to you? Thanks --- >>From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 22 Jan 2015 08:16:31 -0500 Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller lifetime Since b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), re-mounting the memory controller after using it is very likely to hang. The cgroup core assumes that any remaining references after deleting a cgroup are temporary in nature, and synchroneously waits for them, but the above-mentioned commit has left-over page cache pin its css until it is reclaimed naturally. That being said, swap entries and charged kernel memory have been doing the same indefinite pinning forever, the bug is just more likely to trigger with left-over page cache. Reparenting kernel memory is highly impractical, which leaves changing the cgroup assumptions to reflect this: once a controller has been mounted and used, it has internal state that is independent from mount and cgroup lifetime. It can be unmounted and remounted, but it can't be reconfigured during subsequent mounts. Don't offline the controller root as long as there are any children, dead or alive. A remount will no longer wait for these old references to drain, it will simply mount the persistent controller state again. Reported-by: "Suzuki K. Poulose" Reported-by: Will Deacon Signed-off-by: Johannes Weiner --- kernel/cgroup.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index bb263d0caab3..9a09308c8066 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, goto out_unlock; } - if (root->flags ^ opts.flags) - pr_warn("new mount options do not match the existing superblock, will be ignored\n"); + if (root->flags ^ opts.flags) { + pr_warn("new mount options do not match the existing superblock\n"); + ret = -EBUSY; + goto out_unlock; + } /* * We want to reuse @root whose lifetime is governed by its @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb) * * And don't kill the default root. */ - if (css_has_online_children(&root->cgrp.self) || + if (!list_empty(&root->cgrp.self.children) || root == &cgrp_dfl_root) cgroup_put(&root->cgrp); else -- 2.2.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754580AbbAVOfA (ORCPT ); Thu, 22 Jan 2015 09:35:00 -0500 Received: from mail-qa0-f51.google.com ([209.85.216.51]:60009 "EHLO mail-qa0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751602AbbAVOe6 (ORCPT ); Thu, 22 Jan 2015 09:34:58 -0500 Date: Thu, 22 Jan 2015 09:34:54 -0500 From: Tejun Heo To: Johannes Weiner Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122143454.GA4507@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122134550.GA13876@phnom.home.cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote: > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index bb263d0caab3..9a09308c8066 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, > goto out_unlock; > } > > - if (root->flags ^ opts.flags) > - pr_warn("new mount options do not match the existing superblock, will be ignored\n"); > + if (root->flags ^ opts.flags) { > + pr_warn("new mount options do not match the existing superblock\n"); > + ret = -EBUSY; > + goto out_unlock; > + } Do we really need the above chunk? > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb) > * > * And don't kill the default root. > */ > - if (css_has_online_children(&root->cgrp.self) || > + if (!list_empty(&root->cgrp.self.children) || > root == &cgrp_dfl_root) > cgroup_put(&root->cgrp); I tried to do something a bit more advanced so that eventual async release of dying children, if they happen, can also release the hierarchy but I don't think it really matters unless we can forcefully drain. So, shouldn't just the above part be enough? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752908AbbAVPTz (ORCPT ); Thu, 22 Jan 2015 10:19:55 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:48826 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751584AbbAVPTx (ORCPT ); Thu, 22 Jan 2015 10:19:53 -0500 Date: Thu, 22 Jan 2015 10:19:43 -0500 From: Johannes Weiner To: Tejun Heo Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122151943.GA27368@phnom.home.cmpxchg.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> <20150122143454.GA4507@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122143454.GA4507@htj.dyndns.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Thu, Jan 22, 2015 at 09:34:54AM -0500, Tejun Heo wrote: > On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote: > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > > index bb263d0caab3..9a09308c8066 100644 > > --- a/kernel/cgroup.c > > +++ b/kernel/cgroup.c > > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, > > goto out_unlock; > > } > > > > - if (root->flags ^ opts.flags) > > - pr_warn("new mount options do not match the existing superblock, will be ignored\n"); > > + if (root->flags ^ opts.flags) { > > + pr_warn("new mount options do not match the existing superblock\n"); > > + ret = -EBUSY; > > + goto out_unlock; > > + } > > Do we really need the above chunk? Inform and ignore or fail hard? I guess we can drop this hunk and keep with the current behavior. > > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb) > > * > > * And don't kill the default root. > > */ > > - if (css_has_online_children(&root->cgrp.self) || > > + if (!list_empty(&root->cgrp.self.children) || > > root == &cgrp_dfl_root) > > cgroup_put(&root->cgrp); > > I tried to do something a bit more advanced so that eventual async > release of dying children, if they happen, can also release the > hierarchy but I don't think it really matters unless we can forcefully > drain. So, shouldn't just the above part be enough? Yep, I'd be fine with that. --- >>From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 22 Jan 2015 08:16:31 -0500 Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller lifetime Since b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), re-mounting the memory controller after using it is very likely to hang. The cgroup core assumes that any remaining references after deleting a cgroup are temporary in nature, and synchroneously waits for them, but the above-mentioned commit has left-over page cache pin its css until it is reclaimed naturally. That being said, swap entries and charged kernel memory have been doing the same indefinite pinning forever, the bug is just more likely to trigger with left-over page cache. Reparenting kernel memory is highly impractical, which leaves changing the cgroup assumptions to reflect this: once a controller has been mounted and used, it has internal state that is independent from mount and cgroup lifetime. It can be unmounted and remounted, but it can't be reconfigured during subsequent mounts. Don't offline the controller root as long as there are any children, dead or alive. A remount will no longer wait for these old references to drain, it will simply mount the persistent controller state again. Reported-by: "Suzuki K. Poulose" Reported-by: Will Deacon Signed-off-by: Johannes Weiner --- kernel/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index bb263d0caab3..04cfe8ace520 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1909,7 +1909,7 @@ static void cgroup_kill_sb(struct super_block *sb) * * And don't kill the default root. */ - if (css_has_online_children(&root->cgrp.self) || + if (!list_empty(&root->cgrp.self.children) || root == &cgrp_dfl_root) cgroup_put(&root->cgrp); else -- 2.2.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753012AbbAVP21 (ORCPT ); Thu, 22 Jan 2015 10:28:27 -0500 Received: from mail-qc0-f171.google.com ([209.85.216.171]:58493 "EHLO mail-qc0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750754AbbAVP2V (ORCPT ); Thu, 22 Jan 2015 10:28:21 -0500 Date: Thu, 22 Jan 2015 10:28:17 -0500 From: Tejun Heo To: Johannes Weiner Cc: Will Deacon , "Suzuki K. Poulose" , Vladimir Davydov , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg Message-ID: <20150122152817.GD4507@htj.dyndns.org> References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> <20150122143454.GA4507@htj.dyndns.org> <20150122151943.GA27368@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150122151943.GA27368@phnom.home.cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 22, 2015 at 10:19:43AM -0500, Johannes Weiner wrote: > From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 22 Jan 2015 08:16:31 -0500 > Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller > lifetime > > Since b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), re-mounting the memory controller after using it is > very likely to hang. > > The cgroup core assumes that any remaining references after deleting a > cgroup are temporary in nature, and synchroneously waits for them, but > the above-mentioned commit has left-over page cache pin its css until > it is reclaimed naturally. That being said, swap entries and charged > kernel memory have been doing the same indefinite pinning forever, the > bug is just more likely to trigger with left-over page cache. > > Reparenting kernel memory is highly impractical, which leaves changing > the cgroup assumptions to reflect this: once a controller has been > mounted and used, it has internal state that is independent from mount > and cgroup lifetime. It can be unmounted and remounted, but it can't > be reconfigured during subsequent mounts. > > Don't offline the controller root as long as there are any children, > dead or alive. A remount will no longer wait for these old references > to drain, it will simply mount the persistent controller state again. > > Reported-by: "Suzuki K. Poulose" > Reported-by: Will Deacon > Signed-off-by: Johannes Weiner Applied to cgroup/for-3.19-fixes. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755424AbbAWPAG (ORCPT ); Fri, 23 Jan 2015 10:00:06 -0500 Received: from service87.mimecast.com ([91.220.42.44]:58907 "EHLO service87.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750827AbbAWPAD convert rfc822-to-8bit (ORCPT ); Fri, 23 Jan 2015 10:00:03 -0500 Message-ID: <54C261F0.9070606@arm.com> Date: Fri, 23 Jan 2015 15:00:00 +0000 From: "Suzuki K. Poulose" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Johannes Weiner , Will Deacon CC: Vladimir Davydov , Tejun Heo , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "mhocko@suse.cz" , "akpm@linux-foundation.org" Subject: Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg References: <54B01335.4060901@arm.com> <20150110085525.GD2110@esperanza> <54BCFDCF.9090603@arm.com> <20150121163955.GM4549@arm.com> <20150122134550.GA13876@phnom.home.cmpxchg.org> In-Reply-To: <20150122134550.GA13876@phnom.home.cmpxchg.org> X-OriginalArrivalTime: 23 Jan 2015 15:00:01.0078 (UTC) FILETIME=[42F69560:01D0371D] X-MC-Unique: 115012315000100201 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 22/01/15 13:45, Johannes Weiner wrote: > On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote: >> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote: >>> On 10/01/15 08:55, Vladimir Davydov wrote: >>>> The problem is that the memory cgroup controller takes a css reference >>>> per each charged page and does not reparent charged pages on css >>>> offline, while cgroup_mount/cgroup_kill_sb expect all css references to >>>> offline cgroups to be gone soon, restarting the syscall if the ref count >>>> != 0. As a result, if you create a memory cgroup, charge some page cache >>>> to it, and then remove it, unmount/mount will hang forever. >>>> >>>> May be, we should kill the ref counter to the memory controller root in >>>> cgroup_kill_sb only if there is no children at all, neither online nor >>>> offline. >>>> >>> >>> Still reproducible on 3.19-rc5 with the same setup. >> >> Yeah, I'm seeing the same failure on my setup too. >> >>> From git bisect, the last good commit is : >>> >>> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735 >>> Author: Pranith Kumar >>> Date: Wed Dec 10 15:42:28 2014 -0800 >>> >>> slab: replace smp_read_barrier_depends() with lockless_dereference() >> >> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") >> as the offending commit. > > With b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), page cache can pin an old css and its ancestors > indefinitely, making that hang in a second mount() very likely. > > However, swap entries have also been doing that for quite a while now, > and as Vladimir pointed out, the same is true for kernel memory. This > latest change just makes this existing bug easier to trigger. > > I think we have to update the lifetime rules to reflect reality here: > memory and swap lifetime is indefinite, so once the memory controller > is used, it has state that is independent from whether its mounted or > not. We can support an identical remount, but have to fail mounting > with new parameters that would change the behavior of the controller. > > Suzuki, Will, could you give the following patch a shot? > > Tejun, would that route be acceptable to you? > > Thanks > > --- > From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 22 Jan 2015 08:16:31 -0500 > Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller > lifetime > > > Don't offline the controller root as long as there are any children, > dead or alive. A remount will no longer wait for these old references > to drain, it will simply mount the persistent controller state again. > > Reported-by: "Suzuki K. Poulose" > Reported-by: Will Deacon > Signed-off-by: Johannes Weiner This one fixes the issue. Tested-by : Suzuki K. Poulose Thanks Suzuki