[PATCH 5/5] cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: [PATCH 5/5] cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
Date: Thu, 12 Jun 2014 14:33:05 +0800	[thread overview]
Message-ID: <539949A1.90301@huawei.com> (raw)
In-Reply-To: <53994943.60703-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.

Run two instances of this script concurrently:

    for ((; ;))
    {
    	mount -t cgroup -o cpuacct xxx /cgroup
    	umount /cgroup
    }

After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.

This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().

To fix this, we increase the refcnt of the superblock instead of increasing
the percpu refcnt of cgroup root.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---

A better fix is welcome!

---
 kernel/cgroup.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bd37e8d..94e1814 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1654,7 +1654,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	struct dentry *dentry;
 	int ret;
 	int i;
-	bool new_sb;
+	bool sb_pinned = false;
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1735,19 +1735,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		}
 
 		/*
-		 * A root's lifetime is governed by its root cgroup.
-		 * tryget_live failure indicate that the root is being
-		 * destroyed.  Wait for destruction to complete so that the
-		 * subsystems are free.  We can use wait_queue for the wait
-		 * but this path is super cold.  Let's just sleep for a bit
-		 * and retry.
+		 * This may fail for two reasons:
+		 * - A concurrent mount is in process. We wait for that mount
+		     to complete.
+		 * - The superblock is being destroyed. We wait for the
+		 *   desctruction to complete so that the subsystems are free.
+		 * We can use wait_queue for the wait but this path is super
+		 * cold.  Let's just sleep for a bit and retry.
 		 */
-		if (!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+		if (!kernfs_pin_sb(root->kf_root, NULL)) {
 			mutex_unlock(&cgroup_mutex);
 			msleep(10);
 			ret = restart_syscall();
 			goto out_free;
 		}
+		sb_pinned = true;
 
 		ret = 0;
 		goto out_unlock;
@@ -1784,8 +1786,10 @@ out_free:
 	if (ret)
 		return ERR_PTR(ret);
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
-	if (IS_ERR(dentry) || !new_sb)
+	dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
+	if (sb_pinned)
+		kernfs_drop_sb(root->kf_root, NULL);
+	if (!sb_pinned && IS_ERR(dentry))
 		cgroup_put(&root->cgrp);
 	return dentry;
 }
-- 
1.8.0.2

WARNING: multiple messages have this Message-ID (diff)

From: Li Zefan <lizefan@huawei.com>
To: Tejun Heo <tj@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Cgroups <cgroups@vger.kernel.org>
Subject: [PATCH 5/5] cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
Date: Thu, 12 Jun 2014 14:33:05 +0800	[thread overview]
Message-ID: <539949A1.90301@huawei.com> (raw)
In-Reply-To: <53994943.60703@huawei.com>

We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.

Run two instances of this script concurrently:

    for ((; ;))
    {
    	mount -t cgroup -o cpuacct xxx /cgroup
    	umount /cgroup
    }

After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.

This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().

To fix this, we increase the refcnt of the superblock instead of increasing
the percpu refcnt of cgroup root.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---

A better fix is welcome!

---
 kernel/cgroup.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bd37e8d..94e1814 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1654,7 +1654,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	struct dentry *dentry;
 	int ret;
 	int i;
-	bool new_sb;
+	bool sb_pinned = false;
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1735,19 +1735,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		}
 
 		/*
-		 * A root's lifetime is governed by its root cgroup.
-		 * tryget_live failure indicate that the root is being
-		 * destroyed.  Wait for destruction to complete so that the
-		 * subsystems are free.  We can use wait_queue for the wait
-		 * but this path is super cold.  Let's just sleep for a bit
-		 * and retry.
+		 * This may fail for two reasons:
+		 * - A concurrent mount is in process. We wait for that mount
+		     to complete.
+		 * - The superblock is being destroyed. We wait for the
+		 *   desctruction to complete so that the subsystems are free.
+		 * We can use wait_queue for the wait but this path is super
+		 * cold.  Let's just sleep for a bit and retry.
 		 */
-		if (!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+		if (!kernfs_pin_sb(root->kf_root, NULL)) {
 			mutex_unlock(&cgroup_mutex);
 			msleep(10);
 			ret = restart_syscall();
 			goto out_free;
 		}
+		sb_pinned = true;
 
 		ret = 0;
 		goto out_unlock;
@@ -1784,8 +1786,10 @@ out_free:
 	if (ret)
 		return ERR_PTR(ret);
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
-	if (IS_ERR(dentry) || !new_sb)
+	dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
+	if (sb_pinned)
+		kernfs_drop_sb(root->kf_root, NULL);
+	if (!sb_pinned && IS_ERR(dentry))
 		cgroup_put(&root->cgrp);
 	return dentry;
 }
-- 
1.8.0.2

next prev parent reply	other threads:[~2014-06-12  6:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-12  6:31 [PATCH 1/5] cgroup: fix broken css_has_online_children() Li Zefan
2014-06-12  6:31 ` Li Zefan
     [not found] ` <53994943.60703-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-06-12  6:31   ` [PATCH 2/5] percpu-ref: introduce percpu_ref_alive() Li Zefan
2014-06-12  6:31     ` Li Zefan
2014-06-12  6:33   ` Li Zefan [this message]
2014-06-12  6:33     ` [PATCH 5/5] cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() Li Zefan
     [not found]     ` <539949A1.90301-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-06-20 19:35       ` Tejun Heo
2014-06-20 19:35         ` Tejun Heo
     [not found]         ` <20140620193521.GB28324-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-06-24  1:22           ` Li Zefan
2014-06-24  1:22             ` Li Zefan
     [not found]             ` <53A8D2B8.4080107-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-06-24 21:01               ` Tejun Heo
2014-06-24 21:01                 ` Tejun Heo
     [not found]                 ` <20140624210119.GC14909-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-25  1:56                   ` Li Zefan
2014-06-25  1:56                     ` Li Zefan
2014-06-25 15:00                     ` Tejun Heo
     [not found]                       ` <20140625150053.GE26883-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-27  6:32                         ` Li Zefan
2014-06-27  6:32                           ` Li Zefan
     [not found]                           ` <53AD1001.4090405-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-06-27 15:00                             ` Tejun Heo
2014-06-27 15:00                               ` Tejun Heo
2014-06-17 19:26   ` [PATCH 1/5] cgroup: fix broken css_has_online_children() Tejun Heo
2014-06-17 19:26     ` Tejun Heo
2014-06-12  6:32 ` [PATCH 3/5] cgroup: fix mount failure in a corner case Li Zefan
     [not found]   ` <5399496D.6060003-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-06-20 19:10     ` Tejun Heo
2014-06-20 19:10       ` Tejun Heo
2014-06-24  1:15       ` Li Zefan
2014-06-12  6:32 ` [PATCH 4/5] kernfs: introduce kernfs_pin_sb() and kernfs_drop_sb() Li Zefan

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:bd37e8d dfblob:94e1814 dfblob:bd37e8d dfblob:94e1814 )
 OR (
bs:"[PATCH 5/5] cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=539949A1.90301@huawei.com \
    --to=lizefan-hv44wf8li93qt0dzr+alfa@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.