[PATCH -next] cgroup: fix uaf when proc_cpuset

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -next] cgroup: fix uaf when proc_cpuset_show
@ 2024-06-22 11:38 Chen Ridong
  2024-06-22 13:45 ` Markus Elfring
  2024-06-22 15:05 ` Waiman Long
  0 siblings, 2 replies; 16+ messages in thread
From: Chen Ridong @ 2024-06-22 11:38 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, longman; +Cc: bpf, cgroups, linux-kernel

We found a refcount UAF bug as follows:

BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
Read of size 8 at addr ffff8882a4b242b8 by task atop/19903

CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
Call Trace:
 dump_stack+0x7d/0xa7
 print_address_description.constprop.0+0x19/0x170
 ? cgroup_path_ns+0x112/0x150
 __kasan_report.cold+0x6c/0x84
 ? print_unreferenced+0x390/0x3b0
 ? cgroup_path_ns+0x112/0x150
 kasan_report+0x3a/0x50
 cgroup_path_ns+0x112/0x150
 proc_cpuset_show+0x164/0x530
 proc_single_show+0x10f/0x1c0
 seq_read_iter+0x405/0x1020
 ? aa_path_link+0x2e0/0x2e0
 seq_read+0x324/0x500
 ? seq_read_iter+0x1020/0x1020
 ? common_file_perm+0x2a1/0x4a0
 ? fsnotify_unmount_inodes+0x380/0x380
 ? bpf_lsm_file_permission_wrapper+0xa/0x30
 ? security_file_permission+0x53/0x460
 vfs_read+0x122/0x420
 ksys_read+0xed/0x1c0
 ? __ia32_sys_pwrite64+0x1e0/0x1e0
 ? __audit_syscall_exit+0x741/0xa70
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x67/0xcc

This is also reported by: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
 cgroup_path_ns function.
2.$cat /proc/<pid>/cpuset   repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/   repeatly.

The race that cause this bug can be shown as below:

(umount)		|	(cat /proc/<pid>/cpuset)
css_release		|	proc_cpuset_show
css_release_work_fn	|	css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn	|	cgroup_path_ns(css->cgroup, ...);
cgroup_destroy_root	|	mutex_lock(&cgroup_mutex);
rebind_subsystems	|
cgroup_free_root 	|
			|	// cgrp was freed, UAF
			|	cgroup_path_ns_locked(cgrp,..);

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.

To solve this issue, we have added a cgroup reference count in
the proc_cpuset_show function to ensure that css.cgrp->root will not
be freed prematurely. This is a temporary solution. Let's see if anyone
has a better solution.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb22a4..782eaf807173 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 	char *buf;
 	struct cgroup_subsys_state *css;
 	int retval;
+	struct cgroup *root_cgroup = NULL;

 	retval = -ENOMEM;
 	buf = kmalloc(PATH_MAX, GFP_KERNEL);
@@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 		goto out;

 	css = task_get_css(tsk, cpuset_cgrp_id);
+	rcu_read_lock();
+	/*
+	 * When the cpuset subsystem is mounted on the legacy hierarchy,
+	 * the top_cpuset.css->cgroup does not hold a reference count of
+	 * cgroup_root.cgroup. This makes accessing css->cgroup very
+	 * dangerous because when the cpuset subsystem is remounted to the
+	 * default hierarchy, the cgroup_root.cgroup that css->cgroup points
+	 * to will be released, leading to a UAF issue. To avoid this problem,
+	 * get the reference count of top_cpuset.css->cgroup first.
+	 *
+	 * This is ugly!!
+	 */
+	if (css == &top_cpuset.css) {
+		cgroup_get(css->cgroup);
+		root_cgroup = css->cgroup;
+	}
+	rcu_read_unlock();
 	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
 				current->nsproxy->cgroup_ns);
 	css_put(css);
+	if (root_cgroup)
+		cgroup_put(root_cgroup);
 	if (retval == -E2BIG)
 		retval = -ENAMETOOLONG;
 	if (retval < 0)
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-22 11:38 [PATCH -next] cgroup: fix uaf when proc_cpuset_show Chen Ridong
@ 2024-06-22 13:45 ` Markus Elfring
  2024-06-24  3:34   ` chenridong
  2024-06-22 15:05 ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: Markus Elfring @ 2024-06-22 13:45 UTC (permalink / raw)
  To: Chen Ridong, cgroups, bpf, Johannes Weiner, Tejun Heo,
	Waiman Long, Zefan Li
  Cc: LKML

> We found a refcount UAF bug as follows:
>
> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
…

How do you think about to use a summary phrase like “Avoid use-after-free
in proc_cpuset_show()”?


> This is also reported by: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Would you like to add any tags (like “Fixes”) accordingly?


…
> +++ b/kernel/cgroup/cpuset.c
…
> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>  		goto out;
>
>  	css = task_get_css(tsk, cpuset_cgrp_id);
> +	rcu_read_lock();
…
> +	rcu_read_unlock();
>  	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>  				current->nsproxy->cgroup_ns);
…

Would you become interested to apply a statement like “guard(rcu_read_lock)();”?
https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/cleanup.h#L133

Regards,
Markus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-22 11:38 [PATCH -next] cgroup: fix uaf when proc_cpuset_show Chen Ridong
  2024-06-22 13:45 ` Markus Elfring
@ 2024-06-22 15:05 ` Waiman Long
  2024-06-22 20:04   ` [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show() Markus Elfring
  2024-06-24  2:59   ` [PATCH -next] cgroup: fix uaf when proc_cpuset_show chenridong
  1 sibling, 2 replies; 16+ messages in thread
From: Waiman Long @ 2024-06-22 15:05 UTC (permalink / raw)
  To: Chen Ridong, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4819 bytes --]


On 6/22/24 07:38, Chen Ridong wrote:
> We found a refcount UAF bug as follows:
>
> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>
> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
> Call Trace:
>   dump_stack+0x7d/0xa7
>   print_address_description.constprop.0+0x19/0x170
>   ? cgroup_path_ns+0x112/0x150
>   __kasan_report.cold+0x6c/0x84
>   ? print_unreferenced+0x390/0x3b0
>   ? cgroup_path_ns+0x112/0x150
>   kasan_report+0x3a/0x50
>   cgroup_path_ns+0x112/0x150
>   proc_cpuset_show+0x164/0x530
>   proc_single_show+0x10f/0x1c0
>   seq_read_iter+0x405/0x1020
>   ? aa_path_link+0x2e0/0x2e0
>   seq_read+0x324/0x500
>   ? seq_read_iter+0x1020/0x1020
>   ? common_file_perm+0x2a1/0x4a0
>   ? fsnotify_unmount_inodes+0x380/0x380
>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>   ? security_file_permission+0x53/0x460
>   vfs_read+0x122/0x420
>   ksys_read+0xed/0x1c0
>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>   ? __audit_syscall_exit+0x741/0xa70
>   do_syscall_64+0x33/0x40
>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>
> This is also reported by: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>
> This can be reproduced by the following methods:
> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>   cgroup_path_ns function.
> 2.$cat /proc/<pid>/cpuset   repeatly.
> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
> $umount /sys/fs/cgroup/cpuset/   repeatly.
>
> The race that cause this bug can be shown as below:
>
> (umount)		|	(cat /proc/<pid>/cpuset)
> css_release		|	proc_cpuset_show
> css_release_work_fn	|	css = task_get_css(tsk, cpuset_cgrp_id);
> css_free_rwork_fn	|	cgroup_path_ns(css->cgroup, ...);
> cgroup_destroy_root	|	mutex_lock(&cgroup_mutex);
> rebind_subsystems	|
> cgroup_free_root 	|
> 			|	// cgrp was freed, UAF
> 			|	cgroup_path_ns_locked(cgrp,..);
>
> When the cpuset is initialized, the root node top_cpuset.css.cgrp
> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
> allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
> &cgroup_root.cgrp. When the umount operation is executed,
> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>
> The problem is that when rebinding to cgrp_dfl_root, there are cases
> where the cgroup_root allocated by setting up the root for cgroup v1
> is cached. This could lead to a Use-After-Free (UAF) if it is
> subsequently freed. The descendant cgroups of cgroup v1 can only be
> freed after the css is released. However, the css of the root will never
> be released, yet the cgroup_root should be freed when it is unmounted.
> This means that obtaining a reference to the css of the root does
> not guarantee that css.cgrp->root will not be freed.
>
> To solve this issue, we have added a cgroup reference count in
> the proc_cpuset_show function to ensure that css.cgrp->root will not
> be freed prematurely. This is a temporary solution. Let's see if anyone
> has a better solution.
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c12b9fdb22a4..782eaf807173 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>   	char *buf;
>   	struct cgroup_subsys_state *css;
>   	int retval;
> +	struct cgroup *root_cgroup = NULL;
>   
>   	retval = -ENOMEM;
>   	buf = kmalloc(PATH_MAX, GFP_KERNEL);
> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>   		goto out;
>   
>   	css = task_get_css(tsk, cpuset_cgrp_id);
> +	rcu_read_lock();
> +	/*
> +	 * When the cpuset subsystem is mounted on the legacy hierarchy,
> +	 * the top_cpuset.css->cgroup does not hold a reference count of
> +	 * cgroup_root.cgroup. This makes accessing css->cgroup very
> +	 * dangerous because when the cpuset subsystem is remounted to the
> +	 * default hierarchy, the cgroup_root.cgroup that css->cgroup points
> +	 * to will be released, leading to a UAF issue. To avoid this problem,
> +	 * get the reference count of top_cpuset.css->cgroup first.
> +	 *
> +	 * This is ugly!!
> +	 */
> +	if (css == &top_cpuset.css) {
> +		cgroup_get(css->cgroup);
> +		root_cgroup = css->cgroup;
> +	}
> +	rcu_read_unlock();
>   	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>   				current->nsproxy->cgroup_ns);
>   	css_put(css);
> +	if (root_cgroup)
> +		cgroup_put(root_cgroup);
>   	if (retval == -E2BIG)
>   		retval = -ENAMETOOLONG;
>   	if (retval < 0)

Thanks for reporting this UAF bug. Could you try the attached patch to 
see if it can fix the issue?

Cheers,
Longman

[-- Attachment #2: 0001-cgroup-cpuset-Prevent-UAF-in-proc_cpuset_show.patch --]
[-- Type: text/x-patch, Size: 3357 bytes --]

From 11036d027cc1f3dd0a6045794fb87711c840f426 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sat, 22 Jun 2024 10:25:15 -0400
Subject: [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show()

An UAF can happen when /proc/cpuset is read as reported in [1].

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.

Fix this problem by taking a reference to the v1 cgroup root in
cpuset_bind() and release it in the next cpuset_bind() call. The
top_cpuset will always be bound to either cgrp_dfl_root or the
allocated v1 cgroup root. So top_cpuset will always be remounted back
to cgrp_dfl_root whenever a v1 cpuset mount is released.

Access to css->cgroup in proc_cpuset_show() is now protected under
the cpuset_mutex to make sure that an UAF access to css->cgroup is
not possible.

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Reported-by: Chen Ridong <chenridong@huawei.com>
Closes: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb22a4..8155ad9ff927 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4143,9 +4143,20 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 	free_cpuset(cs);
 }
 
+/*
+ * With a cgroup v1 mount, root_css.cgroup can be freed. We need to take a
+ * reference to it to avoid UAF as proc_cpuset_show() may access the content
+ * of this cgroup.
+ */
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
+	static struct cgroup *v1_cgroup_root;
+
 	mutex_lock(&cpuset_mutex);
+	if (v1_cgroup_root) {
+		cgroup_put(v1_cgroup_root);
+		v1_cgroup_root = NULL;
+	}
 	spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
@@ -4159,6 +4170,10 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 	}
 
 	spin_unlock_irq(&callback_lock);
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		v1_cgroup_root = root_css->cgroup;
+		cgroup_get(v1_cgroup_root);
+	}
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -5051,10 +5066,12 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 	if (!buf)
 		goto out;
 
+	mutex_lock(&cpuset_mutex);
 	css = task_get_css(tsk, cpuset_cgrp_id);
 	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
 				current->nsproxy->cgroup_ns);
 	css_put(css);
+	mutex_unlock(&cpuset_mutex);
 	if (retval == -E2BIG)
 		retval = -ENAMETOOLONG;
 	if (retval < 0)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  2024-06-22 15:05 ` Waiman Long
@ 2024-06-22 20:04   ` Markus Elfring
  2024-06-22 20:12     ` Waiman Long
  2024-06-24  2:59   ` [PATCH -next] cgroup: fix uaf when proc_cpuset_show chenridong
  1 sibling, 1 reply; 16+ messages in thread
From: Markus Elfring @ 2024-06-22 20:04 UTC (permalink / raw)
  To: Waiman Long, Chen Ridong, cgroups, bpf, Johannes Weiner,
	Tejun Heo, Zefan Li
  Cc: LKML

…
> +++ b/kernel/cgroup/cpuset.c
…
> @@ -5051,10 +5066,12 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>  	if (!buf)
>  		goto out;
>
> +	mutex_lock(&cpuset_mutex);
>  	css = task_get_css(tsk, cpuset_cgrp_id);
>  	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>  				current->nsproxy->cgroup_ns);
>  	css_put(css);
> +	mutex_unlock(&cpuset_mutex);
…

Under which circumstances would you become interested to apply a statement
like “guard(mutex)(&cpuset_mutex);”?
https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/mutex.h#L196

Regards,
Markus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  2024-06-22 20:04   ` [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show() Markus Elfring
@ 2024-06-22 20:12     ` Waiman Long
  2024-06-23  6:18       ` Markus Elfring
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2024-06-22 20:12 UTC (permalink / raw)
  To: Markus Elfring, Chen Ridong, cgroups, bpf, Johannes Weiner,
	Tejun Heo, Zefan Li
  Cc: LKML

On 6/22/24 16:04, Markus Elfring wrote:
> …
>> +++ b/kernel/cgroup/cpuset.c
> …
>> @@ -5051,10 +5066,12 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>>   	if (!buf)
>>   		goto out;
>>
>> +	mutex_lock(&cpuset_mutex);
>>   	css = task_get_css(tsk, cpuset_cgrp_id);
>>   	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>   				current->nsproxy->cgroup_ns);
>>   	css_put(css);
>> +	mutex_unlock(&cpuset_mutex);
> …
>
> Under which circumstances would you become interested to apply a statement
> like “guard(mutex)(&cpuset_mutex);”?
> https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/mutex.h#L196

A mutex guard will be more appropriate if there is an error exit case 
that needs to be handled. Otherwise, it is more straight forward and 
easier to understand with the simple lock/unlock.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  2024-06-22 20:12     ` Waiman Long
@ 2024-06-23  6:18       ` Markus Elfring
  2024-06-23 16:28         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Markus Elfring @ 2024-06-23  6:18 UTC (permalink / raw)
  To: Waiman Long, Chen Ridong, cgroups, bpf, Johannes Weiner,
	Tejun Heo, Zefan Li
  Cc: LKML, Julia Lawall, Peter Zijlstra

>> …
>>> +++ b/kernel/cgroup/cpuset.c
>> …
>>> @@ -5051,10 +5066,12 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>>>       if (!buf)
>>>           goto out;
>>>
>>> +    mutex_lock(&cpuset_mutex);
>>>       css = task_get_css(tsk, cpuset_cgrp_id);
>>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>                   current->nsproxy->cgroup_ns);
>>>       css_put(css);
>>> +    mutex_unlock(&cpuset_mutex);
>> …
>>
>> Under which circumstances would you become interested to apply a statement
>> like “guard(mutex)(&cpuset_mutex);”?
>> https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/mutex.h#L196
>
> A mutex guard will be more appropriate if there is an error exit case that needs to be handled.

Lock guards can help to reduce and improve source code another bit,
can't they?


> Otherwise, it is more straight forward and easier to understand with the simple lock/unlock.

Will such change reluctance be adjusted anyhow?

Regards,
Markus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  2024-06-23  6:18       ` Markus Elfring
@ 2024-06-23 16:28         ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2024-06-23 16:28 UTC (permalink / raw)
  To: Markus Elfring, Chen Ridong, cgroups, bpf, Johannes Weiner,
	Tejun Heo, Zefan Li
  Cc: LKML, Julia Lawall, Peter Zijlstra


On 6/23/24 02:18, Markus Elfring wrote:
>>> …
>>>> +++ b/kernel/cgroup/cpuset.c
>>> …
>>>> @@ -5051,10 +5066,12 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>>>>        if (!buf)
>>>>            goto out;
>>>>
>>>> +    mutex_lock(&cpuset_mutex);
>>>>        css = task_get_css(tsk, cpuset_cgrp_id);
>>>>        retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>>                    current->nsproxy->cgroup_ns);
>>>>        css_put(css);
>>>> +    mutex_unlock(&cpuset_mutex);
>>> …
>>>
>>> Under which circumstances would you become interested to apply a statement
>>> like “guard(mutex)(&cpuset_mutex);”?
>>> https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/mutex.h#L196
>> A mutex guard will be more appropriate if there is an error exit case that needs to be handled.
> Lock guards can help to reduce and improve source code another bit,
> can't they?

For simple lock critical section, there isn't too much difference in 
term of readability between using lock guard and normal lock/unlock 
call. If there are multiple exit points in the critical section, lock 
guard can help to simplify the code. For those situations, I will 
certain try to use lock guard.

Another reason that I go with normal lock/unlock is that none of the 
cpuset_mutex lock/unlock sites in cpuset.c has used lock guard yet and 
there is no good reason in introduce something different from other call 
sites.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-22 15:05 ` Waiman Long
  2024-06-22 20:04   ` [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show() Markus Elfring
@ 2024-06-24  2:59   ` chenridong
  2024-06-24 23:59     ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: chenridong @ 2024-06-24  2:59 UTC (permalink / raw)
  To: Waiman Long, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel


On 2024/6/22 23:05, Waiman Long wrote:
>
> On 6/22/24 07:38, Chen Ridong wrote:
>> We found a refcount UAF bug as follows:
>>
>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
>> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>>
>> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
>> Call Trace:
>>   dump_stack+0x7d/0xa7
>>   print_address_description.constprop.0+0x19/0x170
>>   ? cgroup_path_ns+0x112/0x150
>>   __kasan_report.cold+0x6c/0x84
>>   ? print_unreferenced+0x390/0x3b0
>>   ? cgroup_path_ns+0x112/0x150
>>   kasan_report+0x3a/0x50
>>   cgroup_path_ns+0x112/0x150
>>   proc_cpuset_show+0x164/0x530
>>   proc_single_show+0x10f/0x1c0
>>   seq_read_iter+0x405/0x1020
>>   ? aa_path_link+0x2e0/0x2e0
>>   seq_read+0x324/0x500
>>   ? seq_read_iter+0x1020/0x1020
>>   ? common_file_perm+0x2a1/0x4a0
>>   ? fsnotify_unmount_inodes+0x380/0x380
>>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>>   ? security_file_permission+0x53/0x460
>>   vfs_read+0x122/0x420
>>   ksys_read+0xed/0x1c0
>>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>>   ? __audit_syscall_exit+0x741/0xa70
>>   do_syscall_64+0x33/0x40
>>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>>
>> This is also reported by: 
>> https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>>
>> This can be reproduced by the following methods:
>> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>>   cgroup_path_ns function.
>> 2.$cat /proc/<pid>/cpuset   repeatly.
>> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
>> $umount /sys/fs/cgroup/cpuset/   repeatly.
>>
>> The race that cause this bug can be shown as below:
>>
>> (umount)        |    (cat /proc/<pid>/cpuset)
>> css_release        |    proc_cpuset_show
>> css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
>> css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
>> cgroup_destroy_root    |    mutex_lock(&cgroup_mutex);
>> rebind_subsystems    |
>> cgroup_free_root     |
>>             |    // cgrp was freed, UAF
>>             |    cgroup_path_ns_locked(cgrp,..);
>>
>> When the cpuset is initialized, the root node top_cpuset.css.cgrp
>> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation 
>> will
>> allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
>> allocated
>> &cgroup_root.cgrp. When the umount operation is executed,
>> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>>
>> The problem is that when rebinding to cgrp_dfl_root, there are cases
>> where the cgroup_root allocated by setting up the root for cgroup v1
>> is cached. This could lead to a Use-After-Free (UAF) if it is
>> subsequently freed. The descendant cgroups of cgroup v1 can only be
>> freed after the css is released. However, the css of the root will never
>> be released, yet the cgroup_root should be freed when it is unmounted.
>> This means that obtaining a reference to the css of the root does
>> not guarantee that css.cgrp->root will not be freed.
>>
>> To solve this issue, we have added a cgroup reference count in
>> the proc_cpuset_show function to ensure that css.cgrp->root will not
>> be freed prematurely. This is a temporary solution. Let's see if anyone
>> has a better solution.
>>
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index c12b9fdb22a4..782eaf807173 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, struct 
>> pid_namespace *ns,
>>       char *buf;
>>       struct cgroup_subsys_state *css;
>>       int retval;
>> +    struct cgroup *root_cgroup = NULL;
>>         retval = -ENOMEM;
>>       buf = kmalloc(PATH_MAX, GFP_KERNEL);
>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
>> struct pid_namespace *ns,
>>           goto out;
>>         css = task_get_css(tsk, cpuset_cgrp_id);
>> +    rcu_read_lock();
>> +    /*
>> +     * When the cpuset subsystem is mounted on the legacy hierarchy,
>> +     * the top_cpuset.css->cgroup does not hold a reference count of
>> +     * cgroup_root.cgroup. This makes accessing css->cgroup very
>> +     * dangerous because when the cpuset subsystem is remounted to the
>> +     * default hierarchy, the cgroup_root.cgroup that css->cgroup 
>> points
>> +     * to will be released, leading to a UAF issue. To avoid this 
>> problem,
>> +     * get the reference count of top_cpuset.css->cgroup first.
>> +     *
>> +     * This is ugly!!
>> +     */
>> +    if (css == &top_cpuset.css) {
>> +        cgroup_get(css->cgroup);
>> +        root_cgroup = css->cgroup;
>> +    }
>> +    rcu_read_unlock();
>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>                   current->nsproxy->cgroup_ns);
>>       css_put(css);
>> +    if (root_cgroup)
>> +        cgroup_put(root_cgroup);
>>       if (retval == -E2BIG)
>>           retval = -ENAMETOOLONG;
>>       if (retval < 0)
>
> Thanks for reporting this UAF bug. Could you try the attached patch to 
> see if it can fix the issue?
>

+/*
+ * With a cgroup v1 mount, root_css.cgroup can be freed. We need to take a
+ * reference to it to avoid UAF as proc_cpuset_show() may access the 
content
+ * of this cgroup.
+ */
  static void cpuset_bind(struct cgroup_subsys_state *root_css)
  {
+    static struct cgroup *v1_cgroup_root;
+
      mutex_lock(&cpuset_mutex);
+    if (v1_cgroup_root) {
+        cgroup_put(v1_cgroup_root);
+        v1_cgroup_root = NULL;
+    }
      spin_lock_irq(&callback_lock);

      if (is_in_v2_mode()) {
@@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
cgroup_subsys_state *root_css)
      }

      spin_unlock_irq(&callback_lock);
+    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+        v1_cgroup_root = root_css->cgroup;
+        cgroup_get(v1_cgroup_root);
+    }
      mutex_unlock(&cpuset_mutex);
  }

Thanks for your suggestion. If we take a reference at rebind(call 
->bind()) function, cgroup_root allocated when setting up root for 
cgroup v1 can never be released, because the reference count will never 
be reduced to zero.

We have already tried similar methods to fix this issue, however doing 
so causes another issue as mentioned previously.


Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-22 13:45 ` Markus Elfring
@ 2024-06-24  3:34   ` chenridong
  0 siblings, 0 replies; 16+ messages in thread
From: chenridong @ 2024-06-24  3:34 UTC (permalink / raw)
  To: Markus Elfring, cgroups, bpf, Johannes Weiner, Tejun Heo,
	Waiman Long, Zefan Li
  Cc: LKML


On 2024/6/22 21:45, Markus Elfring wrote:
>> We found a refcount UAF bug as follows:
>>
>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
> …
>
> How do you think about to use a summary phrase like “Avoid use-after-free
> in proc_cpuset_show()”?
>
>
>> This is also reported by: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
> Would you like to add any tags (like “Fixes”) accordingly?
>
Thank you for review, i will do that.
> …
>> +++ b/kernel/cgroup/cpuset.c
> …
>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
>>   		goto out;
>>
>>   	css = task_get_css(tsk, cpuset_cgrp_id);
>> +	rcu_read_lock();
> …
>> +	rcu_read_unlock();
>>   	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>   				current->nsproxy->cgroup_ns);
> …
>
> Would you become interested to apply a statement like “guard(rcu_read_lock)();”?
> https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/cleanup.h#L133
>
> Regards,
> Markus

We hope somebody could have another better solution.

Regards

Ridong




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-24  2:59   ` [PATCH -next] cgroup: fix uaf when proc_cpuset_show chenridong
@ 2024-06-24 23:59     ` Waiman Long
  2024-06-25  1:46       ` chenridong
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2024-06-24 23:59 UTC (permalink / raw)
  To: chenridong, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7183 bytes --]

On 6/23/24 22:59, chenridong wrote:
>
> On 2024/6/22 23:05, Waiman Long wrote:
>>
>> On 6/22/24 07:38, Chen Ridong wrote:
>>> We found a refcount UAF bug as follows:
>>>
>>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
>>> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>>>
>>> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
>>> Call Trace:
>>>   dump_stack+0x7d/0xa7
>>>   print_address_description.constprop.0+0x19/0x170
>>>   ? cgroup_path_ns+0x112/0x150
>>>   __kasan_report.cold+0x6c/0x84
>>>   ? print_unreferenced+0x390/0x3b0
>>>   ? cgroup_path_ns+0x112/0x150
>>>   kasan_report+0x3a/0x50
>>>   cgroup_path_ns+0x112/0x150
>>>   proc_cpuset_show+0x164/0x530
>>>   proc_single_show+0x10f/0x1c0
>>>   seq_read_iter+0x405/0x1020
>>>   ? aa_path_link+0x2e0/0x2e0
>>>   seq_read+0x324/0x500
>>>   ? seq_read_iter+0x1020/0x1020
>>>   ? common_file_perm+0x2a1/0x4a0
>>>   ? fsnotify_unmount_inodes+0x380/0x380
>>>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>>>   ? security_file_permission+0x53/0x460
>>>   vfs_read+0x122/0x420
>>>   ksys_read+0xed/0x1c0
>>>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>>>   ? __audit_syscall_exit+0x741/0xa70
>>>   do_syscall_64+0x33/0x40
>>>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>>>
>>> This is also reported by: 
>>> https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>>>
>>> This can be reproduced by the following methods:
>>> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>>>   cgroup_path_ns function.
>>> 2.$cat /proc/<pid>/cpuset   repeatly.
>>> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
>>> $umount /sys/fs/cgroup/cpuset/   repeatly.
>>>
>>> The race that cause this bug can be shown as below:
>>>
>>> (umount)        |    (cat /proc/<pid>/cpuset)
>>> css_release        |    proc_cpuset_show
>>> css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
>>> css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
>>> cgroup_destroy_root    |    mutex_lock(&cgroup_mutex);
>>> rebind_subsystems    |
>>> cgroup_free_root     |
>>>             |    // cgrp was freed, UAF
>>>             |    cgroup_path_ns_locked(cgrp,..);
>>>
>>> When the cpuset is initialized, the root node top_cpuset.css.cgrp
>>> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation 
>>> will
>>> allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
>>> allocated
>>> &cgroup_root.cgrp. When the umount operation is executed,
>>> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>>>
>>> The problem is that when rebinding to cgrp_dfl_root, there are cases
>>> where the cgroup_root allocated by setting up the root for cgroup v1
>>> is cached. This could lead to a Use-After-Free (UAF) if it is
>>> subsequently freed. The descendant cgroups of cgroup v1 can only be
>>> freed after the css is released. However, the css of the root will 
>>> never
>>> be released, yet the cgroup_root should be freed when it is unmounted.
>>> This means that obtaining a reference to the css of the root does
>>> not guarantee that css.cgrp->root will not be freed.
>>>
>>> To solve this issue, we have added a cgroup reference count in
>>> the proc_cpuset_show function to ensure that css.cgrp->root will not
>>> be freed prematurely. This is a temporary solution. Let's see if anyone
>>> has a better solution.
>>>
>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>> ---
>>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>>>   1 file changed, 20 insertions(+)
>>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index c12b9fdb22a4..782eaf807173 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, 
>>> struct pid_namespace *ns,
>>>       char *buf;
>>>       struct cgroup_subsys_state *css;
>>>       int retval;
>>> +    struct cgroup *root_cgroup = NULL;
>>>         retval = -ENOMEM;
>>>       buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
>>> struct pid_namespace *ns,
>>>           goto out;
>>>         css = task_get_css(tsk, cpuset_cgrp_id);
>>> +    rcu_read_lock();
>>> +    /*
>>> +     * When the cpuset subsystem is mounted on the legacy hierarchy,
>>> +     * the top_cpuset.css->cgroup does not hold a reference count of
>>> +     * cgroup_root.cgroup. This makes accessing css->cgroup very
>>> +     * dangerous because when the cpuset subsystem is remounted to the
>>> +     * default hierarchy, the cgroup_root.cgroup that css->cgroup 
>>> points
>>> +     * to will be released, leading to a UAF issue. To avoid this 
>>> problem,
>>> +     * get the reference count of top_cpuset.css->cgroup first.
>>> +     *
>>> +     * This is ugly!!
>>> +     */
>>> +    if (css == &top_cpuset.css) {
>>> +        cgroup_get(css->cgroup);
>>> +        root_cgroup = css->cgroup;
>>> +    }
>>> +    rcu_read_unlock();
>>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>                   current->nsproxy->cgroup_ns);
>>>       css_put(css);
>>> +    if (root_cgroup)
>>> +        cgroup_put(root_cgroup);
>>>       if (retval == -E2BIG)
>>>           retval = -ENAMETOOLONG;
>>>       if (retval < 0)
>>
>> Thanks for reporting this UAF bug. Could you try the attached patch 
>> to see if it can fix the issue?
>>
>
> +/*
> + * With a cgroup v1 mount, root_css.cgroup can be freed. We need to 
> take a
> + * reference to it to avoid UAF as proc_cpuset_show() may access the 
> content
> + * of this cgroup.
> + */
>  static void cpuset_bind(struct cgroup_subsys_state *root_css)
>  {
> +    static struct cgroup *v1_cgroup_root;
> +
>      mutex_lock(&cpuset_mutex);
> +    if (v1_cgroup_root) {
> +        cgroup_put(v1_cgroup_root);
> +        v1_cgroup_root = NULL;
> +    }
>      spin_lock_irq(&callback_lock);
>
>      if (is_in_v2_mode()) {
> @@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
> cgroup_subsys_state *root_css)
>      }
>
>      spin_unlock_irq(&callback_lock);
> +    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
> +        v1_cgroup_root = root_css->cgroup;
> +        cgroup_get(v1_cgroup_root);
> +    }
>      mutex_unlock(&cpuset_mutex);
>  }
>
> Thanks for your suggestion. If we take a reference at rebind(call 
> ->bind()) function, cgroup_root allocated when setting up root for 
> cgroup v1 can never be released, because the reference count will 
> never be reduced to zero.
>
> We have already tried similar methods to fix this issue, however doing 
> so causes another issue as mentioned previously.

You are right. Taking the reference in cpuset_bind() will prevent 
cgroup_destroy_root() from being called. I had overlooked that.

Now I have an even simpler fix. Could you try the attached v2 patch to 
verify if that can fix the problem?

Thanks,
Longman

[-- Attachment #2: v2-0001-cgroup-cpuset-Prevent-UAF-in-proc_cpuset_show.patch --]
[-- Type: text/x-patch, Size: 2216 bytes --]

From 2996235545433ce25e917af11f4985d7b6880764 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Mon, 24 Jun 2024 19:53:32 -0400
Subject: [PATCH v2] cgroup/cpuset: Prevent UAF in proc_cpuset_show()

The unmounting of a cpuset cgroup filesystem will lead to a call to
cpuset_bind() to rebind it back to &cgrp_dfl_root.cgrp via the following
call sequence.

  cgroup_destroy_root()
  --> rebind_subsystems()
  --> cpuset_bind()

The call to cpuset_bind() is done after setting top_cpuset.css.cgroup
to the &cgrp_dfl_root.cgrp. The allocated v1 cgroup root will be freed
after the completion of the cpuset_bind() call and other miscellaneous
cleanups.

Fix this potential UAF problem by putting the access and parsing
of top_cpuset.css.cgroup under cpuset_mutex to synchronize with
cpuset_bind() of the unmount operation. If the cpuset_mutex is acquired
after cpuset_bind(), top_cpuset.css.cgroup is guaranteed to point to
cgrp_dfl_root.cgrp. If it is acquired before cpuset_bind(), the allocated
v1 cgroup root cannot be freed until after the cpuset_mutex is released.

A similar UAF problem in proc_cpuset_show() had been reported before in
[1].

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Reported-by: Chen Ridong <chenridong@huawei.com>
Closes: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb22a4..953150a06d81 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -5051,10 +5051,17 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 	if (!buf)
 		goto out;
 
+	/*
+	 * Access to css->cgroup is guarded by cpuset_mutex to synchronize
+	 * with the cpuset_bind() call of a racing v1 cgroup root unmount
+	 * operation to prevent UAF.
+	 */
+	mutex_lock(&cpuset_mutex);
 	css = task_get_css(tsk, cpuset_cgrp_id);
 	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
 				current->nsproxy->cgroup_ns);
 	css_put(css);
+	mutex_unlock(&cpuset_mutex);
 	if (retval == -E2BIG)
 		retval = -ENAMETOOLONG;
 	if (retval < 0)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-24 23:59     ` Waiman Long
@ 2024-06-25  1:46       ` chenridong
  2024-06-25  2:40         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: chenridong @ 2024-06-25  1:46 UTC (permalink / raw)
  To: Waiman Long, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel


On 2024/6/25 7:59, Waiman Long wrote:
> On 6/23/24 22:59, chenridong wrote:
>>
>> On 2024/6/22 23:05, Waiman Long wrote:
>>>
>>> On 6/22/24 07:38, Chen Ridong wrote:
>>>> We found a refcount UAF bug as follows:
>>>>
>>>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
>>>> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>>>>
>>>> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
>>>> Call Trace:
>>>>   dump_stack+0x7d/0xa7
>>>>   print_address_description.constprop.0+0x19/0x170
>>>>   ? cgroup_path_ns+0x112/0x150
>>>>   __kasan_report.cold+0x6c/0x84
>>>>   ? print_unreferenced+0x390/0x3b0
>>>>   ? cgroup_path_ns+0x112/0x150
>>>>   kasan_report+0x3a/0x50
>>>>   cgroup_path_ns+0x112/0x150
>>>>   proc_cpuset_show+0x164/0x530
>>>>   proc_single_show+0x10f/0x1c0
>>>>   seq_read_iter+0x405/0x1020
>>>>   ? aa_path_link+0x2e0/0x2e0
>>>>   seq_read+0x324/0x500
>>>>   ? seq_read_iter+0x1020/0x1020
>>>>   ? common_file_perm+0x2a1/0x4a0
>>>>   ? fsnotify_unmount_inodes+0x380/0x380
>>>>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>>>>   ? security_file_permission+0x53/0x460
>>>>   vfs_read+0x122/0x420
>>>>   ksys_read+0xed/0x1c0
>>>>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>>>>   ? __audit_syscall_exit+0x741/0xa70
>>>>   do_syscall_64+0x33/0x40
>>>>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>>>>
>>>> This is also reported by: 
>>>> https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>>>>
>>>> This can be reproduced by the following methods:
>>>> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>>>>   cgroup_path_ns function.
>>>> 2.$cat /proc/<pid>/cpuset   repeatly.
>>>> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
>>>> $umount /sys/fs/cgroup/cpuset/   repeatly.
>>>>
>>>> The race that cause this bug can be shown as below:
>>>>
>>>> (umount)        |    (cat /proc/<pid>/cpuset)
>>>> css_release        |    proc_cpuset_show
>>>> css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
>>>> css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
>>>> cgroup_destroy_root    |    mutex_lock(&cgroup_mutex);
>>>> rebind_subsystems    |
>>>> cgroup_free_root     |
>>>>             |    // cgrp was freed, UAF
>>>>             |    cgroup_path_ns_locked(cgrp,..);
>>>>
>>>> When the cpuset is initialized, the root node top_cpuset.css.cgrp
>>>> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount 
>>>> operation will
>>>> allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
>>>> allocated
>>>> &cgroup_root.cgrp. When the umount operation is executed,
>>>> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>>>>
>>>> The problem is that when rebinding to cgrp_dfl_root, there are cases
>>>> where the cgroup_root allocated by setting up the root for cgroup v1
>>>> is cached. This could lead to a Use-After-Free (UAF) if it is
>>>> subsequently freed. The descendant cgroups of cgroup v1 can only be
>>>> freed after the css is released. However, the css of the root will 
>>>> never
>>>> be released, yet the cgroup_root should be freed when it is unmounted.
>>>> This means that obtaining a reference to the css of the root does
>>>> not guarantee that css.cgrp->root will not be freed.
>>>>
>>>> To solve this issue, we have added a cgroup reference count in
>>>> the proc_cpuset_show function to ensure that css.cgrp->root will not
>>>> be freed prematurely. This is a temporary solution. Let's see if 
>>>> anyone
>>>> has a better solution.
>>>>
>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>> ---
>>>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>>>>   1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>> index c12b9fdb22a4..782eaf807173 100644
>>>> --- a/kernel/cgroup/cpuset.c
>>>> +++ b/kernel/cgroup/cpuset.c
>>>> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, 
>>>> struct pid_namespace *ns,
>>>>       char *buf;
>>>>       struct cgroup_subsys_state *css;
>>>>       int retval;
>>>> +    struct cgroup *root_cgroup = NULL;
>>>>         retval = -ENOMEM;
>>>>       buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
>>>> struct pid_namespace *ns,
>>>>           goto out;
>>>>         css = task_get_css(tsk, cpuset_cgrp_id);
>>>> +    rcu_read_lock();
>>>> +    /*
>>>> +     * When the cpuset subsystem is mounted on the legacy hierarchy,
>>>> +     * the top_cpuset.css->cgroup does not hold a reference count of
>>>> +     * cgroup_root.cgroup. This makes accessing css->cgroup very
>>>> +     * dangerous because when the cpuset subsystem is remounted to 
>>>> the
>>>> +     * default hierarchy, the cgroup_root.cgroup that css->cgroup 
>>>> points
>>>> +     * to will be released, leading to a UAF issue. To avoid this 
>>>> problem,
>>>> +     * get the reference count of top_cpuset.css->cgroup first.
>>>> +     *
>>>> +     * This is ugly!!
>>>> +     */
>>>> +    if (css == &top_cpuset.css) {
>>>> +        cgroup_get(css->cgroup);
>>>> +        root_cgroup = css->cgroup;
>>>> +    }
>>>> +    rcu_read_unlock();
>>>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>>                   current->nsproxy->cgroup_ns);
>>>>       css_put(css);
>>>> +    if (root_cgroup)
>>>> +        cgroup_put(root_cgroup);
>>>>       if (retval == -E2BIG)
>>>>           retval = -ENAMETOOLONG;
>>>>       if (retval < 0)
>>>
>>> Thanks for reporting this UAF bug. Could you try the attached patch 
>>> to see if it can fix the issue?
>>>
>>
>> +/*
>> + * With a cgroup v1 mount, root_css.cgroup can be freed. We need to 
>> take a
>> + * reference to it to avoid UAF as proc_cpuset_show() may access the 
>> content
>> + * of this cgroup.
>> + */
>>  static void cpuset_bind(struct cgroup_subsys_state *root_css)
>>  {
>> +    static struct cgroup *v1_cgroup_root;
>> +
>>      mutex_lock(&cpuset_mutex);
>> +    if (v1_cgroup_root) {
>> +        cgroup_put(v1_cgroup_root);
>> +        v1_cgroup_root = NULL;
>> +    }
>>      spin_lock_irq(&callback_lock);
>>
>>      if (is_in_v2_mode()) {
>> @@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
>> cgroup_subsys_state *root_css)
>>      }
>>
>>      spin_unlock_irq(&callback_lock);
>> +    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
>> +        v1_cgroup_root = root_css->cgroup;
>> +        cgroup_get(v1_cgroup_root);
>> +    }
>>      mutex_unlock(&cpuset_mutex);
>>  }
>>
>> Thanks for your suggestion. If we take a reference at rebind(call 
>> ->bind()) function, cgroup_root allocated when setting up root for 
>> cgroup v1 can never be released, because the reference count will 
>> never be reduced to zero.
>>
>> We have already tried similar methods to fix this issue, however 
>> doing so causes another issue as mentioned previously.
>
> You are right. Taking the reference in cpuset_bind() will prevent 
> cgroup_destroy_root() from being called. I had overlooked that.
>
> Now I have an even simpler fix. Could you try the attached v2 patch to 
> verify if that can fix the problem?
>
> Thanks,
> Longman

Thanks you for your reply, v2 patch will lead to  ABBA deadlock.

(cat /proc/<pid>/cpuset)                  | (rebind_subsystems)
                                                           | 
lockdep_assert_held(&cgroup_mutex);
mutex_lock(&cpuset_mutex);            |
cgroup_path_ns                                 |    ->bind()
                                                           | 
mutex_lock(&cpuset_mutex);
mutex_lock(&cgroup_mutex);           |


Regards,
Ridong



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-25  1:46       ` chenridong
@ 2024-06-25  2:40         ` Waiman Long
  2024-06-25  3:12           ` chenridong
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2024-06-25  2:40 UTC (permalink / raw)
  To: chenridong, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel

On 6/24/24 21:46, chenridong wrote:
>
> On 2024/6/25 7:59, Waiman Long wrote:
>> On 6/23/24 22:59, chenridong wrote:
>>>
>>> On 2024/6/22 23:05, Waiman Long wrote:
>>>>
>>>> On 6/22/24 07:38, Chen Ridong wrote:
>>>>> We found a refcount UAF bug as follows:
>>>>>
>>>>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
>>>>> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>>>>>
>>>>> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
>>>>> Call Trace:
>>>>>   dump_stack+0x7d/0xa7
>>>>>   print_address_description.constprop.0+0x19/0x170
>>>>>   ? cgroup_path_ns+0x112/0x150
>>>>>   __kasan_report.cold+0x6c/0x84
>>>>>   ? print_unreferenced+0x390/0x3b0
>>>>>   ? cgroup_path_ns+0x112/0x150
>>>>>   kasan_report+0x3a/0x50
>>>>>   cgroup_path_ns+0x112/0x150
>>>>>   proc_cpuset_show+0x164/0x530
>>>>>   proc_single_show+0x10f/0x1c0
>>>>>   seq_read_iter+0x405/0x1020
>>>>>   ? aa_path_link+0x2e0/0x2e0
>>>>>   seq_read+0x324/0x500
>>>>>   ? seq_read_iter+0x1020/0x1020
>>>>>   ? common_file_perm+0x2a1/0x4a0
>>>>>   ? fsnotify_unmount_inodes+0x380/0x380
>>>>>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>>>>>   ? security_file_permission+0x53/0x460
>>>>>   vfs_read+0x122/0x420
>>>>>   ksys_read+0xed/0x1c0
>>>>>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>>>>>   ? __audit_syscall_exit+0x741/0xa70
>>>>>   do_syscall_64+0x33/0x40
>>>>>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>>>>>
>>>>> This is also reported by: 
>>>>> https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>>>>>
>>>>> This can be reproduced by the following methods:
>>>>> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>>>>>   cgroup_path_ns function.
>>>>> 2.$cat /proc/<pid>/cpuset   repeatly.
>>>>> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
>>>>> $umount /sys/fs/cgroup/cpuset/   repeatly.
>>>>>
>>>>> The race that cause this bug can be shown as below:
>>>>>
>>>>> (umount)        |    (cat /proc/<pid>/cpuset)
>>>>> css_release        |    proc_cpuset_show
>>>>> css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
>>>>> css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
>>>>> cgroup_destroy_root    |    mutex_lock(&cgroup_mutex);
>>>>> rebind_subsystems    |
>>>>> cgroup_free_root     |
>>>>>             |    // cgrp was freed, UAF
>>>>>             |    cgroup_path_ns_locked(cgrp,..);
>>>>>
>>>>> When the cpuset is initialized, the root node top_cpuset.css.cgrp
>>>>> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount 
>>>>> operation will
>>>>> allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
>>>>> allocated
>>>>> &cgroup_root.cgrp. When the umount operation is executed,
>>>>> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>>>>>
>>>>> The problem is that when rebinding to cgrp_dfl_root, there are cases
>>>>> where the cgroup_root allocated by setting up the root for cgroup v1
>>>>> is cached. This could lead to a Use-After-Free (UAF) if it is
>>>>> subsequently freed. The descendant cgroups of cgroup v1 can only be
>>>>> freed after the css is released. However, the css of the root will 
>>>>> never
>>>>> be released, yet the cgroup_root should be freed when it is 
>>>>> unmounted.
>>>>> This means that obtaining a reference to the css of the root does
>>>>> not guarantee that css.cgrp->root will not be freed.
>>>>>
>>>>> To solve this issue, we have added a cgroup reference count in
>>>>> the proc_cpuset_show function to ensure that css.cgrp->root will not
>>>>> be freed prematurely. This is a temporary solution. Let's see if 
>>>>> anyone
>>>>> has a better solution.
>>>>>
>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>> ---
>>>>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>>>>>   1 file changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>>> index c12b9fdb22a4..782eaf807173 100644
>>>>> --- a/kernel/cgroup/cpuset.c
>>>>> +++ b/kernel/cgroup/cpuset.c
>>>>> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, 
>>>>> struct pid_namespace *ns,
>>>>>       char *buf;
>>>>>       struct cgroup_subsys_state *css;
>>>>>       int retval;
>>>>> +    struct cgroup *root_cgroup = NULL;
>>>>>         retval = -ENOMEM;
>>>>>       buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>>>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
>>>>> struct pid_namespace *ns,
>>>>>           goto out;
>>>>>         css = task_get_css(tsk, cpuset_cgrp_id);
>>>>> +    rcu_read_lock();
>>>>> +    /*
>>>>> +     * When the cpuset subsystem is mounted on the legacy hierarchy,
>>>>> +     * the top_cpuset.css->cgroup does not hold a reference count of
>>>>> +     * cgroup_root.cgroup. This makes accessing css->cgroup very
>>>>> +     * dangerous because when the cpuset subsystem is remounted 
>>>>> to the
>>>>> +     * default hierarchy, the cgroup_root.cgroup that css->cgroup 
>>>>> points
>>>>> +     * to will be released, leading to a UAF issue. To avoid this 
>>>>> problem,
>>>>> +     * get the reference count of top_cpuset.css->cgroup first.
>>>>> +     *
>>>>> +     * This is ugly!!
>>>>> +     */
>>>>> +    if (css == &top_cpuset.css) {
>>>>> +        cgroup_get(css->cgroup);
>>>>> +        root_cgroup = css->cgroup;
>>>>> +    }
>>>>> +    rcu_read_unlock();
>>>>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>>>                   current->nsproxy->cgroup_ns);
>>>>>       css_put(css);
>>>>> +    if (root_cgroup)
>>>>> +        cgroup_put(root_cgroup);
>>>>>       if (retval == -E2BIG)
>>>>>           retval = -ENAMETOOLONG;
>>>>>       if (retval < 0)
>>>>
>>>> Thanks for reporting this UAF bug. Could you try the attached patch 
>>>> to see if it can fix the issue?
>>>>
>>>
>>> +/*
>>> + * With a cgroup v1 mount, root_css.cgroup can be freed. We need to 
>>> take a
>>> + * reference to it to avoid UAF as proc_cpuset_show() may access 
>>> the content
>>> + * of this cgroup.
>>> + */
>>>  static void cpuset_bind(struct cgroup_subsys_state *root_css)
>>>  {
>>> +    static struct cgroup *v1_cgroup_root;
>>> +
>>>      mutex_lock(&cpuset_mutex);
>>> +    if (v1_cgroup_root) {
>>> +        cgroup_put(v1_cgroup_root);
>>> +        v1_cgroup_root = NULL;
>>> +    }
>>>      spin_lock_irq(&callback_lock);
>>>
>>>      if (is_in_v2_mode()) {
>>> @@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
>>> cgroup_subsys_state *root_css)
>>>      }
>>>
>>>      spin_unlock_irq(&callback_lock);
>>> +    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
>>> +        v1_cgroup_root = root_css->cgroup;
>>> +        cgroup_get(v1_cgroup_root);
>>> +    }
>>>      mutex_unlock(&cpuset_mutex);
>>>  }
>>>
>>> Thanks for your suggestion. If we take a reference at rebind(call 
>>> ->bind()) function, cgroup_root allocated when setting up root for 
>>> cgroup v1 can never be released, because the reference count will 
>>> never be reduced to zero.
>>>
>>> We have already tried similar methods to fix this issue, however 
>>> doing so causes another issue as mentioned previously.
>>
>> You are right. Taking the reference in cpuset_bind() will prevent 
>> cgroup_destroy_root() from being called. I had overlooked that.
>>
>> Now I have an even simpler fix. Could you try the attached v2 patch 
>> to verify if that can fix the problem?
>>
>> Thanks,
>> Longman
>
> Thanks you for your reply, v2 patch will lead to  ABBA deadlock.
>
> (cat /proc/<pid>/cpuset)                  | (rebind_subsystems)
>                                                           | 
> lockdep_assert_held(&cgroup_mutex);
> mutex_lock(&cpuset_mutex);            |
> cgroup_path_ns                                 |    ->bind()
>                                                           | 
> mutex_lock(&cpuset_mutex);
> mutex_lock(&cgroup_mutex);           |

Bummer.

Another alternative is to create a cgroup_path_ns() variant that accepts 
a "struct cgroup **pcgrp" and retrieve the actual cgroup pointer inside 
its critical section. That should also fix the UAF.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-25  2:40         ` Waiman Long
@ 2024-06-25  3:12           ` chenridong
  2024-06-25 10:10             ` Michal Koutný
  0 siblings, 1 reply; 16+ messages in thread
From: chenridong @ 2024-06-25  3:12 UTC (permalink / raw)
  To: Waiman Long, tj, lizefan.x, hannes; +Cc: bpf, cgroups, linux-kernel


On 2024/6/25 10:40, Waiman Long wrote:
> On 6/24/24 21:46, chenridong wrote:
>>
>> On 2024/6/25 7:59, Waiman Long wrote:
>>> On 6/23/24 22:59, chenridong wrote:
>>>>
>>>> On 2024/6/22 23:05, Waiman Long wrote:
>>>>>
>>>>> On 6/22/24 07:38, Chen Ridong wrote:
>>>>>> We found a refcount UAF bug as follows:
>>>>>>
>>>>>> BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
>>>>>> Read of size 8 at addr ffff8882a4b242b8 by task atop/19903
>>>>>>
>>>>>> CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
>>>>>> Call Trace:
>>>>>>   dump_stack+0x7d/0xa7
>>>>>>   print_address_description.constprop.0+0x19/0x170
>>>>>>   ? cgroup_path_ns+0x112/0x150
>>>>>>   __kasan_report.cold+0x6c/0x84
>>>>>>   ? print_unreferenced+0x390/0x3b0
>>>>>>   ? cgroup_path_ns+0x112/0x150
>>>>>>   kasan_report+0x3a/0x50
>>>>>>   cgroup_path_ns+0x112/0x150
>>>>>>   proc_cpuset_show+0x164/0x530
>>>>>>   proc_single_show+0x10f/0x1c0
>>>>>>   seq_read_iter+0x405/0x1020
>>>>>>   ? aa_path_link+0x2e0/0x2e0
>>>>>>   seq_read+0x324/0x500
>>>>>>   ? seq_read_iter+0x1020/0x1020
>>>>>>   ? common_file_perm+0x2a1/0x4a0
>>>>>>   ? fsnotify_unmount_inodes+0x380/0x380
>>>>>>   ? bpf_lsm_file_permission_wrapper+0xa/0x30
>>>>>>   ? security_file_permission+0x53/0x460
>>>>>>   vfs_read+0x122/0x420
>>>>>>   ksys_read+0xed/0x1c0
>>>>>>   ? __ia32_sys_pwrite64+0x1e0/0x1e0
>>>>>>   ? __audit_syscall_exit+0x741/0xa70
>>>>>>   do_syscall_64+0x33/0x40
>>>>>>   entry_SYSCALL_64_after_hwframe+0x67/0xcc
>>>>>>
>>>>>> This is also reported by: 
>>>>>> https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
>>>>>>
>>>>>> This can be reproduced by the following methods:
>>>>>> 1.add an mdelay(1000) before acquiring the cgroup_lock In the
>>>>>>   cgroup_path_ns function.
>>>>>> 2.$cat /proc/<pid>/cpuset   repeatly.
>>>>>> 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
>>>>>> $umount /sys/fs/cgroup/cpuset/   repeatly.
>>>>>>
>>>>>> The race that cause this bug can be shown as below:
>>>>>>
>>>>>> (umount)        |    (cat /proc/<pid>/cpuset)
>>>>>> css_release        |    proc_cpuset_show
>>>>>> css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
>>>>>> css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
>>>>>> cgroup_destroy_root    | mutex_lock(&cgroup_mutex);
>>>>>> rebind_subsystems    |
>>>>>> cgroup_free_root     |
>>>>>>             |    // cgrp was freed, UAF
>>>>>>             |    cgroup_path_ns_locked(cgrp,..);
>>>>>>
>>>>>> When the cpuset is initialized, the root node top_cpuset.css.cgrp
>>>>>> will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount 
>>>>>> operation will
>>>>>> allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
>>>>>> allocated
>>>>>> &cgroup_root.cgrp. When the umount operation is executed,
>>>>>> top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
>>>>>>
>>>>>> The problem is that when rebinding to cgrp_dfl_root, there are cases
>>>>>> where the cgroup_root allocated by setting up the root for cgroup v1
>>>>>> is cached. This could lead to a Use-After-Free (UAF) if it is
>>>>>> subsequently freed. The descendant cgroups of cgroup v1 can only be
>>>>>> freed after the css is released. However, the css of the root 
>>>>>> will never
>>>>>> be released, yet the cgroup_root should be freed when it is 
>>>>>> unmounted.
>>>>>> This means that obtaining a reference to the css of the root does
>>>>>> not guarantee that css.cgrp->root will not be freed.
>>>>>>
>>>>>> To solve this issue, we have added a cgroup reference count in
>>>>>> the proc_cpuset_show function to ensure that css.cgrp->root will not
>>>>>> be freed prematurely. This is a temporary solution. Let's see if 
>>>>>> anyone
>>>>>> has a better solution.
>>>>>>
>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>>> ---
>>>>>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
>>>>>>   1 file changed, 20 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>>>> index c12b9fdb22a4..782eaf807173 100644
>>>>>> --- a/kernel/cgroup/cpuset.c
>>>>>> +++ b/kernel/cgroup/cpuset.c
>>>>>> @@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, 
>>>>>> struct pid_namespace *ns,
>>>>>>       char *buf;
>>>>>>       struct cgroup_subsys_state *css;
>>>>>>       int retval;
>>>>>> +    struct cgroup *root_cgroup = NULL;
>>>>>>         retval = -ENOMEM;
>>>>>>       buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>>>>> @@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
>>>>>> struct pid_namespace *ns,
>>>>>>           goto out;
>>>>>>         css = task_get_css(tsk, cpuset_cgrp_id);
>>>>>> +    rcu_read_lock();
>>>>>> +    /*
>>>>>> +     * When the cpuset subsystem is mounted on the legacy 
>>>>>> hierarchy,
>>>>>> +     * the top_cpuset.css->cgroup does not hold a reference 
>>>>>> count of
>>>>>> +     * cgroup_root.cgroup. This makes accessing css->cgroup very
>>>>>> +     * dangerous because when the cpuset subsystem is remounted 
>>>>>> to the
>>>>>> +     * default hierarchy, the cgroup_root.cgroup that 
>>>>>> css->cgroup points
>>>>>> +     * to will be released, leading to a UAF issue. To avoid 
>>>>>> this problem,
>>>>>> +     * get the reference count of top_cpuset.css->cgroup first.
>>>>>> +     *
>>>>>> +     * This is ugly!!
>>>>>> +     */
>>>>>> +    if (css == &top_cpuset.css) {
>>>>>> +        cgroup_get(css->cgroup);
>>>>>> +        root_cgroup = css->cgroup;
>>>>>> +    }
>>>>>> +    rcu_read_unlock();
>>>>>>       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>>>>>>                   current->nsproxy->cgroup_ns);
>>>>>>       css_put(css);
>>>>>> +    if (root_cgroup)
>>>>>> +        cgroup_put(root_cgroup);
>>>>>>       if (retval == -E2BIG)
>>>>>>           retval = -ENAMETOOLONG;
>>>>>>       if (retval < 0)
>>>>>
>>>>> Thanks for reporting this UAF bug. Could you try the attached 
>>>>> patch to see if it can fix the issue?
>>>>>
>>>>
>>>> +/*
>>>> + * With a cgroup v1 mount, root_css.cgroup can be freed. We need 
>>>> to take a
>>>> + * reference to it to avoid UAF as proc_cpuset_show() may access 
>>>> the content
>>>> + * of this cgroup.
>>>> + */
>>>>  static void cpuset_bind(struct cgroup_subsys_state *root_css)
>>>>  {
>>>> +    static struct cgroup *v1_cgroup_root;
>>>> +
>>>>      mutex_lock(&cpuset_mutex);
>>>> +    if (v1_cgroup_root) {
>>>> +        cgroup_put(v1_cgroup_root);
>>>> +        v1_cgroup_root = NULL;
>>>> +    }
>>>>      spin_lock_irq(&callback_lock);
>>>>
>>>>      if (is_in_v2_mode()) {
>>>> @@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
>>>> cgroup_subsys_state *root_css)
>>>>      }
>>>>
>>>>      spin_unlock_irq(&callback_lock);
>>>> +    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
>>>> +        v1_cgroup_root = root_css->cgroup;
>>>> +        cgroup_get(v1_cgroup_root);
>>>> +    }
>>>>      mutex_unlock(&cpuset_mutex);
>>>>  }
>>>>
>>>> Thanks for your suggestion. If we take a reference at rebind(call 
>>>> ->bind()) function, cgroup_root allocated when setting up root for 
>>>> cgroup v1 can never be released, because the reference count will 
>>>> never be reduced to zero.
>>>>
>>>> We have already tried similar methods to fix this issue, however 
>>>> doing so causes another issue as mentioned previously.
>>>
>>> You are right. Taking the reference in cpuset_bind() will prevent 
>>> cgroup_destroy_root() from being called. I had overlooked that.
>>>
>>> Now I have an even simpler fix. Could you try the attached v2 patch 
>>> to verify if that can fix the problem?
>>>
>>> Thanks,
>>> Longman
>>
>> Thanks you for your reply, v2 patch will lead to  ABBA deadlock.
>>
>> (cat /proc/<pid>/cpuset)                  | (rebind_subsystems)
>>                                                           | 
>> lockdep_assert_held(&cgroup_mutex);
>> mutex_lock(&cpuset_mutex);            |
>> cgroup_path_ns                                 |    ->bind()
>>                                                           | 
>> mutex_lock(&cpuset_mutex);
>> mutex_lock(&cgroup_mutex);           |
>
> Bummer.
>
> Another alternative is to create a cgroup_path_ns() variant that 
> accepts a "struct cgroup **pcgrp" and retrieve the actual cgroup 
> pointer inside its critical section. That should also fix the UAF.
>
> Cheers,
> Longman
>
>
I am considering whether the cgroup framework has a method to fix this 
issue, as other subsystems may also have the same underlying problem. 
Since the root css will not be released, but the css->cgrp will be released.

Regards,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-25  3:12           ` chenridong
@ 2024-06-25 10:10             ` Michal Koutný
       [not found]               ` <920bbfaa-bb76-4aa1-bd07-9a552e3bfdf2@huawei.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Koutný @ 2024-06-25 10:10 UTC (permalink / raw)
  To: chenridong; +Cc: Waiman Long, tj, lizefan.x, hannes, bpf, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

Hello.

On Tue, Jun 25, 2024 at 11:12:20AM GMT, chenridong <chenridong@huawei.com> wrote:
> I am considering whether the cgroup framework has a method to fix this
> issue, as other subsystems may also have the same underlying problem.
> Since the root css will not be released, but the css->cgrp will be
> released.

<del>First part is already done in
	d23b5c5777158 ("cgroup: Make operations on the cgroup root_list RCU safe")
second part is that</del>
you need to take RCU read lock and check for NULL, similar to
	9067d90006df0 ("cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()")

Does that make sense to you?

A Fixes: tag would be nice, it seems at least
	a79a908fd2b08 ("cgroup: introduce cgroup namespaces")
played some role. (Here the RCU lock is not for cgroup_roots list but to
preserve the root cgrp itself css_free_rwork_fn/cgroup_destroy_root.

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
       [not found]               ` <920bbfaa-bb76-4aa1-bd07-9a552e3bfdf2@huawei.com>
@ 2024-06-25 14:16                 ` Waiman Long
  2024-06-25 14:29                   ` chenridong
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2024-06-25 14:16 UTC (permalink / raw)
  To: chenridong, Michal Koutný
  Cc: tj, lizefan.x, hannes, bpf, cgroups, linux-kernel

On 6/25/24 10:11, chenridong wrote:
>
>
> On 2024/6/25 18:10, Michal Koutný wrote:
>> Hello.
>>
>> On Tue, Jun 25, 2024 at 11:12:20AM GMT, chenridong<chenridong@huawei.com>  wrote:
>>> I am considering whether the cgroup framework has a method to fix this
>>> issue, as other subsystems may also have the same underlying problem.
>>> Since the root css will not be released, but the css->cgrp will be
>>> released.
>> <del>First part is already done in
>> 	d23b5c5777158 ("cgroup: Make operations on the cgroup root_list RCU safe")
>> second part is that</del>
>> you need to take RCU read lock and check for NULL, similar to
>> 	9067d90006df0 ("cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()")
>>
>> Does that make sense to you?
>>
>> A Fixes: tag would be nice, it seems at least
>> 	a79a908fd2b08 ("cgroup: introduce cgroup namespaces")
>> played some role. (Here the RCU lock is not for cgroup_roots list but to
>> preserve the root cgrp itself css_free_rwork_fn/cgroup_destroy_root.
>>
>> HTH,
>> Michal
>
> Thank you, Michal, that is a good idea. Do you mean as below?
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>
> index c12b9fdb22a4..2ce0542067f1 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -5051,10 +5051,17 @@ int proc_cpuset_show(struct seq_file *m, 
> struct pid_namespace *ns,
>         if (!buf)
>                 goto out;
>
> +       rcu_read_lock();
> +       spin_lock_irq(&css_set_lock);
>         css = task_get_css(tsk, cpuset_cgrp_id);
> -       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
> - current->nsproxy->cgroup_ns);
> +
> +       retval = cgroup_path_ns_locked(css->cgroup, buf, PATH_MAX,
> +               current->nsproxy->cgroup_ns);
>         css_put(css);
> +
> +       spin_unlock_irq(&css_set_lock);
> +       cgroup_unlock();
> +
>         if (retval == -E2BIG)
>                 retval = -ENAMETOOLONG;
>
>         if (retval < 0)
>
That should work. However, I would suggest that you take task_get_css() 
and css_put() outside of the critical section. The task_get_css() is a 
while loop that may take a while to execute and you don't want run it 
with interrupt disabled.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show
  2024-06-25 14:16                 ` Waiman Long
@ 2024-06-25 14:29                   ` chenridong
  0 siblings, 0 replies; 16+ messages in thread
From: chenridong @ 2024-06-25 14:29 UTC (permalink / raw)
  To: Waiman Long, Michal Koutný
  Cc: tj, lizefan.x, hannes, bpf, cgroups, linux-kernel


On 2024/6/25 22:16, Waiman Long wrote:
> On 6/25/24 10:11, chenridong wrote:
>>
>>
>> On 2024/6/25 18:10, Michal Koutný wrote:
>>> Hello.
>>>
>>> On Tue, Jun 25, 2024 at 11:12:20AM GMT, 
>>> chenridong<chenridong@huawei.com>  wrote:
>>>> I am considering whether the cgroup framework has a method to fix this
>>>> issue, as other subsystems may also have the same underlying problem.
>>>> Since the root css will not be released, but the css->cgrp will be
>>>> released.
>>> <del>First part is already done in
>>>     d23b5c5777158 ("cgroup: Make operations on the cgroup root_list 
>>> RCU safe")
>>> second part is that</del>
>>> you need to take RCU read lock and check for NULL, similar to
>>>     9067d90006df0 ("cgroup: Eliminate the need for cgroup_mutex in 
>>> proc_cgroup_show()")
>>>
>>> Does that make sense to you?
>>>
>>> A Fixes: tag would be nice, it seems at least
>>>     a79a908fd2b08 ("cgroup: introduce cgroup namespaces")
>>> played some role. (Here the RCU lock is not for cgroup_roots list 
>>> but to
>>> preserve the root cgrp itself css_free_rwork_fn/cgroup_destroy_root.
>>>
>>> HTH,
>>> Michal
>>
>> Thank you, Michal, that is a good idea. Do you mean as below?
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>
>> index c12b9fdb22a4..2ce0542067f1 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -5051,10 +5051,17 @@ int proc_cpuset_show(struct seq_file *m, 
>> struct pid_namespace *ns,
>>         if (!buf)
>>                 goto out;
>>
>> +       rcu_read_lock();
>> +       spin_lock_irq(&css_set_lock);
>>         css = task_get_css(tsk, cpuset_cgrp_id);
>> -       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
>> - current->nsproxy->cgroup_ns);
>> +
>> +       retval = cgroup_path_ns_locked(css->cgroup, buf, PATH_MAX,
>> +               current->nsproxy->cgroup_ns);
>>         css_put(css);
>> +
>> +       spin_unlock_irq(&css_set_lock);
>> +       cgroup_unlock();
>> +
>>         if (retval == -E2BIG)
>>                 retval = -ENAMETOOLONG;
>>
>>         if (retval < 0)
>>
> That should work. However, I would suggest that you take 
> task_get_css() and css_put() outside of the critical section. The 
> task_get_css() is a while loop that may take a while to execute and 
> you don't want run it with interrupt disabled.
>
> Cheers,
> Longman
>
>
>
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -5050,11 +5050,18 @@ int proc_cpuset_show(struct seq_file *m, struct 
pid_namespace *ns,
         buf = kmalloc(PATH_MAX, GFP_KERNEL);
         if (!buf)
                 goto out;
-
         css = task_get_css(tsk, cpuset_cgrp_id);
-       retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
-                               current->nsproxy->cgroup_ns);
+
+       rcu_read_lock();
+       spin_lock_irq(&css_set_lock);
+
+       retval = cgroup_path_ns_locked(css->cgroup, buf, PATH_MAX,
+               current->nsproxy->cgroup_ns);
+
+       spin_unlock_irq(&css_set_lock);
+       rcu_read_unlock();
         css_put(css);
+
         if (retval == -E2BIG)
                 retval = -ENAMETOOLONG;

         if (retval < 0)


Yeah, that looks good, i will test for a while. I will send a new patch 
if no other problem occurs.

Thank you.

Regards,
Ridong



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-06-25 14:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-22 11:38 [PATCH -next] cgroup: fix uaf when proc_cpuset_show Chen Ridong
2024-06-22 13:45 ` Markus Elfring
2024-06-24  3:34   ` chenridong
2024-06-22 15:05 ` Waiman Long
2024-06-22 20:04   ` [PATCH] cgroup/cpuset: Prevent UAF in proc_cpuset_show() Markus Elfring
2024-06-22 20:12     ` Waiman Long
2024-06-23  6:18       ` Markus Elfring
2024-06-23 16:28         ` Waiman Long
2024-06-24  2:59   ` [PATCH -next] cgroup: fix uaf when proc_cpuset_show chenridong
2024-06-24 23:59     ` Waiman Long
2024-06-25  1:46       ` chenridong
2024-06-25  2:40         ` Waiman Long
2024-06-25  3:12           ` chenridong
2024-06-25 10:10             ` Michal Koutný
     [not found]               ` <920bbfaa-bb76-4aa1-bd07-9a552e3bfdf2@huawei.com>
2024-06-25 14:16                 ` Waiman Long
2024-06-25 14:29                   ` chenridong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox