[PATCH] cgroup(fix critical bug): new handling for tasks file

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cgroup(fix critical bug): new handling for tasks file
@ 2008-08-26  1:47 Lai Jiangshan
  2008-08-26  2:29 ` Paul Menage
  0 siblings, 1 reply; 13+ messages in thread
From: Lai Jiangshan @ 2008-08-26  1:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Menage, Linux Kernel Mailing List, Linux Containers

In original code, we kmalloc a lot of memory when we open cgroup.tasks.
_This memory is allocated by kmalloc_, So this memory is not statistical
for this process or this user. This is a critical bug.

An unprivileged user(attacker) may use all his available processes
to enlarge cgroup.tasks file and use all his available open-files
to open cgroup.tasks many many times. Thus he will _steal_ a lot of
memory.
If the size of memory that he steal is large than system's memory,
the system will oops.

This script will show how many memory be used when open a file 1000 times
--------------------
#!/bin/sh

file=$1

use_mem=$(free -k | awk '{if(NR==2) print $3}')
i=24
while ((i<1024))
do
	eval "exec $i<$file"
	((i++))
done
use_mem_afer_open=$(free -k | awk '{if(NR==2) print $3}')
echo Use about $((use_mem_afer_open - use_mem))K physical memories \
	for open $file 1000 times
-----------------------
$ ./memory.sh /proc/cpuinfo
Use about 156K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 0K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about -36K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 0K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 56K physical memories for open /proc/cpuinfo 1000 times
----------------------
$ wc -l /dev/cpuset/tasks
1163
$ ./memory.sh /dev/cpuset/tasks
Use about 7896K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8008K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8104K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 7756K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8104K physical memories for open /dev/cpuset/tasks 1000 times
-----------------------

This patch will fix this bug and save memory when open tasks file twice
or more. All open files that open the same cgroup's tasks file
share the pids array.

some semantic meanings are changed:
1) cgroup's tasks file is nonseekable now.
2) dependent auto-updating:
   when this tasks file is opened again(by this process or others), the
   unread part of this file(unread pids) are updated.
   2.1) a task leave this cgroup when after open before read
	may not be read.
   2.2) a task enter this cgroup when after open before read
	may be read.

2.1)compare with the original semantic meaning:
pids of all tasks in this cgroup when opening will be read. Even this
taks has left when read.

testing after fix:

$ wc -l /dev/cpuset/tasks
1056 /dev/cpuset/tasks
$ ./memory.sh /dev/cpuset/tasks
Use about 112K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 120K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 52K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 16K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 32K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 120K physical memories for open /dev/cpuset/tasks 1000 times

these testes are also OK
$ dd  ifs=1  if=/dev/cpuset/tasks
$ dd  ifs=2  if=/dev/cpuset/tasks
$ dd  ifs=3  if=/dev/cpuset/tasks
...

and testes for complex open/read sequence(multi-thread) are OK.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c98dd7c..5e88178 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -15,6 +15,7 @@
 #include <linux/rcupdate.h>
 #include <linux/cgroupstats.h>
 #include <linux/prio_heap.h>
+#include <linux/rwsem.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -137,6 +138,12 @@ struct cgroup {
 	 * release_list_lock
 	 */
 	struct list_head release_list;
+
+	/* tasks file handling */
+	struct rw_semaphore pids_mutex;
+	int pids_use_count;
+	int pids_length;
+	pid_t *tasks_pids;
 };
 
 /* A css_set is a structure holding pointers to a set of
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 13932ab..3c87f20 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1997,15 +1997,13 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  * but we cannot guarantee that the information we produce is correct
  * unless we produce it entirely atomically.
  *
- * Upon tasks file open(), a struct ctr_struct is allocated, that
- * will have a pointer to an array (also allocated here).  The struct
+ * Upon tasks file open(), a struct ctr_struct is allocated.  The struct
  * ctr_struct * is stored in file->private_data.  Its resources will
- * be freed by release() when the file is closed.  The array is used
- * to sprintf the PIDs and then used by read().
+ * be freed by release() when the file is closed.
  */
 struct ctr_struct {
-	char *buf;
-	int bufsz;
+	pid_t last_pid; /* last read pid for tasks file */
+	int unconsumed; /* bytes unconsumed for last_pid */
 };
 
 /*
@@ -2089,21 +2087,6 @@ static int cmppid(const void *a, const void *b)
 }
 
 /*
- * Convert array 'a' of 'npids' pid_t's to a string of newline separated
- * decimal pids in 'buf'.  Don't write more than 'sz' chars, but return
- * count 'cnt' of how many chars would be written if buf were large enough.
- */
-static int pid_array_to_buf(char *buf, int sz, pid_t *a, int npids)
-{
-	int cnt = 0;
-	int i;
-
-	for (i = 0; i < npids; i++)
-		cnt += snprintf(buf + cnt, max(sz - cnt, 0), "%d\n", a[i]);
-	return cnt;
-}
-
-/*
  * Handle an open on 'tasks' file.  Prepare a buffer listing the
  * process id's of tasks currently attached to the cgroup being opened.
  *
@@ -2115,7 +2098,6 @@ static int cgroup_tasks_open(struct inode *unused, struct file *file)
 	struct ctr_struct *ctr;
 	pid_t *pidarray;
 	int npids;
-	char c;
 
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
@@ -2123,6 +2105,8 @@ static int cgroup_tasks_open(struct inode *unused, struct file *file)
 	ctr = kmalloc(sizeof(*ctr), GFP_KERNEL);
 	if (!ctr)
 		goto err0;
+	ctr->last_pid = 0;
+	ctr->unconsumed = 0;
 
 	/*
 	 * If cgroup gets more users after we read count, we won't have
@@ -2139,47 +2123,120 @@ static int cgroup_tasks_open(struct inode *unused, struct file *file)
 		npids = pid_array_load(pidarray, npids, cgrp);
 		sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
 
-		/* Call pid_array_to_buf() twice, first just to get bufsz */
-		ctr->bufsz = pid_array_to_buf(&c, sizeof(c), pidarray, npids) + 1;
-		ctr->buf = kmalloc(ctr->bufsz, GFP_KERNEL);
-		if (!ctr->buf)
-			goto err2;
-		ctr->bufsz = pid_array_to_buf(ctr->buf, ctr->bufsz, pidarray, npids);
-
-		kfree(pidarray);
-	} else {
-		ctr->buf = NULL;
-		ctr->bufsz = 0;
+		down_write(&cgrp->pids_mutex);
+		if (cgrp->tasks_pids)
+			kfree(cgrp->tasks_pids);
+		cgrp->tasks_pids = pidarray;
+		cgrp->pids_length = npids;
+		cgrp->pids_use_count++;
+		up_write(&cgrp->pids_mutex);
 	}
 	file->private_data = ctr;
+	nonseekable_open(unused, file);
 	return 0;
 
-err2:
-	kfree(pidarray);
 err1:
 	kfree(ctr);
 err0:
 	return -ENOMEM;
 }
 
+#define PIDBUF_LEN 128
 static ssize_t cgroup_tasks_read(struct cgroup *cgrp,
 				    struct cftype *cft,
 				    struct file *file, char __user *buf,
 				    size_t nbytes, loff_t *ppos)
 {
 	struct ctr_struct *ctr = file->private_data;
+	pid_t pid = ctr->last_pid;
+	int unconsumed = ctr->unconsumed;
+
+	int index = 0, end;
+	char pidbuf[PIDBUF_LEN];
+	int buf_count, total_count = 0;
+
+	down_read(&cgrp->pids_mutex);
+
+	/* write the unconsumed part of last pid to buf */
+	if (unconsumed > 0) {
+		int byte_start;
+		buf_count = snprintf(pidbuf, PIDBUF_LEN, "%d\n", pid);
+		byte_start = buf_count - unconsumed;
+		total_count = min(unconsumed, (int)nbytes);
+		if (copy_to_user(buf, pidbuf + byte_start, total_count)) {
+			total_count = -EFAULT;
+			goto done;
+		}
+		if (total_count == nbytes) {
+			ctr->unconsumed = unconsumed - total_count;
+			goto done;
+		}
+	}
 
-	return simple_read_from_buffer(buf, nbytes, ppos, ctr->buf, ctr->bufsz);
+	/*
+	 * binary search (upper-bound version):
+	 * find minimum index that cgrp->tasks_pids[index] > pid or
+	 * index = cgrp->pids_length
+	 */
+	end = cgrp->pids_length;
+	while (index < end) {
+		int mid = (index + end) / 2;
+		if (cgrp->tasks_pids[mid] <= pid)
+			index = mid + 1;
+		else
+			end = mid;
+	}
+
+	/* write pids to buf */
+	pid = cgrp->tasks_pids[cgrp->pids_length - 1];
+	unconsumed = 0;
+	while (index < cgrp->pids_length && total_count < nbytes) {
+		buf_count = 0;
+		while (index < cgrp->pids_length) {
+			int len = snprintf(pidbuf + buf_count,
+					PIDBUF_LEN - buf_count, "%d\n",
+					cgrp->tasks_pids[index]);
+			if ((buf_count + len) > PIDBUF_LEN)
+				break;
+			buf_count += len;
+			if (total_count + buf_count >= nbytes) {
+				pid = cgrp->tasks_pids[index];
+				unconsumed = total_count + buf_count - nbytes;
+				buf_count = nbytes - total_count;
+				break;
+			}
+			index++;
+		}
+		if (copy_to_user(buf + total_count, pidbuf, buf_count)) {
+			total_count = -EFAULT;
+			goto done;
+		}
+		total_count += buf_count;
+	}
+	ctr->last_pid = pid;
+	ctr->unconsumed = unconsumed;
+
+done:
+	up_read(&cgrp->pids_mutex);
+	return total_count;
 }
 
 static int cgroup_tasks_release(struct inode *unused_inode,
 					struct file *file)
 {
+	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 	struct ctr_struct *ctr;
 
 	if (file->f_mode & FMODE_READ) {
+		down_write(&cgrp->pids_mutex);
+		if (!--cgrp->pids_use_count) {
+			kfree(cgrp->tasks_pids);
+			cgrp->tasks_pids = NULL;
+			cgrp->pids_length = 0;
+		}
+		up_write(&cgrp->pids_mutex);
+
 		ctr = file->private_data;
-		kfree(ctr->buf);
 		kfree(ctr);
 	}
 	return 0;
@@ -2304,6 +2361,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
+	init_rwsem(&cgrp->pids_mutex);
 
 	cgrp->parent = parent;
 	cgrp->root = parent->root;


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-26  1:47 [PATCH] cgroup(fix critical bug): new handling for tasks file Lai Jiangshan
@ 2008-08-26  2:29 ` Paul Menage
  2008-08-26  5:22   ` Lai Jiangshan
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Menage @ 2008-08-26  2:29 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

On Mon, Aug 25, 2008 at 6:47 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> In original code, we kmalloc a lot of memory when we open cgroup.tasks.
> _This memory is allocated by kmalloc_, So this memory is not statistical
> for this process or this user. This is a critical bug.

It's true that this does represent a way for an untrusted user to
consume a big chunk of memory on a machine. It's been around since
cpusets was introduced. I'm not sure that it's more serious than say,
a fork bomb or allocating lots of memory via lots of sockets with huge
network buffers. I don't see how this can cause an oops - slab.c and
slub.c both appear to just return NULL in that case, from a quick
inspection.

But improving the current code has been on my todo list for a while.
My thought on the best way to fix it is to use a proper seq_file -
that would remove the issues that you have currently with the
"unconsumed" portion of the output.

Using a seq_file iterator with your single pid array per cgroup might
work quite nicely - but isn't there still a problem that with a really
large cgroup we can overflow the kmalloc() limit for a single pid
array?

My original thoughts for solving this involved using a prio_heap to
ensure that every task in the cgroup was seen once:

1) similar to cgroup_scan_tasks, fill a prio_heap with the first N (by
task start time) tasks in the cgroup
2) sort these tasks by pid
3) report these by seq_file
4) once all the pids in the heap have been reported, update the
minimum task start time to be greater than the last task start time in
the reported group, and goto 1

This way we never have to allocate more than a single prio_heap per
open file; the downside is that we have more overhead when reporting
on really large cgroups since we have to make multiple scans of the
cgroup's task list.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-26  2:29 ` Paul Menage
@ 2008-08-26  5:22   ` Lai Jiangshan
  2008-08-26 22:40     ` Paul Menage
  2008-08-26 22:44     ` Paul Menage
  0 siblings, 2 replies; 13+ messages in thread
From: Lai Jiangshan @ 2008-08-26  5:22 UTC (permalink / raw)
  To: Paul Menage; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

Paul Menage wrote:
> On Mon, Aug 25, 2008 at 6:47 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> In original code, we kmalloc a lot of memory when we open cgroup.tasks.
>> _This memory is allocated by kmalloc_, So this memory is not statistical
>> for this process or this user. This is a critical bug.
> 
> It's true that this does represent a way for an untrusted user to
> consume a big chunk of memory on a machine. It's been around since
> cpusets was introduced. I'm not sure that it's more serious than say,
> a fork bomb or allocating lots of memory via lots of sockets with huge
IMO This is very different. The number of creating processes is statistical,
The number of opening sockets is statistical, The kernel(OOM-killer ...)
or system manager(use ulimit or other tools) can control these behaviors.
But kmalloc a big chunk of memory is not statistical.

The evil is that this big chunk of memory is not statistical.

> network buffers. I don't see how this can cause an oops - slab.c and
> slub.c both appear to just return NULL in that case, from a quick
> inspection.
I failed to try my best to rescue my system in this condition in my testes.
(But this is true: my skill is so few that I can not rescue any system)

oops - I means that the system is very hard to be rescued. The kernel
do not know which process should be killed for no statistical result.
system manager may know which process should be killed(use lsof or other tools)
but he has no memory to do any operation.

oops - My system was not responded after some printk's messages for other
subsystem were output in one of my test. I thought this was oops.

> 
> But improving the current code has been on my todo list for a while.
> My thought on the best way to fix it is to use a proper seq_file -
> that would remove the issues that you have currently with the
> "unconsumed" portion of the output.

But for my poor knowledge we always have to handle "unconsumed portion"
for nonseekable file in all different way. In my binary search based
solving, I have to handle unconsumed part for a pid and bring in
some complicated. This is really a disadvantage as you pointed out.

> 
> Using a seq_file iterator with your single pid array per cgroup might
> work quite nicely - but isn't there still a problem that with a really
> large cgroup we can overflow the kmalloc() limit for a single pid
> array?
> 
> My original thoughts for solving this involved using a prio_heap to
> ensure that every task in the cgroup was seen once:
> 
> 1) similar to cgroup_scan_tasks, fill a prio_heap with the first N (by
> task start time) tasks in the cgroup
> 2) sort these tasks by pid
> 3) report these by seq_file
> 4) once all the pids in the heap have been reported, update the
> minimum task start time to be greater than the last task start time in
> the reported group, and goto 1
> 
> This way we never have to allocate more than a single prio_heap per
> open file; the downside is that we have more overhead when reporting
> on really large cgroups since we have to make multiple scans of the
> cgroup's task list.
> 
It's complicated than necessary and change too much code IMO.
This solving is nonseekable also(for multi-scanning).
And the result of tasks file is not sorted(for multi-scanning).

Steps in my solving.
1) deal with the unconsumed part of the just read pid
2) binary search
3) deal with the unconsumed part of file(unconsumed pids).

step 2 ensure that every task in the cgroup was seen once and
ensure that we do not need scan again, the result of tasks file
is sorted.


Thank you for reviewing and comments.

                      Lai



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-26  5:22   ` Lai Jiangshan
@ 2008-08-26 22:40     ` Paul Menage
  2008-08-26 22:44     ` Paul Menage
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Menage @ 2008-08-26 22:40 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

On Mon, Aug 25, 2008 at 10:22 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> But improving the current code has been on my todo list for a while.
>> My thought on the best way to fix it is to use a proper seq_file -
>> that would remove the issues that you have currently with the
>> "unconsumed" portion of the output.
>
> But for my poor knowledge we always have to handle "unconsumed portion"
> for nonseekable file in all different way. In my binary search based
> solving, I have to handle unconsumed part for a pid and bring in
> some complicated. This is really a disadvantage as you pointed out.

I meant that the seq_file code could handle the "unconsumed" portion
for you, rather than you having to manage it explicitly.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-26  5:22   ` Lai Jiangshan
  2008-08-26 22:40     ` Paul Menage
@ 2008-08-26 22:44     ` Paul Menage
  2008-08-27  4:29       ` Lai Jiangshan
  1 sibling, 1 reply; 13+ messages in thread
From: Paul Menage @ 2008-08-26 22:44 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

On Mon, Aug 25, 2008 at 10:22 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> It's complicated than necessary and change too much code IMO.

What about the problem that maintaining a single pid array can still
fail for a really large cgroup? I guess we could just say "don't
create such large cgroups" but someone's bound to want to do that.
Perhaps use an array of pages rather than a single large kmalloc?

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-26 22:44     ` Paul Menage
@ 2008-08-27  4:29       ` Lai Jiangshan
  2008-08-27 12:36         ` Paul Menage
  0 siblings, 1 reply; 13+ messages in thread
From: Lai Jiangshan @ 2008-08-27  4:29 UTC (permalink / raw)
  To: Paul Menage; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

Paul Menage wrote:
> On Mon, Aug 25, 2008 at 10:22 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> It's complicated than necessary and change too much code IMO.
> 
> What about the problem that maintaining a single pid array can still
> fail for a really large cgroup? I guess we could just say "don't
> create such large cgroups" but someone's bound to want to do that.
> Perhaps use an array of pages rather than a single large kmalloc?
> 

Actually, I had a plan to write such a patch:
[RFC PATCH] cgroup,cpuset: use alternative malloc instead of kmalloc

The main idea is: when allocate size >= PAGE_SIZE, vmalloc will be used
instead. This will reduce the stress when continuous pages are few.
Alternative malloc is used for cgroup_tasks_open() and update_tasks_nodemask().

And vmalloc can malloc larger memory than kmalloc, is vmalloc() enough?
If not, I think using an array of pages is the best choice.

[There are several subsystem who use alternative malloc. kernel/relay.c
for example. relay.c is also using an array of pages for relay buffer. ]

   Lai

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-27  4:29       ` Lai Jiangshan
@ 2008-08-27 12:36         ` Paul Menage
  2008-08-28 12:09           ` Lai Jiangshan
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Menage @ 2008-08-27 12:36 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

On Tue, Aug 26, 2008 at 9:29 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>
> And vmalloc can malloc larger memory than kmalloc, is vmalloc() enough?
> If not, I think using an array of pages is the best choice.

vmalloc would be simpler, certainly, but it has a higher overhead. And
since we're dealing with arrays of integers here, it's not too hard to
manage multiple arrays. Oh, except for sorting them, which would be
more of a pain. So yes, maybe vmalloc() would be a better choice at
first.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-27 12:36         ` Paul Menage
@ 2008-08-28 12:09           ` Lai Jiangshan
  2008-09-05  5:34             ` Paul Menage
  0 siblings, 1 reply; 13+ messages in thread
From: Lai Jiangshan @ 2008-08-28 12:09 UTC (permalink / raw)
  To: Paul Menage; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

Paul Menage wrote:
> On Tue, Aug 26, 2008 at 9:29 PM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> And vmalloc can malloc larger memory than kmalloc, is vmalloc() enough?
>> If not, I think using an array of pages is the best choice.
> 
> vmalloc would be simpler, certainly, but it has a higher overhead. And
> since we're dealing with arrays of integers here, it's not too hard to
> manage multiple arrays. Oh, except for sorting them, which would be
> more of a pain. So yes, maybe vmalloc() would be a better choice at
> first.
> 
Yep, these are hard. Which method is your favorite?

My original purpose was to fix a bug as I described. 
This bug and the problem that offering big enough array for a huge
cgroup are orthogonal!

Could you consider/test that is it a bug as I described(and is it as
critical as I described, maybe I was too nervous)?
And this is also a problem: opening a cgroup.tasks twice or will waste
a lot of _physical_ memory.

    Thanks! Lai


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-08-28 12:09           ` Lai Jiangshan
@ 2008-09-05  5:34             ` Paul Menage
  2008-09-08  8:19               ` Lai Jiangshan
  2008-09-08 21:16               ` Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Paul Menage @ 2008-09-05  5:34 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

Hi Lai,

Sorry for the delay, I've been away on vacation.

Lai Jiangshan wrote:
> 
> My original purpose was to fix a bug as I described. 
> This bug and the problem that offering big enough array for a huge
> cgroup are orthogonal!
> 

You're right. So solving them separately seems fine.

It's definitely a bug that can usefully be fixed, but I think it's not quite as critical as you suggest.

I think the simplest approach is to use your basic idea of a shared pid array, but using a standard seq_file rather than hand-coded logic for handling partial reads. Something like the patch below.

Other possible extensions might include:
- delay generating the pid array until the first read
- allocate an array of single pages rather than a single kmalloc() region

Paul



Convert cgroup tasks file to use a seq_file with shared pid array

Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace. All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array. By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.

[ Based on a patch by Lai Jiangshan, extended to use seq_file ]

Signed-off-by: Paul Menage <menage@google.com>

---
 include/linux/cgroup.h |   10 ++
 kernel/cgroup.c        |  226 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 151 insertions(+), 85 deletions(-)

Index: pids-2.6.27-rc1-mm1/include/linux/cgroup.h
===================================================================
--- pids-2.6.27-rc1-mm1.orig/include/linux/cgroup.h
+++ pids-2.6.27-rc1-mm1/include/linux/cgroup.h
@@ -15,6 +15,7 @@
 #include <linux/rcupdate.h>
 #include <linux/cgroupstats.h>
 #include <linux/prio_heap.h>
+#include <linux/rwsem.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -137,6 +138,15 @@ struct cgroup {
 	 * release_list_lock
 	 */
 	struct list_head release_list;
+
+	/* pids_mutex protects the fields below */
+	struct rw_semaphore pids_mutex;
+	/* Array of process ids in the cgroup */
+	pid_t *tasks_pids;
+	/* How many files are using the current tasks_pids array */
+	int pids_use_count;
+	/* Length of the current tasks_pids array */
+	int pids_length;
 };
 
 /* A css_set is a structure holding pointers to a set of
Index: pids-2.6.27-rc1-mm1/kernel/cgroup.c
===================================================================
--- pids-2.6.27-rc1-mm1.orig/kernel/cgroup.c
+++ pids-2.6.27-rc1-mm1/kernel/cgroup.c
@@ -870,6 +870,14 @@ static struct super_operations cgroup_op
 	.remount_fs = cgroup_remount,
 };
 
+static void init_cgroup_housekeeping(struct cgroup *cgrp)
+{
+	INIT_LIST_HEAD(&cgrp->sibling);
+	INIT_LIST_HEAD(&cgrp->children);
+	INIT_LIST_HEAD(&cgrp->css_sets);
+	INIT_LIST_HEAD(&cgrp->release_list);
+	init_rwsem(&cgrp->pids_mutex);
+}
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
 	struct cgroup *cgrp = &root->top_cgroup;
@@ -878,10 +886,7 @@ static void init_cgroup_root(struct cgro
 	root->number_of_cgroups = 1;
 	cgrp->root = root;
 	cgrp->top_cgroup = cgrp;
-	INIT_LIST_HEAD(&cgrp->sibling);
-	INIT_LIST_HEAD(&cgrp->children);
-	INIT_LIST_HEAD(&cgrp->css_sets);
-	INIT_LIST_HEAD(&cgrp->release_list);
+	init_cgroup_housekeeping(cgrp);
 }
 
 static int cgroup_test_super(struct super_block *sb, void *data)
@@ -1997,16 +2002,7 @@ int cgroup_scan_tasks(struct cgroup_scan
  * but we cannot guarantee that the information we produce is correct
  * unless we produce it entirely atomically.
  *
- * Upon tasks file open(), a struct ctr_struct is allocated, that
- * will have a pointer to an array (also allocated here).  The struct
- * ctr_struct * is stored in file->private_data.  Its resources will
- * be freed by release() when the file is closed.  The array is used
- * to sprintf the PIDs and then used by read().
  */
-struct ctr_struct {
-	char *buf;
-	int bufsz;
-};
 
 /*
  * Load into 'pidarray' up to 'npids' of the tasks using cgroup
@@ -2088,42 +2084,132 @@ static int cmppid(const void *a, const v
 	return *(pid_t *)a - *(pid_t *)b;
 }
 
+
 /*
- * Convert array 'a' of 'npids' pid_t's to a string of newline separated
- * decimal pids in 'buf'.  Don't write more than 'sz' chars, but return
- * count 'cnt' of how many chars would be written if buf were large enough.
+ * seq_file methods for the "tasks" file. The seq_file position is the
+ * next pid to display; the seq_file iterator is a pointer to the pid
+ * in the cgroup->tasks_pids array.
  */
-static int pid_array_to_buf(char *buf, int sz, pid_t *a, int npids)
+
+static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
 {
-	int cnt = 0;
-	int i;
+	/*
+	 * Initially we receive a position value that corresponds to
+	 * one more than the last pid shown (or 0 on the first call or
+	 * after a seek to the start). Use a binary-search to find the
+	 * next pid to display, if any
+	 */
+	struct cgroup *cgrp = s->private;
+	int index = 0, pid = *pos;
+	int *iter;
+
+	down_read(&cgrp->pids_mutex);
+	if (pid) {
+		int end = cgrp->pids_length;
+		int i;
+		while (index < end) {
+			int mid = (index + end) / 2;
+			if (cgrp->tasks_pids[mid] == pid) {
+				index = mid;
+				break;
+			} else if (cgrp->tasks_pids[mid] <= pid)
+				index = mid + 1;
+			else
+				end = mid;
+		}
+	}
+	/* If we're off the end of the array, we're done */
+	if (index >= cgrp->pids_length)
+		return NULL;
+	/* Update the abstract position to be the actual pid that we found */
+	iter = cgrp->tasks_pids + index;
+	*pos = *iter;
+	return iter;
+}
+
+static void cgroup_tasks_stop(struct seq_file *s, void *v)
+{
+	struct cgroup *cgrp = s->private;
+	up_read(&cgrp->pids_mutex);
+}
+
+static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
+{
+	struct cgroup *cgrp = s->private;
+	int *p = v;
+	int *end = cgrp->tasks_pids + cgrp->pids_length;
+
+	/*
+	 * Advance to the next pid in the array. If this goes off the
+	 * end, we're done
+	 */
+	p++;
+	if (p >= end) {
+		return NULL;
+	} else {
+		*pos = *p;
+		return p;
+	}
+}
+
+static int cgroup_tasks_show(struct seq_file *s, void *v)
+{
+	return seq_printf(s, "%d\n", *(int *)v);
+}
 
-	for (i = 0; i < npids; i++)
-		cnt += snprintf(buf + cnt, max(sz - cnt, 0), "%d\n", a[i]);
-	return cnt;
+static struct seq_operations cgroup_tasks_seq_operations = {
+	.start = cgroup_tasks_start,
+	.stop = cgroup_tasks_stop,
+	.next = cgroup_tasks_next,
+	.show = cgroup_tasks_show,
+};
+
+static void release_cgroup_pid_array(struct cgroup *cgrp)
+{
+	down_write(&cgrp->pids_mutex);
+	BUG_ON(!cgrp->pids_use_count);
+	if (!--cgrp->pids_use_count) {
+		kfree(cgrp->tasks_pids);
+		cgrp->tasks_pids = NULL;
+		cgrp->pids_length = 0;
+	}
+	up_write(&cgrp->pids_mutex);
+}
+
+static int cgroup_tasks_release(struct inode *inode, struct file *file)
+{
+	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
+
+	if (!(file->f_mode & FMODE_READ))
+		return 0;
+
+	release_cgroup_pid_array(cgrp);
+	return seq_release(inode, file);
 }
 
+static struct file_operations cgroup_tasks_operations = {
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.write = cgroup_file_write,
+	.release = cgroup_tasks_release,
+};
+
 /*
- * Handle an open on 'tasks' file.  Prepare a buffer listing the
+ * Handle an open on 'tasks' file.  Prepare an array containing the
  * process id's of tasks currently attached to the cgroup being opened.
- *
- * Does not require any specific cgroup mutexes, and does not take any.
  */
+
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	struct ctr_struct *ctr;
 	pid_t *pidarray;
 	int npids;
-	char c;
+	int retval;
 
+	/* Nothing to do for write-only files */
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
 
-	ctr = kmalloc(sizeof(*ctr), GFP_KERNEL);
-	if (!ctr)
-		goto err0;
-
 	/*
 	 * If cgroup gets more users after we read count, we won't have
 	 * enough space - tough.  This race is indistinguishable to the
@@ -2131,57 +2217,31 @@ static int cgroup_tasks_open(struct inod
 	 * show up until sometime later on.
 	 */
 	npids = cgroup_task_count(cgrp);
-	if (npids) {
-		pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
-		if (!pidarray)
-			goto err1;
-
-		npids = pid_array_load(pidarray, npids, cgrp);
-		sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
-
-		/* Call pid_array_to_buf() twice, first just to get bufsz */
-		ctr->bufsz = pid_array_to_buf(&c, sizeof(c), pidarray, npids) + 1;
-		ctr->buf = kmalloc(ctr->bufsz, GFP_KERNEL);
-		if (!ctr->buf)
-			goto err2;
-		ctr->bufsz = pid_array_to_buf(ctr->buf, ctr->bufsz, pidarray, npids);
-
-		kfree(pidarray);
-	} else {
-		ctr->buf = NULL;
-		ctr->bufsz = 0;
-	}
-	file->private_data = ctr;
-	return 0;
-
-err2:
-	kfree(pidarray);
-err1:
-	kfree(ctr);
-err0:
-	return -ENOMEM;
-}
-
-static ssize_t cgroup_tasks_read(struct cgroup *cgrp,
-				    struct cftype *cft,
-				    struct file *file, char __user *buf,
-				    size_t nbytes, loff_t *ppos)
-{
-	struct ctr_struct *ctr = file->private_data;
-
-	return simple_read_from_buffer(buf, nbytes, ppos, ctr->buf, ctr->bufsz);
-}
-
-static int cgroup_tasks_release(struct inode *unused_inode,
-					struct file *file)
-{
-	struct ctr_struct *ctr;
+	pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
+	if (!pidarray)
+		return -ENOMEM;
+	npids = pid_array_load(pidarray, npids, cgrp);
+	sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
 
-	if (file->f_mode & FMODE_READ) {
-		ctr = file->private_data;
-		kfree(ctr->buf);
-		kfree(ctr);
+	/*
+	 * Store the array in the cgroup, freeing the old
+	 * array if necessary
+	 */
+	down_write(&cgrp->pids_mutex);
+	kfree(cgrp->tasks_pids);
+	cgrp->tasks_pids = pidarray;
+	cgrp->pids_length = npids;
+	cgrp->pids_use_count++;
+	up_write(&cgrp->pids_mutex);
+
+	file->f_op = &cgroup_tasks_operations;
+
+	retval = seq_open(file, &cgroup_tasks_seq_operations);
+	if (retval) {
+		release_cgroup_pid_array(cgrp);
+		return retval;
 	}
+	((struct seq_file *)file->private_data)->private = cgrp;
 	return 0;
 }
 
@@ -2210,7 +2270,6 @@ static struct cftype files[] = {
 	{
 		.name = "tasks",
 		.open = cgroup_tasks_open,
-		.read = cgroup_tasks_read,
 		.write_u64 = cgroup_tasks_write,
 		.release = cgroup_tasks_release,
 		.private = FILE_TASKLIST,
@@ -2300,10 +2359,7 @@ static long cgroup_create(struct cgroup 
 
 	mutex_lock(&cgroup_mutex);
 
-	INIT_LIST_HEAD(&cgrp->sibling);
-	INIT_LIST_HEAD(&cgrp->children);
-	INIT_LIST_HEAD(&cgrp->css_sets);
-	INIT_LIST_HEAD(&cgrp->release_list);
+	init_cgroup_housekeeping(cgrp);
 
 	cgrp->parent = parent;
 	cgrp->root = parent->root;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-09-05  5:34             ` Paul Menage
@ 2008-09-08  8:19               ` Lai Jiangshan
  2008-09-08 15:42                 ` Paul Menage
  2008-09-08 21:16               ` Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: Lai Jiangshan @ 2008-09-08  8:19 UTC (permalink / raw)
  To: Paul Menage; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

It's great, Thanks!

	Lai Jiangshan.


Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Paul Menage wrote:

> - allocate an array of single pages rather than a single kmalloc() region

I will send a patch for it in few days.

> +
> +static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
>  {
> -	int cnt = 0;
> -	int i;
> +	/*
> +	 * Initially we receive a position value that corresponds to
> +	 * one more than the last pid shown (or 0 on the first call or
> +	 * after a seek to the start). Use a binary-search to find the
> +	 * next pid to display, if any
> +	 */
> +	struct cgroup *cgrp = s->private;
> +	int index = 0, pid = *pos;
> +	int *iter;

use pid_t instead of int (include other functions)

> +
> +	down_read(&cgrp->pids_mutex);
> +	if (pid) {
> +		int end = cgrp->pids_length;
> +		int i;

int i; unused.

> +		while (index < end) {
> +			int mid = (index + end) / 2;
> +			if (cgrp->tasks_pids[mid] == pid) {
> +				index = mid;
> +				break;
> +			} else if (cgrp->tasks_pids[mid] <= pid)

(cgrp->tasks_pids[mid] <= pid) ===> (cgrp->tasks_pids[mid] < pid)

> +				index = mid + 1;
> +			else
> +				end = mid;
> +		}
> +	}


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-09-08  8:19               ` Lai Jiangshan
@ 2008-09-08 15:42                 ` Paul Menage
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Menage @ 2008-09-08 15:42 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Andrew Morton, Linux Kernel Mailing List, Linux Containers

On Mon, Sep 8, 2008 at 1:19 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>
>> +             while (index < end) {
>> +                     int mid = (index + end) / 2;
>> +                     if (cgrp->tasks_pids[mid] == pid) {
>> +                             index = mid;
>> +                             break;
>> +                     } else if (cgrp->tasks_pids[mid] <= pid)
>
> (cgrp->tasks_pids[mid] <= pid) ===> (cgrp->tasks_pids[mid] < pid)

Given the "if" test directly above, those two are equivalent.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-09-05  5:34             ` Paul Menage
  2008-09-08  8:19               ` Lai Jiangshan
@ 2008-09-08 21:16               ` Andrew Morton
  2008-09-08 22:05                 ` Paul Menage
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2008-09-08 21:16 UTC (permalink / raw)
  To: Paul Menage; +Cc: laijs, linux-kernel, containers

On Thu, 04 Sep 2008 22:34:47 -0700
Paul Menage <menage@google.com> wrote:

>  	npids = cgroup_task_count(cgrp);
> +	pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);

kmalloc becomes more unreliable above 32 kbytes and 100% unreliable
above MAX_ORDER.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup(fix critical bug): new handling for tasks file
  2008-09-08 21:16               ` Andrew Morton
@ 2008-09-08 22:05                 ` Paul Menage
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Menage @ 2008-09-08 22:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: laijs, linux-kernel, containers

On Mon, Sep 8, 2008 at 2:16 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 04 Sep 2008 22:34:47 -0700
> Paul Menage <menage@google.com> wrote:
>
>>       npids = cgroup_task_count(cgrp);
>> +     pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
>
> kmalloc becomes more unreliable above 32 kbytes and 100% unreliable
> above MAX_ORDER.

Agreed, but that's something to be fixed in a different patch - the
existing cgroups code (and cpusets originally) has had this kind of
kmalloc call in it.

I think it should be reasonably straightforward to replace it with an
array of page allocations.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-09-08 22:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-26  1:47 [PATCH] cgroup(fix critical bug): new handling for tasks file Lai Jiangshan
2008-08-26  2:29 ` Paul Menage
2008-08-26  5:22   ` Lai Jiangshan
2008-08-26 22:40     ` Paul Menage
2008-08-26 22:44     ` Paul Menage
2008-08-27  4:29       ` Lai Jiangshan
2008-08-27 12:36         ` Paul Menage
2008-08-28 12:09           ` Lai Jiangshan
2008-09-05  5:34             ` Paul Menage
2008-09-08  8:19               ` Lai Jiangshan
2008-09-08 15:42                 ` Paul Menage
2008-09-08 21:16               ` Andrew Morton
2008-09-08 22:05                 ` Paul Menage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox