public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5
@ 2011-12-28 12:14 Cyrill Gorcunov
  2012-01-15 18:07 ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Cyrill Gorcunov @ 2011-12-28 12:14 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
	Tejun Heo, Andrew Vagin, Vasiliy Kulikov, Oleg Nesterov

When we do checkpoint of a task we need to know the list of children
the task has but there is no easy way to make a reverse parent->children
chain from an arbitrary <pid> (while a parent pid is provided in "PPid"
field of /proc/<pid>/status).

So instead of walking over all pids in the system creating the big process
tree just to figure out which children a task has -- we add the explicit
/proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

v2:
 - Kame suggested to use a separated /proc/<pid>/children entry
   instead of poking /proc/<pid>/status
 - Andew suggested to use rcu facility instead of locking
   tasklist_lock
 - Tejun pointed that non-seekable seq file might not be
   enough for tasks with large number of children

v3:
 - To be on a safe side use %lu format for pid_t printing

v4:
 - New line get printed when sequence ends not at seq->stop,
   a nit pointed by Tejun
 - Documentation update
 - tasklist_lock is back, Oleg pointed that ->children list
   is actually not rcu-safe

v5:
 - Oleg suggested to make /proc/<pid>/task/<tid>/children
   instead of global /proc/<pid>/children, which eliminates
   hardness related to threads and children migration, and
   allows patch to be a way simplier.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/proc.txt |   20 ++++
 fs/proc/array.c                    |  162 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 183 insertions(+), 1 deletion(-)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,7 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
-
+  3.7	/proc/<pid>/task/<tid>/children - Information about task children
 
 ------------------------------------------------------------------------------
 Preface
@@ -1545,3 +1545,21 @@ a task to set its own or one of its thre
 is limited in size compared to the cmdline value, so writing anything longer
 then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
 comm value.
+
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be shown here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain descendants.
+
+Because this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results, which means if a child is exiting
+it might or might not be counted. The same applies to freshly created
+children -- they might or might not be counted.
+
+If one needs precise pids -- the task and children should be stopped
+or frozen.
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,165 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+struct proc_pid_children_iter {
+	struct pid_namespace *pid_ns;
+	struct pid *pid_start;
+};
+
+static struct pid *
+get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(iter->pid_start, PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching this would speed
+	 * search significantly.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			/*
+			 * OK, ltes try the fastpath, we might
+			 * miss some freshly created children
+			 * here, but it was never promised to be
+			 * accurate.
+			 *
+			 * Also note if we have not enough rights
+			 * to access the next children pid we simply
+			 * fall into slow-search version.
+			 */
+			if (!list_is_last(&task->sibling, &start->children)) {
+				task = list_first_entry(&task->sibling,
+							struct task_struct, sibling);
+				if (ptrace_may_access(task, PTRACE_MODE_READ)) {
+					pid = get_pid(task_pid(task));
+					goto out;
+				}
+			} else
+				goto out;
+		}
+	}
+
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			if (ptrace_may_access(task, PTRACE_MODE_READ)) {
+				pid = get_pid(task_pid(task));
+				goto out;
+			} else {
+				/* Maybe we success with the next children */
+				pos++;
+			}
+		}
+	}
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	unsigned long pid = (unsigned long)pid_nr_ns(v, iter->pid_ns);
+
+	return seq_printf(seq, " %lu", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	struct pid *pid = NULL;
+
+	pid = get_children_pid(iter, v, *pos + 1);
+	if (!pid)
+		seq_printf(seq, "\n");
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct proc_pid_children_iter *iter = NULL;
+	struct task_struct *task = NULL;
+	int ret = 0;
+
+	task = get_proc_task(inode);
+	if (!task) {
+		ret = -ENOENT;
+		goto err;
+	}
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = seq_open(file, &children_seq_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = iter;
+
+		iter->pid_start = get_pid(task_pid(task));
+		iter->pid_ns = inode->i_sb->s_fs_info;
+	}
+
+err:
+	if (task)
+		put_task_struct(task);
+	if (ret)
+		kfree(iter);
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct proc_pid_children_iter *iter = m->private;
+
+	put_pid(iter->pid_start);
+	kfree(iter);
+
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3497,6 +3497,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5
  2011-12-28 12:14 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5 Cyrill Gorcunov
@ 2012-01-15 18:07 ` Oleg Nesterov
  2012-01-15 18:31   ` Cyrill Gorcunov
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2012-01-15 18:07 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On 12/28, Cyrill Gorcunov wrote:
>
> When we do checkpoint of a task we need to know the list of children
> the task has but there is no easy way to make a reverse parent->children
> chain from an arbitrary <pid> (while a parent pid is provided in "PPid"
> field of /proc/<pid>/status).

Looks correct at first glance... But I'll try to recheck. I guess you need
to resend anyway, I bet nobody can recall this patch ;)

However I do not understand the ptrace_may_access() check at all.

> +static struct pid *
> +get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
> +{
> +	struct task_struct *start, *task;
> +	struct pid *pid = NULL;
> +
> +	read_lock(&tasklist_lock);
> +
> +	start = pid_task(iter->pid_start, PIDTYPE_PID);
> +	if (!start)
> +		goto out;
> +
> +	/*
> +	 * Lets try to continue searching this would speed
> +	 * search significantly.
> +	 */
> +	if (pid_prev) {
> +		task = pid_task(pid_prev, PIDTYPE_PID);
> +		if (task && task->real_parent == start &&
> +		    !(list_empty(&task->sibling))) {
> +			/*
> +			 * OK, ltes try the fastpath, we might
> +			 * miss some freshly created children
> +			 * here, but it was never promised to be
> +			 * accurate.
> +			 *
> +			 * Also note if we have not enough rights
> +			 * to access the next children pid we simply
> +			 * fall into slow-search version.
> +			 */

Why we should try the slow-search path if ptrace_may_access() fails?

> +			if (!list_is_last(&task->sibling, &start->children)) {
> +				task = list_first_entry(&task->sibling,
> +							struct task_struct, sibling);
> +				if (ptrace_may_access(task, PTRACE_MODE_READ)) {
> +					pid = get_pid(task_pid(task));
> +					goto out;
> +				}
> +			} else
> +				goto out;
> +		}
> +	}

Well, this is cosmetic, but imho

			if (list_is_last(...))
				goto out;

			task = list_first_entry(...);
			...

looks better.


> +	list_for_each_entry(task, &start->children, sibling) {
> +		if (pos-- == 0) {
> +			if (ptrace_may_access(task, PTRACE_MODE_READ)) {
> +				pid = get_pid(task_pid(task));
> +				goto out;
> +			} else {
> +				/* Maybe we success with the next children */
> +				pos++;

Again. I simply can't understand what ptrace_may_access() actually
means. Why do we use the possible child, not parent?

IOW. I have no idea if we really need any security check at all.
You can find the children pids without this patch anyway via.
grep PPid /proc/*/status.

But if you want ptrace_may_access/whatever, you should check
ptrace_may_access(start), no?

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5
  2012-01-15 18:07 ` Oleg Nesterov
@ 2012-01-15 18:31   ` Cyrill Gorcunov
  2012-01-15 18:48     ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Cyrill Gorcunov @ 2012-01-15 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On Sun, Jan 15, 2012 at 07:07:21PM +0100, Oleg Nesterov wrote:
> On 12/28, Cyrill Gorcunov wrote:
> >
> > When we do checkpoint of a task we need to know the list of children
> > the task has but there is no easy way to make a reverse parent->children
> > chain from an arbitrary <pid> (while a parent pid is provided in "PPid"
> > field of /proc/<pid>/status).
> 
> Looks correct at first glance... But I'll try to recheck. I guess you need
> to resend anyway, I bet nobody can recall this patch ;)
> 

Sure ;)

> However I do not understand the ptrace_may_access() check at all.
> 
...

> Well, this is cosmetic, but imho
> 
> 			if (list_is_last(...))
> 				goto out;
> 
> 			task = list_first_entry(...);
> 			...
> 
> looks better.
> 

ok

> 
> > +	list_for_each_entry(task, &start->children, sibling) {
> > +		if (pos-- == 0) {
> > +			if (ptrace_may_access(task, PTRACE_MODE_READ)) {
> > +				pid = get_pid(task_pid(task));
> > +				goto out;
> > +			} else {
> > +				/* Maybe we success with the next children */
> > +				pos++;
> 
> Again. I simply can't understand what ptrace_may_access() actually
> means. Why do we use the possible child, not parent?
> 
> IOW. I have no idea if we really need any security check at all.
> You can find the children pids without this patch anyway via.
> grep PPid /proc/*/status.
> 

OK, I see. I am actually not sure which behaviour should be there.
What should we do if say we have a task with a number of children,
which changed permissions of own and some of children. Look what I mean.

We have say tid A, which has children B C D, and when we read
/proc/pid/task/tid/children we should see "B C D" here. But
what if say A started with roots rights, then changed own permission
so everyone could read this /proc/pid/task/<A>/children, but
left C with root permissions only. So should we list C here?
Or such scenario is impossible at all?

> But if you want ptrace_may_access/whatever, you should check
> ptrace_may_access(start), no?
> 

	Cyrill

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5
  2012-01-15 18:31   ` Cyrill Gorcunov
@ 2012-01-15 18:48     ` Oleg Nesterov
  2012-01-15 18:57       ` Cyrill Gorcunov
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2012-01-15 18:48 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On 01/15, Cyrill Gorcunov wrote:
>
> On Sun, Jan 15, 2012 at 07:07:21PM +0100, Oleg Nesterov wrote:
> >
> > Again. I simply can't understand what ptrace_may_access() actually
> > means. Why do we use the possible child, not parent?
> >
> > IOW. I have no idea if we really need any security check at all.
> > You can find the children pids without this patch anyway via.
> > grep PPid /proc/*/status.
> >
>
> OK, I see. I am actually not sure which behaviour should be there.
> What should we do if say we have a task with a number of children,
> which changed permissions of own and some of children. Look what I mean.
>
> We have say tid A, which has children B C D, and when we read
> /proc/pid/task/tid/children we should see "B C D" here. But
> what if say A started with roots rights, then changed own permission
> so everyone could read this /proc/pid/task/<A>/children, but
> left C with root permissions only. So should we list C here?

Why not? What is the problem to know the pid of this child? And once
again, you can list all children without this patch anyway.

Perhaps I missed something, but I simply don't understand the problem.

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5
  2012-01-15 18:48     ` Oleg Nesterov
@ 2012-01-15 18:57       ` Cyrill Gorcunov
  0 siblings, 0 replies; 5+ messages in thread
From: Cyrill Gorcunov @ 2012-01-15 18:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On Sun, Jan 15, 2012 at 07:48:06PM +0100, Oleg Nesterov wrote:
> 
> Why not? What is the problem to know the pid of this child? And once
> again, you can list all children without this patch anyway.
> 
> Perhaps I missed something, but I simply don't understand the problem.
> 

Dunno, that's why  I asked. But indeed I somehow forgot that I can
find all children anyway. I'll drop this snippets. Thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-01-15 18:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-28 12:14 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v5 Cyrill Gorcunov
2012-01-15 18:07 ` Oleg Nesterov
2012-01-15 18:31   ` Cyrill Gorcunov
2012-01-15 18:48     ` Oleg Nesterov
2012-01-15 18:57       ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox