* [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
@ 2011-12-06 18:10 Cyrill Gorcunov
2011-12-06 22:33 ` Andrew Morton
2011-12-07 18:53 ` Oleg Nesterov
0 siblings, 2 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-06 18:10 UTC (permalink / raw)
To: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo
Cc: Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
There is no easy way to make a reverse parent->children chain
from the task status, in turn children->parent provided with "PPid"
field.
So instead of walking over all pids in system to figure out what
children the task have -- we add explicit /proc/<pid>/children entry,
since kernel already knows this kind of information but it was not
yet exported.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
---
Please review. I hope I didn't miss anything here.
fs/proc/array.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 1
3 files changed, 68 insertions(+)
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,69 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ return seq_printf(seq, " %d", pid_vnr(task_pid(task)));
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct task_struct *task;
+
+ rcu_read_lock();
+ task = seq->private;
+ if (task)
+ return seq_list_start(&task->children, *pos);
+
+ return NULL;
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct task_struct *task = seq->private;
+ if (task)
+ return seq_list_next(v, &task->children, pos);
+ return NULL;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ rcu_read_unlock();
+ seq_printf(seq, "\n");
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = (void *)get_proc_task(inode);
+ }
+
+ return ret;
+}
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ if (m->private)
+ put_task_struct(m->private);
+
+ seq_release(inode, file);
+ return 0;
+}
+
+const struct file_operations proc_pid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
+ REG("children", S_IRUGO, proc_pid_children_operations),
REG("maps", S_IRUGO, proc_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+extern const struct file_operations proc_pid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-06 18:10 [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2 Cyrill Gorcunov
@ 2011-12-06 22:33 ` Andrew Morton
2011-12-06 22:35 ` Cyrill Gorcunov
2011-12-07 18:53 ` Oleg Nesterov
1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2011-12-06 22:33 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo,
Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
On Tue, 6 Dec 2011 22:10:26 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> There is no easy way to make a reverse parent->children chain
> from the task status, in turn children->parent provided with "PPid"
> field.
>
> So instead of walking over all pids in system to figure out what
> children the task have -- we add explicit /proc/<pid>/children entry,
> since kernel already knows this kind of information but it was not
> yet exported.
>
>
> ...
>
> +static int children_seq_show(struct seq_file *seq, void *v)
> +{
> + struct task_struct *task = container_of(v, struct task_struct, sibling);
> + return seq_printf(seq, " %d", pid_vnr(task_pid(task)));
> +}
Printing a pid_t with %d is a bit risky. At present all architectures
appear to use int for pid_t so it's not broken (yet!). I suppose we
can leave this code as-is for now - if some 32-bit arch turns up with a
64-bit pid_t, we'll want that warning to come out.
> +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> + struct task_struct *task;
> +
> + rcu_read_lock();
> + task = seq->private;
> + if (task)
> + return seq_list_start(&task->children, *pos);
> +
> + return NULL;
> +}
> +
> +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + struct task_struct *task = seq->private;
> + if (task)
> + return seq_list_next(v, &task->children, pos);
> + return NULL;
> +}
> +
> +static void children_seq_stop(struct seq_file *seq, void *v)
> +{
> + rcu_read_unlock();
> + seq_printf(seq, "\n");
> +}
So as I understand it, they way this works is that when the seqfile
buffer fills up we will drop the rcu lock, will flush the buffer and
then we reacquire the lock and the seq_file core will iterate over the
children list until we end up the same number of items from the head,
and populating the output buffer will resume.
This means that if the children list changes during the flush, the data
which userspace reads will be wrong. Of course, inaccuracy is
unavoidable when returning kernel-internal data such as this to
userspace.
I suggest that the changelog should explain these races and should also
explain how you propose that the client software will be robust against
such races.
It's also worth adding a warning about this into the documentation for
this procfs file. Documentation which this patch forgot to provide,
btw!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-06 22:33 ` Andrew Morton
@ 2011-12-06 22:35 ` Cyrill Gorcunov
2011-12-07 9:41 ` Cyrill Gorcunov
0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-06 22:35 UTC (permalink / raw)
To: Andrew Morton
Cc: Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo,
Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
On Tue, Dec 06, 2011 at 02:33:13PM -0800, Andrew Morton wrote:
...
>
> It's also worth adding a warning about this into the documentation for
> this procfs file. Documentation which this patch forgot to provide,
> btw!
>
>
Yes, thanks for comments Andrew, I already got the feeling that I've
missed documentation update from the start_brk /proc patch ;)
I'll address all issues, thanks!
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-06 22:35 ` Cyrill Gorcunov
@ 2011-12-07 9:41 ` Cyrill Gorcunov
2011-12-07 18:27 ` Tejun Heo
0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 9:41 UTC (permalink / raw)
To: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
On Wed, Dec 07, 2011 at 02:35:58AM +0400, Cyrill Gorcunov wrote:
> On Tue, Dec 06, 2011 at 02:33:13PM -0800, Andrew Morton wrote:
> ...
> >
> > It's also worth adding a warning about this into the documentation for
> > this procfs file. Documentation which this patch forgot to provide,
> > btw!
> >
> >
>
> Yes, thanks for comments Andrew, I already got the feeling that I've
> missed documentation update from the start_brk /proc patch ;)
>
> I'll address all issues, thanks!
>
I hope this one addresses all nits.
Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v3
There is no easy way to make a reverse parent->children chain
from a task status, in turn children->parent provided with "PPid"
field.
So instead of walking over all pids in system to figure out what
children a task have -- we add explicit /proc/<pid>/children entry,
since kernel already has this kind of information but it is not
yet exported.
Note that filling user-space buffer is done in iterative manner
where each iteration is protected via RCU read lock, which means
if between iterations a new task is added it might be missed in
output and to have a consistent image of which children are really
present in system the task should be either frozen or stopped
together with all children.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
v3:
- To be on a safe side use %lu format for pid_t printing
- Update changelog and documentation
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
Documentation/filesystems/proc.txt | 15 ++++++++
fs/proc/array.c | 66 +++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 1
4 files changed, 83 insertions(+)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/children - Information about task children
------------------------------------------------------------------------------
@@ -1545,3 +1546,17 @@ a task to set its own or one of its thre
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
comm value.
+
+3.7 /proc/<pid>/children - Information about task children
+--------------------------------------------------------------
+This file provides a fast way to retrieve children pids of the
+task pointed by <pid>. The format is a stream of pids separated
+by space with a new line at the end. If a task has no children at
+all -- only a new line returned.
+
+Note that filling user-space buffer is done in iterative manner where
+each iteration is protected via RCU read lock, which means if between
+iterations a new task is added it might be missed in output and
+to have a consistent image of which children are really present in
+system the task should be either frozen or stopped together with
+all children.
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,69 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ return seq_printf(seq, " %lu", (unsigned long)pid_vnr(task_pid(task)));
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct task_struct *task;
+
+ rcu_read_lock();
+ task = seq->private;
+ if (task)
+ return seq_list_start(&task->children, *pos);
+
+ return NULL;
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct task_struct *task = seq->private;
+ if (task)
+ return seq_list_next(v, &task->children, pos);
+ return NULL;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ rcu_read_unlock();
+ seq_printf(seq, "\n");
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = (void *)get_proc_task(inode);
+ }
+
+ return ret;
+}
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ if (m->private)
+ put_task_struct(m->private);
+
+ seq_release(inode, file);
+ return 0;
+}
+
+const struct file_operations proc_pid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
+ REG("children", S_IRUGO, proc_pid_children_operations),
REG("maps", S_IRUGO, proc_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+extern const struct file_operations proc_pid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 9:41 ` Cyrill Gorcunov
@ 2011-12-07 18:27 ` Tejun Heo
2011-12-07 18:43 ` Cyrill Gorcunov
0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2011-12-07 18:27 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
Hello,
On Wed, Dec 07, 2011 at 01:41:44PM +0400, Cyrill Gorcunov wrote:
> +static void children_seq_stop(struct seq_file *seq, void *v)
> +{
> + rcu_read_unlock();
> + seq_printf(seq, "\n");
> +}
Wouldn't this add newline on page boundary? Newline should be printed
at the end of the whole sequence but seq_stop() is called at buffer
boundaries.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 18:27 ` Tejun Heo
@ 2011-12-07 18:43 ` Cyrill Gorcunov
0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 18:43 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Vasiliy Kulikov, Andrew Vagin, Oleg Nesterov, LKML
On Wed, Dec 07, 2011 at 10:27:26AM -0800, Tejun Heo wrote:
> Hello,
>
> On Wed, Dec 07, 2011 at 01:41:44PM +0400, Cyrill Gorcunov wrote:
> > +static void children_seq_stop(struct seq_file *seq, void *v)
> > +{
> > + rcu_read_unlock();
> > + seq_printf(seq, "\n");
> > +}
>
> Wouldn't this add newline on page boundary? Newline should be printed
> at the end of the whole sequence but seq_stop() is called at buffer
> boundaries.
>
Yeah, good point. I'll update! Thanks Tejun (somehow missed this
call in traverse()).
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-06 18:10 [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2 Cyrill Gorcunov
2011-12-06 22:33 ` Andrew Morton
@ 2011-12-07 18:53 ` Oleg Nesterov
2011-12-07 19:03 ` Cyrill Gorcunov
1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2011-12-07 18:53 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
Hi Cyrill,
Sorry, I didn't read this patch yet, but
On 12/06, Cyrill Gorcunov wrote:
>
> +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> + struct task_struct *task;
> +
> + rcu_read_lock();
> + task = seq->private;
> + if (task)
> + return seq_list_start(&task->children, *pos);
This looks "obviously wrong".
We can not trust ->children->next after rcu_read_unlock(). Another
rcu_read_lock() can't help.
Once again, I can be easily wrong, need to read the patch first.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 18:53 ` Oleg Nesterov
@ 2011-12-07 19:03 ` Cyrill Gorcunov
2011-12-07 19:19 ` Cyrill Gorcunov
2011-12-08 16:35 ` Oleg Nesterov
0 siblings, 2 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 19:03 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Wed, Dec 07, 2011 at 07:53:43PM +0100, Oleg Nesterov wrote:
> Hi Cyrill,
>
> Sorry, I didn't read this patch yet, but
>
> On 12/06, Cyrill Gorcunov wrote:
> >
> > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct task_struct *task;
> > +
> > + rcu_read_lock();
> > + task = seq->private;
> > + if (task)
> > + return seq_list_start(&task->children, *pos);
>
> This looks "obviously wrong".
>
> We can not trust ->children->next after rcu_read_unlock(). Another
> rcu_read_lock() can't help.
>
> Once again, I can be easily wrong, need to read the patch first.
>
Wait, Oleg, I might be wrong as well, but it's now as
children_seq_open
get_proc_task (so ref to task increased)
the children_seq_start/children_seq_stop works
in iteration and every new iteration seq_list_next
walks over the whole children list from the list
head under rcu lock, so even if task is removed
or added the link should exsist until rcu is unlocked
and sync'ed no?
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 19:03 ` Cyrill Gorcunov
@ 2011-12-07 19:19 ` Cyrill Gorcunov
2011-12-07 20:34 ` Cyrill Gorcunov
2011-12-08 16:35 ` Oleg Nesterov
1 sibling, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 19:19 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Wed, Dec 07, 2011 at 11:03:40PM +0400, Cyrill Gorcunov wrote:
> On Wed, Dec 07, 2011 at 07:53:43PM +0100, Oleg Nesterov wrote:
> > Hi Cyrill,
> >
> > Sorry, I didn't read this patch yet, but
> >
> > On 12/06, Cyrill Gorcunov wrote:
> > >
> > > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > > +{
> > > + struct task_struct *task;
> > > +
> > > + rcu_read_lock();
> > > + task = seq->private;
> > > + if (task)
> > > + return seq_list_start(&task->children, *pos);
> >
> > This looks "obviously wrong".
> >
> > We can not trust ->children->next after rcu_read_unlock(). Another
> > rcu_read_lock() can't help.
> >
> > Once again, I can be easily wrong, need to read the patch first.
> >
>
> Wait, Oleg, I might be wrong as well, but it's now as
>
> children_seq_open
> get_proc_task (so ref to task increased)
>
> the children_seq_start/children_seq_stop works
> in iteration and every new iteration seq_list_next
> walks over the whole children list from the list
> head under rcu lock, so even if task is removed
> or added the link should exsist until rcu is unlocked
> and sync'ed no?
>
On the other hands some if (task) tests are redundant
and might be dropped since we have a reference to a
task until seq-file is not released. I'll update it
and shrink a patch some more.
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 19:19 ` Cyrill Gorcunov
@ 2011-12-07 20:34 ` Cyrill Gorcunov
2011-12-07 21:52 ` Cyrill Gorcunov
0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 20:34 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
KAMEZAWA Hiroyuki, Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Wed, Dec 07, 2011 at 11:19:16PM +0400, Cyrill Gorcunov wrote:
...
> > >
> > > This looks "obviously wrong".
> > >
> > > We can not trust ->children->next after rcu_read_unlock(). Another
> > > rcu_read_lock() can't help.
> > >
> > > Once again, I can be easily wrong, need to read the patch first.
> > >
> >
> > Wait, Oleg, I might be wrong as well, but it's now as
> >
> > children_seq_open
> > get_proc_task (so ref to task increased)
> >
> > the children_seq_start/children_seq_stop works
> > in iteration and every new iteration seq_list_next
> > walks over the whole children list from the list
> > head under rcu lock, so even if task is removed
> > or added the link should exsist until rcu is unlocked
> > and sync'ed no?
> >
>
> On the other hands some if (task) tests are redundant
> and might be dropped since we have a reference to a
> task until seq-file is not released. I'll update it
> and shrink a patch some more.
>
So while adding new task to a list is not a problem (the reader
will simply not notice it) removing task from a list is a bit
tricky, but as far as I see switch_task_namespaces (from exit_notify)
uses synchronize_rcu as well as release_task calls for call_rcu
for put_task_struct) so I think we will not get any wrong dereference
in
static int children_seq_show(struct seq_file *seq, void *v)
{
struct task_struct *task = container_of(v, struct task_struct, sibling);
return seq_printf(seq, " %lu", (unsigned long)pid_vnr(task_pid(task)));
}
while we're in rcu reader section. I might be wrong of course, so
please verify this claim.
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 20:34 ` Cyrill Gorcunov
@ 2011-12-07 21:52 ` Cyrill Gorcunov
0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 21:52 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton, Tejun Heo
Cc: Vasiliy Kulikov, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Andrew Vagin, LKML
On Thu, Dec 08, 2011 at 12:34:50AM +0400, Cyrill Gorcunov wrote:
...
>
> So while adding new task to a list is not a problem (the reader
> will simply not notice it) removing task from a list is a bit
> tricky, but as far as I see switch_task_namespaces (from exit_notify)
> uses synchronize_rcu as well as release_task calls for call_rcu
> for put_task_struct) so I think we will not get any wrong dereference
> in
>
> static int children_seq_show(struct seq_file *seq, void *v)
> {
> struct task_struct *task = container_of(v, struct task_struct, sibling);
> return seq_printf(seq, " %lu", (unsigned long)pid_vnr(task_pid(task)));
> }
>
> while we're in rcu reader section. I might be wrong of course, so
> please verify this claim.
>
An updated version is below (I've updated changelog and docs as well).
Please review if you get a chance. Complains are welcome of course.
Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
There is no easy way to make a reverse parent->children chain
from arbitrary <pid> (while parent pid is provided in "PPid"
field of /proc/<pid>/status).
So instead of walking over all pids in the system to figure out which
children a task have -- we add explicit /proc/<pid>/children entry,
because kernel already has this kind of information but it is not
yet exported. This is a first level children, not the whole process
tree, neither the process threads are identified with this interface.
Note that filling user-space buffer is done in iterative manner
where each iteration is protected via RCU read lock, which means
if between iterations a new task is added it might be missed in
output and to have a consistent image of which children are really
present in system the task should be either frozen or stopped
together with all children.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
v3:
- To be on a safe side use %lu format for pid_t printing
v4:
- New line get printed when sequence ends not at seq->stop,
nit pointed by Tejun
- documentation update
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
Documentation/filesystems/proc.txt | 19 +++++++++
fs/proc/array.c | 72 +++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 1
4 files changed, 93 insertions(+)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/children - Information about task children
------------------------------------------------------------------------------
@@ -1545,3 +1546,21 @@ a task to set its own or one of its thre
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
comm value.
+
+3.7 /proc/<pid>/children - Information about task children
+--------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of the task pointed by <pid>. The format is a stream of pids separated
+by space with a new line at the end. If a task has no children at
+all -- only a new line returned. Note the "first level" here -- if
+a task child has own children they will not be printed there, one
+need to read /proc/<children-pid>/children to obtain such pids.
+The same applies to threads -- they are not counted via this interface,
+one need to read /proc/<pid>/status to find all threads the task created.
+
+Note that filling user-space buffer is done in iterative manner where
+each iteration is protected via RCU read lock, which means if between
+iterations a new task is added it might be missed in output and
+to have a consistent image of which children are really present in
+system the task should be either frozen or stopped together with
+all children.
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,75 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ return seq_printf(seq, " %lu", (unsigned long)pid_vnr(task_pid(task)));
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct task_struct *task = seq->private;
+
+ rcu_read_lock();
+ return seq_list_start(&task->children, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct task_struct *task = seq->private;
+ struct list_head *head;
+
+ head = seq_list_next(v, &task->children, pos);
+ if (!head)
+ seq_printf(seq, "\n");
+ return head;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ rcu_read_unlock();
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ struct task_struct *task;
+ int ret;
+
+ task = get_proc_task(inode);
+ if (!task)
+ return -ENOENT;
+
+ ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = task;
+ } else
+ put_task_struct(task);
+
+ return ret;
+}
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ if (m->private)
+ put_task_struct(m->private);
+
+ seq_release(inode, file);
+ return 0;
+}
+
+const struct file_operations proc_pid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
+ REG("children", S_IRUGO, proc_pid_children_operations),
REG("maps", S_IRUGO, proc_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+extern const struct file_operations proc_pid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-07 19:03 ` Cyrill Gorcunov
2011-12-07 19:19 ` Cyrill Gorcunov
@ 2011-12-08 16:35 ` Oleg Nesterov
2011-12-08 16:50 ` Cyrill Gorcunov
2011-12-08 21:28 ` Cyrill Gorcunov
1 sibling, 2 replies; 20+ messages in thread
From: Oleg Nesterov @ 2011-12-08 16:35 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On 12/07, Cyrill Gorcunov wrote:
>
> On Wed, Dec 07, 2011 at 07:53:43PM +0100, Oleg Nesterov wrote:
> > Hi Cyrill,
> >
> > Sorry, I didn't read this patch yet, but
> >
> > On 12/06, Cyrill Gorcunov wrote:
> > >
> > > +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> > > +{
> > > + struct task_struct *task;
> > > +
> > > + rcu_read_lock();
> > > + task = seq->private;
> > > + if (task)
> > > + return seq_list_start(&task->children, *pos);
> >
> > This looks "obviously wrong".
> >
> > We can not trust ->children->next after rcu_read_unlock(). Another
> > rcu_read_lock() can't help.
> >
> > Once again, I can be easily wrong, need to read the patch first.
> >
>
> Wait, Oleg, I might be wrong as well, but it's now a
>
> children_seq_open
> get_proc_task (so ref to task increased)
Yes. task_struct itself can't go away.
> the children_seq_start/children_seq_stop work
> in iteration and every new iteration seq_list_next
> walks over the whole children list from the list
> head under rcu lock,
Yep, I misread this code, I though it does _next.
However, ->children list is not rcu-safe, this means that even
list_for_each() itself is not safe. Either you need tasklist or
we can probably make it rcu-safe...
As for /proc/pid/children, personally I think it is very useful.
But note that it obviously reports the children per-thread, while
in general this is the per-process thing. Not sure this really
makes sense, but perhaps /proc/pid/children and
/proc/pid/task/tid/children should act differently. Like, say,
proc_tid_stat/proc_tgid_stat. I won't insist.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-08 16:35 ` Oleg Nesterov
@ 2011-12-08 16:50 ` Cyrill Gorcunov
2011-12-08 21:28 ` Cyrill Gorcunov
1 sibling, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-08 16:50 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Thu, Dec 08, 2011 at 05:35:35PM +0100, Oleg Nesterov wrote:
> >
> > Wait, Oleg, I might be wrong as well, but it's now a
> >
> > children_seq_open
> > get_proc_task (so ref to task increased)
>
> Yes. task_struct itself can't go away.
>
> > the children_seq_start/children_seq_stop work
> > in iteration and every new iteration seq_list_next
> > walks over the whole children list from the list
> > head under rcu lock,
>
> Yep, I misread this code, I though it does _next.
>
> However, ->children list is not rcu-safe, this means that even
> list_for_each() itself is not safe. Either you need tasklist or
> we can probably make it rcu-safe...
>
Seems I'll have to use tasklist_lock read-lock (atually it was
there in previous versions of patch but patch was not implementing
start/stop concept so I've been advised to use rcu read locks
instead).
> As for /proc/pid/children, personally I think it is very useful.
> But note that it obviously reports the children per-thread, while
> in general this is the per-process thing. Not sure this really
Yeah, Kosaki pointed me that I missed children from another threads.
> makes sense, but perhaps /proc/pid/children and
> /proc/pid/task/tid/children should act differently. Like, say,
> proc_tid_stat/proc_tgid_stat. I won't insist.
>
At moment I thought only about top level here, ie /proc/pid/children,
but I think once I finish we can extend the patch ;)
Thanks for comments!
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-08 16:35 ` Oleg Nesterov
2011-12-08 16:50 ` Cyrill Gorcunov
@ 2011-12-08 21:28 ` Cyrill Gorcunov
2011-12-08 21:54 ` Andrew Morton
1 sibling, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-08 21:28 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton
Cc: Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo,
Vasiliy Kulikov, Andrew Vagin, LKML
On Thu, Dec 08, 2011 at 05:35:35PM +0100, Oleg Nesterov wrote:
...
>
> However, ->children list is not rcu-safe, this means that even
> list_for_each() itself is not safe. Either you need tasklist or
> we can probably make it rcu-safe...
>
Andrew, Oleg, does the below one look more less fine? Note the
tasklist_lock is back and it worries me a bit since I imagine
one could be endlessly reading some /proc/<pid>/children file
increasing contention over this lock on the whole system
(regardless the fact that it's take for read only).
Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
There is no easy way to make a reverse parent->children chain
from arbitrary <pid> (while parent pid is provided in "PPid"
field of /proc/<pid>/status).
So instead of walking over all pids in the system to figure out which
children a task have -- we add explicit /proc/<pid>/children entry,
because kernel already has this kind of information but it is not
yet exported. This is a first level children, not the whole process
tree, neither the process threads are identified with this interface.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
v3:
- To be on a safe side use %lu format for pid_t printing
v4:
- New line get printed when sequence ends not at seq->stop,
a nit pointed by Tejun
- Documentation update
- tasklist_lock is back, Oleg pointed that ->children list
is actually not rcu-safe
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
Documentation/filesystems/proc.txt | 11 +++
fs/proc/array.c | 127 +++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 6 +
4 files changed, 145 insertions(+)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/children - Information about task children
------------------------------------------------------------------------------
@@ -1545,3 +1546,13 @@ a task to set its own or one of its thre
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
comm value.
+
+3.7 /proc/<pid>/children - Information about task children
+--------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>. The format is a stream of pids separated
+by space with a new line at the end. If a task has no children at
+all -- only a new line returned. Note the "first level" here -- if
+a task child has own children they will not be printed there, one
+need to read /proc/<children-pid>/children to obtain such pids.
+The same applies to threads -- they are not counted here.
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,130 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+static struct list_head *
+children_get_at(struct proc_pid_children_iter *iter, loff_t pos)
+{
+ struct task_struct *t = iter->leader;
+ do {
+ struct list_head *p;
+ list_for_each(p, &t->children) {
+ if (pos-- == 0) {
+ iter->last = t;
+ return p;
+ }
+ }
+ } while_each_thread(iter->leader, t);
+
+ /* To make sure it's never in unknown state */
+ iter->last = NULL;
+
+ return NULL;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ return seq_printf(seq, " %lu", (unsigned long)pid_vnr(task_pid(task)));
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+
+ read_lock(&tasklist_lock);
+ return children_get_at(iter, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+ struct list_head *next = NULL;
+
+ if (iter->last) {
+ if (list_is_last(v, &iter->last->children)) {
+ while_each_thread(iter->leader, iter->last) {
+ if (!list_empty(&iter->last->children)) {
+ next = iter->last->children.next;
+ goto found;
+ }
+ }
+ iter->last = NULL;
+ } else
+ next = ((struct list_head *)v)->next;
+ }
+
+found:
+ ++*pos;
+ if (!next)
+ seq_printf(seq, "\n");
+ return next;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+
+ iter->last = NULL;
+ read_unlock(&tasklist_lock);
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ struct proc_pid_children_iter *iter = NULL;
+ struct task_struct *task = NULL;
+ int ret;
+
+ ret = -ENOMEM;
+ iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ goto err;
+
+ ret = -ENOENT;
+ task = get_proc_task(inode);
+ if (!task)
+ goto err;
+
+ ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = iter;
+ iter->leader = task;
+ iter->last = task;
+ }
+
+err:
+ if (ret) {
+ if (task)
+ put_task_struct(task);
+ kfree(iter);
+ }
+
+ return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_pid_children_iter *iter = m->private;
+
+ put_task_struct(iter->leader);
+ kfree(iter);
+ seq_release(inode, file);
+
+ return 0;
+}
+
+const struct file_operations proc_pid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
+ REG("children", S_IRUGO, proc_pid_children_operations),
REG("maps", S_IRUGO, proc_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,12 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+struct proc_pid_children_iter {
+ struct task_struct *leader;
+ struct task_struct *last;
+};
+
+extern const struct file_operations proc_pid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-08 21:28 ` Cyrill Gorcunov
@ 2011-12-08 21:54 ` Andrew Morton
2011-12-08 22:21 ` Cyrill Gorcunov
2011-12-09 15:30 ` Oleg Nesterov
0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2011-12-08 21:54 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Oleg Nesterov, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Fri, 9 Dec 2011 01:28:53 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Thu, Dec 08, 2011 at 05:35:35PM +0100, Oleg Nesterov wrote:
> ...
> >
> > However, ->children list is not rcu-safe, this means that even
> > list_for_each() itself is not safe. Either you need tasklist or
> > we can probably make it rcu-safe...
> >
>
> Andrew, Oleg, does the below one look more less fine? Note the
> tasklist_lock is back and it worries me a bit since I imagine
> one could be endlessly reading some /proc/<pid>/children file
> increasing contention over this lock on the whole system
> (regardless the fact that it's take for read only).
It is a potential problem, from the lock-hold point of view and
also it can cause large scheduling latencies. What's involved in
making ->children an rcu-protected list?
> ---
> From: Cyrill Gorcunov <gorcunov@openvz.org>
> Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
>
> There is no easy way to make a reverse parent->children chain
> from arbitrary <pid> (while parent pid is provided in "PPid"
> field of /proc/<pid>/status).
>
> So instead of walking over all pids in the system to figure out which
> children a task have -- we add explicit /proc/<pid>/children entry,
> because kernel already has this kind of information but it is not
> yet exported. This is a first level children, not the whole process
> tree, neither the process threads are identified with this interface.
The changelog doesn't explain why we want the patch, so there's no
reason to merge it! Something to do with c/r, yes?
If so, I guess the feature could/should be configurable. Probably with
a CONFIG_PROC_CHILDREN which is selected by CONFIG_CHECKPOINT_RESTORE.
Which is all getting a bit over the top, but I suppose we must do it.
Also, neither the changelog not the documentation mention the
loss-of-data problem which might occur if/when the lock is dropped.
The code now appears to be kinda-duplicating functionality which the
seq_file library provides. Shouldn't this have been
changelogged/commented? If it was, I wouldn't need to ask the next
question.
Why is it kinda-duplicating seq_file functionality? Can we strengthen
the seq_file code so this is unnecessary?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-08 21:54 ` Andrew Morton
@ 2011-12-08 22:21 ` Cyrill Gorcunov
2011-12-09 15:30 ` Oleg Nesterov
1 sibling, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-08 22:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Thu, Dec 08, 2011 at 01:54:30PM -0800, Andrew Morton wrote:
> On Fri, 9 Dec 2011 01:28:53 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
> > On Thu, Dec 08, 2011 at 05:35:35PM +0100, Oleg Nesterov wrote:
> > ...
> > >
> > > However, ->children list is not rcu-safe, this means that even
> > > list_for_each() itself is not safe. Either you need tasklist or
> > > we can probably make it rcu-safe...
> > >
> >
> > Andrew, Oleg, does the below one look more less fine? Note the
> > tasklist_lock is back and it worries me a bit since I imagine
> > one could be endlessly reading some /proc/<pid>/children file
> > increasing contention over this lock on the whole system
> > (regardless the fact that it's take for read only).
>
> It is a potential problem, from the lock-hold point of view and
> also it can cause large scheduling latencies. What's involved in
> making ->children an rcu-protected list?
>
I think better to poke Oleg about it ;) Oleg?
> > ---
> > From: Cyrill Gorcunov <gorcunov@openvz.org>
> > Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
> >
> > There is no easy way to make a reverse parent->children chain
> > from arbitrary <pid> (while parent pid is provided in "PPid"
> > field of /proc/<pid>/status).
> >
> > So instead of walking over all pids in the system to figure out which
> > children a task have -- we add explicit /proc/<pid>/children entry,
> > because kernel already has this kind of information but it is not
> > yet exported. This is a first level children, not the whole process
> > tree, neither the process threads are identified with this interface.
>
> The changelog doesn't explain why we want the patch, so there's no
> reason to merge it! Something to do with c/r, yes?
I though that "there is no easy way to make a reverse parent->children
chain" explains the problem -- ie when we do checkpoint of some task
we need to dump all its children and there is no easy and fast way to
find such topology. Parent of child can always be found from /proc/pid/status
but not the reverse.
>
> If so, I guess the feature could/should be configurable. Probably with
> a CONFIG_PROC_CHILDREN which is selected by CONFIG_CHECKPOINT_RESTORE.
> Which is all getting a bit over the top, but I suppose we must do it.
>
Yes, I can put it under CONFIG_CHECKPOINT_RESTORE, I thought to introduce
this config variable in prctl patch. But I think all might end up in mess,
so I guess better to make some patch series instead
- Introduction of CONFIG_CHECKPOINT_RESTORE variable
- /proc/pid/child patch
- prctl patch
- finally make map_files/ patch (which is in -mm already) being
CONFIG_CHECKPOINT_RESTORE dependant as well
Sounds good in general?
Btw, any hint which Kconfig is a better place for CONFIG_CHECKPOINT_RESTORE,
definitely not Kconfig.cpu, nor the mm/Kconfig. Not sure which one should
be taken. I guess it should be somewhere on top level and at moment x86
dependant as well.
> Also, neither the changelog not the documentation mention the
> loss-of-data problem which might occur if/when the lock is dropped.
>
Indeed. The number set is only valid while read, nothing else is guaranteed
after that (I dropped those part from doc which mentioned that task should
be ither frozen or stopped to get more-less consistent results). I'll
bring it back, sorry!
> The code now appears to be kinda-duplicating functionality which the
> seq_file library provides. Shouldn't this have been
> changelogged/commented? If it was, I wouldn't need to ask the next
> question.
>
> Why is it kinda-duplicating seq_file functionality? Can we strengthen
> the seq_file code so this is unnecessary?
>
If no threads were involved then seq_list_start/next would be enough but
I need next_thread() helper here which doesn't allow me to use the former
seq engine helpers. Andrew, I fear changing seq code bring even more
complexity. At moment all helpers we use are grouped in a couple of code
lines, i think it's easier to read.
Actually I had some crazy idea to be able to pass a "hint" to seq_file.c
about how many memory we need to keep all pids printed, so seq_file would
kmalloc/vmalloc memory for us at once and we simply fill it all in one
pass but finally I got impression that it will bring more complexity
and except us noone in kernel would need this functionality.
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-08 21:54 ` Andrew Morton
2011-12-08 22:21 ` Cyrill Gorcunov
@ 2011-12-09 15:30 ` Oleg Nesterov
2011-12-09 15:49 ` Cyrill Gorcunov
1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2011-12-09 15:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On 12/08, Andrew Morton wrote:
>
> On Fri, 9 Dec 2011 01:28:53 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
> > On Thu, Dec 08, 2011 at 05:35:35PM +0100, Oleg Nesterov wrote:
> > ...
> > >
> > > However, ->children list is not rcu-safe, this means that even
> > > list_for_each() itself is not safe. Either you need tasklist or
> > > we can probably make it rcu-safe...
> > >
> >
> > Andrew, Oleg, does the below one look more less fine? Note the
> > tasklist_lock is back and it worries me a bit since I imagine
> > one could be endlessly reading some /proc/<pid>/children file
> > increasing contention over this lock on the whole system
> > (regardless the fact that it's take for read only).
>
> It is a potential problem, from the lock-hold point of view and
> also it can cause large scheduling latencies. What's involved in
> making ->children an rcu-protected list?
At first glance, this doesn't look trivial... forget_original_parent()
abuses ->sibling.
But yes, it is not really nice to hold tasklist_lock here. May be
we can change this code so that every iteration records the reported
task_struct and then tries to continue. This means we should verify
that ->real_parent is still the same under tasklist, but at least
this way we do not hold it throughout.
> > From: Cyrill Gorcunov <gorcunov@openvz.org>
> > Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
> >
> > There is no easy way to make a reverse parent->children chain
> > from arbitrary <pid> (while parent pid is provided in "PPid"
> > field of /proc/<pid>/status).
> >
> > So instead of walking over all pids in the system to figure out which
> > children a task have -- we add explicit /proc/<pid>/children entry,
> > because kernel already has this kind of information but it is not
> > yet exported. This is a first level children, not the whole process
> > tree, neither the process threads are identified with this interface.
>
> The changelog doesn't explain why we want the patch, so there's no
> reason to merge it! Something to do with c/r, yes?
>
> If so, I guess the feature could/should be configurable. Probably with
> a CONFIG_PROC_CHILDREN which is selected by CONFIG_CHECKPOINT_RESTORE.
> Which is all getting a bit over the top, but I suppose we must do it.
Heh. This is the rare case when I personally like the new feature ;)
I mean, it looks "obviously useful" to me. If nothing else, it can
help to debug the problems. Probably the tools like pstree can use it.
Personally I'd even prefer /proc/pid/children/ directory (like
/proc/pid/task), but I guess this needs much more complications.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-09 15:30 ` Oleg Nesterov
@ 2011-12-09 15:49 ` Cyrill Gorcunov
2011-12-09 16:55 ` Oleg Nesterov
0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-09 15:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Fri, Dec 09, 2011 at 04:30:09PM +0100, Oleg Nesterov wrote:
...
> >
> > It is a potential problem, from the lock-hold point of view and
> > also it can cause large scheduling latencies. What's involved in
> > making ->children an rcu-protected list?
>
> At first glance, this doesn't look trivial... forget_original_parent()
> abuses ->sibling.
>
But wait, forget_original_parent may move task out of the original
list and put it in dead_children list, right? Then release_task
may call delayed_put_task_struct under as call_rcu which should
wait until we finish reading in our rcu section, isn't it?
Even if we get inconsistent picture of children (ie we get pid
of shildren which just getting dead) I think it's fine.
> But yes, it is not really nice to hold tasklist_lock here. May be
> we can change this code so that every iteration records the reported
> task_struct and then tries to continue. This means we should verify
> that ->real_parent is still the same under tasklist, but at least
> this way we do not hold it throughout.
>
And if real_parent is changed, I should simply skip such task and
continue, right?
> > If so, I guess the feature could/should be configurable. Probably with
> > a CONFIG_PROC_CHILDREN which is selected by CONFIG_CHECKPOINT_RESTORE.
> > Which is all getting a bit over the top, but I suppose we must do it.
>
> Heh. This is the rare case when I personally like the new feature ;)
> I mean, it looks "obviously useful" to me. If nothing else, it can
> help to debug the problems. Probably the tools like pstree can use it.
>
Well, I'm personally fine with either approach. We could make it
CONFIG_CHECKPOINT_RESTORE dependant and make CONFIG_CHECKPOINT_RESTORE
being Y on default, then if someone not needed this facility he simply
turn it off, no?
> Personally I'd even prefer /proc/pid/children/ directory (like
> /proc/pid/task), but I guess this needs much more complications.
>
I fear so. Oleg, if it possible I would like to bring in as minimum
code as I can ;)
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-09 15:49 ` Cyrill Gorcunov
@ 2011-12-09 16:55 ` Oleg Nesterov
2011-12-09 17:11 ` Cyrill Gorcunov
0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2011-12-09 16:55 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On 12/09, Cyrill Gorcunov wrote:
>
> On Fri, Dec 09, 2011 at 04:30:09PM +0100, Oleg Nesterov wrote:
> ...
> > >
> > > It is a potential problem, from the lock-hold point of view and
> > > also it can cause large scheduling latencies. What's involved in
> > > making ->children an rcu-protected list?
> >
> > At first glance, this doesn't look trivial... forget_original_parent()
> > abuses ->sibling.
> >
>
> But wait, forget_original_parent may move task out of the original
> list and put it in dead_children list, right? Then release_task
> may call delayed_put_task_struct under as call_rcu which should
> wait until we finish reading in our rcu section, isn't it?
>
> Even if we get inconsistent picture of children (ie we get pid
> of shildren which just getting dead) I think it's fine.
The problen is (one of the problems, in fact), list_move() changes
->next.
To simplify, let's talk about children_seq_start() your patch adds.
Suppose that we make ->children/sibling "rcu-safe" (we replace
__unhash_process()->list_del_init() with list_del_rcu, and so on).
Now, children_seq_start() does:
rcu_read_lock();
list_for_each(child, task->children)
...;
Suppose that this task exits and reparents the child we are looking at.
Once reparent_leader() moves it into another list, list_for_each()
can never stop.
In short. forget_original_parent() changes the _head_ of the list,
in some sense.
> > But yes, it is not really nice to hold tasklist_lock here. May be
> > we can change this code so that every iteration records the reported
> > task_struct and then tries to continue. This means we should verify
> > that ->real_parent is still the same under tasklist, but at least
> > this way we do not hold it throughout.
> >
>
> And if real_parent is changed,
... or if list_empty(->sibling) == T
> I should simply skip such task and
> continue, right?
I think you should restart in this (hopefully unlikely) case. Yes, this
means the potentional data loss, but I guess we can't avoid this in any
case. For example, with the current patch children_seq_start() can miss
the _live_ thread if the already reported process was reaped.
> > Personally I'd even prefer /proc/pid/children/ directory (like
> > /proc/pid/task), but I guess this needs much more complications.
> >
>
> I fear so. Oleg, if it possible I would like to bring in as minimum
> code as I can ;)
Yes, yes, I agree.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2
2011-12-09 16:55 ` Oleg Nesterov
@ 2011-12-09 17:11 ` Cyrill Gorcunov
0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2011-12-09 17:11 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Pavel Emelyanov, Serge Hallyn, KAMEZAWA Hiroyuki,
Tejun Heo, Vasiliy Kulikov, Andrew Vagin, LKML
On Fri, Dec 09, 2011 at 05:55:36PM +0100, Oleg Nesterov wrote:
...
>
> To simplify, let's talk about children_seq_start() your patch adds.
> Suppose that we make ->children/sibling "rcu-safe" (we replace
> __unhash_process()->list_del_init() with list_del_rcu, and so on).
>
> Now, children_seq_start() does:
>
> rcu_read_lock();
>
> list_for_each(child, task->children)
> ...;
>
> Suppose that this task exits and reparents the child we are looking at.
> Once reparent_leader() moves it into another list, list_for_each()
> can never stop.
>
> In short. forget_original_parent() changes the _head_ of the list,
> in some sense.
>
Yeah, I missed this, my bad! Thanks for explanation, Oleg. I'll
post updated version once I finish it.
Cyrill
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-12-09 17:11 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 18:10 [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v2 Cyrill Gorcunov
2011-12-06 22:33 ` Andrew Morton
2011-12-06 22:35 ` Cyrill Gorcunov
2011-12-07 9:41 ` Cyrill Gorcunov
2011-12-07 18:27 ` Tejun Heo
2011-12-07 18:43 ` Cyrill Gorcunov
2011-12-07 18:53 ` Oleg Nesterov
2011-12-07 19:03 ` Cyrill Gorcunov
2011-12-07 19:19 ` Cyrill Gorcunov
2011-12-07 20:34 ` Cyrill Gorcunov
2011-12-07 21:52 ` Cyrill Gorcunov
2011-12-08 16:35 ` Oleg Nesterov
2011-12-08 16:50 ` Cyrill Gorcunov
2011-12-08 21:28 ` Cyrill Gorcunov
2011-12-08 21:54 ` Andrew Morton
2011-12-08 22:21 ` Cyrill Gorcunov
2011-12-09 15:30 ` Oleg Nesterov
2011-12-09 15:49 ` Cyrill Gorcunov
2011-12-09 16:55 ` Oleg Nesterov
2011-12-09 17:11 ` Cyrill Gorcunov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox