From: Cyrill Gorcunov <gorcunov@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
LKML <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Pavel Emelyanov <xemul@parallels.com>,
Serge Hallyn <serge.hallyn@canonical.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Tejun Heo <tj@kernel.org>, Andrew Vagin <avagin@openvz.org>,
Vasiliy Kulikov <segoon@openwall.com>
Subject: Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
Date: Wed, 18 Jan 2012 18:21:56 +0400 [thread overview]
Message-ID: <20120118142156.GR1968@moon> (raw)
In-Reply-To: <20120118135809.GA10105@redhat.com>
On Wed, Jan 18, 2012 at 02:58:09PM +0100, Oleg Nesterov wrote:
> On 01/17, KOSAKI Motohiro wrote:
> >
> > (1/16/12 10:32 AM), Cyrill Gorcunov wrote:
> >> When we do checkpoint of a task we need to know the list of children
> >> the task, has but there is no easy and fast way to generate reverse
> >> parent->children chain from arbitrary<pid> (while a parent pid is
> >> provided in "PPid" field of /proc/<pid>/status).
> >>
> >> So instead of walking over all pids in the system (creating one big process
> >> tree in memory, just to figure out which children a task has) -- we add
> >> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> >> this kind of information but it is not yet exported.
> >
> > I doubt this is good idea. It move some complexity to userland, but not reduce.
> > Again, if we add this interface, it should help pstree like process traversal
> > tools. Bare task hierarchy shouldn't be exposed userland. I believe users need
> > sub process, not sub threads.
>
> IOW, you mean that the reading from 'children' should list all children
> of the whole thread group. Yes, we discussed this before a bit.
>
> In this case this file should live in /proc/pid/, not in /proc/pid/task/tid.
> But this doesn't allow to restore the hierarchy correctly, you need to know
> the parent thread. So I think /proc/pid/tasks/tid/children is still needed.
> And I think it does help pstree-like apps, just they need to look into task/.
>
> As for /proc/pid/children, may be we can add it later. But this is not that
> simple. The problem is the threaded reparenting, we do not want to list the
> same child twice.
>
Yes, /proc/pid/children was initial target but then I fall back to hardness
and though implementing more simplier solution is better approach.
So Oleg, I think you meant something like below? Comment is moved down an
list_empty over siblings remans, right?
Btw, should it be under CONFIG_CHECKPOINT_RESTORE or we could live with it
in general?
Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).
So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.
This is a first level children, not the whole process tree.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
v3:
- To be on a safe side use %lu format for pid_t printing
v4:
- New line get printed when sequence ends not at seq->stop,
a nit pointed by Tejun
- Documentation update
- tasklist_lock is back, Oleg pointed that ->children list
is actually not rcu-safe
v5:
- Oleg suggested to make /proc/<pid>/task/<tid>/children
instead of global /proc/<pid>/children, which eliminates
hardness related to threads and children migration, and
allows patch to be a way simplier.
v6:
- Drop ptrace_may_access tests, pids are can be found anyway
so nothing to protect here.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
Documentation/filesystems/proc.txt | 18 ++++
fs/proc/array.c | 147 +++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 1
4 files changed, 167 insertions(+)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/task/<tid>/children - Information about task children
4 Configuring procfs
4.1 Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
comm value.
+3.7 /proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results, which means if there a new child
+created while we read the "children" file -- it might or might not be
+counted. So one need to either stop or freeze processes if precise
+results are needed.
+
+
------------------------------------------------------------------------------
Configuring procfs
------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,150 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+struct proc_pid_children_iter {
+ struct pid_namespace *pid_ns;
+ struct pid *pid_start;
+};
+
+static struct pid *
+get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
+{
+ struct task_struct *start, *task;
+ struct pid *pid = NULL;
+
+ read_lock(&tasklist_lock);
+
+ start = pid_task(iter->pid_start, PIDTYPE_PID);
+ if (!start)
+ goto out;
+
+ /*
+ * Lets try to continue searching first, this gives
+ * us significant speedup on children-rich processes.
+ */
+ if (pid_prev) {
+ task = pid_task(pid_prev, PIDTYPE_PID);
+ if (task && task->real_parent == start &&
+ !(list_empty(&task->sibling))) {
+ if (list_is_last(&task->sibling, &start->children))
+ goto out;
+ task = list_first_entry(&task->sibling,
+ struct task_struct, sibling);
+ pid = get_pid(task_pid(task));
+ goto out;
+ }
+ }
+
+ /*
+ * Slow search case
+ *
+ * We might miss some freshly created children
+ * here, but it was never promised to be
+ * accurate.
+ */
+ list_for_each_entry(task, &start->children, sibling) {
+ if (pos-- == 0) {
+ pid = get_pid(task_pid(task));
+ break;
+ }
+ }
+
+out:
+ read_unlock(&tasklist_lock);
+ return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+ unsigned long pid = (unsigned long)pid_nr_ns(v, iter->pid_ns);
+
+ return seq_printf(seq, " %lu", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+ struct pid *pid = NULL;
+
+ pid = get_children_pid(iter, v, *pos + 1);
+ if (!pid)
+ seq_printf(seq, "\n");
+ put_pid(v);
+
+ ++*pos;
+ return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ struct proc_pid_children_iter *iter = NULL;
+ struct task_struct *task = NULL;
+ int ret = 0;
+
+ task = get_proc_task(inode);
+ if (!task) {
+ ret = -ENOENT;
+ goto err;
+ }
+
+ iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = iter;
+
+ iter->pid_start = get_pid(task_pid(task));
+ iter->pid_ns = inode->i_sb->s_fs_info;
+ }
+
+err:
+ if (task)
+ put_task_struct(task);
+ if (ret)
+ kfree(iter);
+
+ return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_pid_children_iter *iter = m->private;
+
+ put_pid(iter->pid_start);
+ kfree(iter);
+
+ seq_release(inode, file);
+ return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3454,6 +3454,7 @@ static const struct pid_entry tid_base_s
ONE("stat", S_IRUGO, proc_tid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
REG("maps", S_IRUGO, proc_maps_operations),
+ REG("children", S_IRUGO, proc_tid_children_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
#endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+extern const struct file_operations proc_tid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
next prev parent reply other threads:[~2012-01-18 14:22 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-16 15:32 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6 Cyrill Gorcunov
2012-01-16 16:11 ` Oleg Nesterov
2012-01-16 16:20 ` Cyrill Gorcunov
2012-01-17 17:40 ` Oleg Nesterov
2012-01-17 17:57 ` Cyrill Gorcunov
2012-01-17 18:14 ` Oleg Nesterov
2012-01-17 18:30 ` Cyrill Gorcunov
2012-01-17 21:38 ` KOSAKI Motohiro
2012-01-18 9:43 ` Cyrill Gorcunov
2012-01-18 13:58 ` Oleg Nesterov
2012-01-18 14:21 ` Cyrill Gorcunov [this message]
2012-01-18 14:36 ` Oleg Nesterov
2012-01-18 18:23 ` Cyrill Gorcunov
2012-01-18 19:07 ` Cyrill Gorcunov
2012-01-19 14:10 ` Oleg Nesterov
2012-01-19 14:47 ` Cyrill Gorcunov
2012-01-19 15:55 ` Oleg Nesterov
2012-01-19 16:27 ` Cyrill Gorcunov
2012-01-19 16:44 ` Oleg Nesterov
2012-01-19 17:07 ` Cyrill Gorcunov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120118142156.GR1968@moon \
--to=gorcunov@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=avagin@openvz.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=segoon@openwall.com \
--cc=serge.hallyn@canonical.com \
--cc=tj@kernel.org \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.