* [RFC][PATCH] ps command race fix take 4 [4/4] proc root open/release/llseek
@ 2006-08-25 9:29 KAMEZAWA Hiroyuki
2006-09-04 23:13 ` [PATCH] proc: readdir race fix Eric W. Biederman
0 siblings, 1 reply; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-08-25 9:29 UTC (permalink / raw)
To: LKML; +Cc: ebiederm, kamezawa.hiroyu, Andrew Morton, saito.tadashi, ak
implements open/release/llseek ops are needed by new proc_pid_readdir()
llseek()'s offset is specified by 'bytes' but /proc root doesn't handle
file->f_pos as bytes. So I disabled llseek for /proc for now.
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
fs/proc/root.c | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+)
Index: linux-2.6.18-rc4/fs/proc/root.c
===================================================================
--- linux-2.6.18-rc4.orig/fs/proc/root.c
+++ linux-2.6.18-rc4/fs/proc/root.c
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/smp_lock.h>
+#include <linux/adaptive_pointer.h>
#include "internal.h"
@@ -118,14 +119,47 @@ static int proc_root_readdir(struct file
return ret;
}
+static int proc_root_open(struct inode *inode, struct file *filp)
+{
+ struct adaptive_pointer *ap;
+ ap = kzalloc(sizeof(*ap), GFP_KERNEL);
+ init_adaptive_pointer(ap);
+ if (ap) {
+ filp->private_data = ap;
+ return 0;
+ }
+ return -ENOMEM;
+}
+
+static int proc_root_release(struct inode *inode, struct file *filp)
+{
+ struct adaptive_pointer *ap;
+ ap = filp->private_data;
+ rcu_read_lock();
+ __ap_get_remove_pointer(ap);
+ rcu_read_unlock();
+ kfree(ap);
+ filp->private_data = NULL;
+ return 0;
+}
+
+static loff_t proc_root_llseek(struct file *file, loff_t off, int pos)
+{
+ /* pos is bytes...but we don't use fp->f_pos as bytes... */
+ return -ENOTSUPP;
+}
+
/*
* The root /proc directory is special, as it has the
* <pid> directories. Thus we don't use the generic
* directory handling functions for that..
*/
static struct file_operations proc_root_operations = {
+ .open = proc_root_open,
.read = generic_read_dir,
.readdir = proc_root_readdir,
+ .release = proc_root_release,
+ .llseek = proc_root_llseek,
};
/*
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH] proc: readdir race fix. 2006-08-25 9:29 [RFC][PATCH] ps command race fix take 4 [4/4] proc root open/release/llseek KAMEZAWA Hiroyuki @ 2006-09-04 23:13 ` Eric W. Biederman 2006-09-05 1:30 ` KAMEZAWA Hiroyuki ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-04 23:13 UTC (permalink / raw) To: linux-kernel Cc: Andrew Morton, saito.tadashi, ak, KAMEZAWA Hiroyuki, Oleg Nesterov, Jean Delvare The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. This patch is against 2.6.18-rc6 and it should be relatively straight forward to backport to older kernels as well. Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for providing the first fix, pointing this out and working on it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/proc/base.c | 93 +++++++++++++-------------------------------------- include/linux/pid.h | 1 + kernel/pid.c | 37 ++++++++++++++++++++ 3 files changed, 62 insertions(+), 69 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index fe8d55f..7b93454 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2141,72 +2141,32 @@ out_no_task: } /* - * Find the first tgid to return to user space. + * Find the first task with tgid >= tgid * - * Usually this is just whatever follows &init_task, but if the users - * buffer was too small to hold the full list or there was a seek into - * the middle of the directory we have more work to do. - * - * In the case of a short read we start with find_task_by_pid. - * - * In the case of a seek we start with &init_task and walk nr - * threads past it. */ -static struct task_struct *first_tgid(int tgid, unsigned int nr) +static struct task_struct *next_tgid(unsigned int tgid) { - struct task_struct *pos; - rcu_read_lock(); - if (tgid && nr) { - pos = find_task_by_pid(tgid); - if (pos && thread_group_leader(pos)) - goto found; - } - /* If nr exceeds the number of processes get out quickly */ - pos = NULL; - if (nr && nr >= nr_processes()) - goto done; - - /* If we haven't found our starting place yet start with - * the init_task and walk nr tasks forward. - */ - for (pos = next_task(&init_task); nr > 0; --nr) { - pos = next_task(pos); - if (pos == &init_task) { - pos = NULL; - goto done; - } - } -found: - get_task_struct(pos); -done: - rcu_read_unlock(); - return pos; -} + struct task_struct *task; + struct pid *pid; -/* - * Find the next task in the task list. - * Return NULL if we loop or there is any error. - * - * The reference to the input task_struct is released. - */ -static struct task_struct *next_tgid(struct task_struct *start) -{ - struct task_struct *pos; + task = NULL; rcu_read_lock(); - pos = start; - if (pid_alive(start)) - pos = next_task(start); - if (pid_alive(pos) && (pos != &init_task)) { - get_task_struct(pos); - goto done; +retry: + pid = find_next_pid(tgid); + if (pid) { + tgid = pid->nr + 1; + task = pid_task(pid, PIDTYPE_PID); + if (!task || !thread_group_leader(task)) + goto retry; + get_task_struct(task); } - pos = NULL; -done: rcu_read_unlock(); - put_task_struct(start); - return pos; + return task; + } +#define TGID_OFFSET ((FIRST_PROCESS_ENTRY + 1) - 1) + /* for the /proc/ directory itself, after non-process stuff has been done */ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir) { @@ -2222,29 +2182,24 @@ int proc_pid_readdir(struct file * filp, filp->f_pos++; nr++; } - nr -= 1; - /* f_version caches the tgid value that the last readdir call couldn't - * return. lseek aka telldir automagically resets f_version to 0. - */ - tgid = filp->f_version; - filp->f_version = 0; - for (task = first_tgid(tgid, nr); + tgid = filp->f_pos - TGID_OFFSET; + for (task = next_tgid(tgid); task; - task = next_tgid(task), filp->f_pos++) { + task = next_tgid(tgid + 1)) { int len; ino_t ino; tgid = task->pid; + filp->f_pos = tgid + TGID_OFFSET; len = snprintf(buf, sizeof(buf), "%d", tgid); ino = fake_ino(tgid, PROC_TGID_INO); if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) { - /* returning this tgid failed, save it as the first - * pid for the next readir call */ - filp->f_version = tgid; put_task_struct(task); - break; + goto out; } } + filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET; +out: return 0; } diff --git a/include/linux/pid.h b/include/linux/pid.h index 29960b0..a24db52 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int * Lookup a PID in the hash table, and return with it's count elevated. */ extern struct pid *find_get_pid(int nr); +extern struct pid *find_next_pid(int nr); extern struct pid *alloc_pid(void); extern void FASTCALL(free_pid(struct pid *pid)); diff --git a/kernel/pid.c b/kernel/pid.c index 93e212f..53d4159 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -145,6 +145,23 @@ static int alloc_pidmap(void) return -1; } +static int next_pidmap(int last) +{ + int offset; + pidmap_t *map; + + offset = (last + 1) & BITS_PER_PAGE_MASK; + map = &pidmap_array[(last + 1)/BITS_PER_PAGE]; + for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) { + if (unlikely(!map->page)) + continue; + offset = find_next_bit((map)->page, BITS_PER_PAGE, offset); + if (offset < BITS_PER_PAGE) + return mk_pid(map, offset); + } + return -1; +} + fastcall void put_pid(struct pid *pid) { if (!pid) @@ -297,6 +314,26 @@ struct pid *find_get_pid(pid_t nr) } /* + * Used by proc to find the pid with the first + * pid that is greater than or equal to number. + * + * If there is a pid at nr this function is exactly the same as find_pid. + */ +struct pid *find_next_pid(int nr) +{ + struct pid *next; + + next = find_pid(nr); + while (!next) { + nr = next_pidmap(nr); + if (nr <= 0) + break; + next = find_pid(nr); + } + return next; +} + +/* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or * more. -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-04 23:13 ` [PATCH] proc: readdir race fix Eric W. Biederman @ 2006-09-05 1:30 ` KAMEZAWA Hiroyuki 2006-09-05 2:30 ` Eric W. Biederman 2006-09-05 2:41 ` KAMEZAWA Hiroyuki 2006-09-05 2:26 ` KAMEZAWA Hiroyuki 2006-09-05 5:26 ` [PATCH] proc: readdir race fix KAMEZAWA Hiroyuki 2 siblings, 2 replies; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 1:30 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, akpm, saito.tadashi, ak, oleg, jdelvare Hi, On Mon, 04 Sep 2006 17:13:10 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > These better semantics are implemented by scanning through the > pids in numerical order and by making the file offset a pid > plus a fixed offset. I think this is very sane/solid approach. Maybe this is the way to go. I'll test and ack later, thank you. > The pid scan happens on the pid bitmap, which when you look at it is > remarkably efficient for a brute force algorithm. Given that a typical > cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There > are only 40 cache lines for the entire 32K pid space. A typical system > will have 100 pids or more so this is actually fewer cache lines we have > to look at to scan a linked list, and the worst case of having to scan > the entire pid bitmap is pretty reasonable. I agree with you but.. Becasue this approach has to access *all* task structs in a system, and have to scan pidhash many times. I'm not sure that this scan & lookup is good for future implementation. But this patch is obviously better than current implementation. > /* > + * Used by proc to find the pid with the first > + * pid that is greater than or equal to number. > + * > + * If there is a pid at nr this function is exactly the same as find_pid. > + */ > +struct pid *find_next_pid(int nr) > +{ How about find_first_used_pid(int nr) ? -Kame ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-05 1:30 ` KAMEZAWA Hiroyuki @ 2006-09-05 2:30 ` Eric W. Biederman 2006-09-05 2:41 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-05 2:30 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, akpm, saito.tadashi, ak, oleg, jdelvare KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes: > Hi, > > On Mon, 04 Sep 2006 17:13:10 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: >> These better semantics are implemented by scanning through the >> pids in numerical order and by making the file offset a pid >> plus a fixed offset. > I think this is very sane/solid approach. > Maybe this is the way to go. I'll test and ack later, thank you. > >> The pid scan happens on the pid bitmap, which when you look at it is >> remarkably efficient for a brute force algorithm. Given that a typical >> cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There >> are only 40 cache lines for the entire 32K pid space. A typical system >> will have 100 pids or more so this is actually fewer cache lines we have >> to look at to scan a linked list, and the worst case of having to scan >> the entire pid bitmap is pretty reasonable. > > I agree with you but.. > Becasue this approach has to access *all* task structs in a system, > and have to scan pidhash many times. I'm not sure that this scan & lookup > is good for future implementation. But this patch is obviously better than > current implementation. Quite possibly. But where this approach falls down in not in the basics but in the data structures. And it isn't that hard to fix the data structures. Just something I don't want to do the first pass. >> /* >> + * Used by proc to find the pid with the first >> + * pid that is greater than or equal to number. >> + * >> + * If there is a pid at nr this function is exactly the same as find_pid. >> + */ >> +struct pid *find_next_pid(int nr) >> +{ > > How about find_first_used_pid(int nr) ? Maybe. The best name I can think of is: find_greater_or_equal_pid(int nr) and that is a little clumsy. On the other hand it isn't confusing what the function does. Any other suggestions? Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-05 1:30 ` KAMEZAWA Hiroyuki 2006-09-05 2:30 ` Eric W. Biederman @ 2006-09-05 2:41 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 2:41 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: ebiederm, linux-kernel, akpm, saito.tadashi, ak, oleg, jdelvare On Tue, 5 Sep 2006 10:30:10 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > Hi, > > On Mon, 04 Sep 2006 17:13:10 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > These better semantics are implemented by scanning through the > > pids in numerical order and by making the file offset a pid > > plus a fixed offset. > I think this is very sane/solid approach. > Maybe this is the way to go. I'll test and ack later, thank you. > Just a memo: One interesting aspect of this patch is.. - default result of ps with current(old) kernel is naturally sorted by starttime. then, ps command itself comes at the bottom of the result, as the newest process. - default result of ps with this patch is sorted by pid, regardless of starttime. == kamezawa 3770 0.0 0.0 5156 996 tty1 S+ 11:17 0:00 /bin/bash ./testp kawamura 3989 0.0 0.0 2168 832 pts/3 S+ 11:32 0:00 sh -c (cd /usr/sh kawamura 3990 0.0 0.0 2168 408 pts/3 S+ 11:32 0:00 sh -c (cd /usr/sh kawamura 4002 0.0 0.0 1924 680 pts/3 S+ 11:32 0:00 /usr/bin/less -iR root 10772 0.0 0.0 6912 2188 ? Ss 11:31 0:00 sshd: kawamura [p kamezawa 12341 0.6 0.0 5336 1476 tty2 Ss 11:17 0:06 -bash kamezawa 15308 0.0 0.0 4788 544 tty1 S+ 11:32 0:00 sleep 1 kamezawa 15315 0.0 0.0 2380 756 pts/0 R+ 11:32 0:00 ps aux kamezawa 17322 0.0 0.0 5332 1472 tty3 Ss+ 11:18 0:00 -bash root 22194 0.0 0.0 1760 744 ? Ss 11:20 0:00 in.telnetd: awork root 22198 0.0 0.0 2992 1476 ? Ss 11:20 0:00 login -- kamezawa kawamura 24526 0.0 0.0 6912 1544 ? S 11:31 0:00 sshd: kawamura@pt kawamura 24529 0.0 0.0 5336 1456 pts/3 Ss 11:31 0:00 -bash satoshi 26239 2.2 0.3 40292 13696 ? Sl 11:30 0:02 /usr/bin/gnome-te == But 'ps --sort start_time' is available. -Kame ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-04 23:13 ` [PATCH] proc: readdir race fix Eric W. Biederman 2006-09-05 1:30 ` KAMEZAWA Hiroyuki @ 2006-09-05 2:26 ` KAMEZAWA Hiroyuki 2006-09-05 2:54 ` Eric W. Biederman 2006-09-05 3:07 ` Eric W. Biederman 2006-09-05 5:26 ` [PATCH] proc: readdir race fix KAMEZAWA Hiroyuki 2 siblings, 2 replies; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 2:26 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, akpm, saito.tadashi, ak, oleg, jdelvare On Mon, 04 Sep 2006 17:13:10 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: Hi, Hit OOM-Killer, because of memory leak of task struct. patch is attached. -Kame task struct should be put always. fs/proc/base.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.18-psfix/fs/proc/base.c =================================================================== --- linux-2.6.18-psfix.orig/fs/proc/base.c 2006-09-05 10:42:40.000000000 +0900 +++ linux-2.6.18-psfix/fs/proc/base.c 2006-09-05 11:11:42.000000000 +0900 @@ -2193,8 +2193,8 @@ filp->f_pos = tgid + TGID_OFFSET; len = snprintf(buf, sizeof(buf), "%d", tgid); ino = fake_ino(tgid, PROC_TGID_INO); + put_task_struct(task); if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) { - put_task_struct(task); goto out; } } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-05 2:26 ` KAMEZAWA Hiroyuki @ 2006-09-05 2:54 ` Eric W. Biederman 2006-09-05 3:07 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-05 2:54 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, akpm, saito.tadashi, ak, oleg, jdelvare KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes: > On Mon, 04 Sep 2006 17:13:10 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > Hi, Hit OOM-Killer, because of memory leak of task struct. > > patch is attached. > -Kame > > task struct should be put always. Good catch. I actually have a pending cleanup that works better if the task struct is held across the filldir so the incremental patch should look like: I also noticed a benign typo in TGID_OFFSET (1 subtract one that I shouldn't) Complete patch in just a moment. Eric diff --git a/fs/proc/base.c b/fs/proc/base.c index b7650b9..03f6680 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2186,7 +2186,7 @@ int proc_pid_readdir(struct file * filp, tgid = filp->f_pos - TGID_OFFSET; for (task = next_tgid(tgid); task; - task = next_tgid(tgid + 1)) { + put_task_struct(task), task = next_tgid(tgid + 1)) { int len; ino_t ino; tgid = task->pid; ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH] proc: readdir race fix 2006-09-05 2:26 ` KAMEZAWA Hiroyuki 2006-09-05 2:54 ` Eric W. Biederman @ 2006-09-05 3:07 ` Eric W. Biederman 2006-09-05 5:39 ` KAMEZAWA Hiroyuki 2006-09-05 10:10 ` Oleg Nesterov 1 sibling, 2 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-05 3:07 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-kernel, akpm, ak, oleg, jdelvare, Albert Cahalan, Paul Jackson The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. This patch is against 2.6.18-rc6 and it should be relatively straight forward to backport to older kernels as well. Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for providing the first fix, pointing this out and working on it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/proc/base.c | 93 +++++++++++++-------------------------------------- include/linux/pid.h | 1 + kernel/pid.c | 37 ++++++++++++++++++++ 3 files changed, 62 insertions(+), 69 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index fe8d55f..03f6680 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2141,72 +2141,32 @@ out_no_task: } /* - * Find the first tgid to return to user space. + * Find the first task with tgid >= tgid * - * Usually this is just whatever follows &init_task, but if the users - * buffer was too small to hold the full list or there was a seek into - * the middle of the directory we have more work to do. - * - * In the case of a short read we start with find_task_by_pid. - * - * In the case of a seek we start with &init_task and walk nr - * threads past it. */ -static struct task_struct *first_tgid(int tgid, unsigned int nr) +static struct task_struct *next_tgid(unsigned int tgid) { - struct task_struct *pos; - rcu_read_lock(); - if (tgid && nr) { - pos = find_task_by_pid(tgid); - if (pos && thread_group_leader(pos)) - goto found; - } - /* If nr exceeds the number of processes get out quickly */ - pos = NULL; - if (nr && nr >= nr_processes()) - goto done; - - /* If we haven't found our starting place yet start with - * the init_task and walk nr tasks forward. - */ - for (pos = next_task(&init_task); nr > 0; --nr) { - pos = next_task(pos); - if (pos == &init_task) { - pos = NULL; - goto done; - } - } -found: - get_task_struct(pos); -done: - rcu_read_unlock(); - return pos; -} + struct task_struct *task; + struct pid *pid; -/* - * Find the next task in the task list. - * Return NULL if we loop or there is any error. - * - * The reference to the input task_struct is released. - */ -static struct task_struct *next_tgid(struct task_struct *start) -{ - struct task_struct *pos; + task = NULL; rcu_read_lock(); - pos = start; - if (pid_alive(start)) - pos = next_task(start); - if (pid_alive(pos) && (pos != &init_task)) { - get_task_struct(pos); - goto done; +retry: + pid = find_next_pid(tgid); + if (pid) { + tgid = pid->nr + 1; + task = pid_task(pid, PIDTYPE_PID); + if (!task || !thread_group_leader(task)) + goto retry; + get_task_struct(task); } - pos = NULL; -done: rcu_read_unlock(); - put_task_struct(start); - return pos; + return task; + } +#define TGID_OFFSET (FIRST_PROCESS_ENTRY + (1 /* /proc/self */)) + /* for the /proc/ directory itself, after non-process stuff has been done */ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir) { @@ -2222,29 +2182,24 @@ int proc_pid_readdir(struct file * filp, filp->f_pos++; nr++; } - nr -= 1; - /* f_version caches the tgid value that the last readdir call couldn't - * return. lseek aka telldir automagically resets f_version to 0. - */ - tgid = filp->f_version; - filp->f_version = 0; - for (task = first_tgid(tgid, nr); + tgid = filp->f_pos - TGID_OFFSET; + for (task = next_tgid(tgid); task; - task = next_tgid(task), filp->f_pos++) { + put_task_struct(task), task = next_tgid(tgid + 1)) { int len; ino_t ino; tgid = task->pid; + filp->f_pos = tgid + TGID_OFFSET; len = snprintf(buf, sizeof(buf), "%d", tgid); ino = fake_ino(tgid, PROC_TGID_INO); if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) { - /* returning this tgid failed, save it as the first - * pid for the next readir call */ - filp->f_version = tgid; put_task_struct(task); - break; + goto out; } } + filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET; +out: return 0; } diff --git a/include/linux/pid.h b/include/linux/pid.h index 29960b0..a24db52 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int * Lookup a PID in the hash table, and return with it's count elevated. */ extern struct pid *find_get_pid(int nr); +extern struct pid *find_next_pid(int nr); extern struct pid *alloc_pid(void); extern void FASTCALL(free_pid(struct pid *pid)); diff --git a/kernel/pid.c b/kernel/pid.c index 93e212f..53d4159 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -145,6 +145,23 @@ static int alloc_pidmap(void) return -1; } +static int next_pidmap(int last) +{ + int offset; + pidmap_t *map; + + offset = (last + 1) & BITS_PER_PAGE_MASK; + map = &pidmap_array[(last + 1)/BITS_PER_PAGE]; + for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) { + if (unlikely(!map->page)) + continue; + offset = find_next_bit((map)->page, BITS_PER_PAGE, offset); + if (offset < BITS_PER_PAGE) + return mk_pid(map, offset); + } + return -1; +} + fastcall void put_pid(struct pid *pid) { if (!pid) @@ -297,6 +314,26 @@ struct pid *find_get_pid(pid_t nr) } /* + * Used by proc to find the pid with the first + * pid that is greater than or equal to number. + * + * If there is a pid at nr this function is exactly the same as find_pid. + */ +struct pid *find_next_pid(int nr) +{ + struct pid *next; + + next = find_pid(nr); + while (!next) { + nr = next_pidmap(nr); + if (nr <= 0) + break; + next = find_pid(nr); + } + return next; +} + +/* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or * more. -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix 2006-09-05 3:07 ` Eric W. Biederman @ 2006-09-05 5:39 ` KAMEZAWA Hiroyuki 2006-09-05 10:10 ` Oleg Nesterov 1 sibling, 0 replies; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 5:39 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, akpm, ak, oleg, jdelvare, acahalan, pj On Mon, 04 Sep 2006 21:07:29 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for > providing the first fix, pointing this out and working on it. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> worked well with my (short) test set. and looks good. Thank you. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix 2006-09-05 3:07 ` Eric W. Biederman 2006-09-05 5:39 ` KAMEZAWA Hiroyuki @ 2006-09-05 10:10 ` Oleg Nesterov 2006-09-05 11:36 ` Eric W. Biederman 2006-09-05 14:52 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 1 sibling, 2 replies; 25+ messages in thread From: Oleg Nesterov @ 2006-09-05 10:10 UTC (permalink / raw) To: Eric W. Biederman Cc: KAMEZAWA Hiroyuki, linux-kernel, akpm, ak, jdelvare, Albert Cahalan, Paul Jackson On 09/04, Eric W. Biederman wrote: > > -static struct task_struct *next_tgid(struct task_struct *start) > -{ > - struct task_struct *pos; > + task = NULL; > rcu_read_lock(); > - pos = start; > - if (pid_alive(start)) > - pos = next_task(start); > - if (pid_alive(pos) && (pos != &init_task)) { > - get_task_struct(pos); > - goto done; > +retry: > + pid = find_next_pid(tgid); > + if (pid) { > + tgid = pid->nr + 1; > + task = pid_task(pid, PIDTYPE_PID); > + if (!task || !thread_group_leader(task)) ^^^^^^^^^^^^^^^^^^^ There is a window while de_thread() switches leadership, so next_tgid() may skip a task doing exec. What do you think about // needs a comment if (!task || task->pid != task->tgid) goto retry; instead? Currently first_tgid() has the same (very minor) problem. > + goto retry; > + get_task_struct(task); > } > - pos = NULL; > -done: > rcu_read_unlock(); > - put_task_struct(start); > - return pos; > + return task; > + > } Emply line before '}' > +struct pid *find_next_pid(int nr) > +{ > + struct pid *next; > + > + next = find_pid(nr); > + while (!next) { > + nr = next_pidmap(nr); > + if (nr <= 0) > + break; > + next = find_pid(nr); > + } > + return next; > +} This is strange that we are doing find_pid() before and at the end of loop, I'd suggest this code: struct pid *find_next_pid(int nr) { struct pid *pid; do { pid = find_pid(nr); if (pid != NULL) break; nr = next_pidmap(nr); } while (nr > 0); return pid; } Imho, a bit easier to read. Oleg. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix 2006-09-05 10:10 ` Oleg Nesterov @ 2006-09-05 11:36 ` Eric W. Biederman 2006-09-05 14:52 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-05 11:36 UTC (permalink / raw) To: Oleg Nesterov Cc: KAMEZAWA Hiroyuki, linux-kernel, akpm, ak, jdelvare, Albert Cahalan, Paul Jackson Oleg Nesterov <oleg@tv-sign.ru> writes: > On 09/04, Eric W. Biederman wrote: >> >> -static struct task_struct *next_tgid(struct task_struct *start) >> -{ >> - struct task_struct *pos; >> + task = NULL; >> rcu_read_lock(); >> - pos = start; >> - if (pid_alive(start)) >> - pos = next_task(start); >> - if (pid_alive(pos) && (pos != &init_task)) { >> - get_task_struct(pos); >> - goto done; >> +retry: >> + pid = find_next_pid(tgid); >> + if (pid) { >> + tgid = pid->nr + 1; >> + task = pid_task(pid, PIDTYPE_PID); >> + if (!task || !thread_group_leader(task)) > ^^^^^^^^^^^^^^^^^^^ > There is a window while de_thread() switches leadership, so next_tgid() > may skip a task doing exec. What do you think about > > // needs a comment > if (!task || task->pid != task->tgid) > goto retry; > > instead? Currently first_tgid() has the same (very minor) problem. I see the problem, and your test will certainly alleviate the symptom. You are making the test has this process ever been a thread group leader. I guess alleviating the symptom is all that is necessary there. Grumble. I hate that entire pid transfer case, too bad glibc cares. If I could in the fix for this I would like to add a clean concept that we are testing for wrapped in a helper function. Otherwise even with a big fat comment this will be easy to break next time someone refactors the code. >> + goto retry; >> + get_task_struct(task); >> } >> - pos = NULL; >> -done: >> rcu_read_unlock(); >> - put_task_struct(start); >> - return pos; >> + return task; >> + >> } > > Emply line before '}' > >> +struct pid *find_next_pid(int nr) >> +{ >> + struct pid *next; >> + >> + next = find_pid(nr); >> + while (!next) { >> + nr = next_pidmap(nr); >> + if (nr <= 0) >> + break; >> + next = find_pid(nr); >> + } >> + return next; >> +} > > This is strange that we are doing find_pid() before and at the end of loop, > I'd suggest this code: > > struct pid *find_next_pid(int nr) > { > struct pid *pid; > > do { > pid = find_pid(nr); > if (pid != NULL) > break; > nr = next_pidmap(nr); > } while (nr > 0); > > return pid; > } > > Imho, a bit easier to read. It is at least not worse, so it is probably worth doing. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH] proc: readdir race fix (take 3) 2006-09-05 10:10 ` Oleg Nesterov 2006-09-05 11:36 ` Eric W. Biederman @ 2006-09-05 14:52 ` Eric W. Biederman 2006-09-06 9:01 ` Jean Delvare 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2006-09-05 14:52 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak, jdelvare, Albert Cahalan, Paul Jackson The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. Also another very subtle bug in this area has been fixed. It is possible to catch a task in the middle of de_thread where a thread is assuming the thread of it's thread group leader. This patch carefully handles that case so if we hit it we don't fail to return the pid, that is undergoing the de_thread dance. This patch is against 2.6.18-rc6 and it should be relatively straight forward to backport to older kernels as well. Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for providing the first fix, pointing this out and working on it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/proc/base.c | 104 ++++++++++++++++--------------------------------- include/linux/pid.h | 1 include/linux/sched.h | 11 +++++ kernel/pid.c | 36 +++++++++++++++++ 4 files changed, 83 insertions(+), 69 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index fe8d55f..28e56c3 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2141,72 +2141,43 @@ out_no_task: } /* - * Find the first tgid to return to user space. + * Find the first task with tgid >= tgid * - * Usually this is just whatever follows &init_task, but if the users - * buffer was too small to hold the full list or there was a seek into - * the middle of the directory we have more work to do. - * - * In the case of a short read we start with find_task_by_pid. - * - * In the case of a seek we start with &init_task and walk nr - * threads past it. */ -static struct task_struct *first_tgid(int tgid, unsigned int nr) +static struct task_struct *next_tgid(unsigned int tgid) { - struct task_struct *pos; - rcu_read_lock(); - if (tgid && nr) { - pos = find_task_by_pid(tgid); - if (pos && thread_group_leader(pos)) - goto found; - } - /* If nr exceeds the number of processes get out quickly */ - pos = NULL; - if (nr && nr >= nr_processes()) - goto done; - - /* If we haven't found our starting place yet start with - * the init_task and walk nr tasks forward. - */ - for (pos = next_task(&init_task); nr > 0; --nr) { - pos = next_task(pos); - if (pos == &init_task) { - pos = NULL; - goto done; - } - } -found: - get_task_struct(pos); -done: - rcu_read_unlock(); - return pos; -} + struct task_struct *task; + struct pid *pid; -/* - * Find the next task in the task list. - * Return NULL if we loop or there is any error. - * - * The reference to the input task_struct is released. - */ -static struct task_struct *next_tgid(struct task_struct *start) -{ - struct task_struct *pos; + task = NULL; rcu_read_lock(); - pos = start; - if (pid_alive(start)) - pos = next_task(start); - if (pid_alive(pos) && (pos != &init_task)) { - get_task_struct(pos); - goto done; +retry: + pid = find_ge_pid(tgid); + if (pid) { + tgid = pid->nr + 1; + task = pid_task(pid, PIDTYPE_PID); + /* What we to know is if the pid we have find is the + * pid of a thread_group_leader. Testing for task + * being a thread_group_leader is the obvious thing + * todo but there is a window when it fails, due to + * the pid transfer logic in de_thread. + * + * So we perform the straight forward test of seeing + * if the pid we have found is the pid of a thread + * group leader, and don't worry if the task we have + * found doesn't happen to be a thread group leader. + * As we don't care in the case of readdir. + */ + if (!task || !has_group_leader_pid(task)) + goto retry; + get_task_struct(task); } - pos = NULL; -done: rcu_read_unlock(); - put_task_struct(start); - return pos; + return task; } +#define TGID_OFFSET (FIRST_PROCESS_ENTRY + (1 /* /proc/self */)) + /* for the /proc/ directory itself, after non-process stuff has been done */ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir) { @@ -2222,29 +2193,24 @@ int proc_pid_readdir(struct file * filp, filp->f_pos++; nr++; } - nr -= 1; - /* f_version caches the tgid value that the last readdir call couldn't - * return. lseek aka telldir automagically resets f_version to 0. - */ - tgid = filp->f_version; - filp->f_version = 0; - for (task = first_tgid(tgid, nr); + tgid = filp->f_pos - TGID_OFFSET; + for (task = next_tgid(tgid); task; - task = next_tgid(task), filp->f_pos++) { + put_task_struct(task), task = next_tgid(tgid + 1)) { int len; ino_t ino; tgid = task->pid; + filp->f_pos = tgid + TGID_OFFSET; len = snprintf(buf, sizeof(buf), "%d", tgid); ino = fake_ino(tgid, PROC_TGID_INO); if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) { - /* returning this tgid failed, save it as the first - * pid for the next readir call */ - filp->f_version = tgid; put_task_struct(task); - break; + goto out; } } + filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET; +out: return 0; } diff --git a/include/linux/pid.h b/include/linux/pid.h index 29960b0..525af8b 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int * Lookup a PID in the hash table, and return with it's count elevated. */ extern struct pid *find_get_pid(int nr); +extern struct pid *find_ge_pid(int nr); extern struct pid *alloc_pid(void); extern void FASTCALL(free_pid(struct pid *pid)); diff --git a/include/linux/sched.h b/include/linux/sched.h index 34ed0d9..a5dea85 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1332,6 +1332,17 @@ #define while_each_thread(g, t) \ /* de_thread depends on thread_group_leader not being a pid based check */ #define thread_group_leader(p) (p == p->group_leader) +/* Do to the insanities of de_thread it is possible for a process + * to have the pid of the thread group leader without actually being + * the thread group leader. For iteration through the pids in proc + * all we care about is that we have a task with the appropriate + * pid, we don't actually care if we have the right task. + */ +static inline int has_group_leader_pid(struct task_struct *p) +{ + return p->pid == p->tgid; +} + static inline struct task_struct *next_thread(const struct task_struct *p) { return list_entry(rcu_dereference(p->thread_group.next), diff --git a/kernel/pid.c b/kernel/pid.c index 93e212f..f396796 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -145,6 +145,23 @@ static int alloc_pidmap(void) return -1; } +static int next_pidmap(int last) +{ + int offset; + pidmap_t *map; + + offset = (last + 1) & BITS_PER_PAGE_MASK; + map = &pidmap_array[(last + 1)/BITS_PER_PAGE]; + for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) { + if (unlikely(!map->page)) + continue; + offset = find_next_bit((map)->page, BITS_PER_PAGE, offset); + if (offset < BITS_PER_PAGE) + return mk_pid(map, offset); + } + return -1; +} + fastcall void put_pid(struct pid *pid) { if (!pid) @@ -297,6 +314,25 @@ struct pid *find_get_pid(pid_t nr) } /* + * Used by proc to find the first pid that is greater then or equal to nr. + * + * If there is a pid at nr this function is exactly the same as find_pid. + */ +struct pid *find_ge_pid(int nr) +{ + struct pid *pid; + + do { + pid = find_pid(nr); + if (pid) + break; + nr = next_pidmap(nr); + } while (nr > 0); + + return pid; +} + +/* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or * more. -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-05 14:52 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman @ 2006-09-06 9:01 ` Jean Delvare 2006-09-06 21:12 ` Jean Delvare 0 siblings, 1 reply; 25+ messages in thread From: Jean Delvare @ 2006-09-06 9:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak, Albert Cahalan, Paul Jackson On Tuesday 5 September 2006 16:52, Eric W. Biederman wrote: > The problem: An opendir, readdir, closedir sequence can fail to report > process ids that are continually in use throughout the sequence of > system calls. For this race to trigger the process that > proc_pid_readdir stops at must exit before readdir is called again. > > This can cause ps to fail to report processes, and it is in violation > of posix guarantees and normal application expectations with respect > to readdir. > > Currently there is no way to work around this problem in user space > short of providing a gargantuan buffer to user space so the directory > read all happens in on system call. > > This patch implements the normal directory semantics for proc, > that guarantee that a directory entry that is neither created nor > destroyed while reading the directory entry will be returned. For > directory that are either created or destroyed during the readdir you > may or may not see them. Furthermore you may seek to a directory > offset you have previously seen. > > These are the guarantee that ext[23] provides and that posix requires, > and more importantly that user space expects. Plus it is a simple > semantic to implement reliable service. It is just a matter of > calling readdir a second time if you are wondering if something new > has show up. > > These better semantics are implemented by scanning through the > pids in numerical order and by making the file offset a pid > plus a fixed offset. > > The pid scan happens on the pid bitmap, which when you look at it is > remarkably efficient for a brute force algorithm. Given that a typical > cache line is 64 bytes and thus covers space for 64*8 == 200 pids. > There are only 40 cache lines for the entire 32K pid space. A typical > system will have 100 pids or more so this is actually fewer cache lines > we have to look at to scan a linked list, and the worst case of having > to scan the entire pid bitmap is pretty reasonable. > > If we need something more efficient we can go to a more efficient data > structure for indexing the pids, but for now what we have should be > sufficient. > > In addition this takes no additional locks and is actually less > code than what we are doing now. > > Also another very subtle bug in this area has been fixed. It is > possible to catch a task in the middle of de_thread where a thread is > assuming the thread of it's thread group leader. This patch carefully > handles that case so if we hit it we don't fail to return the pid, that > is undergoing the de_thread dance. > > This patch is against 2.6.18-rc6 and it should be relatively straight > forward to backport to older kernels as well. > > Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for > providing the first fix, pointing this out and working on it. Eric, Kame, thanks a lot for working on this. I'll be giving some good testing to this patch today, and will return back to you when I'm done. -- Jean Delvare ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-06 9:01 ` Jean Delvare @ 2006-09-06 21:12 ` Jean Delvare 2006-09-06 22:25 ` Oleg Nesterov 2006-09-06 22:43 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 0 siblings, 2 replies; 25+ messages in thread From: Jean Delvare @ 2006-09-06 21:12 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak On Wednesday 6 September 2006 11:01, Jean Delvare wrote: > Eric, Kame, thanks a lot for working on this. I'll be giving some good > testing to this patch today, and will return back to you when I'm done. The original issue is indeed fixed, but there's a problem with the patch. When stressing /proc (to verify the bug was fixed), my test machine ended up crashing. Here are the 2 traces I found in the logs: Sep 6 12:06:00 arrakis kernel: BUG: warning at kernel/fork.c:113/__put_task_struct() Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 12:06:05 arrakis kernel: BUG: warning at kernel/fork.c:113/__put_task_struct() Sep 6 12:06:05 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 Sep 6 12:06:05 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 12:06:05 arrakis kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004 Sep 6 12:06:05 arrakis kernel: printing eip: Sep 6 12:06:05 arrakis kernel: c0115ede Sep 6 12:06:05 arrakis kernel: *pde = 00000000 Sep 6 12:06:05 arrakis kernel: Oops: 0002 [#1] Sep 6 12:06:05 arrakis kernel: PREEMPT Sep 6 12:06:05 arrakis kernel: Modules linked in: button battery ac thermal processor cpufreq_powersave speedstep_ich speedstep_lib freq_table snd_pcm_oss snd_mixer_oss smbfs sg sd_mod nls_iso8859_1 nls_cp437 vfat fat radeon drm usb_storage scsi_mod eth1394 usbhid snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer ide_cd uhci_hcd snd intel_rng psmouse cdrom usbcore evdev soundcore ohci1394 intel_agp i2c_i801 snd_page_alloc ieee1394 e100 mii unix Sep 6 12:06:05 arrakis kernel: CPU: 0 Sep 6 12:06:05 arrakis kernel: EIP: 0060:[<c0115ede>] Not tainted VLI Sep 6 12:06:05 arrakis kernel: EFLAGS: 00010282 (2.6.18-rc6 #113) Sep 6 12:06:05 arrakis kernel: EIP is at __put_task_struct+0x3e/0x100 Sep 6 12:06:05 arrakis kernel: eax: 00000000 ebx: c7ba7580 ecx: c0316bd4 edx: 00000000 Sep 6 12:06:05 arrakis kernel: esi: 00000000 edi: cc4b2d40 ebp: c953ef48 esp: c953ef14 Sep 6 12:06:05 arrakis kernel: ds: 007b es: 007b ss: 0068 Sep 6 12:06:05 arrakis kernel: Process bug (pid: 28719, ti=c953e000 task=c953da90 task.ti=c953e000) Sep 6 12:06:05 arrakis kernel: Stack: 00000000 c02ddbf5 00000071 c02c8401 c7ba7580 c019666a c7ba7580 c953ef48 Sep 6 12:06:05 arrakis kernel: 00000001 00000101 00000000 00000002 00000004 39350030 cc4b0038 c953ef98 Sep 6 12:06:05 arrakis kernel: c0174750 cc4b2d40 c137bdf0 fffffffe c137be60 c01745f0 cc4b2d40 c953ef98 Sep 6 12:06:05 arrakis kernel: Call Trace: Sep 6 12:06:05 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 12:06:05 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 12:06:05 arrakis kernel: Code: 84 af 00 00 00 8b 43 08 85 c0 75 77 89 e0 25 00 f0 ff ff 3b 18 74 3e 8b 83 84 01 00 00 89 04 24 e8 68 e0 00 00 8b 83 70 01 00 00 <ff> 48 04 0f 94 c2 84 d2 75 10 89 5c 24 18 8b 5c 24 10 83 c4 14 Sep 6 12:06:05 arrakis kernel: EIP: [<c0115ede>] __put_task_struct+0x3e/0x100 SS:ESP 0068:c953ef14 Sep 6 19:38:23 arrakis kernel: BUG: warning at kernel/fork.c:113/__put_task_struct() Sep 6 19:38:23 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 Sep 6 19:38:23 arrakis kernel: [<c019667e>] proc_pid_readdir+0x14e/0x150 Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:23 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:23 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:23 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 19:38:26 arrakis kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000c Sep 6 19:38:26 arrakis kernel: printing eip: Sep 6 19:38:26 arrakis kernel: c01d2a69 Sep 6 19:38:26 arrakis kernel: *pde = 00000000 Sep 6 19:38:26 arrakis kernel: Oops: 0000 [#1] Sep 6 19:38:26 arrakis kernel: PREEMPT Sep 6 19:38:26 arrakis kernel: Modules linked in: button battery ac thermal processor cpufreq_powersave speedstep_ich speedstep_lib freq_table snd_pcm_oss snd_mixer_oss smbfs nls_iso8859_1 nls_cp437 vfat fat radeon drm sg sd_mod usb_storage scsi_mod eth1394 usbhid snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer intel_rng uhci_hcd snd ide_cd intel_agp e100 usbcore i2c_i801 cdrom ohci1394 psmouse soundcore evdev mii ieee1394 snd_page_alloc unix Sep 6 19:38:26 arrakis kernel: CPU: 0 Sep 6 19:38:26 arrakis kernel: EIP: 0060:[<c01d2a69>] Not tainted VLI Sep 6 19:38:26 arrakis kernel: EFLAGS: 00010017 (2.6.18-rc6 #115) Sep 6 19:38:26 arrakis kernel: EIP is at rwsem_wake+0x99/0x140 Sep 6 19:38:26 arrakis kernel: eax: 00000000 ebx: d6e483b4 ecx: cd522f64 edx: 00000001 Sep 6 19:38:26 arrakis kernel: esi: cd522f64 edi: b393a000 ebp: d6e48380 esp: d34dff58 Sep 6 19:38:26 arrakis kernel: ds: 007b es: 007b ss: 0068 Sep 6 19:38:26 arrakis kernel: Process firefox-bin (pid: 2400, ti=d34df000 task=d6e6c580 task.ti=d34df000) Sep 6 19:38:26 arrakis kernel: Stack: cf2d94e8 d6e483b8 00000292 cd784df4 d6e483b4 b393a000 d6e48380 c013219e Sep 6 19:38:26 arrakis kernel: 00000025 c0112b99 d6e483b4 cd784df4 b393a000 00000000 00030002 00000000 Sep 6 19:38:26 arrakis kernel: b393a000 d6e6c580 00000004 d34dffbc b77d4b94 09473ed0 c0112810 bfa3ea00 Sep 6 19:38:26 arrakis kernel: Call Trace: Sep 6 19:38:26 arrakis kernel: [<c013219e>] .text.lock.rwsem+0x21/0x43 Sep 6 19:38:26 arrakis kernel: [<c0112b99>] do_page_fault+0x389/0x580 Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580 Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40 Sep 6 19:38:26 arrakis kernel: Code: c3 e8 1c 0b 0f 00 eb ef 8b 4b 04 89 ce f6 41 0c 02 75 7c 31 d2 8d b6 00 00 00 00 8d bc 27 00 00 00 00 8b 01 42 3b 44 24 04 74 08 <f6> 40 0c 01 89 c1 75 ef 89 d0 89 d5 c1 e0 10 8d 54 02 ff 01 13 Sep 6 19:38:26 arrakis kernel: EIP: [<c01d2a69>] rwsem_wake+0x99/0x140 SS:ESP 0068:d34dff58 Sep 6 19:38:26 arrakis kernel: <6>note: firefox-bin[2400] exited with preempt_count 1 Sep 6 19:38:26 arrakis kernel: BUG: scheduling while atomic: firefox-bin/0x00000001/2400 Sep 6 19:38:26 arrakis kernel: [<c02c3552>] schedule+0x662/0x670 Sep 6 19:38:26 arrakis kernel: [<c02c4ebd>] rwsem_down_read_failed+0xbd/0x1a0 Sep 6 19:38:26 arrakis kernel: [<c0237435>] vt_console_print+0x85/0x2a0 Sep 6 19:38:26 arrakis kernel: [<c0132184>] .text.lock.rwsem+0x7/0x43 Sep 6 19:38:26 arrakis kernel: [<c0133585>] futex_wake+0x25/0xf0 Sep 6 19:38:26 arrakis kernel: [<c0135210>] do_futex+0x70/0x110 Sep 6 19:38:26 arrakis kernel: [<c0119a34>] release_console_sem+0x104/0x110 Sep 6 19:38:26 arrakis kernel: [<c013530c>] sys_futex+0x5c/0x130 Sep 6 19:38:26 arrakis kernel: [<c011633b>] mm_release+0x8b/0x90 Sep 6 19:38:26 arrakis kernel: [<c011b0e5>] exit_mm+0x25/0x110 Sep 6 19:38:26 arrakis kernel: [<c011b999>] do_exit+0xf9/0x530 Sep 6 19:38:26 arrakis kernel: [<c01042cc>] die+0x20c/0x210 Sep 6 19:38:26 arrakis kernel: [<c0112ab4>] do_page_fault+0x2a4/0x580 Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580 Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40 Sep 6 19:38:26 arrakis kernel: [<c01d2a69>] rwsem_wake+0x99/0x140 Sep 6 19:38:26 arrakis kernel: [<c013219e>] .text.lock.rwsem+0x21/0x43 Sep 6 19:38:26 arrakis kernel: [<c0112b99>] do_page_fault+0x389/0x580 Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580 Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40 Sep 6 19:38:27 arrakis kernel: BUG: warning at kernel/fork.c:113/__put_task_struct() Sep 6 19:38:27 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 Sep 6 19:38:27 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 19:38:27 arrakis kernel: BUG: unable to handle kernel paging request at virtual address 61672d78 Sep 6 19:38:27 arrakis kernel: printing eip: Sep 6 19:38:27 arrakis kernel: c01cf5ff Sep 6 19:38:27 arrakis kernel: *pde = 00000000 Sep 6 19:38:27 arrakis kernel: Oops: 0002 [#2] Sep 6 19:38:27 arrakis kernel: PREEMPT Sep 6 19:38:27 arrakis kernel: Modules linked in: button battery ac thermal processor cpufreq_powersave speedstep_ich speedstep_lib freq_table snd_pcm_oss snd_mixer_oss smbfs nls_iso8859_1 nls_cp437 vfat fat radeon drm sg sd_mod usb_storage scsi_mod eth1394 usbhid snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer intel_rng uhci_hcd snd ide_cd intel_agp e100 usbcore i2c_i801 cdrom ohci1394 psmouse soundcore evdev mii ieee1394 snd_page_alloc unix Sep 6 19:38:27 arrakis kernel: CPU: 0 Sep 6 19:38:27 arrakis kernel: EIP: 0060:[<c01cf5ff>] Not tainted VLI Sep 6 19:38:27 arrakis kernel: EFLAGS: 00010013 (2.6.18-rc6 #115) Sep 6 19:38:27 arrakis kernel: EIP is at _atomic_dec_and_lock+0xf/0x50 Sep 6 19:38:27 arrakis kernel: eax: cd772000 ebx: 61672d78 ecx: 00000000 edx: 00000001 Sep 6 19:38:27 arrakis kernel: esi: 61672d78 edi: cd1f84e0 ebp: cd772f48 esp: cd772ef8 Sep 6 19:38:27 arrakis kernel: ds: 007b es: 007b ss: 0068 Sep 6 19:38:27 arrakis kernel: Process bug (pid: 10459, ti=cd772000 task=cd75a030 task.ti=cd772000) Sep 6 19:38:27 arrakis kernel: Stack: 00000206 c0123f67 61672d78 c03af980 cd521ab0 00000000 c0115ed8 61672d78 Sep 6 19:38:27 arrakis kernel: c02ddbd5 00000071 c02c8401 cd521ab0 c019666a cd521ab0 cd772f48 00000001 Sep 6 19:38:27 arrakis kernel: 00000101 00000000 00000002 00000004 35310030 cd1f8400 cd772f98 c0174750 Sep 6 19:38:27 arrakis kernel: Call Trace: Sep 6 19:38:27 arrakis kernel: [<c0123f67>] free_uid+0x27/0xa0 Sep 6 19:38:27 arrakis kernel: [<c0115ed8>] __put_task_struct+0x38/0x100 Sep 6 19:38:27 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 Sep 6 19:38:27 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb Sep 6 19:38:27 arrakis kernel: Code: 01 c1 e2 0a 89 06 8b 5c 24 0c 89 d0 89 ca 8b 74 24 10 83 c4 14 c3 90 90 90 90 90 90 53 b8 01 00 00 00 8b 5c 24 08 e8 21 52 f4 ff <ff> 0b 0f 94 c0 84 c0 ba 01 00 00 00 74 04 5b 89 d0 c3 b8 01 00 Sep 6 19:38:27 arrakis kernel: EIP: [<c01cf5ff>] _atomic_dec_and_lock+0xf/0x50 SS:ESP 0068:cd772ef8 Sep 6 19:38:27 arrakis kernel: <6>note: bug[10459] exited with preempt_count 1 Sometimes the machine just hung, with nothing in the logs. The machine is a Sony laptop (i686). I have been testing the patch on another machine (x86_64) and had no problem at all, so the reproduceability of the bug might depend on the arch or some config option. I'll help nailing down this issue if I can, just tell me what to do. Thanks, -- Jean Delvare ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-06 21:12 ` Jean Delvare @ 2006-09-06 22:25 ` Oleg Nesterov 2006-09-06 22:38 ` [PATCH] proc-readdir-race-fix-take-3-fix-3 Oleg Nesterov 2006-09-06 22:43 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 1 sibling, 1 reply; 25+ messages in thread From: Oleg Nesterov @ 2006-09-06 22:25 UTC (permalink / raw) To: Jean Delvare Cc: Eric W. Biederman, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, ak On 09/06, Jean Delvare wrote: > > On Wednesday 6 September 2006 11:01, Jean Delvare wrote: > > Eric, Kame, thanks a lot for working on this. I'll be giving some good > > testing to this patch today, and will return back to you when I'm done. > > The original issue is indeed fixed, but there's a problem with the patch. > When stressing /proc (to verify the bug was fixed), my test machine ended > up crashing. Here are the 2 traces I found in the logs: > > Sep 6 12:06:00 arrakis kernel: BUG: warning at > kernel/fork.c:113/__put_task_struct() > Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 > Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 > Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 > Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 > Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb I think there is a bug in next_tgid(), > -static struct task_struct *next_tgid(struct task_struct *start) > -{ > - struct task_struct *pos; > + task = NULL; > rcu_read_lock(); > - pos = start; > - if (pid_alive(start)) > - pos = next_task(start); > - if (pid_alive(pos) && (pos != &init_task)) { > - get_task_struct(pos); > - goto done; > +retry: > + pid = find_ge_pid(tgid); > + if (pid) { > + tgid = pid->nr + 1; > + task = pid_task(pid, PIDTYPE_PID); > + /* What we to know is if the pid we have find is the > + * pid of a thread_group_leader. Testing for task > + * being a thread_group_leader is the obvious thing > + * todo but there is a window when it fails, due to > + * the pid transfer logic in de_thread. > + * > + * So we perform the straight forward test of seeing > + * if the pid we have found is the pid of a thread > + * group leader, and don't worry if the task we have > + * found doesn't happen to be a thread group leader. > + * As we don't care in the case of readdir. > + */ > + if (!task || !has_group_leader_pid(task)) > + goto retry; > + get_task_struct(task); > } > - pos = NULL; > -done: > rcu_read_unlock(); > - put_task_struct(start); > - return pos; > + return task; > } If the task found is not a group leader, we go to retry, but the task != NULL. Now, if find_ge_pid(tgid) returns NULL, we return that wrong task, and it was not get_task_struct()'ed. Oleg. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH] proc-readdir-race-fix-take-3-fix-3 2006-09-06 22:25 ` Oleg Nesterov @ 2006-09-06 22:38 ` Oleg Nesterov 2006-09-06 22:59 ` Eric W. Biederman 2006-09-08 6:38 ` Eric W. Biederman 0 siblings, 2 replies; 25+ messages in thread From: Oleg Nesterov @ 2006-09-06 22:38 UTC (permalink / raw) To: Jean Delvare Cc: Eric W. Biederman, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, ak On 09/07, Oleg Nesterov wrote: > > On 09/06, Jean Delvare wrote: > > > > On Wednesday 6 September 2006 11:01, Jean Delvare wrote: > > > Eric, Kame, thanks a lot for working on this. I'll be giving some good > > > testing to this patch today, and will return back to you when I'm done. > > > > The original issue is indeed fixed, but there's a problem with the patch. > > When stressing /proc (to verify the bug was fixed), my test machine ended > > up crashing. Here are the 2 traces I found in the logs: > > > > Sep 6 12:06:00 arrakis kernel: BUG: warning at > > kernel/fork.c:113/__put_task_struct() > > Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 > > Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 > > Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 > > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 > > Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 > > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 > > Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb > > If the task found is not a group leader, we go to retry, but > the task != NULL. > > Now, if find_ge_pid(tgid) returns NULL, we return that wrong > task, and it was not get_task_struct()'ed. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> --- t/fs/proc/base.c~ 2006-09-07 02:33:26.000000000 +0400 +++ t/fs/proc/base.c 2006-09-07 02:34:19.000000000 +0400 @@ -2149,9 +2149,9 @@ static struct task_struct *next_tgid(uns struct task_struct *task; struct pid *pid; - task = NULL; rcu_read_lock(); retry: + task = NULL; pid = find_ge_pid(tgid); if (pid) { tgid = pid->nr + 1; ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc-readdir-race-fix-take-3-fix-3 2006-09-06 22:38 ` [PATCH] proc-readdir-race-fix-take-3-fix-3 Oleg Nesterov @ 2006-09-06 22:59 ` Eric W. Biederman 2006-09-08 6:38 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-06 22:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Jean Delvare, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, ak Oleg Nesterov <oleg@tv-sign.ru> writes: > On 09/07, Oleg Nesterov wrote: >> >> On 09/06, Jean Delvare wrote: >> > >> > On Wednesday 6 September 2006 11:01, Jean Delvare wrote: >> > > Eric, Kame, thanks a lot for working on this. I'll be giving some good >> > > testing to this patch today, and will return back to you when I'm done. >> > >> > The original issue is indeed fixed, but there's a problem with the patch. >> > When stressing /proc (to verify the bug was fixed), my test machine ended >> > up crashing. Here are the 2 traces I found in the logs: >> > >> > Sep 6 12:06:00 arrakis kernel: BUG: warning at >> > kernel/fork.c:113/__put_task_struct() >> > Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100 >> > Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150 >> > Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0 >> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 >> > Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0 >> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0 >> > Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb >> >> If the task found is not a group leader, we go to retry, but >> the task != NULL. >> >> Now, if find_ge_pid(tgid) returns NULL, we return that wrong >> task, and it was not get_task_struct()'ed. Yep. That would do it. And of course having written the code it was very hard for me to step back far enough to see that. The other two failure modes still don't make sense to me but they may have been a side effect of this. And it would have taken a thread being the highest pid in the system that exits while we are calling filldir to trigger this. Wow. That was a good stress test. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> > > --- t/fs/proc/base.c~ 2006-09-07 02:33:26.000000000 +0400 > +++ t/fs/proc/base.c 2006-09-07 02:34:19.000000000 +0400 > @@ -2149,9 +2149,9 @@ static struct task_struct *next_tgid(uns > struct task_struct *task; > struct pid *pid; > > - task = NULL; > rcu_read_lock(); > retry: > + task = NULL; > pid = find_ge_pid(tgid); > if (pid) { > tgid = pid->nr + 1; ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc-readdir-race-fix-take-3-fix-3 2006-09-06 22:38 ` [PATCH] proc-readdir-race-fix-take-3-fix-3 Oleg Nesterov 2006-09-06 22:59 ` Eric W. Biederman @ 2006-09-08 6:38 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-08 6:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Jean Delvare, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, ak Ok. While we are on the topic of races in readdir in /proc. I just wanted to note that I believe thread groups have the same potential problem. I don't think it is severe enough to cause any real world problems. But it makes sense to note it. Something to think about. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-06 21:12 ` Jean Delvare 2006-09-06 22:25 ` Oleg Nesterov @ 2006-09-06 22:43 ` Eric W. Biederman 2006-09-07 8:31 ` Jean Delvare 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2006-09-06 22:43 UTC (permalink / raw) To: Jean Delvare Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak Jean Delvare <jdelvare@suse.de> writes: > On Wednesday 6 September 2006 11:01, Jean Delvare wrote: >> Eric, Kame, thanks a lot for working on this. I'll be giving some good >> testing to this patch today, and will return back to you when I'm done. > > The original issue is indeed fixed, but there's a problem with the patch. > When stressing /proc (to verify the bug was fixed), my test machine ended > up crashing. Here are the 2 traces I found in the logs: Ugh. So the death in __put_task_struct() is from: WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE))); So it appears we have something that is decrementing but not incrementing the count on the task struct. Now what is interesting is that there are a couple of other failure modes present here. free_uid called from __put_task_struct is failing And you seem to have a recursive page fault going on somewhere. I suspect the triggering of this bug is the result of an earlier oops, that left some process half cleaned up. Have you tested 2.6.18-rc6 without my patch? If not can you please test the same 2.6.18-rc6 configuration with my patch? > Sometimes the machine just hung, with nothing in the logs. The machine is > a Sony laptop (i686). > > I have been testing the patch on another machine (x86_64) and had no > problem at all, so the reproduceability of the bug might depend on the > arch or some config option. I'll help nailing down this issue if I can, > just tell me what to do. So I don't know what is going on with your laptop. It feels nasty. I think my patch is just tripping on the problem, rather than causing it. The previous version of fs/proc/base.c should have tripped over this problem as well if it happened to have hit the same process. I'm staring at the patch and I can not think of anything that would explain your problem. The reference counting is simple and the only bug I had in a posted version was a failure to decrement the count on the task_struct. I guess the practical question is what was your test methodology to reproduce this problem? A couple of more people running the same test on a few more machines might at least give us confidence in what is going on. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-06 22:43 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman @ 2006-09-07 8:31 ` Jean Delvare 2006-09-07 13:57 ` Eric W. Biederman 0 siblings, 1 reply; 25+ messages in thread From: Jean Delvare @ 2006-09-07 8:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak [-- Attachment #1: Type: text/plain, Size: 1041 bytes --] On Thursday 7 September 2006 00:43, Eric W. Biederman wrote: > Have you tested 2.6.18-rc6 without my patch? Yes I did, it didn't crash after a couple hours. Of course it doesn't prove anything as the crash appears to be the result of a race. I'll now apply Oleg's fix and see if things get better. > I guess the practical question is what was your test methodology to > reproduce this problem? A couple of more people running the same > test on a few more machines might at least give us confidence in what > is going on. "My" test program forks 1000 children who sleep for 1 second then look for themselves in /proc, warn if they can't find themselves, and exit. So basically the idea is that the process list will shrink very rapidly at the same moment every child does readdir(/proc). I attached the test program, I take no credit (nor shame) for it, it was provided to me by IBM (possibly on behalf of one of their own customers) as a way to demonstrate and reproduce the original readdir(/proc) race bug. -- Jean Delvare [-- Attachment #2: test.c --] [-- Type: text/x-csrc, Size: 1151 bytes --] #include <stdlib.h> #include <stdio.h> #include <dirent.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> #include <sys/param.h> #include <utmp.h> #include <pwd.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <signal.h> #include <syslog.h> #include <errno.h> #include <stdarg.h> #include <ctype.h> #define NUM_CHILDREN 1000 findme(i) int i; { DIR * dir = NULL; struct dirent *d; int pid; int mypid; mypid = getpid(); if ((dir = opendir("/proc")) == (DIR *)0) { perror("failed to open /proc\n"); exit(1); } while((d = readdir(dir)) != (struct dirent *)0) { if ((pid = (pid_t)atoi(d->d_name)) == 0) continue; if (pid==mypid) return(1); } printf("\nfailed to find myself: pid %d, iteration %d\n",mypid,i); return(0); } fork_child(i) int i; { int pid; switch ((pid = fork())) { case 0: /* child */ sleep(1); findme(i); exit(0); ;; case -1: /* error */ perror("failed to fork\n"); exit(1); ;; default: /* parent */ ;; } } main() { int i; (void)signal(SIGCHLD, SIG_IGN); for (i=0; i<NUM_CHILDREN; i++) { fork_child(i); } } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-07 8:31 ` Jean Delvare @ 2006-09-07 13:57 ` Eric W. Biederman 2006-09-07 18:07 ` Jean Delvare 0 siblings, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2006-09-07 13:57 UTC (permalink / raw) To: Jean Delvare Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak Jean Delvare <jdelvare@suse.de> writes: > On Thursday 7 September 2006 00:43, Eric W. Biederman wrote: >> Have you tested 2.6.18-rc6 without my patch? > > Yes I did, it didn't crash after a couple hours. Of course it doesn't > prove anything as the crash appears to be the result of a race. > > I'll now apply Oleg's fix and see if things get better. > >> I guess the practical question is what was your test methodology to >> reproduce this problem? A couple of more people running the same >> test on a few more machines might at least give us confidence in what >> is going on. > > "My" test program forks 1000 children who sleep for 1 second then look for > themselves in /proc, warn if they can't find themselves, and exit. So > basically the idea is that the process list will shrink very rapidly at > the same moment every child does readdir(/proc). > > I attached the test program, I take no credit (nor shame) for it, it was > provided to me by IBM (possibly on behalf of one of their own customers) > as a way to demonstrate and reproduce the original readdir(/proc) race > bug. Ok. So whatever is creating lots of child threads that tripped you up is probably peculiar to the environment on your laptop. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-07 13:57 ` Eric W. Biederman @ 2006-09-07 18:07 ` Jean Delvare 2006-09-07 18:40 ` Eric W. Biederman 0 siblings, 1 reply; 25+ messages in thread From: Jean Delvare @ 2006-09-07 18:07 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak On Thursday 7 September 2006 15:57, Eric W. Biederman wrote: > Jean Delvare <jdelvare@suse.de> writes: > > I'll now apply Oleg's fix and see if things get better. After 8 hours of stress testing on two machines, no crash and no freeze. So Oleg's fix seems to do the trick. Thanks Oleg :) I'll keep the patches applied on both machines, even without stress tests it is still better to make sure nothing bad happens in the long run. > > "My" test program forks 1000 children who sleep for 1 second then > > look for themselves in /proc, warn if they can't find themselves, and > > exit. So basically the idea is that the process list will shrink very > > rapidly at the same moment every child does readdir(/proc). > > > > I attached the test program, I take no credit (nor shame) for it, it > > was provided to me by IBM (possibly on behalf of one of their own > > customers) as a way to demonstrate and reproduce the original > > readdir(/proc) race bug. > > Ok. So whatever is creating lots of child threads that tripped you > up is probably peculiar to the environment on your laptop. There's nothing really special running on this laptop. Slackware 10.2 with xterm, firefox, sylpheed, xchat, and that's about it. At least one of the crashes I had yesterday happened when I was actively using firefox, I can't tell for the other one. The difference with the system where no problem was observed may be that the laptop has a preemptive kernel, and the desktop hasn't. -- Jean Delvare ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix (take 3) 2006-09-07 18:07 ` Jean Delvare @ 2006-09-07 18:40 ` Eric W. Biederman 0 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2006-09-07 18:40 UTC (permalink / raw) To: Jean Delvare Cc: Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki, linux-kernel, ak Jean Delvare <jdelvare@suse.de> writes: > On Thursday 7 September 2006 15:57, Eric W. Biederman wrote: >> Jean Delvare <jdelvare@suse.de> writes: >> > I'll now apply Oleg's fix and see if things get better. > > After 8 hours of stress testing on two machines, no crash and no freeze. > So Oleg's fix seems to do the trick. Thanks Oleg :) > > I'll keep the patches applied on both machines, even without stress tests > it is still better to make sure nothing bad happens in the long run. > >> > "My" test program forks 1000 children who sleep for 1 second then >> > look for themselves in /proc, warn if they can't find themselves, and >> > exit. So basically the idea is that the process list will shrink very >> > rapidly at the same moment every child does readdir(/proc). >> > >> > I attached the test program, I take no credit (nor shame) for it, it >> > was provided to me by IBM (possibly on behalf of one of their own >> > customers) as a way to demonstrate and reproduce the original >> > readdir(/proc) race bug. >> >> Ok. So whatever is creating lots of child threads that tripped you >> up is probably peculiar to the environment on your laptop. > > There's nothing really special running on this laptop. Slackware 10.2 with > xterm, firefox, sylpheed, xchat, and that's about it. At least one of the > crashes I had yesterday happened when I was actively using firefox, I > can't tell for the other one. Well firefox is threaded so that may be enough. It takes a threaded program to be able to trigger it. > The difference with the system where no problem was observed may be that > the laptop has a preemptive kernel, and the desktop hasn't. I suspect that just has the potential to make the window bigger. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-04 23:13 ` [PATCH] proc: readdir race fix Eric W. Biederman 2006-09-05 1:30 ` KAMEZAWA Hiroyuki 2006-09-05 2:26 ` KAMEZAWA Hiroyuki @ 2006-09-05 5:26 ` KAMEZAWA Hiroyuki 2006-09-05 5:41 ` KAMEZAWA Hiroyuki 2 siblings, 1 reply; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 5:26 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, akpm, ak, oleg, jdelvare On Mon, 04 Sep 2006 17:13:10 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> worked well with my (short) test set. and looks good. Thank you. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] proc: readdir race fix. 2006-09-05 5:26 ` [PATCH] proc: readdir race fix KAMEZAWA Hiroyuki @ 2006-09-05 5:41 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 25+ messages in thread From: KAMEZAWA Hiroyuki @ 2006-09-05 5:41 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: ebiederm, linux-kernel, akpm, ak, oleg, jdelvare On Tue, 5 Sep 2006 14:26:38 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Mon, 04 Sep 2006 17:13:10 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > > worked well with my (short) test set. and looks good. > Thank you. > > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Sorry (>_<. I should ack to another e-mail. ignore this. Hmm, please change subject of e-mail if patch is changed. -Kame ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-09-08 6:39 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-25 9:29 [RFC][PATCH] ps command race fix take 4 [4/4] proc root open/release/llseek KAMEZAWA Hiroyuki 2006-09-04 23:13 ` [PATCH] proc: readdir race fix Eric W. Biederman 2006-09-05 1:30 ` KAMEZAWA Hiroyuki 2006-09-05 2:30 ` Eric W. Biederman 2006-09-05 2:41 ` KAMEZAWA Hiroyuki 2006-09-05 2:26 ` KAMEZAWA Hiroyuki 2006-09-05 2:54 ` Eric W. Biederman 2006-09-05 3:07 ` Eric W. Biederman 2006-09-05 5:39 ` KAMEZAWA Hiroyuki 2006-09-05 10:10 ` Oleg Nesterov 2006-09-05 11:36 ` Eric W. Biederman 2006-09-05 14:52 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 2006-09-06 9:01 ` Jean Delvare 2006-09-06 21:12 ` Jean Delvare 2006-09-06 22:25 ` Oleg Nesterov 2006-09-06 22:38 ` [PATCH] proc-readdir-race-fix-take-3-fix-3 Oleg Nesterov 2006-09-06 22:59 ` Eric W. Biederman 2006-09-08 6:38 ` Eric W. Biederman 2006-09-06 22:43 ` [PATCH] proc: readdir race fix (take 3) Eric W. Biederman 2006-09-07 8:31 ` Jean Delvare 2006-09-07 13:57 ` Eric W. Biederman 2006-09-07 18:07 ` Jean Delvare 2006-09-07 18:40 ` Eric W. Biederman 2006-09-05 5:26 ` [PATCH] proc: readdir race fix KAMEZAWA Hiroyuki 2006-09-05 5:41 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox