From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S265212AbUBAEnf (ORCPT ); Sat, 31 Jan 2004 23:43:35 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265215AbUBAEnf (ORCPT ); Sat, 31 Jan 2004 23:43:35 -0500 Received: from nevyn.them.org ([66.93.172.17]:21122 "EHLO nevyn.them.org") by vger.kernel.org with ESMTP id S265212AbUBAEnc (ORCPT ); Sat, 31 Jan 2004 23:43:32 -0500 Date: Sat, 31 Jan 2004 23:43:31 -0500 From: Daniel Jacobowitz To: Linus Torvalds Cc: Kernel Mailing List , Roland McGrath , Ingo Molnar Subject: Re: More waitpid issues with CLONE_DETACHED/CLONE_THREAD Message-ID: <20040201044331.GA27271@nevyn.them.org> Mail-Followup-To: Linus Torvalds , Kernel Mailing List , Roland McGrath , Ingo Molnar References: <20040201032525.GA10254@nevyn.them.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jan 31, 2004 at 08:38:43PM -0800, Linus Torvalds wrote: > > > On Sat, 31 Jan 2004, Daniel Jacobowitz wrote: > > > > This may be related to the python bug reported today... > > Indeed. > > Having a "waitpid(x, .., WNOHANG)" return 0 is a very interesting > condition. That condition basically guarantees that: > > - the kernel did find the child > - but the kernel decided that the child cannot be reaped right then. > > If you see the process as a Zombie in a "ps" listing, then we know that > that isn't the reason why it couldn't be reaped. Can you verify that > /proc//status shows it as "Z (zombie)"? Yes, it is a zombie when the waitpid is executed. > In fact, if we see it as "Z (zombie)", we know even more: it means that > wait_task_zombie() was never called, because that would have started out > with changing the process state to "X (dead)". > > And that in turn implies that "eligible_child()" would have returned 2. > > Which is a normal occurrence: it happens when a process group leader still > has threads attached to it. At that point it may be a Zombie, but we can't > reap it yet. The threads have to go away before the thing can be reaped. > > Can you verify that that process doesn't have any sub-threads? (Again, > that should be easily visible in /proc//task/). It is quite easily visible - in fact, it's hilarious. 8454 pts/8 Z 0:00 [linux-dp] drow@nevyn:~% ls /proc/8454 auxv cmdline cwd@ environ exe@ fd/ maps mem mounts root@ stat drow@nevyn:~% ls /proc/8454/task ls: /proc/8454/task: No such file or directory What that means I'm not entirely sure. > Another alternative is that the process is a zombie, but it is being > traced. When that happens, it shows up on the "ptrace_children" list, and > we'll see in in wait4(), but we won't be able to reap it. At this point it is being traced - this is gdbserver. But the same process that is tracing it is calling waitpid. And the problem persists after the tracer dies; the process above is actually from a couple of hours ago, and its tracer was killed. > Roland, Ingo - have you followed the discussion on linux-kernel? Something > strange does seem to be going on.. The only two kernels I've tried were 2.6.0-test7 and 2.6.2-rc3, by the way - same behavior in both. I'll try to write a single program testcase for this. -- Daniel Jacobowitz MontaVista Software Debian GNU/Linux Developer