From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S265203AbUBADZb (ORCPT ); Sat, 31 Jan 2004 22:25:31 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265205AbUBADZb (ORCPT ); Sat, 31 Jan 2004 22:25:31 -0500 Received: from nevyn.them.org ([66.93.172.17]:30336 "EHLO nevyn.them.org") by vger.kernel.org with ESMTP id S265203AbUBADZ2 (ORCPT ); Sat, 31 Jan 2004 22:25:28 -0500 Date: Sat, 31 Jan 2004 22:25:25 -0500 From: Daniel Jacobowitz To: linux-kernel@vger.kernel.org Subject: More waitpid issues with CLONE_DETACHED/CLONE_THREAD Message-ID: <20040201032525.GA10254@nevyn.them.org> Mail-Followup-To: linux-kernel@vger.kernel.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org This may be related to the python bug reported today... I've been playing around with gdbserver support for NPTL threading all day today. Right now it works, except that when I say "kill" in the GDB client, gdbserver hangs. The problem is that we kill the child, and wait for it, but wait never returns it. write(2, "Killing inferior\n", 17) = 17 ptrace(PTRACE_CONT, 8454, 0, SIG_0) = 0 tkill(8454, SIGKILL) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(8454, 0xbfffec04, WNOHANG) = 0 waitpid(8454, 0xbfffec04, WNOHANG|__WCLONE) = -1 ECHILD (No child processes) nanosleep({0, 1000000}, 0) = ? ERESTART_RESTARTBLOCK (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- setup() = 0 waitpid(8454, 0xbfffec04, WNOHANG) = 0 waitpid(8454, 0xbfffec04, WNOHANG|__WCLONE) = -1 ECHILD (No child processes) and so on (looping on waitpid). At this point, process 8454 is marked as a zombie, and nothing can reap it. After gdbserver is killed, it reparents to init: Name: linux-dp State: Z (zombie) SleepAVG: 91% Tgid: 8454 Pid: 8454 PPid: 1 TracerPid: 0 but init can't reap it either. 8454 was the original (i.e. non-CLONE_DETACHED) thread. Same behavior if I use ptrace_kill. GDB doesn't suffer from the same problem. A little time with strace and I found out why: GDB PTRACE_KILL's the detached threads, PTRACE_KILL's the parent thread, waitpid's the detached threads, and then waitpid's the parent thread. No design, just different order of items on the linked list. If I change gdbserver to do "kill thread; wait for thread; kill next thread; wait for next thread; kill parent last; wait for parent last" then it terminates and I don't get an unkillable zombie. ptrace(PTRACE_KILL, 18348, 0, 0) = 0 waitpid(18348, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes) --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(18348, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18348 ptrace(PTRACE_KILL, 18349, 0, 0) = 0 waitpid(18349, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes) waitpid(18349, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18349 --- SIGCHLD (Child exited) @ 0 (0) --- ptrace(PTRACE_KILL, 18350, 0, 0) = 0 waitpid(18350, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes) waitpid(18350, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18350 ptrace(PTRACE_KILL, 18351, 0, 0) = 0 waitpid(18351, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes) waitpid(18351, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18351 ptrace(PTRACE_KILL, 18352, 0, 0) = 0 waitpid(18352, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes) waitpid(18352, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18352 --- SIGCHLD (Child exited) @ 0 (0) --- ptrace(PTRACE_KILL, 18329, 0, 0) = 0 waitpid(18329, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG) = 18329 exit_group(0) = ? So it looks like something gets very confused if the parent is SIGKILLed before the children. What should happen? -- Daniel Jacobowitz MontaVista Software Debian GNU/Linux Developer