From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751234AbdFATnu (ORCPT ); Thu, 1 Jun 2017 15:43:50 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:38405 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751050AbdFATns (ORCPT ); Thu, 1 Jun 2017 15:43:48 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Guenter Roeck Cc: Vovo Yang , Ingo Molnar , linux-kernel@vger.kernel.org References: <20170511224724.GB15676@roeck-us.net> <8760h79e22.fsf@xmission.com> <8760h66wak.fsf@xmission.com> <20170512165214.GA12960@roeck-us.net> <874lwqyo8i.fsf@xmission.com> <20170512194304.GE12960@roeck-us.net> <87wp9lvo4u.fsf@xmission.com> <87inkfab4l.fsf@xmission.com> <20170601184549.GA28522@roeck-us.net> Date: Thu, 01 Jun 2017 14:36:38 -0500 In-Reply-To: <20170601184549.GA28522@roeck-us.net> (Guenter Roeck's message of "Thu, 1 Jun 2017 11:45:49 -0700") Message-ID: <87tw3z8pq1.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1dGW08-0006UD-Ka;;;mid=<87tw3z8pq1.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+wZEaDTTAm7c6oQyIKC1jrmPly5VDIqnM= X-SA-Exim-Connect-IP: 97.121.81.159 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.5 XMGappySubj_01 Very gappy subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4822] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa06 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa06 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Guenter Roeck X-Spam-Relay-Country: X-Spam-Timing: total 5303 ms - load_scoreonly_sql: 0.04 (0.0%), signal_user_changed: 2.5 (0.0%), b_tie_ro: 1.76 (0.0%), parse: 0.75 (0.0%), extract_message_metadata: 15 (0.3%), get_uri_detail_list: 1.91 (0.0%), tests_pri_-1000: 7 (0.1%), tests_pri_-950: 1.16 (0.0%), tests_pri_-900: 0.95 (0.0%), tests_pri_-400: 20 (0.4%), check_bayes: 19 (0.4%), b_tokenize: 7 (0.1%), b_tok_get_all: 7 (0.1%), b_comp_prob: 2.2 (0.0%), b_tok_touch_all: 2.2 (0.0%), b_finish: 0.55 (0.0%), tests_pri_0: 198 (3.7%), check_dkim_signature: 0.74 (0.0%), check_dkim_adsp: 3.2 (0.1%), tests_pri_500: 5054 (95.3%), poll_dns_idle: 5047 (95.2%), rewrite_mail: 0.00 (0.0%) Subject: Re: Threads stuck in zap_pid_ns_processes() X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Guenter Roeck writes: > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote: >> Guenter Roeck writes: >> > >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get >> > a zombie process. >> > >> > I guess the only question left is if zap_pid_ns_processes() should (or could) >> > somehow detect that situation and return instead of waiting forever. >> > What do you think ? >> >> Any chance you can point me at the chromium code that is performing the >> ptrace? >> >> I want to conduct a review of the kernel semantics to see if the current >> semantics make it unnecessarily easy to get into hang situations. If >> the semantics make it really easy to get into a hang situation I want >> to see if there is anything we can do to delicately change the semantics >> to avoid the hangs without breaking existing userspace. >> > The internal bug should be accessible to you. > > https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2 > > It has some additional information, and points to the following code in Chrome. > > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85 > > With the information we have, I don't really have a good idea what we could or > should change in Chrome to make the problem disappear, so I just concluded that > we'll have to live with the forever-sleeping task. I believe I see what is happening. The code makes the assumption that a thread will stay stopped and will not go away once ptrace attach completes. Unfortunately if someone sends SIGKILL to the process or exec sends SIGKILL to the individual thread then PTRACE_DETACH will fail. At which point you can use waitpid to reap the zombie and detach from the thread. So I think the forever-sleeping can be fixed with something as simple as changing ResumeThread to say: // Resumes a thread by detaching from it. static bool ResumeThread(pid_t pid) { if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0) return true; /* Someone killed the thread? */ return waitpid(pid, NULL, 0) == pid; } It almost certainly makes sense to fix PTRACE_DETACH in the kernel to allow this case to work. And odds are good that we could make that change without breaking anyone. So it is worth a try. Eric