From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757213AbdELNc7 (ORCPT ); Fri, 12 May 2017 09:32:59 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:50824 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755124AbdELNc5 (ORCPT ); Fri, 12 May 2017 09:32:57 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Vovo Yang Cc: Guenter Roeck , Ingo Molnar , linux-kernel@vger.kernel.org References: <20170511171108.GB15063@roeck-us.net> <87shkbfggm.fsf@xmission.com> <20170511202104.GA14720@roeck-us.net> <87y3u3axx8.fsf@xmission.com> <20170511224724.GB15676@roeck-us.net> <8760h79e22.fsf@xmission.com> Date: Fri, 12 May 2017 08:26:27 -0500 In-Reply-To: (Vovo Yang's message of "Fri, 12 May 2017 17:30:15 +0800") Message-ID: <8760h66wak.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1d9Agb-0003nL-HN;;;mid=<8760h66wak.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX19QZ13v8qfG/n3TpPVSXK8n7OaZn4Hr73s= X-SA-Exim-Connect-IP: 97.121.81.159 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.5 XMGappySubj_01 Very gappy subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * -0.0 BAYES_40 BODY: Bayes spam probability is 20 to 40% * [score: 0.3684] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa05 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa05 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Vovo Yang X-Spam-Relay-Country: X-Spam-Timing: total 5698 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.7 (0.1%), b_tie_ro: 2.6 (0.0%), parse: 1.29 (0.0%), extract_message_metadata: 23 (0.4%), get_uri_detail_list: 2.6 (0.0%), tests_pri_-1000: 10 (0.2%), tests_pri_-950: 1.62 (0.0%), tests_pri_-900: 1.35 (0.0%), tests_pri_-400: 23 (0.4%), check_bayes: 21 (0.4%), b_tokenize: 8 (0.1%), b_tok_get_all: 6 (0.1%), b_comp_prob: 2.5 (0.0%), b_tok_touch_all: 2.3 (0.0%), b_finish: 0.73 (0.0%), tests_pri_0: 199 (3.5%), check_dkim_signature: 0.62 (0.0%), check_dkim_adsp: 3.2 (0.1%), tests_pri_500: 5432 (95.3%), poll_dns_idle: 5419 (95.1%), rewrite_mail: 0.00 (0.0%) Subject: Re: Threads stuck in zap_pid_ns_processes() X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Vovo Yang writes: > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman > wrote: >> Guenter Roeck writes: >> >>> What I know so far is >>> - We see this condition on a regular basis in the field. Regular is >>> relative, of course - let's say maybe 1 in a Milion Chromebooks >>> per day reports a crash because of it. That is not that many, >>> but it adds up. >>> - We are able to reproduce the problem with a performance benchmark >>> which opens 100 chrome tabs. While that is a lot, it should not >>> result in a kernel hang/crash. >>> - Vovo proviced the test code last night. I don't know if this is >>> exactly what is observed in the benchmark, or how it relates to the >>> benchmark in the first place, but it is the first time we are actually >>> able to reliably create a condition where the problem is seen. >> >> Thank you. I will be interesting to hear what is happening in the >> chrome perfomance benchmark that triggers this. >> > What's happening in the benchmark: > 1. A chrome renderer process was created with CLONE_NEWPID > 2. The process crashed > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every > threads of the crashed process to dump info > 4. When breakpad detach the crashed process, the crashed process stuck in > zap_pid_ns_processes() Very interesting thank you. So the question is specifically which interaction is causing this. In the test case provided it was a sibling task in the pid namespace dying and not being reaped. Which may be what is happening with breakpad. So far I have yet to see kernel bug but I won't rule one out. Eric