From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933413AbdEKRhw (ORCPT ); Thu, 11 May 2017 13:37:52 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:52191 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933348AbdEKRhv (ORCPT ); Thu, 11 May 2017 13:37:51 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Guenter Roeck Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Vovo Yang References: <20170511171108.GB15063@roeck-us.net> Date: Thu, 11 May 2017 12:31:21 -0500 In-Reply-To: <20170511171108.GB15063@roeck-us.net> (Guenter Roeck's message of "Thu, 11 May 2017 10:11:08 -0700") Message-ID: <87shkbfggm.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1d8s22-0005mo-QL;;;mid=<87shkbfggm.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18rTkvH1tTdfw65eWzACSg1T9+dfs6wBwQ= X-SA-Exim-Connect-IP: 97.121.81.159 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.5 XMGappySubj_01 Very gappy subject * 0.0 TVD_RCVD_IP Message was received from an IP address * 1.2 LotsOfNums_01 BODY: Lots of long strings of numbers * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa02 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Guenter Roeck X-Spam-Relay-Country: X-Spam-Timing: total 5342 ms - load_scoreonly_sql: 0.26 (0.0%), signal_user_changed: 5 (0.1%), b_tie_ro: 3.0 (0.1%), parse: 2.4 (0.0%), extract_message_metadata: 33 (0.6%), get_uri_detail_list: 6 (0.1%), tests_pri_-1000: 12 (0.2%), tests_pri_-950: 2.4 (0.0%), tests_pri_-900: 1.83 (0.0%), tests_pri_-400: 47 (0.9%), check_bayes: 45 (0.8%), b_tokenize: 18 (0.3%), b_tok_get_all: 11 (0.2%), b_comp_prob: 7 (0.1%), b_tok_touch_all: 3.9 (0.1%), b_finish: 0.96 (0.0%), tests_pri_0: 773 (14.5%), check_dkim_signature: 1.30 (0.0%), check_dkim_adsp: 10 (0.2%), tests_pri_500: 4457 (83.4%), poll_dns_idle: 4438 (83.1%), rewrite_mail: 0.00 (0.0%) Subject: Re: Threads stuck in zap_pid_ns_processes() X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Guenter Roeck writes: > Hi all, > > the test program attached below almost always results in one of the child > processes being stuck in zap_pid_ns_processes(). When this happens, I can > see from test logs that nr_hashed == 2 and init_pids==1, but there is only > a single thread left in the pid namespace (the one that is stuck). > Traceback from /proc//stack is > > [] zap_pid_ns_processes+0x1ee/0x2a0 > [] do_exit+0x10d4/0x1330 > [] do_group_exit+0x86/0x130 > [] get_signal+0x367/0x8a0 > [] do_signal+0x83/0xb90 > [] exit_to_usermode_loop+0x75/0xc0 > [] syscall_return_slowpath+0xc6/0xd0 > [] entry_SYSCALL_64_fastpath+0xab/0xad > [] 0xffffffffffffffff > > After 120 seconds, I get the "hung task" message. > > Example from v4.11: > > ... > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds. > [ 3263.379561] Not tainted 4.11.0+ #1 > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 3263.379577] clone D 0 27910 27909 0x00000000 > [ 3263.379587] Call Trace: > [ 3263.379608] __schedule+0x677/0xda0 > [ 3263.379621] ? pci_mmcfg_check_reserved+0xc0/0xc0 > [ 3263.379634] ? task_stopped_code+0x70/0x70 > [ 3263.379643] schedule+0x4d/0xd0 > [ 3263.379653] zap_pid_ns_processes+0x1ee/0x2a0 > [ 3263.379659] ? copy_pid_ns+0x4d0/0x4d0 > [ 3263.379670] do_exit+0x10d4/0x1330 > ... > > The problem is seen in all kernels up to v4.11. > > Any idea what might be going on and how to fix the problem ? Let me see. Reading the code it looks like we have three tasks let's call them main, child1, and child2. child1 and child2 are started using CLONE_THREAD and are thus clones of one another. child2 exits first but is ptraced by main so is not reaped. Further child2 calls do_group_exit forcing child1 to exit making for fun races. A ptread_exit() or syscall(SYS_exit, 0); would skip the group exit and make the window larger. child1 exits next and calls zap_pid_ns_processes and is waiting for child2 to be reaped by main. main is just sitting around doing nothing for 3600 seconds not reaping anyone. I would expect that when main exits everything would be cleaned up and the only real issue is that we have a hung task warning. Does everything cleanup when main exits? Eric > > Thanks, > Guenter > > --- > This test program was kindly provided by Vovo Yang . > > Note that the ptrace() call in child1() is not necessary for the problem > to be seen, though it seems to make it a bit more likely. That would appear to just slow things down a smidge. As there is nothing substantial that happens ptrace wise except until after zap_pid_ns_processes. > --- > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > #define STACK_SIZE 65536 > > int child1(void* arg); > int child2(void* arg); > > int main(int argc, char **argv) > { > int child_pid; > char* child_stack = malloc(STACK_SIZE); > char* stack_top = child_stack + STACK_SIZE; > char command[256]; > > child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL); > if (child_pid == -1) { > printf("parent: clone failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > printf("parent: child1_pid: %d\n", child_pid); > > sleep(2); > printf("child state, if it's D (disk sleep), the child process is hung\n"); > sprintf(command, "cat /proc/%d/status | grep State:", child_pid); > system(command); > sleep(3600); > return EXIT_SUCCESS; > } > > int child1(void* arg) > { > int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD; > char* child_stack = malloc(STACK_SIZE); > char* stack_top = child_stack + STACK_SIZE; > long ret; > > ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > if (ret == -1) { > printf("child1: ptrace failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > > ret = clone(&child2, stack_top, flags, NULL); > if (ret == -1) { > printf("child1: clone failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > printf("child1: child2 pid: %ld\n", ret); > > sleep(1); > printf("child1: end\n"); > return EXIT_SUCCESS; > } > > int child2(void* arg) > { > long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > if (ret == -1) { > printf("child2: ptrace failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > > printf("child2: end\n"); > return EXIT_SUCCESS; > }