From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751972AbaKYQ57 (ORCPT ); Tue, 25 Nov 2014 11:57:59 -0500 Received: from mx1.redhat.com ([209.132.183.28]:45229 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751663AbaKYQ55 (ORCPT ); Tue, 25 Nov 2014 11:57:57 -0500 Date: Tue, 25 Nov 2014 17:57:45 +0100 From: Oleg Nesterov To: "Eric W. Biederman" Cc: Andrew Morton , Aaron Tomlin , Pavel Emelyanov , Serge Hallyn , Sterling Alexander , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Message-ID: <20141125165745.GB28913@redhat.com> References: <20141124200626.GA21006@redhat.com> <874mtodzhd.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <874mtodzhd.fsf@x220.int.ebiederm.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/24, Eric W. Biederman wrote: > > Oleg Nesterov writes: > > > The comments in zap_pid_ns_processes() are simply wrong, we need > > to explain how this code actually works. > > > > 1. "Ignore SIGCHLD" looks like optimization but it is not, we also > > need this for correctness. > > > > 2. The comment above sys_wait4() could be more clear. > > > > 3. The comment about TASK_DEAD children is outdated. Contrary to > > what it says we do not need to make sure they all go away. > > We absolutely do need to wait until they all go away. I can easily miss something, and that is why I asked you to review this change. But if I missed something, perhaps we should update the comments? This "wait until TASK_DEAD children" loop was added to ensure that child_reaper can't be reaped before other tasks in this pid namespace. 6347e90091041e "pidns: guarantee that the pidns init will be the last pidns process reaped". But this was needed because proc_flush_task(pid == 1) called kern_unmount(proc_mnt). After 0a01f2cc390e10633 "pidns: Make the pidns proc mount/umount logic obvious" we can rely on schedule_work() in free_pid(nr_hashed == 0). So in any case I think that the current comment is outdated. > - rusage will be wrong if we don't wait. I already answered in 0/2, let me quote myself: Do you mean cstime/cutime/c* accounting? Firstly it is not clear what makes child_reaper special in _this_ sense, but this doesn't matter at all. The auotoreaping/EXIT_DEAD children are not accounted, only wait_task_zombie() accumulates these counters. (just in case, accounting in __exit_signal() is another thing). > - We won't wait for an injected process in a pid namespace, > or a processes debugged with gdb to be reaped before the pid > init process exits if we don't wait. Yes, and I do not see why this is bad, but this is off-topic. Again, lets discuss this in another thread. This patch doesn't try to document the desired semantics, it only tries to explain why zap_pid_ns _has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away. > The user visible semantics go from weird to completely insane if we > relax the rule that the init process is the last process in a pid > namespace. > > I don't see anything approaching a good reason for changing the user > space semantics. See above. > > @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > > /* Don't allow any more processes into the pid namespace */ > > disable_pid_allocation(pid_ns); > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The above line disables injections of new processes in this pid > namespace. Yes, sure. See below. > > - /* Ignore SIGCHLD causing any terminated children to autoreap */ > > + /* > > + * Ignore SIGCHLD causing any terminated children to autoreap. > > + * This speeds up the namespace shutdown, plus see the comment > > + * below. > > + */ > > This speeds up the namespace shutdown and ensures that after we > have waited for all existing zombies there will be no more > zombies to wait for. Afaics, no, see below. > > - * sys_wait4() above can't reap the TASK_DEAD children. > > - * Make sure they all go away, see free_pid(). > > + * sys_wait4() above can't reap the TASK_DEAD children but we do not > > + * really care, we could reparent them to the global init. > > We do care. See above. I only meant that nothing bad can happen from the kernel perspective. > > + * But this namespace can also have other tasks injected by setns(). > > + * Again, we do not really need to wait until they are all reaped, > > We do, and setns does not matter here. Injection was stopped way up above. I think that setns() does matter. Yes injection was stopped. But a task T can be already injected before disable_pid_allocation() was called. Now it is killed (or it could even exit before). To simplify, suppose it is already EXIT_ZOMBIE, although this doesn't really matter. The sys_wait4() loop can't see it, it is not our child. Now suppose that its parent doesn't do wait() but exits. This means that the exiting parent will try to reparent T to pid_ns->child_reaper. child_reaper already sleeps in "wait for nr_hashed == init_pids" loop, it can do nothing with T. So we rely on autoreaping (forced by ignored SIGCHLD), and this is the (technical) reason why child_reaper can't exit: pid_ns->child_reaper should be valid. No? > > + * We rely on ignored SIGCHLD, an injected zombie must be autoreaped > > + * if reparented. > > Your new comment is about 90% wrong. See above. Eric. I'd really ask you to take another look. But in any case: thanks for looking at this. Oleg.