From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751232AbdAXP3T (ORCPT ); Tue, 24 Jan 2017 10:29:19 -0500 Received: from out03.mta.xmission.com ([166.70.13.233]:40569 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751199AbdAXP3P (ORCPT ); Tue, 24 Jan 2017 10:29:15 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Oleg Nesterov Cc: Pavel Tikhomirov , Lennart Poettering , Kay Sievers , Ingo Molnar , Peter Zijlstra , Andrew Morton , Cyrill Gorcunov , John Stultz , Thomas Gleixner , Nicolas Pitre , Michal Hocko , Stanislav Kinsburskiy , Mateusz Guzik , linux-kernel@vger.kernel.org, Pavel Emelyanov , Konstantin Khorenko References: <20170119164346.4214-1-ptikhomirov@virtuozzo.com> <20170123164420.GA2145@redhat.com> <87tw8p8wo8.fsf@xmission.com> <20170124140738.GA21034@redhat.com> Date: Wed, 25 Jan 2017 04:24:52 +1300 In-Reply-To: <20170124140738.GA21034@redhat.com> (Oleg Nesterov's message of "Tue, 24 Jan 2017 15:07:38 +0100") Message-ID: <87bmuwjxa3.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1cW31v-0004lQ-HQ;;;mid=<87bmuwjxa3.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=101.100.131.98;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/EWWeFvRJen7DLXuSGAK6ce/jFseh1HC0= X-SA-Exim-Connect-IP: 101.100.131.98 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.5 XMGappySubj_01 Very gappy subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa08 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa08 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Oleg Nesterov X-Spam-Relay-Country: X-Spam-Timing: total 295 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 2.9 (1.0%), b_tie_ro: 2.2 (0.8%), parse: 0.64 (0.2%), extract_message_metadata: 3.7 (1.3%), get_uri_detail_list: 2.2 (0.8%), tests_pri_-1000: 4.0 (1.3%), tests_pri_-950: 0.91 (0.3%), tests_pri_-900: 0.87 (0.3%), tests_pri_-400: 24 (8.1%), check_bayes: 23 (7.8%), b_tokenize: 8 (2.8%), b_tok_get_all: 8 (2.8%), b_comp_prob: 1.93 (0.7%), b_tok_touch_all: 2.8 (0.9%), b_finish: 0.56 (0.2%), tests_pri_0: 247 (83.9%), check_dkim_signature: 0.40 (0.1%), check_dkim_adsp: 2.5 (0.8%), tests_pri_500: 3.2 (1.1%), rewrite_mail: 0.00 (0.0%) Subject: Re: setns() && PR_SET_CHILD_SUBREAPER X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oleg Nesterov writes: > On 01/24, Eric W. Biederman wrote: >> >> Oleg Nesterov writes: >> >> > Suppose we have a process P in the root namespace and another namespace X. >> > >> > P does setns() and enters the X namespace. >> > P forks a child C. >> > >> > C forks a grandchild G. >> > C exits. >> > >> > The question is, where should we reparent the grandchild G? In the normal >> > case it will be reparented to X->child_reaper and this looks correct. >> > >> > But lets suppose that P runs with the ->has_child_subreaper bit set. In >> > this case it will be reparented to P's sub-reaper or a global init, and >> > given that P can't control its ->has_child_subreaper flag this does not >> > look right to me. >> > >> > I can make a simple patch but perhaps I missed something or we actually >> > want this (imo strange) behaviour? >> >> We definitely do not want a child to be repareted out of a pid namespace >> when the pid namespace has a perfectly fine child_reaper. >> >> The special case for the init_task in find_new_reaper appears to be the >> instance of this problem that was considered in the code. > > Actually we should blame the same_thread_group(reaper, child_reaper) check, > it should had ensured we could not cross the namespaces, but it is not > enough. Because this logic predates setns(). > >> Semantically what we want to do is walk up the parents in the process >> tree. If a parent has is_child_subreaper we stop at it. If the >> transition from one parent to the next we are switching pid namespaces >> we want the reaper from the pid namespace. > > Yes, this is what I have in mind, see the patch below. I need to re-check > it and update the comment to explain why we can't simply check child_reaper > as we currently do. > > This way we can start the search from father->real_parent, but the comment > above the "reaper == &init_task" is no longer correct, we always need this > check although perhaps is_idle_task(reaper) would be better. > >> As I recall has_child_subreaper was just supposed to be an optimization >> so the common case would not have to walk up the process tree when >> finding it's parent. > > Yep. > >> If we retain any optimizations such as has_child_subreaper please >> consider the case where a process with is_child_subreaper set exits, >> and what happens to it's children. > > Yes, in this case it should not have any effect. Well, there is another > corner case, perhaps we should turn > > if (!reaper->signal->is_child_subreaper) > continue; > > into > if (!reaper->signal->is_child_subreaper) { > if (!reaper->signal->has_child_subreaper) > break; > continue; > } > > this looks a bit more correct if the exited "is_child_subreaper" process > was forked, and after that its parent called prctl(SET_CHILD_SUBREAPER). > But I think we do not care and Pavel is going to eliminate the case when > a child of is_child_subreaper task can run without has_child_subreaper > flag set. As long as we update the flag when reparenting so that it is accurate and we clear it when creating a child in a new pid namespace. > So what do you think about the patch below? That does look like the correct logic. Whose tree do we want to run merge these fixes through? I can pick them up if that would be convinient. Eric > Oleg. > > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -569,15 +569,15 @@ static struct task_struct *find_new_reaper(struct task_struct *father, > return thread; > > if (father->signal->has_child_subreaper) { > + unsigned int level = task_pid(father)->level; > /* > * Find the first ->is_child_subreaper ancestor in our pid_ns. > - * We start from father to ensure we can not look into another > - * namespace, this is safe because all its threads are dead. > + * We check pid->level, this is slightly more efficient than > + * task_active_pid_ns(reaper) != task_active_pid_ns(father). > */ > - for (reaper = father; > - !same_thread_group(reaper, child_reaper); > + for (reaper = father->real_parent; > + task_pid(reaper)->level == level; > reaper = reaper->real_parent) { > - /* call_usermodehelper() descendants need this check */ > if (reaper == &init_task) > break; > if (!reaper->signal->is_child_subreaper)