From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752997AbdELOcz (ORCPT ); Fri, 12 May 2017 10:32:55 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:47740 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751881AbdELOcx (ORCPT ); Fri, 12 May 2017 10:32:53 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Kirill Tkhai Cc: , , , , , , , , , , References: <149459442953.20826.15806664408144117255.stgit@localhost.localdomain> Date: Fri, 12 May 2017 09:26:19 -0500 In-Reply-To: <149459442953.20826.15806664408144117255.stgit@localhost.localdomain> (Kirill Tkhai's message of "Fri, 12 May 2017 16:07:28 +0300") Message-ID: <87shka40dw.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1d9BcY-0002Vp-L9;;;mid=<87shka40dw.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18tiqErjhH0isxZzH0UlXeunZuFjSGO9vI= X-SA-Exim-Connect-IP: 97.121.81.159 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.7 XMSubLong Long Subject * 0.5 XMGappySubj_01 Very gappy subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa05 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject * 0.0 T_TooManySym_02 5+ unique symbols in subject * 0.0 T_TooManySym_03 6+ unique symbols in subject X-Spam-DCC: XMission; sa05 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Kirill Tkhai X-Spam-Relay-Country: X-Spam-Timing: total 5853 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 2.7 (0.0%), b_tie_ro: 1.80 (0.0%), parse: 0.87 (0.0%), extract_message_metadata: 29 (0.5%), get_uri_detail_list: 3.7 (0.1%), tests_pri_-1000: 13 (0.2%), tests_pri_-950: 1.09 (0.0%), tests_pri_-900: 0.93 (0.0%), tests_pri_-400: 37 (0.6%), check_bayes: 36 (0.6%), b_tokenize: 12 (0.2%), b_tok_get_all: 14 (0.2%), b_comp_prob: 2.6 (0.0%), b_tok_touch_all: 5 (0.1%), b_finish: 1.23 (0.0%), tests_pri_0: 570 (9.7%), check_dkim_signature: 0.55 (0.0%), check_dkim_adsp: 2.8 (0.0%), tests_pri_500: 5195 (88.8%), poll_dns_idle: 5186 (88.6%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH] pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes() X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Kirill Tkhai writes: > Imagine we have a pid namespace and a task from its parent's pid_ns, > which made setns() to the pid namespace. The task is doing fork(), > while the pid namespace's child reaper is dying. We have the race > between them: > > Task from parent pid_ns Child reaper > copy_process() .. > alloc_pid() .. > .. zap_pid_ns_processes() > .. disable_pid_allocation() > .. read_lock(&tasklist_lock) > .. iterate over pids in pid_ns > .. kill tasks linked to pids > .. read_unlock(&tasklist_lock) > write_lock_irq(&tasklist_lock); .. > attach_pid(p, PIDTYPE_PID); .. > .. .. > > So, just created task p won't receive SIGKILL signal, > and the pid namespace will be in contradictory state. > Only manual kill will help there, but does the userspace > care about this? I suppose, the most users just inject > a task into a pid namespace and wait a SIGCHLD from it. > > The patch fixes the problem. It moves disable_pid_allocation() > into find_child_reaper() where tasklist_lock is held, > and this allows to simply check for (pid_ns->nr_hashed & PIDNS_HASH_ADDING) > in copy_process(). If allocation is disabled, we just > return -ENOMEM like it's made for such cases in alloc_pid(). This problem sounds very theoretical has it ever come up in practice? I am asking to see if this is something we will care enough about to backport. Please look at what happens when you call spin_unlock_irq(&pidmap_lock) under writelock_irq(&tasklist_lock); Please also look at what happens when pid == &init_pid but p->nsproxy->pid_ns_for_children happens to be have PIDNS_HASH_ADDING set. All of that said I think this is a fix worth fixing. Eric > Signed-off-by: Kirill Tkhai > CC: Andrew Morton > CC: Ingo Molnar > CC: Peter Zijlstra > CC: Oleg Nesterov > CC: Mike Rapoport > CC: Michal Hocko > CC: Andy Lutomirski > CC: "Eric W. Biederman" > CC: Andrei Vagin > CC: Cyrill Gorcunov > CC: Serge Hallyn > --- > kernel/exit.c | 2 ++ > kernel/fork.c | 15 ++++++++++----- > kernel/pid_namespace.c | 3 --- > 3 files changed, 12 insertions(+), 8 deletions(-) > > diff --git a/kernel/exit.c b/kernel/exit.c > index 516acdb0e0ec..9310e69fbc5f 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -586,6 +586,8 @@ static struct task_struct *find_child_reaper(struct task_struct *father) > return reaper; > } > > + /* Don't allow any more processes into the pid namespace */ > + disable_pid_allocation(pid_ns); > write_unlock_irq(&tasklist_lock); > if (unlikely(pid_ns == &init_pid_ns)) { > panic("Attempted to kill init! exitcode=0x%08x\n", > diff --git a/kernel/fork.c b/kernel/fork.c > index bfd91b180778..dbafabf6c7b1 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1523,6 +1523,7 @@ static __latent_entropy struct task_struct *copy_process( > unsigned long tls, > int node) > { > + struct pid_namespace *pid_ns; > int retval; > struct task_struct *p; > > @@ -1735,8 +1736,9 @@ static __latent_entropy struct task_struct *copy_process( > if (retval) > goto bad_fork_cleanup_io; > > + pid_ns = p->nsproxy->pid_ns_for_children; > if (pid != &init_struct_pid) { > - pid = alloc_pid(p->nsproxy->pid_ns_for_children); > + pid = alloc_pid(pid_ns); > if (IS_ERR(pid)) { > retval = PTR_ERR(pid); > goto bad_fork_cleanup_thread; > @@ -1845,10 +1847,11 @@ static __latent_entropy struct task_struct *copy_process( > */ > recalc_sigpending(); > if (signal_pending(current)) { > - spin_unlock(¤t->sighand->siglock); > - write_unlock_irq(&tasklist_lock); > retval = -ERESTARTNOINTR; > - goto bad_fork_cancel_cgroup; > + goto bad_fork_unlock_siglock; > + } else if (unlikely(!(pid_ns->nr_hashed & PIDNS_HASH_ADDING))) { > + retval = -ENOMEM; > + goto bad_fork_unlock_siglock; > } > > if (likely(p->pid)) { > @@ -1906,7 +1909,9 @@ static __latent_entropy struct task_struct *copy_process( > > return p; > > -bad_fork_cancel_cgroup: > +bad_fork_unlock_siglock: > + spin_unlock(¤t->sighand->siglock); > + write_unlock_irq(&tasklist_lock); > cgroup_cancel_fork(p); > bad_fork_free_pid: > cgroup_threadgroup_change_end(current); > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index d1f3e9f558b8..aedf86a8017e 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -210,9 +210,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > struct task_struct *task, *me = current; > int init_pids = thread_group_leader(me) ? 1 : 2; > > - /* Don't allow any more processes into the pid namespace */ > - disable_pid_allocation(pid_ns); > - > /* > * Ignore SIGCHLD causing any terminated children to autoreap. > * This speeds up the namespace shutdown, plus see the comment