From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E925C3A59B for ; Fri, 30 Aug 2019 19:36:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DF16721897 for ; Fri, 30 Aug 2019 19:36:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728211AbfH3Tgc (ORCPT ); Fri, 30 Aug 2019 15:36:32 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:56479 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727888AbfH3Tgb (ORCPT ); Fri, 30 Aug 2019 15:36:31 -0400 Received: from in02.mta.xmission.com ([166.70.13.52]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1i3mh9-0008W2-Sp; Fri, 30 Aug 2019 13:36:27 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95] helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.87) (envelope-from ) id 1i3mh8-0001ij-KN; Fri, 30 Aug 2019 13:36:27 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Linus Torvalds Cc: Oleg Nesterov , Russell King - ARM Linux admin , Peter Zijlstra , Chris Metcalf , Christoph Lameter , Kirill Tkhai , Mike Galbraith , Thomas Gleixner , Ingo Molnar , Linux List Kernel Mailing References: <20190830140805.GD13294@shell.armlinux.org.uk> <20190830160957.GC2634@redhat.com> Date: Fri, 30 Aug 2019 14:36:15 -0500 In-Reply-To: (Linus Torvalds's message of "Fri, 30 Aug 2019 09:21:31 -0700") Message-ID: <87o906wimo.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1i3mh8-0001ij-KN;;;mid=<87o906wimo.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+hg6msJMXAr0qDofmUEK7im4uo7cyS1tc= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Linus Torvalds writes: > On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov wrote: >> >> >> Yes, please see >> >> [PATCH 2/3] introduce probe_slab_address() >> https://lore.kernel.org/lkml/20141027195425.GC11736@redhat.com/ >> >> I sent 5 years ago ;) Do you think >> >> /* >> * Same as probe_kernel_address(), but @addr must be the valid pointer >> * to a slab object, potentially freed/reused/unmapped. >> */ >> #ifdef CONFIG_DEBUG_PAGEALLOC >> #define probe_slab_address(addr, retval) \ >> probe_kernel_address(addr, retval) >> #else >> #define probe_slab_address(addr, retval) \ >> ({ \ >> (retval) = *(typeof(retval) *)(addr); \ >> 0; \ >> }) >> #endif >> >> can work? > > Ugh. I would much rather handle the general case, because honestly, > tracing has had a lot of issues with our hacky "probe_kernel_read()" > stuff that bases itself on user addresses. > > It's also one of the few remaining users of "set_fs()" in core code, > and we really should try to get rid of those. > > So your code would work for this particular case, but not for other > cases that can trap simply because the pointer isn't reliable (tracing > being the main case for that - but if the source of the pointer itself > might have been free'd, you would also have that situation). > > So I'd really prefer to have something a bit fancier. On most > architectures, doing a good exception fixup for kernel addresses is > really not that hard. > > On x86, for example, we actually have *exactly* that. The > "__get_user_asm()" macro is basically it. It purely does a load > instruction from an unchecked address. > > (It's a really odd syntax, but you could remove the __chk_user_ptr() > from the __get_user_size() macro, and now you'd have basically a "any > regular size kernel access with exception handlng"). > > But yes, your hack is I guess optimal for this particular case where > you simply can depend on "we know the pointer was valid, we just don't > know if it was freed". > > Hmm. Don't we RCU-free the task struct? Because then we don't even > need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access > the pointer as long as we have the RCU read lock. We do that in other > cases. Sort of. The rcu delay happens when release_task calls delayed_put_task_struct. Which unfortunately means that anytime after exit_notify, release_task can operate on a task. So it is possible that by the time do_dead_task is called the rcu grace period is up. Which is the problem the users of task_rcu_dereference are fighting. They are performing an rcu walk on the set of cups in task_numa_migrate and in the userspace membarrier system calls. For a short while we the rcu delay in put_task_struct but that required changes all of the place and was just a pain to work with. Then I did: > commit 8c7904a00b06d2ee51149794b619e07369fcf9d4 > Author: Eric W. Biederman > Date: Fri Mar 31 02:31:37 2006 -0800 > > [PATCH] task: RCU protect task->usage > > A big problem with rcu protected data structures that are also reference > counted is that you must jump through several hoops to increase the reference > count. I think someone finally implemented atomic_inc_not_zero(&count) to > automate the common case. Unfortunately this means you must special case the > rcu access case. > > When data structures are only visible via rcu in a manner that is not > determined by the reference count on the object (i.e. tasks are visible until > their zombies are reaped) there is a much simpler technique we can employ. > Simply delaying the decrement of the reference count until the rcu interval is > over. > > What that means is that the proc code that looks up a task and later > wants to sleep can now do: > > rcu_read_lock(); > task = find_task_by_pid(some_pid); > if (task) { > get_task_struct(task); > } > rcu_read_unlock(); > > The effect on the rest of the kernel is that put_task_struct becomes cheaper > and immediate, and in the case where the task has been reaped it frees the > task immediate instead of unnecessarily waiting an until the rcu interval is > over. > > Cleanup of task_struct does not happen when its reference count drops to > zero, instead cleanup happens when release_task is called. Tasks can only > be looked up via rcu before release_task is called. All rcu protected > members of task_struct are freed by release_task. > > Therefore we can move call_rcu from put_task_struct into release_task. And > we can modify release_task to not immediately release the reference count > but instead have it call put_task_struct from the function it gives to > call_rcu. > > The end result: > > - get_task_struct is safe in an rcu context where we have just looked > up the task. > > - put_task_struct() simplifies into its old pre rcu self. > > This reorganization also makes put_task_struct uncallable from modules as > it is not exported but it does not appear to be called from any modules so > this should not be an issue, and is trivially fixed. > > Signed-off-by: Eric W. Biederman > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds About a decade later task_struct grew some new rcu users and Oleg introduced task_rcu_dereference to deal with them: > commit 150593bf869393d10a79f6bd3df2585ecc20a9bb > Author: Oleg Nesterov > Date: Wed May 18 19:02:18 2016 +0200 > > sched/api: Introduce task_rcu_dereference() and try_get_task_struct() > > Generally task_struct is only protected by RCU if it was found on a > RCU protected list (say, for_each_process() or find_task_by_vpid()). > > As Kirill pointed out rq->curr isn't protected by RCU, the scheduler > drops the (potentially) last reference without RCU gp, this means > that we need to fix the code which uses foreign_rq->curr under > rcu_read_lock(). > > Add a new helper which can be used to dereference rq->curr or any > other pointer to task_struct assuming that it should be cleared or > updated before the final put_task_struct(). It returns non-NULL > only if this task can't go away before rcu_read_unlock(). > > ( Also add try_get_task_struct() to make it easier to use this API > correctly. ) So I think it makes a lot of sense to change how we do this. Either moving the rcu delay back into put_task_struct or doing halfway like creating a put_dead_task_struct that will perform the rcu delay after a task has been taken off the run queues and has stopped being a zombie. Something like: diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 0497091e40c1..bf323418094e 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t) __put_task_struct(t); } -struct task_struct *task_rcu_dereference(struct task_struct **ptask); +void put_dead_task_struct(struct task_struct *task); #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT extern int arch_task_struct_size __read_mostly; diff --git a/kernel/exit.c b/kernel/exit.c index 5b4a5dcce8f8..3a85bc2e8031 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp) put_task_struct(tsk); } +void put_dead_task_struct(struct task_struct *task) +{ + bool delay = false; + unsigned long flags; + + /* Is the task both reaped and no longer being scheduled? */ + raw_spin_lock_irqsave(&task->pi_lock, flags); + if ((task->state == TASK_DEAD) && + (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD)) + delay = true; + raw_spin_lock_irqrestore(&task->pi_lock, flags); + + /* If both are true use rcu delay the put_task_struct */ + if (delay) + call_rcu(&task->rcu, delayed_put_task_struct); + else + put_task_struct(task); +} void release_task(struct task_struct *p) { @@ -222,76 +240,13 @@ void release_task(struct task_struct *p) write_unlock_irq(&tasklist_lock); release_thread(p); - call_rcu(&p->rcu, delayed_put_task_struct); + put_dead_task_struct(p); p = leader; if (unlikely(zap_leader)) goto repeat; } -/* - * Note that if this function returns a valid task_struct pointer (!NULL) - * task->usage must remain >0 for the duration of the RCU critical section. - */ -struct task_struct *task_rcu_dereference(struct task_struct **ptask) -{ - struct sighand_struct *sighand; - struct task_struct *task; - - /* - * We need to verify that release_task() was not called and thus - * delayed_put_task_struct() can't run and drop the last reference - * before rcu_read_unlock(). We check task->sighand != NULL, - * but we can read the already freed and reused memory. - */ -retry: - task = rcu_dereference(*ptask); - if (!task) - return NULL; - - probe_kernel_address(&task->sighand, sighand); - - /* - * Pairs with atomic_dec_and_test() in put_task_struct(). If this task - * was already freed we can not miss the preceding update of this - * pointer. - */ - smp_rmb(); - if (unlikely(task != READ_ONCE(*ptask))) - goto retry; - - /* - * We've re-checked that "task == *ptask", now we have two different - * cases: - * - * 1. This is actually the same task/task_struct. In this case - * sighand != NULL tells us it is still alive. - * - * 2. This is another task which got the same memory for task_struct. - * We can't know this of course, and we can not trust - * sighand != NULL. - * - * In this case we actually return a random value, but this is - * correct. - * - * If we return NULL - we can pretend that we actually noticed that - * *ptask was updated when the previous task has exited. Or pretend - * that probe_slab_address(&sighand) reads NULL. - * - * If we return the new task (because sighand is not NULL for any - * reason) - this is fine too. This (new) task can't go away before - * another gp pass. - * - * And note: We could even eliminate the false positive if re-read - * task->sighand once again to avoid the falsely NULL. But this case - * is very unlikely so we don't care. - */ - if (!sighand) - return NULL; - - return task; -} - void rcuwait_wake_up(struct rcuwait *w) { struct task_struct *task; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..5b697c0572ce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) /* Task is done with its stack. */ put_task_stack(prev); - put_task_struct(prev); + put_dead_task_struct(prev); } tick_nohz_task_switch(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc9cfeaac8bd..c3e1a302211a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env, return; rcu_read_lock(); - cur = task_rcu_dereference(&dst_rq->curr); + cur = rcu_dereference(&dst_rq->curr); if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) cur = NULL; diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index aa8d75804108..74df8e0dfc84 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -71,7 +71,7 @@ static int membarrier_global_expedited(void) continue; rcu_read_lock(); - p = task_rcu_dereference(&cpu_rq(cpu)->curr); + p = rcu_dereference(&cpu_rq(cpu)->curr); if (p && p->mm && (atomic_read(&p->mm->membarrier_state) & MEMBARRIER_STATE_GLOBAL_EXPEDITED)) { if (!fallback) @@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags) if (cpu == raw_smp_processor_id()) continue; rcu_read_lock(); - p = task_rcu_dereference(&cpu_rq(cpu)->curr); + p = rcu_dereference(&cpu_rq(cpu)->curr); if (p && p->mm == current->mm) { if (!fallback) __cpumask_set_cpu(cpu, tmpmask); Eric