From: Peter Zijlstra <peterz@infradead.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>,
linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
Vladimir Davydov <vdavydov@parallels.com>,
Kirill Tkhai <tkhai@yandex.ru>, Christoph Lameter <cl@linux.com>
Subject: Re: [PATCH 3/3] introduce task_rcu_dereference()
Date: Wed, 18 May 2016 21:10:45 +0200 [thread overview]
Message-ID: <20160518191045.GP3193@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20160518182318.GA15661@redhat.com>
On Wed, May 18, 2016 at 08:23:18PM +0200, Oleg Nesterov wrote:
> IOW. We can never know if we have a garbage in "sighand" or the real value,
> this task_struct can be freed/reallocated when we do probe_slab_address().
>
> And this is fine. We re-check that "task == *ptask" after that. Now we have
> two different cases:
>
> 1. This is actually the same task/task_struct. In this case
> sighand != NULL tells us it is still alive.
>
> 2. This is another task which got the same memory for task_struct.
> We can't know this of course, and we can not trust sighand != NULL.
>
> In this case we actually return a random value, but this is correct.
>
> If we return NULL - we can pretend that we actually noticed that
> *ptask was updated when the previous task has exited. Or pretend
> that probe_slab_address(&sighand) reads NULL.
>
> If we return the new task (because sighand is not NULL for any
> reason) - this is fine too. This (new) task can't go away before
> another gp pass.
>
> And please note again the "We could even eliminate the false positive"
> comment above (hmm, it should probably say false negative). We could
> re-read task->sighand once again to avoid the falsely NULL.
>
> But this case is very unlikely so I think we do not really care.
>
Ah right, lets stick that in.. :-)
OK, something like so then?
---
include/linux/sched.h | 3 ++
kernel/exit.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 29 +++++---------------
3 files changed, 86 insertions(+), 22 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45a22b9..7f90002e9344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2134,6 +2134,9 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
}
+struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern void task_cputime(struct task_struct *t,
cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..58d7e05821ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,6 +211,82 @@ void release_task(struct task_struct *p)
}
/*
+ * Note that if this function returns a valid task_struct pointer (!NULL)
+ * task->usage must remain >0 for the duration of the RCU critical section.
+ */
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+ struct sighand_struct *sighand;
+ struct task_struct *task;
+
+ /*
+ * We need to verify that release_task() was not called and thus
+ * delayed_put_task_struct() can't run and drop the last reference
+ * before rcu_read_unlock(). We check task->sighand != NULL,
+ * but we can read the already freed and reused memory.
+ */
+retry:
+ task = rcu_dereference(*ptask);
+ if (!task)
+ return NULL;
+
+ probe_kernel_address(&task->sighand, sighand);
+
+ /*
+ * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+ * was already freed we can not miss the preceding update of this
+ * pointer.
+ */
+ smp_rmb();
+ if (unlikely(task != READ_ONCE(*ptask)))
+ goto retry;
+
+ /*
+ * We've re-checked that "task == *ptask", now we have two different
+ * cases:
+ *
+ * 1. This is actually the same task/task_struct. In this case
+ * sighand != NULL tells us it is still alive.
+ *
+ * 2. This is another task which got the same memory for task_struct.
+ * We can't know this of course, and we can not trust
+ * sighand != NULL.
+ *
+ * In this case we actually return a random value, but this is
+ * correct.
+ *
+ * If we return NULL - we can pretend that we actually noticed that
+ * *ptask was updated when the previous task has exited. Or pretend
+ * that probe_slab_address(&sighand) reads NULL.
+ *
+ * If we return the new task (because sighand is not NULL for any
+ * reason) - this is fine too. This (new) task can't go away before
+ * another gp pass.
+ *
+ * And note: We could even eliminate the false positive if re-read
+ * task->sighand once again to avoid the falsely NULL. But this case
+ * is very unlikely so we don't care.
+ */
+ if (!sighand)
+ return NULL;
+
+ return task;
+}
+
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+ struct task_struct *task;
+
+ rcu_read_lock();
+ task = task_rcu_dereference(ptask);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ return task;
+}
+
+/*
* Determine if a process group is "orphaned", according to the POSIX
* definition in 2.2.2.52. Orphaned process groups are not to be affected
* by terminal-generated stop signals. Newly orphaned process groups are
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..1d3a410c481b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
int dist = env->dist;
bool assigned = false;
- rcu_read_lock();
-
- raw_spin_lock_irq(&dst_rq->lock);
- cur = dst_rq->curr;
- /*
- * No need to move the exiting task or idle task.
- */
- if ((cur->flags & PF_EXITING) || is_idle_task(cur))
- cur = NULL;
- else {
- /*
- * The task_struct must be protected here to protect the
- * p->numa_faults access in the task_weight since the
- * numa_faults could already be freed in the following path:
- * finish_task_switch()
- * --> put_task_struct()
- * --> __put_task_struct()
- * --> task_numa_free()
- */
- get_task_struct(cur);
+ cur = try_get_task_struct(&dst_rq->curr);
+ if (cur) {
+ if ((cur->flags & PF_EXITING) || is_idle_task(cur)) {
+ put_task_struct(cur);
+ cur = NULL;
+ }
}
- raw_spin_unlock_irq(&dst_rq->lock);
-
+ rcu_read_lock();
/*
* Because we have preemption enabled we can get migrated around and
* end try selecting ourselves (current == env->p) as a swap candidate.
next prev parent reply other threads:[~2016-05-18 19:10 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-22 7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
2014-10-22 22:23 ` Oleg Nesterov
2014-10-23 18:15 ` Oleg Nesterov
2014-10-23 8:10 ` Kirill Tkhai
2014-10-23 18:18 ` Oleg Nesterov
2014-10-24 7:51 ` Kirill Tkhai
2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
2014-10-27 19:54 ` [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read() Oleg Nesterov
2014-10-27 19:54 ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
2014-10-27 19:21 ` Christoph Lameter
2014-10-28 5:44 ` Kirill Tkhai
2014-10-28 5:48 ` Kirill Tkhai
2014-10-28 15:01 ` Peter Zijlstra
2014-10-28 17:56 ` Kirill Tkhai
2014-10-28 18:00 ` Kirill Tkhai
2014-10-28 19:55 ` Oleg Nesterov
2014-10-28 20:12 ` Oleg Nesterov
2014-10-29 5:10 ` Kirill Tkhai
2014-10-27 19:54 ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
2014-10-28 6:22 ` Kirill Tkhai
2016-05-18 17:02 ` Peter Zijlstra
2016-05-18 18:23 ` Oleg Nesterov
2016-05-18 19:10 ` Peter Zijlstra [this message]
2016-05-18 19:57 ` Oleg Nesterov
2016-05-26 11:34 ` Peter Zijlstra
2016-06-03 10:49 ` [tip:sched/core] sched/fair: Use task_rcu_dereference() tip-bot for Oleg Nesterov
2016-06-03 10:48 ` [tip:sched/core] sched/api: Introduce task_rcu_dereference() and try_get_task_struct() tip-bot for Oleg Nesterov
2014-10-28 11:02 ` [tip:sched/core] sched/numa: Fix unsafe get_task_struct() in task_numa_assign() tip-bot for Kirill Tkhai
2014-11-08 3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
2014-11-09 14:07 ` Kirill Tkhai
2014-11-10 10:03 ` Peter Zijlstra
2014-11-10 15:48 ` Sasha Levin
2014-11-10 16:01 ` Peter Zijlstra
2014-11-16 9:50 ` [tip:sched/urgent] sched/numa: Avoid selecting oneself as swap target tip-bot for Peter Zijlstra
2014-11-10 16:03 ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
2014-11-10 16:09 ` Sasha Levin
2014-11-10 16:16 ` Peter Zijlstra
2014-11-10 16:10 ` Kirill Tkhai
2014-11-10 16:36 ` Kirill Tkhai
2014-11-10 16:44 ` Sasha Levin
2014-11-10 20:01 ` Kirill Tkhai
2014-11-12 9:49 ` Kirill Tkhai
2014-11-15 2:38 ` Sasha Levin
2014-11-18 17:30 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160518191045.GP3193@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=cl@linux.com \
--cc=ktkhai@parallels.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=tkhai@yandex.ru \
--cc=vdavydov@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.