From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Don Zickus <dzickus@redhat.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Ingo Molnar <mingo@elte.hu>,
Jerome Marchand <jmarchan@redhat.com>,
Mandeep Singh Baines <msb@google.com>,
Roland McGrath <roland@redhat.com>,
linux-kernel@vger.kernel.org, stable@kernel.org,
"Eric W. Biederman" <ebiederm@xmission.com>
Subject: Re: while_each_thread() under rcu_read_lock() is broken?
Date: Fri, 18 Jun 2010 15:33:54 -0700 [thread overview]
Message-ID: <20100618223354.GL2365@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100618193403.GA17314@redhat.com>
On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> (add cc's)
>
> Hmm. Once I sent this patch, I suddenly realized with horror that
> while_each_thread() is NOT safe under rcu_read_lock(). Both
> do_each_thread/while_each_thread or do/while_each_thread() can
> race with exec().
>
> Yes, it is safe to do next_thread() or next_task(). But:
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> suppose that t is not the group leader, and it does de_thread() and then
> release_task(g). After that next_thread(t) returns t, not g, and the loop
> will never stop.
>
> I _really_ hope I missed something, will recheck tomorrow with the fresh
> head. Still I'd like to share my concerns...
>
> If I am right, probably we can fix this, something like
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g && pid_alive(g))
>
> [we can't do while (!thread_group_leadr(t = next_thread(t)))].
> but this needs barrires, and we should validate the callers anyway.
>
> Or, perhaps,
>
> #define XXX(t) ({
> struct task_struct *__prev = t;
> t = next_thread(t);
> t != g && t != __prev;
> })
>
> #define while_each_thread(g, t) \
> while (XXX(t))
Isn't the above vulnerable to a pthread_create() immediately following
the offending exec()? Especially if the task doing the traversal is
preempted?
I cannot claim to understand the task-list code, but here are some
techniques that might (or might not) help:
o Check ACCESS_ONCE(p->group_leader == g), if false, restart
the traversal. Any race on update of p->group_leader would
sort itself out on later iterations. This of course might
require careful attention of the order of updates to ->group_leader
and the list pointers. I also don't like it much from a
worst-case response-time viewpoint. ;-)
Plus it requires all operations on the tasks be idempotent,
which is a bit ugly and restrictive.
o Maintain an ->oldnext field that tracks the old pointer to
the next task for one RCU grace period after a de_thread()
operation. When the grace period expires (presumably via
call_rcu()), the ->oldnext field is set to NULL.
If the ->oldnext field is non-NULL, any subsequent de_thread()
operations wait until it is NULL. (I convinced myself that
pthread_create() need -not- wait, but perhaps mistakenly --
the idea is that any recent de_thread() victim remains group
leader, so is skipped by while_each_thread(), but you would
know better than I.)
Then while_each_thread() does checks similar to what you have
above, possibly with the addition of the ->group_leader check,
but follows the ->oldnext pointer if the checks indicate that
this task has de_thread()ed itself. The ->oldnext access will
need to be preceded by a memory barrier, but this is off the
fast path, so should be OK. There would also need to be
memory barriers on the update side.
o Do the de_thread() incrementally. So if the list is tasks A,
B, and C, in that order, and if we are de-thread()ing B,
then make A's pointer refer to C, wait a grace period, then
complete the de_thread() operation. I would be surprised if
people would actually be happy with the resulting increase in
exec() overhead, but mentioning it for completeness. Of course,
synchronize_rcu_expedited() has much shorter latency, and might
work in this situation. (Though please do let me know if you
choose this approach -- it will mean that I need to worry about
synchronize_rcu_expedited() scalability sooner rather than
later! Which is OK as long as I know.)
This all assumes that is OK for de_thread() to block, but I have
no idea if this is the case.
> Please tell me I am wrong!
It would take a braver man than me to say that Oleg Nesterov is wrong!
Thanx, Paul
> Oleg.
>
> On 06/18, Oleg Nesterov wrote:
> >
> > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> > absolutely wrong.
> >
> > - rcu_lock_break() does put_task_struct(). If the task has exited
> > it is not safe to even read its ->state, nothing protects this
> > task_struct.
> >
> > - The TASK_DEAD checks are wrong too. Contrary to the comment, we
> > can't use it to check if the task was unhashed. It can be unhashed
> > without TASK_DEAD, or it can be valid with TASK_DEAD.
> >
> > For example, an autoreaping task can do release_task(current)
> > long before it sets TASK_DEAD in do_exit().
> >
> > Or, a zombie task can have ->state == TASK_DEAD but release_task()
> > was not called, and in this case we must not break the loop.
> >
> > Change this code to check pid_alive() instead, and do this before we
> > drop the reference to the task_struct.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > ---
> >
> > kernel/hung_task.c | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100
> > +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200
> > @@ -113,15 +113,20 @@ static void check_hung_task(struct task_
> > * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> > * exit the grace period. For classic RCU, a reschedule is required.
> > */
> > -static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
> > +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
> > {
> > + bool can_cont;
> > +
> > get_task_struct(g);
> > get_task_struct(t);
> > rcu_read_unlock();
> > cond_resched();
> > rcu_read_lock();
> > + can_cont = pid_alive(g) && pid_alive(t);
> > put_task_struct(t);
> > put_task_struct(g);
> > +
> > + return can_cont;
> > }
> >
> > /*
> > @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
> > goto unlock;
> > if (!--batch_count) {
> > batch_count = HUNG_TASK_BATCHING;
> > - rcu_lock_break(g, t);
> > - /* Exit if t or g was unhashed during refresh. */
> > - if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > + if (!rcu_lock_break(g, t))
> > goto unlock;
> > }
> > /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
>
next prev parent reply other threads:[~2010-06-18 22:33 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov
2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov
2010-06-18 21:08 ` Roland McGrath
2010-06-18 22:37 ` Oleg Nesterov
2010-06-18 22:33 ` Paul E. McKenney [this message]
2010-06-21 17:09 ` Oleg Nesterov
2010-06-21 17:44 ` Oleg Nesterov
2010-06-21 18:00 ` Oleg Nesterov
2010-06-21 19:02 ` Roland McGrath
2010-06-21 20:06 ` Oleg Nesterov
2010-06-21 21:19 ` Eric W. Biederman
2010-06-22 14:34 ` Oleg Nesterov
2010-07-08 23:59 ` Roland McGrath
2010-07-09 0:41 ` Paul E. McKenney
2010-07-09 1:01 ` Roland McGrath
2010-07-09 16:18 ` Paul E. McKenney
2010-06-21 20:51 ` Paul E. McKenney
2010-06-21 21:22 ` Eric W. Biederman
2010-06-21 21:38 ` Paul E. McKenney
2010-06-22 21:23 ` Oleg Nesterov
2010-06-22 22:12 ` Paul E. McKenney
2010-06-23 15:24 ` Oleg Nesterov
2010-06-24 18:07 ` Paul E. McKenney
2010-06-24 18:50 ` Chris Friesen
2010-06-24 22:00 ` Oleg Nesterov
2010-06-25 0:08 ` Eric W. Biederman
2010-06-25 3:42 ` Paul E. McKenney
2010-06-25 10:08 ` Oleg Nesterov
2010-07-09 0:52 ` Roland McGrath
2010-06-24 21:14 ` Roland McGrath
2010-06-25 3:37 ` Paul E. McKenney
2010-07-09 0:41 ` Roland McGrath
2010-06-24 21:57 ` Oleg Nesterov
2010-06-25 3:41 ` Paul E. McKenney
2010-06-25 9:55 ` Oleg Nesterov
2010-06-28 23:43 ` Paul E. McKenney
2010-06-29 13:05 ` Oleg Nesterov
2010-06-29 15:34 ` Paul E. McKenney
2010-06-29 17:54 ` Oleg Nesterov
2010-06-19 5:00 ` Mandeep Baines
2010-06-19 5:35 ` Frederic Weisbecker
2010-06-19 15:44 ` Mandeep Baines
2010-06-19 19:19 ` Oleg Nesterov
2010-06-18 20:11 ` [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Frederic Weisbecker
2010-06-18 20:38 ` Mandeep Singh Baines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100618223354.GL2365@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=dzickus@redhat.com \
--cc=ebiederm@xmission.com \
--cc=fweisbec@gmail.com \
--cc=jmarchan@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=msb@google.com \
--cc=oleg@redhat.com \
--cc=roland@redhat.com \
--cc=stable@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.