Re: while_each_thread() under rcu_read_lock() is broken?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Don Zickus <dzickus@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@elte.hu>,
	Jerome Marchand <jmarchan@redhat.com>,
	Mandeep Singh Baines <msb@google.com>,
	Roland McGrath <roland@redhat.com>,
	linux-kernel@vger.kernel.org, stable@kernel.org,
	"Eric W. Biederman" <ebiederm@xmission.com>
Subject: Re: while_each_thread() under rcu_read_lock() is broken?
Date: Fri, 18 Jun 2010 15:33:54 -0700	[thread overview]
Message-ID: <20100618223354.GL2365@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100618193403.GA17314@redhat.com>

On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> (add cc's)
> 
> Hmm. Once I sent this patch, I suddenly realized with horror that
> while_each_thread() is NOT safe under rcu_read_lock(). Both
> do_each_thread/while_each_thread or do/while_each_thread() can
> race with exec().
> 
> Yes, it is safe to do next_thread() or next_task(). But:
> 
> 	#define while_each_thread(g, t) \
> 		while ((t = next_thread(t)) != g)
> 
> suppose that t is not the group leader, and it does de_thread() and then
> release_task(g). After that next_thread(t) returns t, not g, and the loop
> will never stop.
> 
> I _really_ hope I missed something, will recheck tomorrow with the fresh
> head. Still I'd like to share my concerns...
> 
> If I am right, probably we can fix this, something like
> 
> 	#define while_each_thread(g, t) \
> 		while ((t = next_thread(t)) != g && pid_alive(g))
> 
> [we can't do while (!thread_group_leadr(t = next_thread(t)))].
> but this needs barrires, and we should validate the callers anyway.
> 
> Or, perhaps,
> 
> 	#define XXX(t)	({
> 		struct task_struct *__prev = t;
> 		t = next_thread(t);
> 		t != g && t != __prev;
> 	})
> 
> 	#define while_each_thread(g, t) \
> 		while (XXX(t))

Isn't the above vulnerable to a pthread_create() immediately following
the offending exec()?  Especially if the task doing the traversal is
preempted?

I cannot claim to understand the task-list code, but here are some
techniques that might (or might not) help:

o	Check ACCESS_ONCE(p->group_leader == g), if false, restart
	the traversal.  Any race on update of p->group_leader would
	sort itself out on later iterations.  This of course might
	require careful attention of the order of updates to ->group_leader
	and the list pointers.  I also don't like it much from a
	worst-case response-time viewpoint.  ;-)

	Plus it requires all operations on the tasks be idempotent,
	which is a bit ugly and restrictive.

o	Maintain an ->oldnext field that tracks the old pointer to
	the next task for one RCU grace period after a de_thread()
	operation.  When the grace period expires (presumably via
	call_rcu()), the ->oldnext field is set to NULL.

	If the ->oldnext field is non-NULL, any subsequent de_thread()
	operations wait until it is NULL.  (I convinced myself that
	pthread_create() need -not- wait, but perhaps mistakenly --
	the idea is that any recent de_thread() victim remains group
	leader, so is skipped by while_each_thread(), but you would
	know better than I.)

	Then while_each_thread() does checks similar to what you have
	above, possibly with the addition of the ->group_leader check,
	but follows the ->oldnext pointer if the checks indicate that
	this task has de_thread()ed itself.  The ->oldnext access will
	need to be preceded by a memory barrier, but this is off the
	fast path, so should be OK.  There would also need to be
	memory barriers on the update side.

o	Do the de_thread() incrementally.  So if the list is tasks A,
	B, and C, in that order, and if we are de-thread()ing B,
	then make A's pointer refer to C, wait a grace period, then
	complete the de_thread() operation.  I would be surprised if
	people would actually be happy with the resulting increase in
	exec() overhead, but mentioning it for completeness.  Of course,
	synchronize_rcu_expedited() has much shorter latency, and might
	work in this situation.  (Though please do let me know if you
	choose this approach -- it will mean that I need to worry about
	synchronize_rcu_expedited() scalability sooner rather than
	later!  Which is OK as long as I know.)

	This all assumes that is OK for de_thread() to block, but I have
	no idea if this is the case.

> Please tell me I am wrong!

It would take a braver man than me to say that Oleg Nesterov is wrong!

							Thanx, Paul

> Oleg.
> 
> On 06/18, Oleg Nesterov wrote:
> >
> > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> > absolutely wrong.
> >
> > 	- rcu_lock_break() does put_task_struct(). If the task has exited
> > 	  it is not safe to even read its ->state, nothing protects this
> > 	  task_struct.
> >
> > 	- The TASK_DEAD checks are wrong too. Contrary to the comment, we
> > 	  can't use it to check if the task was unhashed. It can be unhashed
> > 	  without TASK_DEAD, or it can be valid with TASK_DEAD.
> >
> > 	  For example, an autoreaping task can do release_task(current)
> > 	  long before it sets TASK_DEAD in do_exit().
> >
> > 	  Or, a zombie task can have ->state == TASK_DEAD but release_task()
> > 	  was not called, and in this case we must not break the loop.
> >
> > Change this code to check pid_alive() instead, and do this before we
> > drop the reference to the task_struct.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > ---
> >
> >  kernel/hung_task.c |   11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK	2009-12-18 19:05:38.000000000 +0100
> > +++ 35-rc2/kernel/hung_task.c	2010-06-18 20:06:11.000000000 +0200
> > @@ -113,15 +113,20 @@ static void check_hung_task(struct task_
> >   * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> >   * exit the grace period. For classic RCU, a reschedule is required.
> >   */
> > -static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
> > +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
> >  {
> > +	bool can_cont;
> > +
> >  	get_task_struct(g);
> >  	get_task_struct(t);
> >  	rcu_read_unlock();
> >  	cond_resched();
> >  	rcu_read_lock();
> > +	can_cont = pid_alive(g) && pid_alive(t);
> >  	put_task_struct(t);
> >  	put_task_struct(g);
> > +
> > +	return can_cont;
> >  }
> >
> >  /*
> > @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
> >  			goto unlock;
> >  		if (!--batch_count) {
> >  			batch_count = HUNG_TASK_BATCHING;
> > -			rcu_lock_break(g, t);
> > -			/* Exit if t or g was unhashed during refresh. */
> > -			if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > +			if (!rcu_lock_break(g, t))
> >  				goto unlock;
> >  		}
> >  		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
>

next prev parent reply	other threads:[~2010-06-18 22:33 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov
2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov
2010-06-18 21:08   ` Roland McGrath
2010-06-18 22:37     ` Oleg Nesterov
2010-06-18 22:33   ` Paul E. McKenney [this message]
2010-06-21 17:09     ` Oleg Nesterov
2010-06-21 17:44       ` Oleg Nesterov
2010-06-21 18:00         ` Oleg Nesterov
2010-06-21 19:02         ` Roland McGrath
2010-06-21 20:06           ` Oleg Nesterov
2010-06-21 21:19             ` Eric W. Biederman
2010-06-22 14:34               ` Oleg Nesterov
2010-07-08 23:59             ` Roland McGrath
2010-07-09  0:41               ` Paul E. McKenney
2010-07-09  1:01                 ` Roland McGrath
2010-07-09 16:18                   ` Paul E. McKenney
2010-06-21 20:51       ` Paul E. McKenney
2010-06-21 21:22         ` Eric W. Biederman
2010-06-21 21:38           ` Paul E. McKenney
2010-06-22 21:23         ` Oleg Nesterov
2010-06-22 22:12           ` Paul E. McKenney
2010-06-23 15:24             ` Oleg Nesterov
2010-06-24 18:07               ` Paul E. McKenney
2010-06-24 18:50                 ` Chris Friesen
2010-06-24 22:00                   ` Oleg Nesterov
2010-06-25  0:08                     ` Eric W. Biederman
2010-06-25  3:42                       ` Paul E. McKenney
2010-06-25 10:08                       ` Oleg Nesterov
2010-07-09  0:52                       ` Roland McGrath
2010-06-24 21:14                 ` Roland McGrath
2010-06-25  3:37                   ` Paul E. McKenney
2010-07-09  0:41                     ` Roland McGrath
2010-06-24 21:57                 ` Oleg Nesterov
2010-06-25  3:41                   ` Paul E. McKenney
2010-06-25  9:55                     ` Oleg Nesterov
2010-06-28 23:43                       ` Paul E. McKenney
2010-06-29 13:05                         ` Oleg Nesterov
2010-06-29 15:34                           ` Paul E. McKenney
2010-06-29 17:54                             ` Oleg Nesterov
2010-06-19  5:00   ` Mandeep Baines
2010-06-19  5:35     ` Frederic Weisbecker
2010-06-19 15:44       ` Mandeep Baines
2010-06-19 19:19     ` Oleg Nesterov
2010-06-18 20:11 ` [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Frederic Weisbecker
2010-06-18 20:38 ` Mandeep Singh Baines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100618223354.GL2365@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dzickus@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=fweisbec@gmail.com \
    --cc=jmarchan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=msb@google.com \
    --cc=oleg@redhat.com \
    --cc=roland@redhat.com \
    --cc=stable@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).