Re: Filesystem lockup with CONFIG_PREEMPT_RT

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Steven Rostedt <rostedt@goodmis.org>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Austin Schuh <austin@peloton-tech.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Richard Weinberger <richard.weinberger@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	rt-users <linux-rt-users@vger.kernel.org>
Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT
Date: Fri, 27 Jun 2014 10:01:57 -0400	[thread overview]
Message-ID: <20140627100157.6b0143a5@gandalf.local.home> (raw)
In-Reply-To: <1403873856.5827.56.camel@marge.simpson.net>

On Fri, 27 Jun 2014 14:57:36 +0200
Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> On Thu, 2014-06-26 at 17:07 -0700, Austin Schuh wrote:
> 
> > I'm not sure where to go from there.  Any changes to the workpool to
> > try to fix that will be hard, or could affect latency significantly.
> 
> Oh what the hell, I'm out of frozen shark, may as well stock up.  Nobody
> else has posted spit yet.  I _know_ Steven has some shark, I saw tglx
> toss him a chunk.
> 
> It's the same root cause as -rt letting tasks schedule off with plugged
> IO, leading to deadlock scenarios.  Extending the way I dealt with that
> to deal with workqueue forward progress requirements.. works.. though it

> could perhaps use a face lift, tummy tuck.. and minor body-ectomy.

Yeah, I'd say ;-)

> 
> Subject: rtmutex: move pre/post schedule() progress guarantees to before we schedule
> 
> Queued IO can lead to IO deadlock should a task require wakeup from as task
> which is blocked on that queued IO, which is why we usually pull the plug
> while scheduling off.  In RT, pulling the plug could lead us to block on
> a contended sleeping lock while n the process of blocking.  Bad idea, so

"in the process"

> we don't, but that leaves us with various flavors of IO stall like below.
> 
> ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
> the buffer queued by dbench1.  Game over.
> 
> The exact same situation can occur with workqueues.  We must notify the
> workqueue management that a worker is blocking, as if we don't, we can
> lock the box up completely.  It's too late to do that once we have arrived
> in schedule(), as we can't take a sleeping lock.
> 
> Therefore, move progress guarantee work up to before we reach the point of
> no return, and tell the scheduler that we're handling it early.
> 
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> ---
>  include/linux/sched.h    |    2 +
>  kernel/locking/rtmutex.c |   59 +++++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/core.c      |   19 +++++++++++----
>  3 files changed, 72 insertions(+), 8 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1256,6 +1256,8 @@ struct task_struct {
>  	/* Revert to default priority/policy when forking */
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
> +	unsigned sched_presched:1;
> +	unsigned rtmutex_presched:1;
>  
>  	pid_t pid;
>  	pid_t tgid;
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -23,6 +23,7 @@
>  #include <linux/sched/deadline.h>
>  #include <linux/timer.h>
>  #include <linux/ww_mutex.h>
> +#include <linux/blkdev.h>
>  
>  #include "rtmutex_common.h"
>  
> @@ -782,6 +783,41 @@ void rt_mutex_adjust_pi(struct task_stru
>  }
>  
>  #ifdef CONFIG_PREEMPT_RT_FULL
> +#include "../workqueue_internal.h"
> +
> +static inline void rt_mutex_presched(struct task_struct *tsk)
> +{
> +	/* Recursive handling of presched work is a very bad idea. */

The above comment can be simplified to just:

	/* Recursion protection */

> +	if (tsk->rtmutex_presched || tsk->sched_presched)
> +		return;
> +
> +	/* Tell the scheduler we handle pre/post schedule() work. */
> +	tsk->rtmutex_presched = 1;
> +
> +	/*
> +	 * If a worker went to sleep, notify and ask workqueue whether
> +	 * it wants to wake up a task to maintain concurrency.
> +	 */
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_sleeping(tsk);
> +
> +	/*
> +	 * If we are going to sleep and we have plugged IO queued,
> +	 * make sure to submit it to avoid deadlocks.
> +	 */
> +	if (blk_needs_flush_plug(tsk))
> +		blk_schedule_flush_plug(tsk);
> +}
> +
> +static inline void rt_mutex_postsched(struct task_struct *tsk)
> +{
> +	if (!tsk->rtmutex_presched)
> +		return;
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_running(tsk);
> +	tsk->rtmutex_presched = 0;
> +}
> +
>  /*
>   * preemptible spin_lock functions:
>   */
> @@ -857,13 +893,23 @@ static void  noinline __sched rt_spin_lo
>  	struct rt_mutex_waiter waiter, *top_waiter;
>  	int ret;
>  
> +	/*
> +	 * We can't do presched work if we're already holding a lock
> +	 * else we can deadlock.  eg, if we're holding slab_lock,
> +	 * ksoftirqd can block while processing BLOCK_SOFTIRQ after
> +	 * having acquired q->queue_lock.  If _we_ then block on
> +	 * that q->queue_lock while flushing our plug, deadlock.
> +	 */
> +	if (__migrate_disabled(self) < 2)
> +		rt_mutex_presched(self);
> +
>  	rt_mutex_init_waiter(&waiter, true);
>  
>  	raw_spin_lock(&lock->wait_lock);
>  
>  	if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
>  		raw_spin_unlock(&lock->wait_lock);
> -		return;
> +		goto out;
>  	}
>  
>  	BUG_ON(rt_mutex_owner(lock) == self);
> @@ -928,6 +974,8 @@ static void  noinline __sched rt_spin_lo
>  	raw_spin_unlock(&lock->wait_lock);
>  
>  	debug_rt_mutex_free_waiter(&waiter);
> +out:
> +	rt_mutex_postsched(self);
>  }
>  
>  /*
> @@ -1274,18 +1322,20 @@ rt_mutex_slowlock(struct rt_mutex *lock,
>  		  int detect_deadlock, struct ww_acquire_ctx *ww_ctx)
>  {
>  	struct rt_mutex_waiter waiter;
> +	struct task_struct *self = current;
>  	int ret = 0;
>  
> +	rt_mutex_presched(self);
>  	rt_mutex_init_waiter(&waiter, false);
>  
>  	raw_spin_lock(&lock->wait_lock);
>  
>  	/* Try to acquire the lock again: */
> -	if (try_to_take_rt_mutex(lock, current, NULL)) {
> +	if (try_to_take_rt_mutex(lock, self, NULL)) {
>  		if (ww_ctx)
>  			ww_mutex_account_lock(lock, ww_ctx);
>  		raw_spin_unlock(&lock->wait_lock);
> -		return 0;
> +		goto out;
>  	}
>  
>  	set_current_state(state);
> @@ -1322,7 +1372,8 @@ rt_mutex_slowlock(struct rt_mutex *lock,
>  		hrtimer_cancel(&timeout->timer);
>  
>  	debug_rt_mutex_free_waiter(&waiter);
> -
> +out:
> +	rt_mutex_postsched(self);
>  	return ret;
>  }
>  
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2915,11 +2915,18 @@ static void __sched __schedule(void)
>  		goto need_resched;
>  }
>  
> -static inline void sched_submit_work(struct task_struct *tsk)
> +static inline void sched_presched_work(struct task_struct *tsk)
>  {
> +	/* It's being handled by rtmutex code */
> +	if (tsk->rtmutex_presched)
> +		return;
> +
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
>  
> +	/* Tell rtmutex presched code that we're handling it. */
> +	tsk->sched_presched = 1;
> +
>  	/*
>  	 * If a worker went to sleep, notify and ask workqueue whether
>  	 * it wants to wake up a task to maintain concurrency.
> @@ -2935,19 +2942,23 @@ static inline void sched_submit_work(str
>  		blk_schedule_flush_plug(tsk);
>  }
>  
> -static inline void sched_update_worker(struct task_struct *tsk)
> +static inline void sched_postsched_work(struct task_struct *tsk)
>  {
> +	/* It's being handled by rtmutex code */
> +	if (tsk->rtmutex_presched)
> +		return;
>  	if (tsk->flags & PF_WQ_WORKER)
>  		wq_worker_running(tsk);
> +	tsk->sched_presched = 0;
>  }
>  
>  asmlinkage void __sched schedule(void)
>  {
>  	struct task_struct *tsk = current;
>  
> -	sched_submit_work(tsk);
> +	sched_presched_work(tsk);
>  	__schedule();
> -	sched_update_worker(tsk);
> +	sched_postsched_work(tsk);
>  }

This seems like a lot of hacks. I'm wondering if it would work if we
just have the rt_spin_lock_slowlock not call schedule(), but call
__schedule() directly. I mean it would keep with the mainline paradigm
as spinlocks don't sleep there, and one going to sleep in the -rt
kernel is similar to it being preempted by a very long NMI.

Does a spin_lock going to sleep really need to do all the presched and
postsched work?


-- Steve



>  EXPORT_SYMBOL(schedule);
>  
>

next prev parent reply	other threads:[~2014-06-27 14:02 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CANGgnMbHckBQdKGN_N5Q6qEKc9n1CenxvMpeXog1NbSdL8UrTw@mail.gmail.com>
     [not found] ` <CANGgnMYDXerOUDOO9-RHMJKadKACA2KBGskZwoP-1ZwAhDEfVA@mail.gmail.com>
2014-05-21  7:33   ` Filesystem lockup with CONFIG_PREEMPT_RT Richard Weinberger
2014-06-26 19:50     ` Austin Schuh
2014-06-26 22:35       ` Thomas Gleixner
2014-06-27  0:07         ` Austin Schuh
2014-06-27  3:22           ` Mike Galbraith
2014-06-27 12:57           ` Mike Galbraith
2014-06-27 14:01             ` Steven Rostedt [this message]
2014-06-27 17:34               ` Mike Galbraith
2014-06-27 17:54                 ` Steven Rostedt
2014-06-27 18:07                   ` Mike Galbraith
2014-06-27 18:19                     ` Steven Rostedt
2014-06-27 19:11                       ` Mike Galbraith
2014-06-28  1:18                       ` Austin Schuh
2014-06-28  3:32                         ` Mike Galbraith
2014-06-28  6:20                           ` Austin Schuh
2014-06-28  7:11                             ` Mike Galbraith
2014-06-27 14:24           ` Thomas Gleixner
2014-06-28  4:51             ` Mike Galbraith
2014-07-01  0:12             ` Austin Schuh
2014-07-01  0:53               ` Austin Schuh
2014-07-05 20:26                 ` Thomas Gleixner
2014-07-06  4:55                   ` Austin Schuh
2014-07-01  3:01             ` Austin Schuh
2014-07-01 19:32               ` Austin Schuh
2014-07-03 23:08                 ` Austin Schuh
2014-07-04  4:42                   ` Mike Galbraith
2014-05-21 19:30 John Blackwood
2014-05-21 21:59 ` Austin Schuh
2014-07-05 20:36 ` Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2014-07-05 19:30 Jan de Kruyf
2014-07-07  8:48 Jan de Kruyf
2014-07-07 13:00 ` Thomas Gleixner
2014-07-07 16:23 ` Austin Schuh
2014-07-08  8:03   ` Jan de Kruyf
2014-07-08 16:09     ` Austin Schuh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140627100157.6b0143a5@gandalf.local.home \
    --to=rostedt@goodmis.org \
    --cc=austin@peloton-tech.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=richard.weinberger@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).