[PATCH] fs,xfs: fix missed wakeup on l_flush

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
@ 2019-05-07 17:05 Rik van Riel
  2019-05-07 21:22 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2019-05-07 17:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: kernel-team, linux-kernel, Darrick J. Wong, David Chinner

The code in xlog_wait uses the spinlock to make adding the task to
the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
with respect to the waker.

Doing the wakeup after releasing the spinlock opens up the following
race condition:

- add task to wait queue

-                                      wake up task

- set task state to UNINTERRUPTIBLE

Simply moving the spin_unlock to after the wake_up_all results
in the waker not being able to see a task on the waitqueue before
it has set its state to UNINTERRUPTIBLE.

The lock ordering of taking the waitqueue lock inside the l_icloglock
is already used inside xlog_wait; it is unclear why the waker was doing
things differently.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Chris Mason <clm@fb.com>

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c3b610b687d1..8b9be76b2412 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2710,7 +2710,6 @@ xlog_state_do_callback(
 	int		   funcdidcallbacks; /* flag: function did callbacks */
 	int		   repeats;	/* for issuing console warnings if
 					 * looping too many times */
-	int		   wake = 0;

 	spin_lock(&log->l_icloglock);
 	first_iclog = iclog = log->l_iclog;
@@ -2912,11 +2911,9 @@ xlog_state_do_callback(
 #endif

 	if (log->l_iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_IOERROR))
-		wake = 1;
-	spin_unlock(&log->l_icloglock);
-
-	if (wake)
 		wake_up_all(&log->l_flush_wait);
+
+	spin_unlock(&log->l_icloglock);
 }

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-07 17:05 [PATCH] fs,xfs: fix missed wakeup on l_flush_wait Rik van Riel
@ 2019-05-07 21:22 ` Dave Chinner
  2019-05-08 14:08   ` Rik van Riel
  2019-05-08 16:39   ` Chris Mason
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Chinner @ 2019-05-07 21:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong,
	David Chinner

On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
> The code in xlog_wait uses the spinlock to make adding the task to
> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
> with respect to the waker.
> 
> Doing the wakeup after releasing the spinlock opens up the following
> race condition:
> 
> - add task to wait queue
> 
> -                                      wake up task
> 
> - set task state to UNINTERRUPTIBLE
> 
> Simply moving the spin_unlock to after the wake_up_all results
> in the waker not being able to see a task on the waitqueue before
> it has set its state to UNINTERRUPTIBLE.

Yup, seems like an issue. Good find, Rik.

So, what problem is this actually fixing? Was it noticed by
inspection, or is it actually manifesting on production machines?
If it is manifesting IRL, what are the symptoms (e.g. hang running
out of log space?) and do you have a test case or any way to
exercise it easily?

And, FWIW, did you check all the other xlog_wait() users for the
same problem?

> The lock ordering of taking the waitqueue lock inside the l_icloglock
> is already used inside xlog_wait; it is unclear why the waker was doing
> things differently.

Historical, most likely, and the wakeup code has changed in years
gone by and a race condition that rarely manifests is unlikely to be
noticed.

....

Yeah, the conversion from counting semaphore outside the iclog lock
to use wait queues in 2008 introduced this bug. The wait queue
addition was moved inside the lock, the wakeup left outside. So:

Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")

Apart from the commit message, the change looks fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-07 21:22 ` Dave Chinner
@ 2019-05-08 14:08   ` Rik van Riel
  2019-05-08 21:32     ` Dave Chinner
  2019-05-08 16:39   ` Chris Mason
  1 sibling, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2019-05-08 14:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong,
	David Chinner

[-- Attachment #1: Type: text/plain, Size: 2829 bytes --]

On Wed, 2019-05-08 at 07:22 +1000, Dave Chinner wrote:
> On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
> > The code in xlog_wait uses the spinlock to make adding the task to
> > the wait queue, and setting the task state to UNINTERRUPTIBLE
> > atomic
> > with respect to the waker.
> > 
> > Doing the wakeup after releasing the spinlock opens up the
> > following
> > race condition:
> > 
> > - add task to wait queue
> > 
> > -                                      wake up task
> > 
> > - set task state to UNINTERRUPTIBLE
> > 
> > Simply moving the spin_unlock to after the wake_up_all results
> > in the waker not being able to see a task on the waitqueue before
> > it has set its state to UNINTERRUPTIBLE.
> 
> Yup, seems like an issue. Good find, Rik.
> 
> So, what problem is this actually fixing? Was it noticed by
> inspection, or is it actually manifesting on production machines?
> If it is manifesting IRL, what are the symptoms (e.g. hang running
> out of log space?) and do you have a test case or any way to
> exercise it easily?

Chris spotted a hung kworker task, in UNINTERRUPTIBLE
state, but with an empty wait queue. This does not seem
like something that is easily reproducible.

> And, FWIW, did you check all the other xlog_wait() users for the
> same problem?

I did not, but am looking now. The xlog_wait code itself
is fine, but it seems there are a few other wakers that
are doing the wakeup after releasing the lock.

It looks like xfs_log_force_umount() and the other wakeup 
in xlog_state_do_callback() suffer from the same issue.

> > The lock ordering of taking the waitqueue lock inside the
> > l_icloglock
> > is already used inside xlog_wait; it is unclear why the waker was
> > doing
> > things differently.
> 
> Historical, most likely, and the wakeup code has changed in years
> gone by and a race condition that rarely manifests is unlikely to be
> noticed.
> 
> ....
> 
> Yeah, the conversion from counting semaphore outside the iclog lock
> to use wait queues in 2008 introduced this bug. The wait queue
> addition was moved inside the lock, the wakeup left outside. So:

It looks like that conversion may have introduced the
same bug in multiple places.

That first wakeup in xlog_state_do_callback() looks pretty
straightforward. That spin_unlock could be moved down as well,
or a lock & unlock pair could be placed around the wake_up_all.

I am not sure about xfs_log_force_umount(). Could the unlock 
be moved to after the wake_up_all, or does that create lock
ordering issues with the xlog_grant_head_wake_all calls?
Could a simple lock + unlock of log->l_icloglock around the
wake_up_all do the trick, or is there some other state that
also needs to stay locked?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-08 14:08   ` Rik van Riel
@ 2019-05-08 21:32     ` Dave Chinner
  2019-05-09 14:27       ` Rik van Riel
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2019-05-08 21:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong,
	David Chinner

On Wed, May 08, 2019 at 10:08:59AM -0400, Rik van Riel wrote:
> On Wed, 2019-05-08 at 07:22 +1000, Dave Chinner wrote:
> > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
> > > The code in xlog_wait uses the spinlock to make adding the task to
> > > the wait queue, and setting the task state to UNINTERRUPTIBLE
> > > atomic
> > > with respect to the waker.
> > > 
> > > Doing the wakeup after releasing the spinlock opens up the
> > > following
> > > race condition:
> > > 
> > > - add task to wait queue
> > > 
> > > -                                      wake up task
> > > 
> > > - set task state to UNINTERRUPTIBLE
> > > 
> > > Simply moving the spin_unlock to after the wake_up_all results
> > > in the waker not being able to see a task on the waitqueue before
> > > it has set its state to UNINTERRUPTIBLE.
> > 
> > Yup, seems like an issue. Good find, Rik.
> > 
> > So, what problem is this actually fixing? Was it noticed by
> > inspection, or is it actually manifesting on production machines?
> > If it is manifesting IRL, what are the symptoms (e.g. hang running
> > out of log space?) and do you have a test case or any way to
> > exercise it easily?
> 
> Chris spotted a hung kworker task, in UNINTERRUPTIBLE
> state, but with an empty wait queue. This does not seem
> like something that is easily reproducible.

Yeah, I just read that, not something we can trigger with a
regression test :P

> > And, FWIW, did you check all the other xlog_wait() users for the
> > same problem?
> 
> I did not, but am looking now. The xlog_wait code itself
> is fine, but it seems there are a few other wakers that
> are doing the wakeup after releasing the lock.
> 
> It looks like xfs_log_force_umount() and the other wakeup 
> in xlog_state_do_callback() suffer from the same issue.

Hmmm, the first wakeup in xsdc is this one, right:

	       /* wake up threads waiting in xfs_log_force() */
	       wake_up_all(&iclog->ic_force_wait);

At the end of the iclog iteration loop? That one is under the
ic_loglock - the lock is dropped to run callbacks, then picked up
again once the callbacks are done and before the ic_callback_lock is
dropped (about 10 lines above the wakeup). So unless I'm missing
something (like enough coffee!) that one look fine.

.....

> I am not sure about xfs_log_force_umount(). Could the unlock 
> be moved to after the wake_up_all, or does that create lock
> ordering issues with the xlog_grant_head_wake_all calls?
> Could a simple lock + unlock of log->l_icloglock around the
> wake_up_all do the trick, or is there some other state that
> also needs to stay locked?

Need to be careful which lock is used with which wait queue :)

This one is waking the the xc_commit_wait queue (CIL push commit
sequencing wait queue), which is protected by the
log->l_cilp->xc_push_lock. That should nest jsut fine inside any
locks we are holding at this point, so you should just be able to
wrap it.  It's not a common code path, though, it'll only hit this
code when the filesystem is already considered to be in an
unrecoverable state.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-08 21:32     ` Dave Chinner
@ 2019-05-09 14:27       ` Rik van Riel
  0 siblings, 0 replies; 7+ messages in thread
From: Rik van Riel @ 2019-05-09 14:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong,
	David Chinner

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]

On Thu, 2019-05-09 at 07:32 +1000, Dave Chinner wrote:

> Hmmm, the first wakeup in xsdc is this one, right:
> 
> 	       /* wake up threads waiting in xfs_log_force() */
> 	       wake_up_all(&iclog->ic_force_wait);
> 
> At the end of the iclog iteration loop? That one is under the
> ic_loglock - the lock is dropped to run callbacks, then picked up
> again once the callbacks are done and before the ic_callback_lock is
> dropped (about 10 lines above the wakeup). So unless I'm missing
> something (like enough coffee!) that one look fine.

You are right. I failed to spot that the spin_unlock unlocks 
a different lock than the spin_lock above it takes. 

> .....
> 
> > I am not sure about xfs_log_force_umount(). Could the unlock 
> > be moved to after the wake_up_all, or does that create lock
> > ordering issues with the xlog_grant_head_wake_all calls?
> > Could a simple lock + unlock of log->l_icloglock around the
> > wake_up_all do the trick, or is there some other state that
> > also needs to stay locked?
> 
> Need to be careful which lock is used with which wait queue :)
> 
> This one is waking the the xc_commit_wait queue (CIL push commit
> sequencing wait queue), which is protected by the
> log->l_cilp->xc_push_lock. That should nest jsut fine inside any
> locks we are holding at this point, so you should just be able to
> wrap it.  It's not a common code path, though, it'll only hit this
> code when the filesystem is already considered to be in an
> unrecoverable state.

Awesome.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-07 21:22 ` Dave Chinner
  2019-05-08 14:08   ` Rik van Riel
@ 2019-05-08 16:39   ` Chris Mason
  2019-05-08 21:40     ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Mason @ 2019-05-08 16:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, linux-xfs@vger.kernel.org, Kernel Team,
	linux-kernel@vger.kernel.org, Darrick J. Wong, David Chinner

On 7 May 2019, at 17:22, Dave Chinner wrote:

> On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
>> The code in xlog_wait uses the spinlock to make adding the task to
>> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
>> with respect to the waker.
>>
>> Doing the wakeup after releasing the spinlock opens up the following
>> race condition:
>>
>> - add task to wait queue
>>
>> -                                      wake up task
>>
>> - set task state to UNINTERRUPTIBLE
>>
>> Simply moving the spin_unlock to after the wake_up_all results
>> in the waker not being able to see a task on the waitqueue before
>> it has set its state to UNINTERRUPTIBLE.
>
> Yup, seems like an issue. Good find, Rik.
>
> So, what problem is this actually fixing? Was it noticed by
> inspection, or is it actually manifesting on production machines?
> If it is manifesting IRL, what are the symptoms (e.g. hang running
> out of log space?) and do you have a test case or any way to
> exercise it easily?

The steps to reproduce are semi-complicated, they create a bunch of 
files, do stuff, and then delete all the files in a loop.  I think they 
shotgunned it across 500 or so machines to trigger 5 times, and then 
left the wreckage for us to poke at.

The symptoms were identical to the bug fixed here:

commit 696a562072e3c14bcd13ae5acc19cdf27679e865
Author: Brian Foster <bfoster@redhat.com>
Date:   Tue Mar 28 14:51:44 2017 -0700

xfs: use dedicated log worker wq to avoid deadlock with cil wq

But since our 4.16 kernel is new than that, I briefly hoped that 
m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM.  I don't have 
a great picture of how all of these workqueues interact, but I do think 
it needs WQ_MEM_RECLAIM.  It can't be the cause of this deadlock, the 
workqueue watchdog would have fired.

Rik mentioned that I found sleeping procs with an empty iclog waitqueue 
list, which is when he noticed this race.  We sent a wakeup to the 
sleeping process, and ftrace showed the process looping back around to 
sleep on the iclog again.  Long story short, Rik's patch definitely 
wouldn't have prevented the deadlock, and the iclog waitqueue I was 
poking must not have been the same one that process was sleeping on.

The actual problem ended up being the blkmq IO schedulers sitting on a 
request.  Switching schedulers makes the box come back to life, so it's 
either a kyber bug or slightly higher up in blkmqland.

That's a huge tangent around acking Rik's patch, but it's hard to be 
sure if we've hit the lost wakeup in prod.  I could search through all 
the related hung task timeouts, but they are probably all stuck in 
blkmq.

Acked-but-I'm-still-blaming-Jens-by: Chris Mason <clm@fb.com>

-chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
  2019-05-08 16:39   ` Chris Mason
@ 2019-05-08 21:40     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2019-05-08 21:40 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, linux-xfs@vger.kernel.org, Kernel Team,
	linux-kernel@vger.kernel.org, Darrick J. Wong, David Chinner

On Wed, May 08, 2019 at 04:39:41PM +0000, Chris Mason wrote:
> On 7 May 2019, at 17:22, Dave Chinner wrote:
> 
> > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
> >> The code in xlog_wait uses the spinlock to make adding the task to
> >> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
> >> with respect to the waker.
> >>
> >> Doing the wakeup after releasing the spinlock opens up the following
> >> race condition:
> >>
> >> - add task to wait queue
> >>
> >> -                                      wake up task
> >>
> >> - set task state to UNINTERRUPTIBLE
> >>
> >> Simply moving the spin_unlock to after the wake_up_all results
> >> in the waker not being able to see a task on the waitqueue before
> >> it has set its state to UNINTERRUPTIBLE.
> >
> > Yup, seems like an issue. Good find, Rik.
> >
> > So, what problem is this actually fixing? Was it noticed by
> > inspection, or is it actually manifesting on production machines?
> > If it is manifesting IRL, what are the symptoms (e.g. hang running
> > out of log space?) and do you have a test case or any way to
> > exercise it easily?
> 
> The steps to reproduce are semi-complicated, they create a bunch of 
> files, do stuff, and then delete all the files in a loop.  I think they 
> shotgunned it across 500 or so machines to trigger 5 times, and then 
> left the wreckage for us to poke at.
> 
> The symptoms were identical to the bug fixed here:
> 
> commit 696a562072e3c14bcd13ae5acc19cdf27679e865
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Tue Mar 28 14:51:44 2017 -0700
> 
> xfs: use dedicated log worker wq to avoid deadlock with cil wq
> 
> But since our 4.16 kernel is new than that, I briefly hoped that 
> m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM.  I don't have 
> a great picture of how all of these workqueues interact, but I do think 
> it needs WQ_MEM_RECLAIM.  It can't be the cause of this deadlock, the 
> workqueue watchdog would have fired.

It shouldn't matter, because the m_sync_workqueue is largely
advisory - the only real function it has is to ensure that idle
filesystems have dirty metadata flushed and the journal emptied and
marked clean (that's what "covering the log" means) so if we crash
on an idle filesystem recovery is unnecessary....

i.e. if the filesystem is heavily busy it doesn't matter is the
m_sync_workqueue is run or not.

....

> That's a huge tangent around acking Rik's patch, but it's hard to be 
> sure if we've hit the lost wakeup in prod.  I could search through all 
> the related hung task timeouts, but they are probably all stuck in 
> blkmq.
> 
> Acked-but-I'm-still-blaming-Jens-by: Chris Mason <clm@fb.com>

No worries, quite the wild goose chase. :)

I just wanted some background on how it manifested so that we have
some idea of whether we have other unresolved bug reports that might
be a result of this problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-05-09 14:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-07 17:05 [PATCH] fs,xfs: fix missed wakeup on l_flush_wait Rik van Riel
2019-05-07 21:22 ` Dave Chinner
2019-05-08 14:08   ` Rik van Riel
2019-05-08 21:32     ` Dave Chinner
2019-05-09 14:27       ` Rik van Riel
2019-05-08 16:39   ` Chris Mason
2019-05-08 21:40     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox