* [PATCH] fs,xfs: fix missed wakeup on l_flush_wait @ 2019-05-07 17:05 Rik van Riel 2019-05-07 21:22 ` Dave Chinner 0 siblings, 1 reply; 7+ messages in thread From: Rik van Riel @ 2019-05-07 17:05 UTC (permalink / raw) To: linux-xfs; +Cc: kernel-team, linux-kernel, Darrick J. Wong, David Chinner The code in xlog_wait uses the spinlock to make adding the task to the wait queue, and setting the task state to UNINTERRUPTIBLE atomic with respect to the waker. Doing the wakeup after releasing the spinlock opens up the following race condition: - add task to wait queue - wake up task - set task state to UNINTERRUPTIBLE Simply moving the spin_unlock to after the wake_up_all results in the waker not being able to see a task on the waitqueue before it has set its state to UNINTERRUPTIBLE. The lock ordering of taking the waitqueue lock inside the l_icloglock is already used inside xlog_wait; it is unclear why the waker was doing things differently. Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Chris Mason <clm@fb.com> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index c3b610b687d1..8b9be76b2412 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2710,7 +2710,6 @@ xlog_state_do_callback( int funcdidcallbacks; /* flag: function did callbacks */ int repeats; /* for issuing console warnings if * looping too many times */ - int wake = 0; spin_lock(&log->l_icloglock); first_iclog = iclog = log->l_iclog; @@ -2912,11 +2911,9 @@ xlog_state_do_callback( #endif if (log->l_iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_IOERROR)) - wake = 1; - spin_unlock(&log->l_icloglock); - - if (wake) wake_up_all(&log->l_flush_wait); + + spin_unlock(&log->l_icloglock); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-07 17:05 [PATCH] fs,xfs: fix missed wakeup on l_flush_wait Rik van Riel @ 2019-05-07 21:22 ` Dave Chinner 2019-05-08 14:08 ` Rik van Riel 2019-05-08 16:39 ` Chris Mason 0 siblings, 2 replies; 7+ messages in thread From: Dave Chinner @ 2019-05-07 21:22 UTC (permalink / raw) To: Rik van Riel Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong, David Chinner On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: > The code in xlog_wait uses the spinlock to make adding the task to > the wait queue, and setting the task state to UNINTERRUPTIBLE atomic > with respect to the waker. > > Doing the wakeup after releasing the spinlock opens up the following > race condition: > > - add task to wait queue > > - wake up task > > - set task state to UNINTERRUPTIBLE > > Simply moving the spin_unlock to after the wake_up_all results > in the waker not being able to see a task on the waitqueue before > it has set its state to UNINTERRUPTIBLE. Yup, seems like an issue. Good find, Rik. So, what problem is this actually fixing? Was it noticed by inspection, or is it actually manifesting on production machines? If it is manifesting IRL, what are the symptoms (e.g. hang running out of log space?) and do you have a test case or any way to exercise it easily? And, FWIW, did you check all the other xlog_wait() users for the same problem? > The lock ordering of taking the waitqueue lock inside the l_icloglock > is already used inside xlog_wait; it is unclear why the waker was doing > things differently. Historical, most likely, and the wakeup code has changed in years gone by and a race condition that rarely manifests is unlikely to be noticed. .... Yeah, the conversion from counting semaphore outside the iclog lock to use wait queues in 2008 introduced this bug. The wait queue addition was moved inside the lock, the wakeup left outside. So: Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t") Apart from the commit message, the change looks fine. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-07 21:22 ` Dave Chinner @ 2019-05-08 14:08 ` Rik van Riel 2019-05-08 21:32 ` Dave Chinner 2019-05-08 16:39 ` Chris Mason 1 sibling, 1 reply; 7+ messages in thread From: Rik van Riel @ 2019-05-08 14:08 UTC (permalink / raw) To: Dave Chinner Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong, David Chinner [-- Attachment #1: Type: text/plain, Size: 2829 bytes --] On Wed, 2019-05-08 at 07:22 +1000, Dave Chinner wrote: > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: > > The code in xlog_wait uses the spinlock to make adding the task to > > the wait queue, and setting the task state to UNINTERRUPTIBLE > > atomic > > with respect to the waker. > > > > Doing the wakeup after releasing the spinlock opens up the > > following > > race condition: > > > > - add task to wait queue > > > > - wake up task > > > > - set task state to UNINTERRUPTIBLE > > > > Simply moving the spin_unlock to after the wake_up_all results > > in the waker not being able to see a task on the waitqueue before > > it has set its state to UNINTERRUPTIBLE. > > Yup, seems like an issue. Good find, Rik. > > So, what problem is this actually fixing? Was it noticed by > inspection, or is it actually manifesting on production machines? > If it is manifesting IRL, what are the symptoms (e.g. hang running > out of log space?) and do you have a test case or any way to > exercise it easily? Chris spotted a hung kworker task, in UNINTERRUPTIBLE state, but with an empty wait queue. This does not seem like something that is easily reproducible. > And, FWIW, did you check all the other xlog_wait() users for the > same problem? I did not, but am looking now. The xlog_wait code itself is fine, but it seems there are a few other wakers that are doing the wakeup after releasing the lock. It looks like xfs_log_force_umount() and the other wakeup in xlog_state_do_callback() suffer from the same issue. > > The lock ordering of taking the waitqueue lock inside the > > l_icloglock > > is already used inside xlog_wait; it is unclear why the waker was > > doing > > things differently. > > Historical, most likely, and the wakeup code has changed in years > gone by and a race condition that rarely manifests is unlikely to be > noticed. > > .... > > Yeah, the conversion from counting semaphore outside the iclog lock > to use wait queues in 2008 introduced this bug. The wait queue > addition was moved inside the lock, the wakeup left outside. So: It looks like that conversion may have introduced the same bug in multiple places. That first wakeup in xlog_state_do_callback() looks pretty straightforward. That spin_unlock could be moved down as well, or a lock & unlock pair could be placed around the wake_up_all. I am not sure about xfs_log_force_umount(). Could the unlock be moved to after the wake_up_all, or does that create lock ordering issues with the xlog_grant_head_wake_all calls? Could a simple lock + unlock of log->l_icloglock around the wake_up_all do the trick, or is there some other state that also needs to stay locked? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-08 14:08 ` Rik van Riel @ 2019-05-08 21:32 ` Dave Chinner 2019-05-09 14:27 ` Rik van Riel 0 siblings, 1 reply; 7+ messages in thread From: Dave Chinner @ 2019-05-08 21:32 UTC (permalink / raw) To: Rik van Riel Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong, David Chinner On Wed, May 08, 2019 at 10:08:59AM -0400, Rik van Riel wrote: > On Wed, 2019-05-08 at 07:22 +1000, Dave Chinner wrote: > > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: > > > The code in xlog_wait uses the spinlock to make adding the task to > > > the wait queue, and setting the task state to UNINTERRUPTIBLE > > > atomic > > > with respect to the waker. > > > > > > Doing the wakeup after releasing the spinlock opens up the > > > following > > > race condition: > > > > > > - add task to wait queue > > > > > > - wake up task > > > > > > - set task state to UNINTERRUPTIBLE > > > > > > Simply moving the spin_unlock to after the wake_up_all results > > > in the waker not being able to see a task on the waitqueue before > > > it has set its state to UNINTERRUPTIBLE. > > > > Yup, seems like an issue. Good find, Rik. > > > > So, what problem is this actually fixing? Was it noticed by > > inspection, or is it actually manifesting on production machines? > > If it is manifesting IRL, what are the symptoms (e.g. hang running > > out of log space?) and do you have a test case or any way to > > exercise it easily? > > Chris spotted a hung kworker task, in UNINTERRUPTIBLE > state, but with an empty wait queue. This does not seem > like something that is easily reproducible. Yeah, I just read that, not something we can trigger with a regression test :P > > And, FWIW, did you check all the other xlog_wait() users for the > > same problem? > > I did not, but am looking now. The xlog_wait code itself > is fine, but it seems there are a few other wakers that > are doing the wakeup after releasing the lock. > > It looks like xfs_log_force_umount() and the other wakeup > in xlog_state_do_callback() suffer from the same issue. Hmmm, the first wakeup in xsdc is this one, right: /* wake up threads waiting in xfs_log_force() */ wake_up_all(&iclog->ic_force_wait); At the end of the iclog iteration loop? That one is under the ic_loglock - the lock is dropped to run callbacks, then picked up again once the callbacks are done and before the ic_callback_lock is dropped (about 10 lines above the wakeup). So unless I'm missing something (like enough coffee!) that one look fine. ..... > I am not sure about xfs_log_force_umount(). Could the unlock > be moved to after the wake_up_all, or does that create lock > ordering issues with the xlog_grant_head_wake_all calls? > Could a simple lock + unlock of log->l_icloglock around the > wake_up_all do the trick, or is there some other state that > also needs to stay locked? Need to be careful which lock is used with which wait queue :) This one is waking the the xc_commit_wait queue (CIL push commit sequencing wait queue), which is protected by the log->l_cilp->xc_push_lock. That should nest jsut fine inside any locks we are holding at this point, so you should just be able to wrap it. It's not a common code path, though, it'll only hit this code when the filesystem is already considered to be in an unrecoverable state. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-08 21:32 ` Dave Chinner @ 2019-05-09 14:27 ` Rik van Riel 0 siblings, 0 replies; 7+ messages in thread From: Rik van Riel @ 2019-05-09 14:27 UTC (permalink / raw) To: Dave Chinner Cc: linux-xfs, kernel-team, linux-kernel, Darrick J. Wong, David Chinner [-- Attachment #1: Type: text/plain, Size: 1575 bytes --] On Thu, 2019-05-09 at 07:32 +1000, Dave Chinner wrote: > Hmmm, the first wakeup in xsdc is this one, right: > > /* wake up threads waiting in xfs_log_force() */ > wake_up_all(&iclog->ic_force_wait); > > At the end of the iclog iteration loop? That one is under the > ic_loglock - the lock is dropped to run callbacks, then picked up > again once the callbacks are done and before the ic_callback_lock is > dropped (about 10 lines above the wakeup). So unless I'm missing > something (like enough coffee!) that one look fine. You are right. I failed to spot that the spin_unlock unlocks a different lock than the spin_lock above it takes. > ..... > > > I am not sure about xfs_log_force_umount(). Could the unlock > > be moved to after the wake_up_all, or does that create lock > > ordering issues with the xlog_grant_head_wake_all calls? > > Could a simple lock + unlock of log->l_icloglock around the > > wake_up_all do the trick, or is there some other state that > > also needs to stay locked? > > Need to be careful which lock is used with which wait queue :) > > This one is waking the the xc_commit_wait queue (CIL push commit > sequencing wait queue), which is protected by the > log->l_cilp->xc_push_lock. That should nest jsut fine inside any > locks we are holding at this point, so you should just be able to > wrap it. It's not a common code path, though, it'll only hit this > code when the filesystem is already considered to be in an > unrecoverable state. Awesome. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-07 21:22 ` Dave Chinner 2019-05-08 14:08 ` Rik van Riel @ 2019-05-08 16:39 ` Chris Mason 2019-05-08 21:40 ` Dave Chinner 1 sibling, 1 reply; 7+ messages in thread From: Chris Mason @ 2019-05-08 16:39 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, linux-xfs@vger.kernel.org, Kernel Team, linux-kernel@vger.kernel.org, Darrick J. Wong, David Chinner On 7 May 2019, at 17:22, Dave Chinner wrote: > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: >> The code in xlog_wait uses the spinlock to make adding the task to >> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic >> with respect to the waker. >> >> Doing the wakeup after releasing the spinlock opens up the following >> race condition: >> >> - add task to wait queue >> >> - wake up task >> >> - set task state to UNINTERRUPTIBLE >> >> Simply moving the spin_unlock to after the wake_up_all results >> in the waker not being able to see a task on the waitqueue before >> it has set its state to UNINTERRUPTIBLE. > > Yup, seems like an issue. Good find, Rik. > > So, what problem is this actually fixing? Was it noticed by > inspection, or is it actually manifesting on production machines? > If it is manifesting IRL, what are the symptoms (e.g. hang running > out of log space?) and do you have a test case or any way to > exercise it easily? The steps to reproduce are semi-complicated, they create a bunch of files, do stuff, and then delete all the files in a loop. I think they shotgunned it across 500 or so machines to trigger 5 times, and then left the wreckage for us to poke at. The symptoms were identical to the bug fixed here: commit 696a562072e3c14bcd13ae5acc19cdf27679e865 Author: Brian Foster <bfoster@redhat.com> Date: Tue Mar 28 14:51:44 2017 -0700 xfs: use dedicated log worker wq to avoid deadlock with cil wq But since our 4.16 kernel is new than that, I briefly hoped that m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM. I don't have a great picture of how all of these workqueues interact, but I do think it needs WQ_MEM_RECLAIM. It can't be the cause of this deadlock, the workqueue watchdog would have fired. Rik mentioned that I found sleeping procs with an empty iclog waitqueue list, which is when he noticed this race. We sent a wakeup to the sleeping process, and ftrace showed the process looping back around to sleep on the iclog again. Long story short, Rik's patch definitely wouldn't have prevented the deadlock, and the iclog waitqueue I was poking must not have been the same one that process was sleeping on. The actual problem ended up being the blkmq IO schedulers sitting on a request. Switching schedulers makes the box come back to life, so it's either a kyber bug or slightly higher up in blkmqland. That's a huge tangent around acking Rik's patch, but it's hard to be sure if we've hit the lost wakeup in prod. I could search through all the related hung task timeouts, but they are probably all stuck in blkmq. Acked-but-I'm-still-blaming-Jens-by: Chris Mason <clm@fb.com> -chris ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait 2019-05-08 16:39 ` Chris Mason @ 2019-05-08 21:40 ` Dave Chinner 0 siblings, 0 replies; 7+ messages in thread From: Dave Chinner @ 2019-05-08 21:40 UTC (permalink / raw) To: Chris Mason Cc: Rik van Riel, linux-xfs@vger.kernel.org, Kernel Team, linux-kernel@vger.kernel.org, Darrick J. Wong, David Chinner On Wed, May 08, 2019 at 04:39:41PM +0000, Chris Mason wrote: > On 7 May 2019, at 17:22, Dave Chinner wrote: > > > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: > >> The code in xlog_wait uses the spinlock to make adding the task to > >> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic > >> with respect to the waker. > >> > >> Doing the wakeup after releasing the spinlock opens up the following > >> race condition: > >> > >> - add task to wait queue > >> > >> - wake up task > >> > >> - set task state to UNINTERRUPTIBLE > >> > >> Simply moving the spin_unlock to after the wake_up_all results > >> in the waker not being able to see a task on the waitqueue before > >> it has set its state to UNINTERRUPTIBLE. > > > > Yup, seems like an issue. Good find, Rik. > > > > So, what problem is this actually fixing? Was it noticed by > > inspection, or is it actually manifesting on production machines? > > If it is manifesting IRL, what are the symptoms (e.g. hang running > > out of log space?) and do you have a test case or any way to > > exercise it easily? > > The steps to reproduce are semi-complicated, they create a bunch of > files, do stuff, and then delete all the files in a loop. I think they > shotgunned it across 500 or so machines to trigger 5 times, and then > left the wreckage for us to poke at. > > The symptoms were identical to the bug fixed here: > > commit 696a562072e3c14bcd13ae5acc19cdf27679e865 > Author: Brian Foster <bfoster@redhat.com> > Date: Tue Mar 28 14:51:44 2017 -0700 > > xfs: use dedicated log worker wq to avoid deadlock with cil wq > > But since our 4.16 kernel is new than that, I briefly hoped that > m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM. I don't have > a great picture of how all of these workqueues interact, but I do think > it needs WQ_MEM_RECLAIM. It can't be the cause of this deadlock, the > workqueue watchdog would have fired. It shouldn't matter, because the m_sync_workqueue is largely advisory - the only real function it has is to ensure that idle filesystems have dirty metadata flushed and the journal emptied and marked clean (that's what "covering the log" means) so if we crash on an idle filesystem recovery is unnecessary.... i.e. if the filesystem is heavily busy it doesn't matter is the m_sync_workqueue is run or not. .... > That's a huge tangent around acking Rik's patch, but it's hard to be > sure if we've hit the lost wakeup in prod. I could search through all > the related hung task timeouts, but they are probably all stuck in > blkmq. > > Acked-but-I'm-still-blaming-Jens-by: Chris Mason <clm@fb.com> No worries, quite the wild goose chase. :) I just wanted some background on how it manifested so that we have some idea of whether we have other unresolved bug reports that might be a result of this problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-05-09 14:27 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-05-07 17:05 [PATCH] fs,xfs: fix missed wakeup on l_flush_wait Rik van Riel 2019-05-07 21:22 ` Dave Chinner 2019-05-08 14:08 ` Rik van Riel 2019-05-08 21:32 ` Dave Chinner 2019-05-09 14:27 ` Rik van Riel 2019-05-08 16:39 ` Chris Mason 2019-05-08 21:40 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox