* Re: xfstests 073 regression [not found] ` <20110730134422.GA1884@infradead.org> @ 2011-07-31 15:10 ` Wu Fengguang 2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 0 siblings, 2 replies; 19+ messages in thread From: Wu Fengguang @ 2011-07-31 15:10 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > This patch fixes the hang for me. Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. (if anyone find major problems with the fix, please speak up) Linus, will you pull from this branch for the below one-liner change? git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git urgent-for-linus Thanks, Fengguang --- >From 0e995816f4fb69cef602b7fe82da68ced6be3b41 Mon Sep 17 00:00:00 2001 From: Wu Fengguang <fengguang.wu@intel.com> Date: Fri, 29 Jul 2011 22:14:35 -0600 Subject: [PATCH] don't busy retry the inode on failed grab_super_passive() This fixes a soft lockup on conditions a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or even writeback_inodes_wb(). bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html Reported-by: Christoph Hellwig <hch@infradead.org> Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1599aa9..04cf3b9 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + /* + * grab_super_passive() may fail consistently due to + * s_umount being grabbed by someone else. Don't use + * requeue_io() to avoid busy retrying the inode/sb. + */ + redirty_tail(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [GIT PULL] fix xfstests 073 regression for 3.1-rc1 2011-07-31 15:10 ` xfstests 073 regression Wu Fengguang @ 2011-07-31 15:14 ` Wu Fengguang 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 1 sibling, 0 replies; 19+ messages in thread From: Wu Fengguang @ 2011-07-31 15:14 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML [add GIT PULL in title] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-07-31 15:10 ` xfstests 073 regression Wu Fengguang 2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang @ 2011-07-31 23:47 ` Dave Chinner 2011-08-01 0:25 ` Linus Torvalds 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 2 replies; 19+ messages in thread From: Dave Chinner @ 2011-07-31 23:47 UTC (permalink / raw) To: Wu Fengguang Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote: > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > > This patch fixes the hang for me. > > Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. > (if anyone find major problems with the fix, please speak up) Yes, I already have, a couple of hours before you sent this: http://www.spinics.net/lists/linux-fsdevel/msg47357.html We haven't found the root cause of the problem, and writeback cannot hold off grab_super_passive() because writeback only holds read locks on s_umount, just like grab_super_passive. So if grab_super_passive is failing, there is some other, not yet unidentified actor causing this problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner @ 2011-08-01 0:25 ` Linus Torvalds 2011-08-01 1:28 ` Dave Chinner 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2011-08-01 0:25 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: > > Yes, I already have, a couple of hours before you sent this: > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html > > We haven't found the root cause of the problem, and writeback cannot > hold off grab_super_passive() because writeback only holds read > locks on s_umount, just like grab_super_passive. With read-write semaphores, even read-vs-read recursion is a deadlock possibility. Why? Because if a writer comes in on another thread, while the read lock is initially held, then the writer will now block. And due to fairness, now a subsequent reader will *also* block. So no, nesting readers is *not* allowed for rw_semaphores even if naively you'd think it should work. So if xfstests 073 does mount/umount testing, then it is entirely possible that a reader blocks another reader due to a pending writer. NOTE! The rwlock *spinlocks* are designed to be unfair to writers, and by design allow recursive readers. That's important and very much by design: it is ok to take a rwlock for reading without disabling interrupts even if there may be *interrupts* that also need it for reading. With the spinning rwlock, there is also much less chance of starvation due to this unfairness. In contrast, the rw_semaphores really can be starved pretty easily if you are excessively unfair to writers. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 0:25 ` Linus Torvalds @ 2011-08-01 1:28 ` Dave Chinner 2011-08-01 1:40 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2011-08-01 1:28 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 02:25:27PM -1000, Linus Torvalds wrote: > On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Yes, I already have, a couple of hours before you sent this: > > > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html > > > > We haven't found the root cause of the problem, and writeback cannot > > hold off grab_super_passive() because writeback only holds read > > locks on s_umount, just like grab_super_passive. > > With read-write semaphores, even read-vs-read recursion is a deadlock > possibility. > > Why? Because if a writer comes in on another thread, while the read > lock is initially held, then the writer will now block. And due to > fairness, now a subsequent reader will *also* block. Yes, I know. > So no, nesting readers is *not* allowed for rw_semaphores even if > naively you'd think it should work. Right, hence my question of where the deadlock was. I can see how we get a temporary holdoff due to background flushing holding the s_umount lock (and hence why this change prevents that holdoff) but that should not deadlock even if there is a pending write lock. > So if xfstests 073 does mount/umount testing, then it is entirely > possible that a reader blocks another reader due to a pending writer. And I know that remount,ro will take the lock in write mode, hence cause such a holdoff to occur. Test 073 does this, and it is likely to be the cause of the problem. However, if we can't grab the superblock, it implies it that there is something major happening on the superblock. It's likely to be going away, or in the case of a remount, something significant is changing. In both cases, writeback of dirty data is going to be handled by the umount/remount process, and so whatever writeback that is currently in progress is just getting in the way of executing the higher level, more important operation. AFAICT, there's a bunch of non-trivial ways that we can get read-write-read deadlocks that the bdi-flusher "avoids" by ignoring inodes when it can't get the superblock. However, if it can't get the superblock reference, why shouldn't we just abort whatever writeback that is going through writeback_inodes_wb? That writeback is guaranteed to be unable to complete until the reference to s_umount is dropped, so why do we continue to try one inode at a time? IOWs, what I'm asking is whether this "just move the inodes one at a time to a different queue" is just a bandaid for a particular symptom of a deeper problem we haven't realised existed.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 1:28 ` Dave Chinner @ 2011-08-01 1:40 ` Linus Torvalds 2011-08-01 2:09 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2011-08-01 1:40 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > IOWs, what I'm asking is whether this "just move the inodes one at a > time to a different queue" is just a bandaid for a particular > symptom of a deeper problem we haven't realised existed.... Deeper problems in writeback? Unpossible. The writeback code has pretty much always been just a collection of "bandaids for particular symptoms of deeper problems". So let's just say I'd not be shocked. But what else would you suggest? You could just break out of the loop if you can't get the read lock, but while the *common* case is likely that a lot of the inodes are on the same filesystem, that's certainly not the only possible case. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 1:40 ` Linus Torvalds @ 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Dave Chinner @ 2011-08-01 2:09 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote: > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > IOWs, what I'm asking is whether this "just move the inodes one at a > > time to a different queue" is just a bandaid for a particular > > symptom of a deeper problem we haven't realised existed.... > > Deeper problems in writeback? Unpossible. Heh. But that's exactly why I'd like to understand the problem fully. > The writeback code has pretty much always been just a collection of > "bandaids for particular symptoms of deeper problems". So let's just > say I'd not be shocked. But what else would you suggest? You could > just break out of the loop if you can't get the read lock, but while > the *common* case is likely that a lot of the inodes are on the same > filesystem, that's certainly not the only possible case. Right, but in this specific case of executing writeback_inodes_wb(), we can only be operating on a specific bdi without being told which sb to flush. If we are told which sb, then we go through __writeback_inodes_sb() and avoid the grab_super_passive() altogether because some other thread holds the s_umount lock. These no-specific-sb cases can come only from wb_check_background_flush() or wb_check_old_data_flush() which, by definition, are oppurtunist background asynchronous writeback executed only when there is no other work to do. Further, if there is new work queued while they are running, they abort. Hence if we can't grab the superblock here, it is simply another case of a "new work pending" interrupt, right? And so aborting the work is the correct thing to do? Especially as it avoids all the ordering problems of redirtying inodes and allows the writeback work to restart (form whatever context it is stared from next time) where it stopped. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner @ 2011-08-01 2:21 ` Linus Torvalds 2011-08-01 5:52 ` Wu Fengguang 2011-08-01 11:23 ` Christoph Hellwig 2 siblings, 0 replies; 19+ messages in thread From: Linus Torvalds @ 2011-08-01 2:21 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 4:09 PM, Dave Chinner <david@fromorbit.com> wrote: > > Hence if we can't grab the superblock here, it is simply another > case of a "new work pending" interrupt, right? And so aborting the > work is the correct thing to do? Especially as it avoids all the > ordering problems of redirtying inodes and allows the writeback work > to restart (form whatever context it is stared from next time) where > it stopped. Ok, that does sound like a reasonable approach to me. Which probably means that there is some oversimplification and gotcha hidden in that argument, but I don't see it immediately. I do agree that it would be nice if we could avoid re-ordering problems. The whole dirty time issue and whether inodes are actually correctly ordered on the dirty lists has always been one huge rats nest. Ask Andrew about it, he was espousing his love for that code at the recent Intel Tech Days ;) Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds @ 2011-08-01 5:52 ` Wu Fengguang 2011-08-01 16:44 ` Christoph Hellwig 2011-08-01 11:23 ` Christoph Hellwig 2 siblings, 1 reply; 19+ messages in thread From: Wu Fengguang @ 2011-08-01 5:52 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 10:09:51AM +0800, Dave Chinner wrote: > On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote: > > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > IOWs, what I'm asking is whether this "just move the inodes one at a > > > time to a different queue" is just a bandaid for a particular > > > symptom of a deeper problem we haven't realised existed.... > > > > Deeper problems in writeback? Unpossible. > > Heh. > > But that's exactly why I'd like to understand the problem fully. > > > The writeback code has pretty much always been just a collection of > > "bandaids for particular symptoms of deeper problems". So let's just > > say I'd not be shocked. But what else would you suggest? You could > > just break out of the loop if you can't get the read lock, but while > > the *common* case is likely that a lot of the inodes are on the same > > filesystem, that's certainly not the only possible case. > > Right, but in this specific case of executing writeback_inodes_wb(), > we can only be operating on a specific bdi without being told which > sb to flush. If we are told which sb, then we go through > __writeback_inodes_sb() and avoid the grab_super_passive() > altogether because some other thread holds the s_umount lock. > > These no-specific-sb cases can come only from > wb_check_background_flush() or wb_check_old_data_flush() which, by > definition, are oppurtunist background asynchronous writeback > executed only when there is no other work to do. Further, if there > is new work queued while they are running, they abort. There is another type of work that won't abort: the one that started by __bdi_start_writeback() and I'd call it "nr_pages" work since its termination condition is simply nr_pages and nothing more. It's not the for_background or for_kupdate works that will abort as soon as other works are queued. Here I listed the two conditions for the deadlock (missing the 3rd one: the read-write-read lock): http://lkml.org/lkml/2011/7/31/63 In particular, the deadlock, once triggered, does not depend on how large nr_pages is. It can be fixed by either of 1) the flusher abort the work early 2) the flusher don't busy retry the inode(s) In the other email, I proposed to fix (2) for now and then do (1) in future: : So I'd propose this patch as the reasonable fix for 3.1. In long term, : we may further consider make the nr_pages works give up temporarily : when there comes a sync work, which could eliminate lots of : redirty_tail()s at this point. > Hence if we can't grab the superblock here, it is simply another > case of a "new work pending" interrupt, right? And so aborting the > work is the correct thing to do? Especially as it avoids all the > ordering problems of redirtying inodes and allows the writeback work > to restart (form whatever context it is stared from next time) where > it stopped. The long term solution (2) I proposed is actually the same as your proposal to abort the work :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 5:52 ` Wu Fengguang @ 2011-08-01 16:44 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2011-08-01 16:44 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 01:52:12PM +0800, Wu Fengguang wrote: > There is another type of work that won't abort: the one that started > by __bdi_start_writeback() and I'd call it "nr_pages" work since its > termination condition is simply nr_pages and nothing more. It's not > the for_background or for_kupdate works that will abort as soon as > other works are queued. We have two callers of __bdi_start_writeback: bdi_start_writeback and wakeup_flusher_threads. The first one want to write all pages for a bdi when a laptop-mode completion comes in. It's a bit like the first pass of a sync in that it's a) asynchronous and b) skipping inodes or even superblocks has absolutely no data integrity or system balance effect. The second one is a huge mess. We may call it either in places like sync that want to write all pages, and have similar characteristics as above. Or in some places where the VM wants to write a certain number of pages - but the writeback code totally fucks that goal up by trying to write the requested number of pages per BDI, leading to much more massive writeback than we have requested on any larger system. And what the callers appear to actually want is to either increase background writeback a bit, or free a certain number of pages (in one case even just in the lowmem zone). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds 2011-08-01 5:52 ` Wu Fengguang @ 2011-08-01 11:23 ` Christoph Hellwig 2011-08-01 16:52 ` Christoph Hellwig 2 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2011-08-01 11:23 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML I think Daves theory sounds good, but I'd like to get some evidence for it. We already have fairly good trace points in the writeback code, but for this they aren't quite enough yet. Curt posted some patches to pass a why argument down the writeback stack so that we known the high-level caller in all tracepoints. I'll put that into my tree to check where these calls come from. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 11:23 ` Christoph Hellwig @ 2011-08-01 16:52 ` Christoph Hellwig 2011-08-02 11:44 ` Wu Fengguang 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2011-08-01 16:52 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML [-- Attachment #1: Type: text/plain, Size: 3499 bytes --] wb_check_background_flush is indeed what we're hitting. See the trace output using a patch inspired by Curt's below: # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush [-- Attachment #2: fs-trace-io-source --] [-- Type: text/plain, Size: 7728 bytes --] From: Christoph Hellwig <hch@lst.de> Subject: writeback: trace the source of writeback works Trace who queued up work to the flusher thread. Also add a new tracepoint to the failed grab_super_passive case. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/fs-writeback.c =================================================================== --- xfs.orig/fs/fs-writeback.c 2011-08-01 17:59:15.318991514 +0200 +++ xfs/fs/fs-writeback.c 2011-08-01 18:31:33.221826327 +0200 @@ -41,7 +41,7 @@ struct wb_writeback_work { unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; - + enum writeback_reason why; struct list_head list; /* pending work list */ struct completion *done; /* set if the caller waits */ }; @@ -115,7 +115,7 @@ static void bdi_queue_work(struct backin static void __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - bool range_cyclic) + bool range_cyclic, enum writeback_reason why) { struct wb_writeback_work *work; @@ -135,6 +135,7 @@ __bdi_start_writeback(struct backing_dev work->sync_mode = WB_SYNC_NONE; work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; + work->why = why; bdi_queue_work(bdi, work); } @@ -152,7 +153,7 @@ __bdi_start_writeback(struct backing_dev */ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages) { - __bdi_start_writeback(bdi, nr_pages, true); + __bdi_start_writeback(bdi, nr_pages, true, WB_START_WRITEBACK); } /** @@ -618,6 +619,7 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { + trace_writeback_grab_super_failed(wb->bdi, work); redirty_tail(inode, wb); continue; } @@ -636,12 +638,14 @@ static long __writeback_inodes_wb(struct return wrote; } -long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages) +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, + enum writeback_reason why) { struct wb_writeback_work work = { .nr_pages = nr_pages, .sync_mode = WB_SYNC_NONE, .range_cyclic = 1, + .why = why, }; spin_lock(&wb->list_lock); @@ -813,6 +817,7 @@ static long wb_check_background_flush(st .sync_mode = WB_SYNC_NONE, .for_background = 1, .range_cyclic = 1, + .why = WB_BACKGROUND, }; return wb_writeback(wb, &work); @@ -846,6 +851,7 @@ static long wb_check_old_data_flush(stru .sync_mode = WB_SYNC_NONE, .for_kupdate = 1, .range_cyclic = 1, + .why = WB_OLD, }; return wb_writeback(wb, &work); @@ -977,7 +983,7 @@ void wakeup_flusher_threads(long nr_page list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { if (!bdi_has_dirty_io(bdi)) continue; - __bdi_start_writeback(bdi, nr_pages, false); + __bdi_start_writeback(bdi, nr_pages, false, WB_FLUSHER_THREADS); } rcu_read_unlock(); } @@ -1207,6 +1213,7 @@ void writeback_inodes_sb_nr(struct super .tagged_writepages = 1, .done = &done, .nr_pages = nr, + .why = WB_WRITEBACK_INODES, }; WARN_ON(!rwsem_is_locked(&sb->s_umount)); @@ -1285,6 +1292,7 @@ void sync_inodes_sb(struct super_block * .nr_pages = LONG_MAX, .range_cyclic = 0, .done = &done, + .why = WB_SYNC_INODES, }; WARN_ON(!rwsem_is_locked(&sb->s_umount)); Index: xfs/include/trace/events/writeback.h =================================================================== --- xfs.orig/include/trace/events/writeback.h 2011-08-01 17:58:27.649249764 +0200 +++ xfs/include/trace/events/writeback.h 2011-08-01 18:31:26.501862733 +0200 @@ -21,6 +21,16 @@ {I_REFERENCED, "I_REFERENCED"} \ ) +#define WB_TYPES \ + { WB_SYNC_INODES, "sync_inodes_sb" }, \ + { WB_WRITEBACK_INODES, "writeback_inodes_sb" }, \ + { WB_START_WRITEBACK, "bdi_start_writeback" }, \ + { WB_FLUSHER_THREADS, "wakeup_flusher_threads" }, \ + { WB_EMERGENCY, "emergency flush" }, \ + { WB_BALANCE, "balance_dirty_pages" }, \ + { WB_BACKGROUND, "wb_check_background_flush" }, \ + { WB_OLD, "wb_check_old_data_flush" } + struct wb_writeback_work; DECLARE_EVENT_CLASS(writeback_work_class, @@ -34,6 +44,7 @@ DECLARE_EVENT_CLASS(writeback_work_class __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) + __field(int, why) ), TP_fast_assign( strncpy(__entry->name, dev_name(bdi->dev), 32); @@ -43,16 +54,18 @@ DECLARE_EVENT_CLASS(writeback_work_class __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; + __entry->why = work->why; ), TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d " - "kupdate=%d range_cyclic=%d background=%d", + "kupdate=%d range_cyclic=%d background=%d reason=%s", __entry->name, MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev), __entry->nr_pages, __entry->sync_mode, __entry->for_kupdate, __entry->range_cyclic, - __entry->for_background + __entry->for_background, + __print_symbolic(__entry->why, WB_TYPES) ) ); #define DEFINE_WRITEBACK_WORK_EVENT(name) \ @@ -65,6 +78,7 @@ DEFINE_WRITEBACK_WORK_EVENT(writeback_ex DEFINE_WRITEBACK_WORK_EVENT(writeback_start); DEFINE_WRITEBACK_WORK_EVENT(writeback_written); DEFINE_WRITEBACK_WORK_EVENT(writeback_wait); +DEFINE_WRITEBACK_WORK_EVENT(writeback_grab_super_failed); TRACE_EVENT(writeback_pages_written, TP_PROTO(long pages_written), Index: xfs/include/linux/writeback.h =================================================================== --- xfs.orig/include/linux/writeback.h 2011-08-01 18:24:26.754136700 +0200 +++ xfs/include/linux/writeback.h 2011-08-01 18:32:37.158146620 +0200 @@ -79,13 +79,26 @@ struct writeback_control { * fs/fs-writeback.c */ struct bdi_writeback; + +enum writeback_reason { + WB_SYNC_INODES, + WB_WRITEBACK_INODES, + WB_START_WRITEBACK, + WB_FLUSHER_THREADS, + WB_EMERGENCY, + WB_BALANCE, + WB_BACKGROUND, + WB_OLD, +}; + int inode_wait(void *); void writeback_inodes_sb(struct super_block *); void writeback_inodes_sb_nr(struct super_block *, unsigned long nr); int writeback_inodes_sb_if_idle(struct super_block *); int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr); void sync_inodes_sb(struct super_block *); -long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages); +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, + enum writeback_reason why); long wb_do_writeback(struct bdi_writeback *wb, int force_wait); void wakeup_flusher_threads(long nr_pages); Index: xfs/mm/backing-dev.c =================================================================== --- xfs.orig/mm/backing-dev.c 2011-08-01 18:24:26.774136593 +0200 +++ xfs/mm/backing-dev.c 2011-08-01 18:25:09.910569569 +0200 @@ -456,7 +456,7 @@ static int bdi_forker_thread(void *ptr) * the bdi from the thread. Hopefully 1024 is * large enough for efficient IO. */ - writeback_inodes_wb(&bdi->wb, 1024); + writeback_inodes_wb(&bdi->wb, 1024, WB_EMERGENCY); } else { /* * The spinlock makes sure we do not lose Index: xfs/mm/page-writeback.c =================================================================== --- xfs.orig/mm/page-writeback.c 2011-08-01 18:24:26.794136483 +0200 +++ xfs/mm/page-writeback.c 2011-08-01 18:27:43.019740106 +0200 @@ -738,7 +738,8 @@ static void balance_dirty_pages(struct a trace_balance_dirty_start(bdi); if (bdi_nr_reclaimable > task_bdi_thresh) { pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); + write_chunk, + WB_BALANCE); trace_balance_dirty_written(bdi, pages_written); if (pages_written >= write_chunk) break; /* We've done our duty */ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-01 16:52 ` Christoph Hellwig @ 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Wu Fengguang @ 2011-08-02 11:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > wb_check_background_flush is indeed what we're hitting. That means s_umount is NOT held by another queued writeback work. > See the trace output using a patch inspired by Curt's below: > > # tracer: nop > # > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. All are indicating some special bdi/inode. Thanks, Fengguang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang @ 2011-08-02 12:04 ` Christoph Hellwig 2011-08-02 12:04 ` Dave Chinner 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2011-08-02 12:04 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > What's that bdi 7:0? /dev/loop0 > And sb_dev=0:0 that we didn't have sb specified in thhis wb_work > , nr_pages=9223372036854775807=0x7fffffffffffffff. LONG_MAX. It all fits perfectly: static long wb_check_background_flush(struct bdi_writeback *wb) { if (over_bground_thresh()) { struct wb_writeback_work work = { .nr_pages = LONG_MAX, .sync_mode = WB_SYNC_NONE, .for_background = 1, .range_cyclic = 1, }; return wb_writeback(wb, &work); } return 0; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig @ 2011-08-02 12:04 ` Dave Chinner 2011-08-02 12:16 ` Wu Fengguang 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2011-08-02 12:04 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > wb_check_background_flush is indeed what we're hitting. > > That means s_umount is NOT held by another queued writeback work. Right. We already kind of knew that was ocurring because there's a remount,ro going on. > > > See the trace output using a patch inspired by Curt's below: > > > > # tracer: nop > > # > > # TASK-PID CPU# TIMESTAMP FUNCTION > > # | | | | | > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > All are indicating some special bdi/inode. #define LOOP_MAJOR 7 It's a loop device. xfstests uses them quite a lot. Maybe it would be a good idea to run xfstests on an xfs filesystem in your regular writeback testing cycle to get decent coverage of this case? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-02 12:04 ` Dave Chinner @ 2011-08-02 12:16 ` Wu Fengguang 2011-08-02 12:26 ` Wu Fengguang 0 siblings, 1 reply; 19+ messages in thread From: Wu Fengguang @ 2011-08-02 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 08:04:45PM +0800, Dave Chinner wrote: > On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > > wb_check_background_flush is indeed what we're hitting. > > > > That means s_umount is NOT held by another queued writeback work. > > Right. We already kind of knew that was ocurring because there's > a remount,ro going on. Yes, and even better if it can be confirmed with a full sysrq-t trace. > > > > > See the trace output using a patch inspired by Curt's below: > > > > > > # tracer: nop > > > # > > > # TASK-PID CPU# TIMESTAMP FUNCTION > > > # | | | | | > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > > > All are indicating some special bdi/inode. > > #define LOOP_MAJOR 7 > > It's a loop device. xfstests uses them quite a lot. Yeah, it is. > Maybe it would be a good idea to run xfstests on an xfs filesystem > in your regular writeback testing cycle to get decent coverage of > this case? I've run xfstests case 073 on two of my boxes, however still cannot reproduce the problem. This is the script I used, anything wrong with it? #!/bin/sh export TEST_DEV=/dev/sda5 export TEST_DIR=/mnt/test export SCRATCH_DEV=/dev/sda6 export SCRATCH_MNT=/mnt/scratch mount $TEST_DEV $TEST_DIR ./check 073 And the interesting thing is, that test case always fails in one box and succeed in another. Thanks, Fengguang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-02 12:16 ` Wu Fengguang @ 2011-08-02 12:26 ` Wu Fengguang 0 siblings, 0 replies; 19+ messages in thread From: Wu Fengguang @ 2011-08-02 12:26 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML > And the interesting thing is, that test case always fails in one box > and succeed in another. Ah it's my fault. The failure is caused by some debug aids I added to the tools.. Sorry for the noises. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig 2011-08-02 12:04 ` Dave Chinner @ 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 0 replies; 19+ messages in thread From: Wu Fengguang @ 2011-08-02 12:05 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > wb_check_background_flush is indeed what we're hitting. > > That means s_umount is NOT held by another queued writeback work. > > > See the trace output using a patch inspired by Curt's below: > > > > # tracer: nop > > # > > # TASK-PID CPU# TIMESTAMP FUNCTION > > # | | | | | > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > All are indicating some special bdi/inode. I see, it's loop0: lrwxrwxrwx 1 root root 0 Aug 2 20:46 /sys/class/block/loop0/bdi -> ../../bdi/7:0 Thanks, Fengguang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfstests 073 regression 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 2011-08-01 0:25 ` Linus Torvalds @ 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 0 replies; 19+ messages in thread From: Wu Fengguang @ 2011-08-01 5:24 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 07:47:49AM +0800, Dave Chinner wrote: > On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote: > > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > > > This patch fixes the hang for me. > > > > Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. > > (if anyone find major problems with the fix, please speak up) > > Yes, I already have, a couple of hours before you sent this: > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html Err I found it in my spam folder.. and added you to the white list :) > We haven't found the root cause of the problem, and writeback cannot > hold off grab_super_passive() because writeback only holds read > locks on s_umount, just like grab_super_passive. So if > grab_super_passive is failing, there is some other, not yet > unidentified actor causing this problem.... Yeah sorry, I totally overlooked the read lock. Since Linus has pointed out the possibility of read-write-read deadlock, I'll jump to your more recent email for the root cause. Thanks, Fengguang ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-08-02 12:26 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20110728164105.GA18258@infradead.org>
[not found] ` <20110729142121.GA21149@localhost>
[not found] ` <20110730134422.GA1884@infradead.org>
2011-07-31 15:10 ` xfstests 073 regression Wu Fengguang
2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
2011-08-01 0:25 ` Linus Torvalds
2011-08-01 1:28 ` Dave Chinner
2011-08-01 1:40 ` Linus Torvalds
2011-08-01 2:09 ` Dave Chinner
2011-08-01 2:21 ` Linus Torvalds
2011-08-01 5:52 ` Wu Fengguang
2011-08-01 16:44 ` Christoph Hellwig
2011-08-01 11:23 ` Christoph Hellwig
2011-08-01 16:52 ` Christoph Hellwig
2011-08-02 11:44 ` Wu Fengguang
2011-08-02 12:04 ` Christoph Hellwig
2011-08-02 12:04 ` Dave Chinner
2011-08-02 12:16 ` Wu Fengguang
2011-08-02 12:26 ` Wu Fengguang
2011-08-02 12:05 ` Wu Fengguang
2011-08-01 5:24 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox