* [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() @ 2015-09-11 19:37 Chris Mason 2015-09-11 20:02 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-11 19:37 UTC (permalink / raw) To: Linus Torvalds, LKML, linux-fsdevel; +Cc: Josef Bacik, Dave Chinner, neilb Linus, this is the plugging problem I mentioned in my btrfs pull. It impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the patch. But I wanted to at least send in the basic fix for rc1 so this doesn't cause bigger problems for early testers: Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes, but writeback_sb_inodes has a horrible secret...it's called with the wb->list_lock held. Btrfs raid5/6 and MD raid10 have horrible secrets of their own...they both do allocations in their unplug callbacks. None of the options to fix it are very pretty. We don't want to kick off workers for all of these unplugs, and the lock doesn't look hot enough to justify bigger restructuring. [ 2854.025042] BUG: sleeping function called from invalid context at mm/page_alloc.c:3189 [ 2854.041366] in_atomic(): 1, irqs_disabled(): 0, pid: 145562, name: kworker/u66:15 [ 2854.056813] INFO: lockdep is turned off. [ 2854.064870] CPU: 13 PID: 145562 Comm: kworker/u66:15 Not tainted 4.2.0-mason+ #1 [ 2854.080082] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [ 2854.096211] Workqueue: writeback wb_workfn (flush-btrfs-244) [ 2854.107821] ffffffff81a2bbee ffff880ee09a7598 ffffffff813307bb ffff880ee09a7598 [ 2854.123162] ffff881010d1ca00 ffff880ee09a75c8 ffffffff81086615 0000000000000000 [ 2854.138556] 0000000000000000 0000000000000c75 ffffffff81a2bbee ffff880ee09a75f8 [ 2854.153936] Call Trace: [ 2854.181101] [<ffffffff81086722>] __might_sleep+0x52/0x90 [ 2854.192136] [<ffffffff8116d2b4>] __alloc_pages_nodemask+0x344/0xbe0 [ 2854.229682] [<ffffffff811b54aa>] alloc_pages_current+0x10a/0x1e0 [ 2854.255508] [<ffffffffa0663f19>] full_stripe_write+0x59/0xc0 [btrfs] [ 2854.268600] [<ffffffffa0663fb9>] __raid56_parity_write+0x39/0x60 [btrfs] [ 2854.282385] [<ffffffffa06640fb>] run_plug+0x11b/0x140 [btrfs] [ 2854.294259] [<ffffffffa0664143>] btrfs_raid_unplug+0x23/0x70 [btrfs] [ 2854.307334] [<ffffffff81307622>] blk_flush_plug_list+0x82/0x1f0 [ 2854.319542] [<ffffffff813077c4>] blk_finish_plug+0x34/0x50 [ 2854.330878] [<ffffffff812079c2>] writeback_sb_inodes+0x122/0x580 [ 2854.343256] [<ffffffff81208016>] wb_writeback+0x136/0x4e0 Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Jens Axboe <axboe@fb.com> --- fs/fs-writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ae0f438..07c9c50 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1539,7 +1539,9 @@ static long writeback_sb_inodes(struct super_block *sb, break; } } + spin_unlock(&wb->list_lock); blk_finish_plug(&plug); + spin_lock(&wb->list_lock); return wrote; } -- 1.8.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason @ 2015-09-11 20:02 ` Linus Torvalds 2015-09-11 20:37 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 20:02 UTC (permalink / raw) To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig I hate this fix. On Fri, Sep 11, 2015 at 12:37 PM, Chris Mason <clm@fb.com> wrote: > Linus, this is the plugging problem I mentioned in my btrfs pull. It > impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the > patch. But I wanted to at least send in the basic fix for rc1 so this > doesn't cause bigger problems for early testers: > > Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes, > but writeback_sb_inodes has a horrible secret...it's called with the > wb->list_lock held. Quite frankly, just dropping and retaking the lock around the blk_finish_plug() is just disgusting. The whole "drop and retake lock" pattern in general is a bad idea, because it can so easily break the caller (because now the lock no longer covers things over the call. Yes, in this case we already do something similar in writeback_single_inode(), so I guess the argument is that it doesn't make things much worse, and that the caller already cannot depend on the lock being held. True, but no less disgusting for that. So we could do this, but in this case I don't think there's any good _reason_ for doing that disgusting thing. How about we instead: (a) revert that commit d353d7587 as broken (because it clearly is) (b) add a big honking comment about the fact that we hold 'list_lock' in writeback_sb_inodes() (c) move the plugging up to wb_writeback() and writeback_inodes_wb() _outside_ the spinlock. because that way we not only avoid the ugliness, it should also be more effective at plugging things since we gather _all_ the writeback rather than just one superblock. Let's not paper over a completely broken commit. Let's just admit that commit d353d7587 was prue and utter shite, and rather than try to fix up the mistake, make it all better! Anyway, I will start by reverting that commit, and adding the comment. I'm more than happy to take the patch that moves the plugging, but since that one was about performance rather than correctness, I think it would be good to just re-verify the numbers. Dave? Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 20:02 ` Linus Torvalds @ 2015-09-11 20:37 ` Linus Torvalds 2015-09-11 20:40 ` Josef Bacik 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 20:37 UTC (permalink / raw) To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 971 bytes --] On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > How about we instead: > > (a) revert that commit d353d7587 as broken (because it clearly is) > > (b) add a big honking comment about the fact that we hold 'list_lock' > in writeback_sb_inodes() > > (c) move the plugging up to wb_writeback() and writeback_inodes_wb() > _outside_ the spinlock. Ok, I've done (a) and (b) in my tree. And attached is the totally untested patch for (c). It looks ObviouslyCorrect(tm), but since this is a performance issue, I'm not going to commit it without some more ACK's from people. I obviously think this is a *much* better approach than dropping and retaking the lock, but there might be something silly I'm missing. For example, maybe we want to unplug and replug around the "inode_sleep_on_writeback()" in wb_writeback()? So while the revert was a no-brainer, this one I really want people to think about. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1102 bytes --] fs/fs-writeback.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d8ea7ed411b2..587ac08eabb6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1546,12 +1546,15 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, .range_cyclic = 1, .reason = reason, }; + struct blk_plug plug; + blk_start_plug(&plug); spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) queue_io(wb, &work); __writeback_inodes_wb(wb, &work); spin_unlock(&wb->list_lock); + blk_finish_plug(&plug); return nr_pages - work.nr_pages; } @@ -1579,10 +1582,12 @@ static long wb_writeback(struct bdi_writeback *wb, unsigned long oldest_jif; struct inode *inode; long progress; + struct blk_plug plug; oldest_jif = jiffies; work->older_than_this = &oldest_jif; + blk_start_plug(&plug); spin_lock(&wb->list_lock); for (;;) { /* @@ -1662,6 +1667,7 @@ static long wb_writeback(struct bdi_writeback *wb, } } spin_unlock(&wb->list_lock); + blk_finish_plug(&plug); return nr_pages - work->nr_pages; } ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 20:37 ` Linus Torvalds @ 2015-09-11 20:40 ` Josef Bacik 2015-09-11 21:04 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Josef Bacik @ 2015-09-11 20:40 UTC (permalink / raw) To: Linus Torvalds, Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On 09/11/2015 04:37 PM, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> How about we instead: >> >> (a) revert that commit d353d7587 as broken (because it clearly is) >> >> (b) add a big honking comment about the fact that we hold 'list_lock' >> in writeback_sb_inodes() >> >> (c) move the plugging up to wb_writeback() and writeback_inodes_wb() >> _outside_ the spinlock. > > Ok, I've done (a) and (b) in my tree. And attached is the totally > untested patch for (c). It looks ObviouslyCorrect(tm), but since this > is a performance issue, I'm not going to commit it without some more > ACK's from people. > > I obviously think this is a *much* better approach than dropping and > retaking the lock, but there might be something silly I'm missing. > > For example, maybe we want to unplug and replug around the > "inode_sleep_on_writeback()" in wb_writeback()? So while the revert > was a no-brainer, this one I really want people to think about. So we talked about this when we were trying to figure out a solution. The problem with this approach is now we have a plug that covers multiple super blocks (__writeback_inodes_wb loops through the sb's starts writeback), which is likely to give us crappier performance than no plug at all. Thanks, Josef ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 20:40 ` Josef Bacik @ 2015-09-11 21:04 ` Linus Torvalds 2015-09-11 22:06 ` Linus Torvalds 2015-09-11 23:06 ` Chris Mason 0 siblings, 2 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 21:04 UTC (permalink / raw) To: Josef Bacik Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote: > > So we talked about this when we were trying to figure out a solution. The > problem with this approach is now we have a plug that covers multiple super > blocks (__writeback_inodes_wb loops through the sb's starts writeback), > which is likely to give us crappier performance than no plug at all. Why would that be? Either they are on separate disks, and the IO is all independent anyway, and at most it got started by some (small) CPU-amount later. Actual throughput should be the same. No? Or the filesystems are on the same disk, in which case it should presumably be a win to submit the IO together. Of course, actual numbers would be the deciding factor if this really is noticeable. But "cleaner code and saner locking" is definitely an issue at least for me. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 21:04 ` Linus Torvalds @ 2015-09-11 22:06 ` Linus Torvalds 2015-09-11 23:16 ` Chris Mason 2015-09-11 23:06 ` Chris Mason 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 22:06 UTC (permalink / raw) To: Josef Bacik Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Of course, actual numbers would be the deciding factor if this really > is noticeable. But "cleaner code and saner locking" is definitely an > issue at least for me. Anyway, I'll hold off pushing out the revert too for a while, in the hope that we'll have actual numbers for or against whatever the particular solution should be. I do get the feeling that the whole wb->list_lock needs more loving. For example, there's that locked_inode_to_wb_and_lock_list() thing (which might be better off trying to just do "trylock" on it, but that's a separate issue) showing lock inversion worries. Maybe we should *not* get that wb->list_lock early at all, and nest it inside the inode spinlocks, and just move the list_lock locking down a lot (ie not even try to hold it over big functions that then are forced to releasing it anyway). For example, realistically, it looks like the longest we ever *really* hold that lock is at the top of the loop of writeback_sb_inodes() - maybe we could just explicitly have a function that does "find the first inode that matches this sb and needs writeout activity", and literally only take hold the lock over that function. And *not* take the lock in the caller at all? The callers seem to want that lock mainly because they do that "list_empty(&wb->b_io)" test. But the way our doubly linked lists work, "list_empty()" is actually something that can be done racily without the lock held... That said, I doubt anybody really wants to touch this, so at least for now we're stuck with either the "plug outside the lock" or the "drop and retake lock" options. It really would be loveyl to haev numbers either way. <insert pricess leia gif> "Dave, you're our only hope" Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 22:06 ` Linus Torvalds @ 2015-09-11 23:16 ` Chris Mason 2015-09-11 23:36 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-11 23:16 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 03:06:18PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Of course, actual numbers would be the deciding factor if this really > > is noticeable. But "cleaner code and saner locking" is definitely an > > issue at least for me. > > Anyway, I'll hold off pushing out the revert too for a while, in the > hope that we'll have actual numbers for or against whatever the > particular solution should be. > > I do get the feeling that the whole wb->list_lock needs more loving. > For example, there's that locked_inode_to_wb_and_lock_list() thing > (which might be better off trying to just do "trylock" on it, but > that's a separate issue) showing lock inversion worries. At the very least, it's kind of sad how many of us were surprised to find the lock held when Dave's patch was unplugging. If this bug slipped through, more are going to. It's also true N-1 of those people were really surprised about scheduling unplug functions, so maybe we can't put all the blame on wb->list_lock. > > Maybe we should *not* get that wb->list_lock early at all, and nest it > inside the inode spinlocks, and just move the list_lock locking down a > lot (ie not even try to hold it over big functions that then are > forced to releasing it anyway). > > For example, realistically, it looks like the longest we ever *really* > hold that lock is at the top of the loop of writeback_sb_inodes() - > maybe we could just explicitly have a function that does "find the > first inode that matches this sb and needs writeout activity", and > literally only take hold the lock over that function. And *not* take > the lock in the caller at all? > > The callers seem to want that lock mainly because they do that > "list_empty(&wb->b_io)" test. But the way our doubly linked lists > work, "list_empty()" is actually something that can be done racily > without the lock held... > > That said, I doubt anybody really wants to touch this, so at least for > now we're stuck with either the "plug outside the lock" or the "drop > and retake lock" options. It really would be loveyl to haev numbers > either way. For 4.3 timeframes, what runs do you want to see numbers for: 1) revert 2) my hack 3) plug over multiple sbs (on different devices) 4) ? -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 23:16 ` Chris Mason @ 2015-09-11 23:36 ` Linus Torvalds 2015-09-12 0:52 ` Linus Torvalds 2015-09-12 23:00 ` Chris Mason 0 siblings, 2 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 23:36 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > For 4.3 timeframes, what runs do you want to see numbers for: > > 1) revert > 2) my hack > 3) plug over multiple sbs (on different devices) > 4) ? Just 2 or 3. I don't think the plain revert is all that interesting, and I think the "anything else" is far too late for this merge window. So we'll go with either (2) your patch (which I obviously don't _like_, but apart from the ugliness I don't think there's anything technically wrong with), or with (3) the "plug across a bigger area". So the only issue with (3) is whether that's just "revert plus the patch I sent out", or whether we should unplug/replug over the "wait synchronously for an inode" case (iow, the "inode_sleep_on_writeback()"). The existing plug code (that has the spinlock issue) already has a "wait on inode" case, and did *not* unplug over that call, but broadening the plugging further now ends up having two of those "wait synchronosly on inode". Are we really ok with waiting synchronously for an inode while holding the plug? No chance of deadlock (waiting for IO that we've plugged)? That issue is true even of the current code, though, and I have _not_ really thought that through, it's just a worry. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 23:36 ` Linus Torvalds @ 2015-09-12 0:52 ` Linus Torvalds 2015-09-12 2:15 ` Chris Mason 2015-09-12 23:00 ` Chris Mason 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-12 0:52 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Are we really ok with waiting synchronously for an inode while holding > the plug? No chance of deadlock (waiting for IO that we've plugged)? > That issue is true even of the current code, though, and I have _not_ > really thought that through, it's just a worry. Never mind. We still flush the plug on explicit scheduling events. I wonder why I thought we got rid of that. Some kind of "senior moment", I guess. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 0:52 ` Linus Torvalds @ 2015-09-12 2:15 ` Chris Mason 2015-09-12 2:27 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-12 2:15 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 05:52:27PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Are we really ok with waiting synchronously for an inode while holding > > the plug? No chance of deadlock (waiting for IO that we've plugged)? > > That issue is true even of the current code, though, and I have _not_ > > really thought that through, it's just a worry. > > Never mind. We still flush the plug on explicit scheduling events. I > wonder why I thought we got rid of that. Some kind of "senior moment", But flushing on schedule is a little different. It ends up calling blk_schedule_flush_plug() which will hand off work to kblockd through blk_run_queue_async() Not a huge deal, but if we're scheduling to wait for that IO, we should really run the plug ourselves so that we're not waiting for kblockd too. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 2:15 ` Chris Mason @ 2015-09-12 2:27 ` Linus Torvalds 0 siblings, 0 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-12 2:27 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 7:15 PM, Chris Mason <clm@fb.com> wrote: > > But flushing on schedule is a little different. It ends up calling > blk_schedule_flush_plug() which will hand off work to kblockd through > blk_run_queue_async() I was more worried about some actual deadlock from the changes. And blk_schedule_flush_plug() is fine in that it doesn't actually remove the plug, it just schedules the currently plugged pending IO, so the IO will start from waiting on the inode, but the plug will still remain for the rest of the writeback, and it all looks like it should be fine. And the reason we use kblockd is simple: stack usage. The reschedule can happen pretty deep on the stack, we don't actually want to necessarily then cause much more stack use through things like md/raid allocating new requests etc. So it all looks fine to me. Btw, very tangentially related: grepping a bit shows that "blk_flush_plug()" isn't actually used anywhere any more. Can we get rid of that? Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 23:36 ` Linus Torvalds 2015-09-12 0:52 ` Linus Torvalds @ 2015-09-12 23:00 ` Chris Mason 2015-09-12 23:29 ` Linus Torvalds 2015-09-13 23:12 ` Dave Chinner 1 sibling, 2 replies; 56+ messages in thread From: Chris Mason @ 2015-09-12 23:00 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > > > For 4.3 timeframes, what runs do you want to see numbers for: > > > > 1) revert > > 2) my hack > > 3) plug over multiple sbs (on different devices) > > 4) ? > > Just 2 or 3. > > I don't think the plain revert is all that interesting, and I think > the "anything else" is far too late for this merge window. I did the plain revert as well, just to have a baseline. This box is a little different from Dave's. Bare metal two socket box (E5-2660 v2 @ 2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and one fusionio, and I put a one FS on each device (two mounts total). The test created 1.6M files, 4K each. I used Dave's fs_mark command line, spread out over 16 directories from each mounted filesystem. In btrfs they are spread over subvolumes to cut down lock contention. I need to change around the dirty ratios more to smooth out the IO, and I had trouble with both XFS and btrfs getting runs that were not CPU bound. I included the time to run sync at the end of the run because the results were not very consistent without it. The XFS runs generally had one CPU pegged at 100%, and I think this is throwing off the results. On Monday I'll redo them with two (four?) filesystems per flash device, hopefully that'll break things up. The btrfs runs generally had all the CPUs pegged at 100%. I switched to mount -o nodatasum and squeezed out an extra 50K files/sec at much lower CPU utilization. wall time fs_mark files/sec bytes written/sec XFS: baseline v4.2: 5m6s 118,578 272MB/s Dave's patch: 4m46s 151,421 294MB/s my hack: 5m5s 150,714 275MB/s Linus plug: 5m15s 147,735 266MB/s Btrfs (nodatasum): baseline v4.2: 4m39s 242,643 313MB/s Dave's patch: 3m46s 252,452 389MB/s my hack: 3m48s 257,924 379MB/s Linus plug: 3m58s 247,528 369MB/s Bottom line, not as conclusive as I'd like. My hack doesn't seem to hurt, but FS internals are consuming enough CPU that this lock just isn't showing up. Linus' plug patch is consistently slower, and I don't have a great explanation. My guesses: not keeping the flash pipelines full, or the imbalance between the different speed flash is averaging the overall result down, or its my kblockd vs explicit unplug handwaving from yesterday. So, next step is either more runs on flash or grab a box with a bunch of spindles. I'd rather do the spindle runs, I agree with Dave that his patch should help much more on actual drives. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 23:00 ` Chris Mason @ 2015-09-12 23:29 ` Linus Torvalds 2015-09-12 23:46 ` Chris Mason 2015-09-13 23:12 ` Dave Chinner 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-12 23:29 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote: > > I did the plain revert as well, just to have a baseline. Ahh, I ended up not expecting you to get this done until after rc1 was out, so I in the meantime just merged my fix instead rather than leave the expected scheduling-while-atomic problem. And just as well that you did a baseline, since apparently the numbers are all over the map. I don't see how your hack and dave's original can _possibly_ differ that much, but they clearly did on your xfs test. So there's probably huge variance that depends on random details. I'll leave things as they are until we have something that looks a bit more believable ;) Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 23:29 ` Linus Torvalds @ 2015-09-12 23:46 ` Chris Mason 2015-09-13 13:12 ` Chris Mason 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-12 23:46 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Sat, Sep 12, 2015 at 04:29:06PM -0700, Linus Torvalds wrote: > On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote: > > > > I did the plain revert as well, just to have a baseline. > > Ahh, I ended up not expecting you to get this done until after rc1 was > out, so I in the meantime just merged my fix instead rather than leave > the expected scheduling-while-atomic problem. Yeah, I wasn't sure I'd be able to do the runs, but it was a rainy afternoon and this was more fun than cleaning. Really glad something got in for rc1 either way. > > And just as well that you did a baseline, since apparently the numbers > are all over the map. I don't see how your hack and dave's original > can _possibly_ differ that much, but they clearly did on your xfs > test. So there's probably huge variance that depends on random > details. I don't think the XFS numbers can be trusted too much since it was basically bottlenecked behind that single pegged CPU. It was bouncing around and I couldn't quite track it down to a process name (or perf profile). The btrfs numbers were much more consistent, but your patch is still a win over plain 4.2. > > I'll leave things as they are until we have something that looks a bit > more believable ;) We can build from here, thanks Linus. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 23:46 ` Chris Mason @ 2015-09-13 13:12 ` Chris Mason 2015-09-13 22:56 ` Dave Chinner 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-13 13:12 UTC (permalink / raw) To: Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote: > I don't think the XFS numbers can be trusted too much since it was > basically bottlenecked behind that single pegged CPU. It was bouncing > around and I couldn't quite track it down to a process name (or perf > profile). I'll do more runs Monday, but I was able to grab a perf profile of the pegged XFS CPU. It was just the writeback worker thread, and it hit btrfs differently because we defer more of this stuff to endio workers, effectively spreading it out over more CPUs. With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K. Here's one of the profiles, but it bounced around a lot so I wouldn't use this to actually tune anything: 11.42% kworker/u82:61 [kernel.kallsyms] [k] _raw_spin_lock | ---_raw_spin_lock | |--83.43%-- xfs_extent_busy_trim | xfs_alloc_compute_aligned | | | |--99.92%-- xfs_alloc_ag_vextent_near | | xfs_alloc_ag_vextent | | xfs_alloc_vextent | | xfs_bmap_btalloc | | xfs_bmap_alloc | | xfs_bmapi_write | | xfs_iomap_write_allocate | | xfs_map_blocks | | xfs_vm_writepage | | __writepage | | write_cache_pages | | generic_writepages | | xfs_vm_writepages | | do_writepages | | __writeback_single_inode | | writeback_sb_inodes | | __writeback_inodes_wb | | wb_writeback | | wb_do_writeback | | wb_workfn | | process_one_work | | worker_thread | | kthread | | ret_from_fork | --0.08%-- [...] | ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-13 13:12 ` Chris Mason @ 2015-09-13 22:56 ` Dave Chinner 0 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2015-09-13 22:56 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Jan Kara, Christoph Hellwig On Sun, Sep 13, 2015 at 09:12:44AM -0400, Chris Mason wrote: > On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote: > > I don't think the XFS numbers can be trusted too much since it was > > basically bottlenecked behind that single pegged CPU. It was bouncing > > around and I couldn't quite track it down to a process name (or perf > > profile). > > I'll do more runs Monday, but I was able to grab a perf profile of the > pegged XFS CPU. It was just the writeback worker thread, and it > hit btrfs differently because we defer more of this stuff to endio > workers, effectively spreading it out over more CPUs. > > With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K. > Here's one of the profiles, but it bounced around a lot so I wouldn't > use this to actually tune anything: mkfs.xfs -d agcount=64 .... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-12 23:00 ` Chris Mason 2015-09-12 23:29 ` Linus Torvalds @ 2015-09-13 23:12 ` Dave Chinner 2015-09-14 20:06 ` Linus Torvalds 1 sibling, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-13 23:12 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Jan Kara, Christoph Hellwig On Sat, Sep 12, 2015 at 07:00:27PM -0400, Chris Mason wrote: > On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote: > > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote: > > > > > > For 4.3 timeframes, what runs do you want to see numbers for: > > > > > > 1) revert > > > 2) my hack > > > 3) plug over multiple sbs (on different devices) > > > 4) ? > > > > Just 2 or 3. > > > > I don't think the plain revert is all that interesting, and I think > > the "anything else" is far too late for this merge window. > > I did the plain revert as well, just to have a baseline. This box is a > little different from Dave's. Bare metal two socket box (E5-2660 v2 @ > 2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and > one fusionio, and I put a one FS on each device (two mounts total). > > The test created 1.6M files, 4K each. I used Dave's fs_mark command > line, spread out over 16 directories from each mounted filesystem. In > btrfs they are spread over subvolumes to cut down lock contention. > > I need to change around the dirty ratios more to smooth out the IO, and > I had trouble with both XFS and btrfs getting runs that were not CPU > bound. I included the time to run sync at the end of the run because > the results were not very consistent without it. > > The XFS runs generally had one CPU pegged at 100%, and I think this is > throwing off the results. On Monday I'll redo them with two (four?) > filesystems per flash device, hopefully that'll break things up. > > The btrfs runs generally had all the CPUs pegged at 100%. I switched to > mount -o nodatasum and squeezed out an extra 50K files/sec at much lower > CPU utilization. > > wall time fs_mark files/sec bytes written/sec > > XFS: > baseline v4.2: 5m6s 118,578 272MB/s > Dave's patch: 4m46s 151,421 294MB/s > my hack: 5m5s 150,714 275MB/s > Linus plug: 5m15s 147,735 266MB/s > > > Btrfs (nodatasum): > baseline v4.2: 4m39s 242,643 313MB/s > Dave's patch: 3m46s 252,452 389MB/s > my hack: 3m48s 257,924 379MB/s > Linus plug: 3m58s 247,528 369MB/s Really need to run these numbers on slower disks where block layer merging makes a difference to performance. The high level plugging improves performance on spinning disks by a huge amount on XFS because the merging reduces the number of IOs issued to disk by 2 orders of magnitude. Plugging makes comparitively little difference on devices that can sustain extremely high IOPS and hence sink the tens to hundreds of thousand individual 4k IOs that this workload generates through writeback. i.e. while throughput increases, that's not the numbers that matters here - it's the change in write IO behaviour that needs to be categorised and measured by the different patches... (I'm on holidays, so I'm not going to get to this any time soon) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-13 23:12 ` Dave Chinner @ 2015-09-14 20:06 ` Linus Torvalds 2015-09-16 15:16 ` Chris Mason 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-14 20:06 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Jan Kara, Christoph Hellwig On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > Really need to run these numbers on slower disks where block layer > merging makes a difference to performance. Yeah. We've seen plugging and io schedulers not make much difference for high-performance flash (although I think the people who argued that noop should generally be used for non-rotating media were wrong, I think - the elevator ends up still being critical to merging, and while merging isn't a life-or-death situation, it tends to still help). For rotating rust with nasty seek times, the plugging is likely to make the biggest difference. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-14 20:06 ` Linus Torvalds @ 2015-09-16 15:16 ` Chris Mason 2015-09-16 19:58 ` Jan Kara 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-16 15:16 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Jan Kara, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1990 bytes --] On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote: > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Really need to run these numbers on slower disks where block layer > > merging makes a difference to performance. > > Yeah. We've seen plugging and io schedulers not make much difference > for high-performance flash (although I think the people who argued > that noop should generally be used for non-rotating media were wrong, > I think - the elevator ends up still being critical to merging, and > while merging isn't a life-or-death situation, it tends to still > help). Yeah, my big concern was that holding the plug longer would result in lower overall perf because we weren't keeping the flash busy. So I started with the flash boxes to make sure we weren't regressing past 4.2 levels at least. I'm still worried about that, but this probably isn't the right benchmark to show it. And if it's really a problem, it'll happen everywhere we plug and not just here. > > For rotating rust with nasty seek times, the plugging is likely to > make the biggest difference. For rotating storage, I grabbed a big box and did the fs_mark run against 8 spindles. These are all behind a megaraid card as jbods, so I flipped the card's cache to write-through. I changed around the run a bit, making enough files for fs_mark to run for ~10 minutes, and I took out the sync. I ran only xfs to cut down on the iterations, and after the fs_mark run, I did short 30 second run with blktrace in the background to capture the io sizes. v4.2: 178K files/sec Chinner: 192K files/sec Mason: 192K files/sec Linus: 193K files/sec I added support to iowatcher to graph IO size, and attached the graph. Short version, Linus' patch still gives bigger IOs and similar perf to Dave's original. I should have done the blktrace runs for 60 seconds instead of 30, I suspect that would even out the average sizes between the three patches. -chris [-- Attachment #2: fs_mark.png --] [-- Type: application/octet-stream, Size: 34363 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-16 15:16 ` Chris Mason @ 2015-09-16 19:58 ` Jan Kara 2015-09-16 20:00 ` Chris Mason 0 siblings, 1 reply; 56+ messages in thread From: Jan Kara @ 2015-09-16 19:58 UTC (permalink / raw) To: Chris Mason Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Jan Kara, Christoph Hellwig On Wed 16-09-15 11:16:21, Chris Mason wrote: > On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote: > > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > Really need to run these numbers on slower disks where block layer > > > merging makes a difference to performance. > > > > Yeah. We've seen plugging and io schedulers not make much difference > > for high-performance flash (although I think the people who argued > > that noop should generally be used for non-rotating media were wrong, > > I think - the elevator ends up still being critical to merging, and > > while merging isn't a life-or-death situation, it tends to still > > help). > > > Yeah, my big concern was that holding the plug longer would result in > lower overall perf because we weren't keeping the flash busy. So I > started with the flash boxes to make sure we weren't regressing past 4.2 > levels at least. > > I'm still worried about that, but this probably isn't the right > benchmark to show it. And if it's really a problem, it'll happen > everywhere we plug and not just here. > > > > > For rotating rust with nasty seek times, the plugging is likely to > > make the biggest difference. > > For rotating storage, I grabbed a big box and did the fs_mark run > against 8 spindles. These are all behind a megaraid card as jbods, so I > flipped the card's cache to write-through. > > I changed around the run a bit, making enough files for fs_mark to run > for ~10 minutes, and I took out the sync. I ran only xfs to cut down on > the iterations, and after the fs_mark run, I did short 30 second run with > blktrace in the background to capture the io sizes. > > v4.2: 178K files/sec > Chinner: 192K files/sec > Mason: 192K files/sec > Linus: 193K files/sec > > I added support to iowatcher to graph IO size, and attached the graph. > > Short version, Linus' patch still gives bigger IOs and similar perf to > Dave's original. I should have done the blktrace runs for 60 seconds > instead of 30, I suspect that would even out the average sizes between > the three patches. Thanks for the data Chris. So I guess we are fine with what's currently in, right? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-16 19:58 ` Jan Kara @ 2015-09-16 20:00 ` Chris Mason 2015-09-16 22:07 ` Dave Chinner 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-16 20:00 UTC (permalink / raw) To: Jan Kara Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > Short version, Linus' patch still gives bigger IOs and similar perf to > > Dave's original. I should have done the blktrace runs for 60 seconds > > instead of 30, I suspect that would even out the average sizes between > > the three patches. > > Thanks for the data Chris. So I guess we are fine with what's currently in, > right? Looks like it works well to me. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-16 20:00 ` Chris Mason @ 2015-09-16 22:07 ` Dave Chinner 2015-09-17 0:37 ` Dave Chinner 0 siblings, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-16 22:07 UTC (permalink / raw) To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > instead of 30, I suspect that would even out the average sizes between > > > the three patches. > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > right? > > Looks like it works well to me. Graph looks good, though I'll confirm it on my test rig once I get out from under the pile of email and other stuff that is queued up after being away for a week... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-16 22:07 ` Dave Chinner @ 2015-09-17 0:37 ` Dave Chinner 2015-09-17 1:12 ` Linus Torvalds 2015-09-17 3:48 ` Chris Mason 0 siblings, 2 replies; 56+ messages in thread From: Dave Chinner @ 2015-09-17 0:37 UTC (permalink / raw) To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig Cc: Tejun Heo [cc Tejun] On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > > instead of 30, I suspect that would even out the average sizes between > > > > the three patches. > > > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > > right? > > > > Looks like it works well to me. > > Graph looks good, though I'll confirm it on my test rig once I get > out from under the pile of email and other stuff that is queued up > after being away for a week... I ran some tests in the background while reading other email..... TL;DR: Results look really bad - not only is the plugging problematic, baseline writeback performance has regressed significantly. We need to revert the plugging changes until the underlying writeback performance regressions are sorted out. In more detail, these tests were run on my usual 16p/16GB RAM performance test VM with storage set up as described here: http://permalink.gmane.org/gmane.linux.kernel/1768786 The test: $ ~/tests/fsmark-10-4-test-xfs.sh meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0 data = bsize=4096 blocks=134217727500, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. FSUse% Count Size Files/sec App Overhead 0 80000 4096 106938.0 543310 0 160000 4096 102922.7 476362 0 240000 4096 107182.9 538206 0 320000 4096 107871.7 619821 0 400000 4096 99255.6 622021 0 480000 4096 103217.8 609943 0 560000 4096 96544.2 640988 0 640000 4096 100347.3 676237 0 720000 4096 87534.8 483495 0 800000 4096 72577.5 2556920 0 880000 4096 97569.0 646996 <RAM fills here, sustained performance is now dependent on writeback> 0 960000 4096 80147.0 515679 0 1040000 4096 100394.2 816979 0 1120000 4096 91466.5 739009 0 1200000 4096 85868.1 977506 0 1280000 4096 89691.5 715207 0 1360000 4096 52547.5 712810 0 1440000 4096 47999.1 685282 0 1520000 4096 47894.3 697261 0 1600000 4096 47549.4 789977 0 1680000 4096 40029.2 677885 0 1760000 4096 16637.4 12804557 0 1840000 4096 16883.6 24295975 0 1920000 4096 18852.1 22264137 0 2000000 4096 27107.3 10310408 0 2080000 4096 23654.1 19823567 0 2160000 4096 24588.9 19769484 0 2240000 4096 13959.3 40168163 0 2320000 4096 26239.2 17806813 0 2400000 4096 17858.5 24596715 0 2480000 4096 12072.9 38177650 0 2560000 4096 9989.7 43413729 <This is Real Bad. It gets significantly worse as the test goes on:> 0 6880000 4096 11773.4 16470400 0 6960000 4096 5556.5 27325343 0 7040000 4096 13816.6 22440991 0 7120000 4096 7322.4 42086686 0 7200000 4096 8033.0 31182907 0 7280000 4096 10589.4 30876729 0 7360000 4096 15844.5 19903105 0 7440000 4096 9187.7 35041718 0 7520000 4096 11800.9 12254920 0 7600000 4096 20049.9 13052020 0 7680000 4096 17444.9 12055262 0 7760000 4096 25900.6 10633885 0 7840000 4096 6398.8 12014901 0 7920000 4096 12147.4 14655678 <...> 0 9360000 4096 19406.0 5874193 0 9440000 4096 7261.3 14542806 0 9520000 4096 14978.7 21921186 0 9600000 4096 12656.7 10539894 real 10m9.240s user 0m51.817s sys 13m23.874s A quick reading of the IO behaviour shows an average of about 1000 IOPS and 100MB/s, which matches roughly with the average files/s created. There's no obvious peaks and gaps in the IO throughput, but CPU usage has regular gaps where it falls to nothing and the only "usage" is IO wait time. Reverting just Linus' patch, so no plugging at all in the code (i.e. the baseline numbers): FSUse% Count Size Files/sec App Overhead 0 80000 4096 109017.5 586268 0 160000 4096 93548.1 553333 0 240000 4096 104141.7 817182 0 320000 4096 112846.2 597494 0 400000 4096 98628.6 641524 0 480000 4096 86586.1 1274195 0 560000 4096 95965.0 776961 0 640000 4096 84585.9 583330 0 720000 4096 93787.1 695658 0 800000 4096 94112.7 645925 0 880000 4096 92259.2 792950 <RAM fills here, sustained performance is now dependent on writeback> 0 960000 4096 62936.9 763642 0 1040000 4096 22484.8 737471 0 1120000 4096 22038.6 706791 0 1200000 4096 24096.5 680878 0 1280000 4096 23973.1 839826 0 1360000 4096 24761.9 779486 0 1440000 4096 25467.7 976022 0 1520000 4096 25338.7 935552 0 1600000 4096 27484.8 683351 0 1680000 4096 24383.2 3039781 0 1760000 4096 24857.0 2657008 <maintains 13-26k files/s all the way to the end .... > 0 9040000 4096 21438.6 4869317 0 9120000 4096 25387.1 642198 0 9200000 4096 26815.9 727763 0 9280000 4096 25175.5 4458811 0 9360000 4096 19832.8 982122 0 9440000 4096 17643.8 805599 0 9520000 4096 13923.4 7819152 0 9600000 4096 18101.3 1071163 real 7m7.926s user 0m57.976s sys 14m14.620s The baseline of no plugging is a full 3 minutes faster than the plugging behaviour of Linus' patch. The IO behaviour demonstrates that, sustaining between 25-30,000 IOPS and throughput of 130-150MB/s. Hence, while Linus' patch does change the IO patterns, it does not result in a performance improvement like the original plugging patch did. So I went back and had a look at my original patch, which I've been using locally for a couple of years and was similar to the original commit. It has this description from when I last updated the perf numbers from testing done on 3.17: | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem, | metadata CRCs enabled. | | Test: | | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d | /mnt/scratch/6 -d /mnt/scratch/7 | | Result: | wall sys create rate Physical write IO | time CPU (avg files/s) IOPS Bandwidth | ----- ----- ------------- ------ --------- | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s | improvement -43.8% -13.3% +62.7% -94.6% +86.6% IOWs, what we are seeing here is that the baseline writeback performance has regressed quite significantly since I took these numbers back on 3.17. I'm running on exactly the same test setup; the only difference is the kernel and so the current kernel baseline is ~20% slower than the baseline numbers I have in my patch. So, let's put the original [broken] plugging back in: ..... 0 8960000 4096 28420.0 2129129 0 9040000 4096 26863.0 6893094 0 9120000 4096 46721.4 705927 0 9200000 4096 46384.0 1081627 0 9280000 4096 39692.3 2148938 0 9360000 4096 41976.0 3540489 0 9440000 4096 28025.1 10364440 0 9520000 4096 38580.2 2457019 0 9600000 4096 33526.1 7661316 real 5m12.114s user 0m50.294s sys 12m54.425s While we get back that regression and then some more, it's still much slower than the high level plugging on slightly older kernels. And the performance is, again, highly variable through the midpoint of the test run. e.g: .... 0 4720000 4096 39224.4 9401401 0 4800000 4096 25342.9 18889665 0 4880000 4096 14841.7 36668633 0 4960000 4096 27624.6 17731873 0 5040000 4096 37874.8 11390831 0 5120000 4096 29009.2 12331965 0 5200000 4096 49568.0 4346746 0 5280000 4096 83074.5 877142 0 5360000 4096 32498.8 5401470 0 5440000 4096 55468.1 3188117 0 5520000 4096 19156.0 4112120 0 5600000 4096 44894.0 3186559 .... We have a low of ~15k files/s and a high of 83k files/s whilst averaging about 38k files/s. IOWs, ignoring the plugging, the recent changes to the writeback code have really screwed up writeback performance and changed the way it behaves from a throughput consistency point of view. I'd suggest we should start by reverting all the plugging changes for now, because there are other problems these plugging changes have exposed in the current writeback code that need to be fixed first... Cheers Dave. PS: I was going to run this on an ext4 filesystem to check it wasn't just XFS seeing this, but that reminded me why I don't run this test on ext4: FSUse% Count Size Files/sec App Overhead 5 80000 4096 211668.3 534504 5 160000 4096 191731.8 532701 5 240000 4096 196251.2 489772 5 320000 4096 207597.1 464032 5 400000 4096 121616.0 601156 5 480000 4096 197729.9 573925 <Really, really fast in memory, but background writeback starts here> 5 560000 4096 128698.4 546362 5 640000 4096 137904.3 543718 5 720000 4096 63680.9 498110 5 800000 4096 25825.9 1538213 5 880000 4096 32708.7 632235 <dirty page threshold (20% RAM) is hit here. performance sucks from here on in (still 10GB of free RAM) because the writeback IO pattern is a random 4k write pattern> 5 960000 4096 6940.2 7250666 5 1040000 4096 7474.8 711872 5 1120000 4096 4127.5 14034167 5 1200000 4096 4822.1 752398 .... Hence it takes about around hour to do a single run of this test on ext4 because it's delayed allocation algorithm does not linearise writes to small files with similar inode locality like XFS does, and that's the behaviour that high level plugging really helps. IOws, this specific workload does not demonstrate any significant performance difference on ext4 with the different plugging algorithms because it does random IO rather than sequential IO. I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS every 30s regardless of whether there is plugging at a high level or not, so again btrfs behaviour is pretty much unaffected by the high level plugging on this workload.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 0:37 ` Dave Chinner @ 2015-09-17 1:12 ` Linus Torvalds 2015-09-17 2:14 ` Dave Chinner 2015-09-17 3:48 ` Chris Mason 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-17 1:12 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > TL;DR: Results look really bad - not only is the plugging > problematic, baseline writeback performance has regressed > significantly. Dave, if you're testing my current -git, the other performance issue might still be the spinlock thing. I haven't gotten the -tip pull fox the spinlock regression yet, and since you were the only one who noticed, I didn't consider it critical enough to bypass the normal schedule, so -rc1 still has the crazy-bad test-and-set loop for spinlocks. So it's waiting in the locking fix branch (and is cc'd for stable), but hasn't reached me yet. If you want to test that theory out, you can make virt_spin_lock() just unconditionally return false to disable the bad cmpxchg loop for now. The plugging IO pauses are interesting, though. Plugging really *shouldn't* cause that kind of pauses, _regardless_ of what level it happens on, so I wonder if the patch ends up just exposing some really basic problem that just normally goes hidden. Can you match up the IO wait times with just *where* it is waiting? Is it waiting for that inode I_SYNC thing in inode_sleep_on_writeback()? But it does sound like we should just revert the whole plugging for now, if only because "it has odd effects". Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 1:12 ` Linus Torvalds @ 2015-09-17 2:14 ` Dave Chinner 2015-09-17 19:39 ` Linus Torvalds 2015-09-17 23:03 ` Dave Chinner 0 siblings, 2 replies; 56+ messages in thread From: Dave Chinner @ 2015-09-17 2:14 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote: > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > TL;DR: Results look really bad - not only is the plugging > > problematic, baseline writeback performance has regressed > > significantly. > > Dave, if you're testing my current -git, the other performance issue > might still be the spinlock thing. I have the fix as the first commit in my local tree - it'll remain there until I get a conflict after an update. :) > The plugging IO pauses are interesting, though. Plugging really > *shouldn't* cause that kind of pauses, _regardless_ of what level it > happens on, so I wonder if the patch ends up just exposing some really > basic problem that just normally goes hidden. Right, that's what I suspect - it didn't happen on older kernels, but we've just completely reworked the writeback code for the control group awareness since I last looked really closely at this... > Can you match up the IO wait times with just *where* it is > waiting? Is it waiting for that inode I_SYNC thing in > inode_sleep_on_writeback()? I'll do some more investigation. > But it does sound like we should just revert the whole plugging for > now, if only because "it has odd effects". Yup - we can add it again next merge window once we get to the bottom of whatever is going on and have had time to test the new code properly. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 2:14 ` Dave Chinner @ 2015-09-17 19:39 ` Linus Torvalds 2015-09-17 22:42 ` Chris Mason 2015-09-17 23:03 ` Dave Chinner 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-17 19:39 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> Dave, if you're testing my current -git, the other performance issue >> might still be the spinlock thing. > > I have the fix as the first commit in my local tree - it'll remain > there until I get a conflict after an update. :) Ok. I'm happy to report that you should get a conflict now, and that the spinlock code should work well for your virtualized case again. No updates on the plugging thing yet, I'll wait a bit and follow this thread and see if somebody comes up with any explanations or theories in the hope that we might not need to revert (or at least have a more targeted change). Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 19:39 ` Linus Torvalds @ 2015-09-17 22:42 ` Chris Mason 2015-09-17 23:08 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-17 22:42 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote: > On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> Dave, if you're testing my current -git, the other performance issue > >> might still be the spinlock thing. > > > > I have the fix as the first commit in my local tree - it'll remain > > there until I get a conflict after an update. :) > > Ok. I'm happy to report that you should get a conflict now, and that > the spinlock code should work well for your virtualized case again. > > No updates on the plugging thing yet, I'll wait a bit and follow this > thread and see if somebody comes up with any explanations or theories > in the hope that we might not need to revert (or at least have a more > targeted change). Playing around with the plug a little, most of the unplugs are coming from the cond_resched_lock(). Not really sure why we are doing the cond_resched() there, we should be doing it before we retake the lock instead. This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K files/sec up to 213K. Average IO size is the same as 4.3-rc1. It probably won't help Dave, since most of his unplugs should have been from the cond_resched_locked() too. diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 587ac08..05ed541 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb, wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; wrote += write_chunk - wbc.nr_to_write; + + if (need_resched()) { + /* + * we're plugged and don't want to hand off to kblockd + * for the actual unplug work. But we do want to + * reschedule. So flush our plug and then + * schedule away + */ + blk_flush_plug(current); + cond_resched(); + } + + spin_lock(&wb->list_lock); spin_lock(&inode->i_lock); if (!(inode->i_state & I_DIRTY_ALL)) @@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb, requeue_inode(inode, wb, &wbc); inode_sync_complete(inode); spin_unlock(&inode->i_lock); - cond_resched_lock(&wb->list_lock); + /* * bail out to wb_writeback() often enough to check * background threshold and other termination conditions. ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 22:42 ` Chris Mason @ 2015-09-17 23:08 ` Linus Torvalds 2015-09-17 23:56 ` Chris Mason 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-17 23:08 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > Playing around with the plug a little, most of the unplugs are coming > from the cond_resched_lock(). Not really sure why we are doing the > cond_resched() there, we should be doing it before we retake the lock > instead. > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > files/sec up to 213K. Average IO size is the same as 4.3-rc1. Ok, so at least for you, part of the problem really ends up being that there's a mix of the "synchronous" unplugging (by the actual explicit "blk_finish_plug(&plug);") and the writeback that is handed off to kblockd_workqueue. I'm not seeing why that should be an issue. Sure, there's some CPU overhead to context switching, but I don't see that it should be that big of a deal. I wonder if there is something more serious wrong with the kblockd_workqueue. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 23:08 ` Linus Torvalds @ 2015-09-17 23:56 ` Chris Mason 2015-09-18 0:37 ` Dave Chinner 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-17 23:56 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > > > Playing around with the plug a little, most of the unplugs are coming > > from the cond_resched_lock(). Not really sure why we are doing the > > cond_resched() there, we should be doing it before we retake the lock > > instead. > > > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > > files/sec up to 213K. Average IO size is the same as 4.3-rc1. > > Ok, so at least for you, part of the problem really ends up being that > there's a mix of the "synchronous" unplugging (by the actual explicit > "blk_finish_plug(&plug);") and the writeback that is handed off to > kblockd_workqueue. > > I'm not seeing why that should be an issue. Sure, there's some CPU > overhead to context switching, but I don't see that it should be that > big of a deal. > > I wonder if there is something more serious wrong with the kblockd_workqueue. I'm driving the box pretty hard, it's right on the line between CPU bound and IO bound. So I've got 32 fs_mark processes banging away and 32 CPUs (16 really, with hyperthreading). They are popping in and out of balance_dirty_pages() so I have high CPU utilization alternating with high IO wait times. There no reads at all, so all of these waits are for buffered writes. People in balance_dirty_pages are indirectly waiting on the unplug, so maybe the context switch overhead on a loaded box is enough to explain it. We've definitely gotten more than 9% by inlining small synchronous items in btrfs in the past, but those were more explicitly synchronous. I know it's painfully hand wavy. I don't see any other users of the kblockd workqueues, and the perf profiles don't jump out at me. I'll feel better about the patch if Dave confirms any gains. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 23:56 ` Chris Mason @ 2015-09-18 0:37 ` Dave Chinner 2015-09-18 1:50 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-18 0:37 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 07:56:47PM -0400, Chris Mason wrote: > On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote: > > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > > > > > Playing around with the plug a little, most of the unplugs are coming > > > from the cond_resched_lock(). Not really sure why we are doing the > > > cond_resched() there, we should be doing it before we retake the lock > > > instead. > > > > > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > > > files/sec up to 213K. Average IO size is the same as 4.3-rc1. > > > > Ok, so at least for you, part of the problem really ends up being that > > there's a mix of the "synchronous" unplugging (by the actual explicit > > "blk_finish_plug(&plug);") and the writeback that is handed off to > > kblockd_workqueue. > > > > I'm not seeing why that should be an issue. Sure, there's some CPU > > overhead to context switching, but I don't see that it should be that > > big of a deal. It may well change the dispatch order of enough IOs for it to be significant on an IO bound device. > > I wonder if there is something more serious wrong with the kblockd_workqueue. > > I'm driving the box pretty hard, it's right on the line between CPU > bound and IO bound. So I've got 32 fs_mark processes banging away and > 32 CPUs (16 really, with hyperthreading). I'm only using 8 threads right now, so I have ~6-7 idle CPUs on this workload. Hence if it's CPU load related, I probably won't see any change in behaviour. > They are popping in and out of balance_dirty_pages() so I have high CPU > utilization alternating with high IO wait times. There no reads at all, > so all of these waits are for buffered writes. > > People in balance_dirty_pages are indirectly waiting on the unplug, so > maybe the context switch overhead on a loaded box is enough to explain > it. We've definitely gotten more than 9% by inlining small synchronous > items in btrfs in the past, but those were more explicitly synchronous. > > I know it's painfully hand wavy. I don't see any other users of the > kblockd workqueues, and the perf profiles don't jump out at me. I'll > feel better about the patch if Dave confirms any gains. In outright performance on my test machine, the difference in files/s is noise. However, the consistency looks to be substantially improved and the context switch rate is now running at under 3,000/sec. Numbers, including the std deviation of the files/s number output during the fsmark run (averaged across 3 separate benahmark runs): files/s std-dev wall time 4.3-rc1-noplug 34400 2.0e04 5m25s 4.3-rc1 56600 2.3e04 3m23s 4.3-rc1-flush 56079 1.4e04 3m14s std-dev is well down, and the improvement in wall time is large enough to be significant. Looks good to me. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 0:37 ` Dave Chinner @ 2015-09-18 1:50 ` Linus Torvalds 2015-09-18 5:40 ` Dave Chinner 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-18 1:50 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo [-- Attachment #1: Type: text/plain, Size: 2457 bytes --] On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: >> > >> > I'm not seeing why that should be an issue. Sure, there's some CPU >> > overhead to context switching, but I don't see that it should be that >> > big of a deal. > > It may well change the dispatch order of enough IOs for it to be > significant on an IO bound device. Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I could see the use of a workqueue maybe changing that sufficiently. But it still sounds a bit unlikely. And in fact, I think I have a better explanation. > In outright performance on my test machine, the difference in > files/s is noise. However, the consistency looks to be substantially > improved and the context switch rate is now running at under > 3,000/sec. Hmm. I don't recall seeing you mention how many context switches per second you had before. What is it down from? However, I think I may have found something more interesting here. The fact is that a *normal* schedule will trigger that whole blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock() doesn't actually do a normal schedule at all. Those trigger a *preemption* schedule. And a preemption schedule does not trigger that unplugging at all. Why? A kernel "preemption" very much tries to avoid touching thread state, because the whole point is that normally we may be preempting threads in random places, so we don't run things like sched_submit_work(), because the thread may be in the middle of *creating* that work, and we don't want to introduce races. The preemption scheduling can also be done with "task->state" set to sleeping, and it won't actually sleep. Now, for the explicit schedules like "cond_resched()" and "cond_resched_lock()", those races with obviously don't exist, but they happen to share the same preemption scheduling logic. So it turns out that as far as I can see, the whole "cond_resched()" will not start any IO at all, and it will just be left on the thread plug until we schedule back to the thread. So I don't think this has anything to do with kblockd_workqueue. I don't think it even gets to that point. I may be missing something, but just to humor me, can you test the attached patch *without* Chris's patch to do explicit plugging? This should make cond_resched() and cond_resched_lock() run the unplugging. It may be entirely broken, I haven't thought this entirely through. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 633 bytes --] diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 97d276ff1edb..388ea9e7ab8a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4548,6 +4548,7 @@ SYSCALL_DEFINE0(sched_yield) int __sched _cond_resched(void) { if (should_resched(0)) { + sched_submit_work(current); preempt_schedule_common(); return 1; } @@ -4572,9 +4573,10 @@ int __cond_resched_lock(spinlock_t *lock) if (spin_needbreak(lock) || resched) { spin_unlock(lock); - if (resched) + if (resched) { + sched_submit_work(current); preempt_schedule_common(); - else + } else cpu_relax(); ret = 1; spin_lock(lock); ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 1:50 ` Linus Torvalds @ 2015-09-18 5:40 ` Dave Chinner 2015-09-18 6:04 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-18 5:40 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > > >> > I'm not seeing why that should be an issue. Sure, there's some CPU > >> > overhead to context switching, but I don't see that it should be that > >> > big of a deal. > > > > It may well change the dispatch order of enough IOs for it to be > > significant on an IO bound device. > > Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I > could see the use of a workqueue maybe changing that sufficiently. But > it still sounds a bit unlikely. > > And in fact, I think I have a better explanation. > > > In outright performance on my test machine, the difference in > > files/s is noise. However, the consistency looks to be substantially > > improved and the context switch rate is now running at under > > 3,000/sec. > > Hmm. I don't recall seeing you mention how many context switches per > second you had before. What is it down from? Around 4000-4500/sec, so there's not a huge amount of context switches going on. It's not like btrfs, where this workload generates over 250,000 context switches/sec.... > However, I think I may have found something more interesting here. > > The fact is that a *normal* schedule will trigger that whole > blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock() > doesn't actually do a normal schedule at all. Those trigger a > *preemption* schedule. Ok, makes sense - the plug is not being flushed as we switch away, but Chris' patch makes it do that. > So it turns out that as far as I can see, the whole "cond_resched()" > will not start any IO at all, and it will just be left on the thread > plug until we schedule back to the thread. > > So I don't think this has anything to do with kblockd_workqueue. I > don't think it even gets to that point. > > I may be missing something, but just to humor me, can you test the > attached patch *without* Chris's patch to do explicit plugging? This > should make cond_resched() and cond_resched_lock() run the unplugging. Context switches go back to the 4-4500/sec range. Otherwise behaviour and performance is indistinguishable from Chris' patch. Cheers, Dave. PS: just hit another "did this just get broken in 4.3-rc1" issue - I can't run blktrace while there's a IO load because: $ sudo blktrace -d /dev/vdc BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory .... [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 [ 641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430 [ 641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 641.429066] 0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0 [ 641.430585] ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8 [ 641.432143] ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100 [ 641.433695] Call Trace: [ 641.434196] [<ffffffff81785609>] dump_stack+0x4b/0x72 [ 641.435219] [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130 [ 641.436402] [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0 [ 641.437684] [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0 [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 [ 641.442492] [<ffffffff8115e041>] blk_trace_setup+0x51/0x90 [ 641.443586] [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110 [ 641.444702] [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0 [ 641.446419] [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690 [ 641.447489] [<ffffffff811fd37d>] block_ioctl+0x3d/0x50 [ 641.449194] [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490 [ 641.450491] [<ffffffff811d9629>] SyS_ioctl+0x79/0x90 [ 641.451575] [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 641.453231] Mem-Info: [ 641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0 active_file:8456 inactive_file:2070698 isolated_file:0 unevictable:1011 dirty:374605 writeback:1176 unstable:0 slab_reclaimable:1514989 slab_unreclaimable:233692 mapped:9332 shmem:2250 pagetables:1891 bounce:0 free:24305 free_pcp:1448 free_cma:0 [ 641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 641.470017] lowmem_reserve[]: 0 2960 16005 16005 [ 641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 641.479677] lowmem_reserve[]: 0 0 13044 13044 [ 641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 641.492392] lowmem_reserve[]: 0 0 0 0 [ 641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB [ 641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB [ 641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB [ 641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 641.503019] 2082056 total pagecache pages [ 641.503759] 0 pages in swap cache [ 641.504403] Swap cache stats: add 0, delete 0, find 0/0 [ 641.505374] Free swap = 497976kB [ 641.506002] Total swap = 497976kB [ 641.506622] 4194173 pages RAM [ 641.507171] 0 pages HighMem/MovableOnly [ 641.507878] 92503 pages reserved This is from: gdb) l *(relay_open+0x68) 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). 577 return NULL; 578 if (subbuf_size > UINT_MAX / n_subbufs) 579 return NULL; 580 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); 582 if (!chan) 583 return NULL; 584 585 chan->version = RELAYFS_CHANNEL_VERSION; 586 chan->n_subbufs = n_subbufs; and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation that fails here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 5:40 ` Dave Chinner @ 2015-09-18 6:04 ` Linus Torvalds 2015-09-18 6:06 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-18 6:04 UTC (permalink / raw) To: Dave Chinner, Jens Axboe Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: > > Ok, makes sense - the plug is not being flushed as we switch away, > but Chris' patch makes it do that. Yup. And I actually think Chris' patch is better than the one I sent out (but maybe the scheduler people should take a look at the behavior of cond_resched()), I just wanted you to test that to verify the behavior. The fact that Chris' patch ends up lowering the context switches (because it does the unplugging directly) is also an argument for his approach. I just wanted to understand the oddity with kblockd_workqueue. And I think that's solved. > Context switches go back to the 4-4500/sec range. Otherwise > behaviour and performance is indistinguishable from Chris' patch. .. this was exactly what I wanted to hear. So it sounds like we have no odd unexplained behavior left in this area. Which is not to say that there wouldn't be room for improvement, but it just makes me much happier about the state of these patches to feel like we understand what was going on. > PS: just hit another "did this just get broken in 4.3-rc1" issue - I > can't run blktrace while there's a IO load because: > > $ sudo blktrace -d /dev/vdc > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory > .... > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 > > gdb) l *(relay_open+0x68) > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). > 577 return NULL; > 578 if (subbuf_size > UINT_MAX / n_subbufs) > 579 return NULL; > 580 > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); > 582 if (!chan) > 583 return NULL; > 584 > 585 chan->version = RELAYFS_CHANNEL_VERSION; > 586 chan->n_subbufs = n_subbufs; > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation > that fails here.... Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? >From a quick check, none of this code seems to be new. That said, having that struct rchan_buf *buf[NR_CPUS]; in "struct rchan" really is something we should fix. We really should strive to not allocate things by CONFIG_NR_CPU's, but by the actual real CPU count. This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens? Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 6:04 ` Linus Torvalds @ 2015-09-18 6:06 ` Linus Torvalds 2015-09-18 14:21 ` Jens Axboe 2015-09-18 13:16 ` Chris Mason 2015-09-18 22:17 ` Dave Chinner 2 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-18 6:06 UTC (permalink / raw) To: Dave Chinner, Jens Axboe Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo Gaah, my mailer autocompleted Jens' email with an old one.. Sorry for the repeat email with the correct address. On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> PS: just hit another "did this just get broken in 4.3-rc1" issue - I >> can't run blktrace while there's a IO load because: >> >> $ sudo blktrace -d /dev/vdc >> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error >> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory >> .... >> >> [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 >> [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 >> [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 >> [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 >> >> gdb) l *(relay_open+0x68) >> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). >> 577 return NULL; >> 578 if (subbuf_size > UINT_MAX / n_subbufs) >> 579 return NULL; >> 580 >> 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); >> 582 if (!chan) >> 583 return NULL; >> 584 >> 585 chan->version = RELAYFS_CHANNEL_VERSION; >> 586 chan->n_subbufs = n_subbufs; >> >> and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; >> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation >> that fails here.... > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? > From a quick check, none of this code seems to be new. > > That said, having that > > struct rchan_buf *buf[NR_CPUS]; > > in "struct rchan" really is something we should fix. We really should > strive to not allocate things by CONFIG_NR_CPU's, but by the actual > real CPU count. > > This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens? > > Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 6:06 ` Linus Torvalds @ 2015-09-18 14:21 ` Jens Axboe 0 siblings, 0 replies; 56+ messages in thread From: Jens Axboe @ 2015-09-18 14:21 UTC (permalink / raw) To: Linus Torvalds, Dave Chinner Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On 09/18/2015 12:06 AM, Linus Torvalds wrote: > Gaah, my mailer autocompleted Jens' email with an old one.. > > Sorry for the repeat email with the correct address. > > On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I >>> can't run blktrace while there's a IO load because: >>> >>> $ sudo blktrace -d /dev/vdc >>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error >>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory >>> .... >>> >>> [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 >>> [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 >>> [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 >>> [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 >>> >>> gdb) l *(relay_open+0x68) >>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). >>> 577 return NULL; >>> 578 if (subbuf_size > UINT_MAX / n_subbufs) >>> 579 return NULL; >>> 580 >>> 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); >>> 582 if (!chan) >>> 583 return NULL; >>> 584 >>> 585 chan->version = RELAYFS_CHANNEL_VERSION; >>> 586 chan->n_subbufs = n_subbufs; >>> >>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; >>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation >>> that fails here.... >> >> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? >> From a quick check, none of this code seems to be new. >> >> That said, having that >> >> struct rchan_buf *buf[NR_CPUS]; >> >> in "struct rchan" really is something we should fix. We really should >> strive to not allocate things by CONFIG_NR_CPU's, but by the actual >> real CPU count. >> >> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens? The relayfs code mostly came out of IBM, but yes, that alloc doesn't look nice. Not a regression, though, I don't think that has changed in years. I'll take a stab at fixing this. -- Jens Axboe ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 6:04 ` Linus Torvalds 2015-09-18 6:06 ` Linus Torvalds @ 2015-09-18 13:16 ` Chris Mason 2015-09-18 14:23 ` Jens Axboe 2015-09-18 22:17 ` Dave Chinner 2 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-18 13:16 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jens Axboe, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Ok, makes sense - the plug is not being flushed as we switch away, > > but Chris' patch makes it do that. > > Yup. Huh, that does make much more sense, thanks Linus. I'm wondering where else I've assumed that cond_resched() unplugged. > > And I actually think Chris' patch is better than the one I sent out > (but maybe the scheduler people should take a look at the behavior of > cond_resched()), I just wanted you to test that to verify the > behavior. Ok, I'll fix up the description and comments and send out. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 13:16 ` Chris Mason @ 2015-09-18 14:23 ` Jens Axboe 2015-09-18 15:32 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Jens Axboe @ 2015-09-18 14:23 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On 09/18/2015 07:16 AM, Chris Mason wrote: > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: >> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> Ok, makes sense - the plug is not being flushed as we switch away, >>> but Chris' patch makes it do that. >> >> Yup. > > Huh, that does make much more sense, thanks Linus. I'm wondering where > else I've assumed that cond_resched() unplugged. > >> >> And I actually think Chris' patch is better than the one I sent out >> (but maybe the scheduler people should take a look at the behavior of >> cond_resched()), I just wanted you to test that to verify the >> behavior. It makes no sense for preemption schedule to NOT unplug, the fact that it doesn't is news to me as well. It was never the intent of the unplug-on-schedule to NOT unplug for certain schedule out events, that seems like very odd behavior. -- Jens Axboe ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 14:23 ` Jens Axboe @ 2015-09-18 15:32 ` Linus Torvalds 2015-09-18 15:59 ` Peter Zijlstra 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-18 15:32 UTC (permalink / raw) To: Jens Axboe, Ingo Molnar, Peter Zijlstra, Frederic Weisbecker Cc: Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Fri, Sep 18, 2015 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote: > > It makes no sense for preemption schedule to NOT unplug, the fact that it > doesn't is news to me as well. It was never the intent of the > unplug-on-schedule to NOT unplug for certain schedule out events, that seems > like very odd behavior. Actually, even a *full* schedule doesn't unplug, unless the process is going to sleep. See sched_submit_work(), which will only call the unplugging if the process is actually going to sleep (ok, so it's a bit subtle if you don't know the state rules, but it's the "!tsk->state" check there) So preemption and cond_resched() isn't _that_ odd. We've basically treated a non-sleeping schedule as a no-op for the task work. The thinking was probably that it might be better to delay starting the IO in case we get scheduled back quickly, and we're obviously not actually _sleeping_, so it's likely not too bad. Now, that's probably bogus, and I think that we should perhaps just make the rule be that "if we actually switch to another task, we run blk_schedule_flush_plug()". But it should be noted that that really *does* introduce a lot of new potential races. Traditionally, our block layer plugging has been entirely thread-synchronous, and would never happen asynchronously. But with preemption, that "switch to another thread" really *does* happen asynchronously. So making things always happen on task switch is actually fairly dangerous, and potentially adds the need for much more synchronization for the IO submission. What we possibly *could* make the scheduler rule be: - if it's not an actual PREEMPT_ACTIVE (ie in a random place) - _and_ we actually switch to another thread - _then_ do the whole blk_schedule_flush_plug(tsk) thing. adding some scheduler people to the explicit cc list. That said, the "cond_resched[_lock]()" functions currently always set PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so even though those are synchronous, right now they *look* asynchronous to the scheduler, so we'd still have to sort that out. Ingo/Peter/Frederic? Comments? Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 15:32 ` Linus Torvalds @ 2015-09-18 15:59 ` Peter Zijlstra 2015-09-18 16:02 ` Peter Zijlstra 2015-09-18 16:12 ` Linus Torvalds 0 siblings, 2 replies; 56+ messages in thread From: Peter Zijlstra @ 2015-09-18 15:59 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Fri, Sep 18, 2015 at 08:32:15AM -0700, Linus Torvalds wrote: > Actually, even a *full* schedule doesn't unplug, unless the process is > going to sleep. See sched_submit_work(), which will only call the > unplugging if the process is actually going to sleep (ok, so it's a > bit subtle if you don't know the state rules, but it's the > "!tsk->state" check there) > > So preemption and cond_resched() isn't _that_ odd. We've basically > treated a non-sleeping schedule as a no-op for the task work. > > The thinking was probably that it might be better to delay starting > the IO in case we get scheduled back quickly, and we're obviously not > actually _sleeping_, so it's likely not too bad. > > Now, that's probably bogus, and I think that we should perhaps just > make the rule be that "if we actually switch to another task, we run > blk_schedule_flush_plug()". That gets to be a tad tricky. We must either do it while holding scheduler locks (and I would really want to avoid calling block layer code with those held) or when the new task is already scheduled (and that would suck too, because then we account stuff to the wrong task). > But it should be noted that that really *does* introduce a lot of new > potential races. Traditionally, our block layer plugging has been > entirely thread-synchronous, and would never happen asynchronously. > But with preemption, that "switch to another thread" really *does* > happen asynchronously. > > So making things always happen on task switch is actually fairly > dangerous, and potentially adds the need for much more synchronization > for the IO submission. Can't say I'm a fan of it :/ > What we possibly *could* make the scheduler rule be: > > - if it's not an actual PREEMPT_ACTIVE (ie in a random place) PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not factor into it. That is, most 'random places' will not have PREEMPT_ACTIVE set. > - _and_ we actually switch to another thread This is 'hard', as per the above. > - _then_ do the whole blk_schedule_flush_plug(tsk) thing. > > adding some scheduler people to the explicit cc list. > > That said, the "cond_resched[_lock]()" functions currently always set > PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so > even though those are synchronous, right now they *look* asynchronous > to the scheduler, so we'd still have to sort that out. See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't preempt while we're already scheduling. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 15:59 ` Peter Zijlstra @ 2015-09-18 16:02 ` Peter Zijlstra 2015-09-18 16:12 ` Linus Torvalds 1 sibling, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2015-09-18 16:02 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Fri, Sep 18, 2015 at 05:59:56PM +0200, Peter Zijlstra wrote: > > - if it's not an actual PREEMPT_ACTIVE (ie in a random place) > > PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not > factor into it. That is, most 'random places' will not have > PREEMPT_ACTIVE set. Ah, _not_ PREEMPT_ACTIVE, its late and I can't read no moar. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 15:59 ` Peter Zijlstra 2015-09-18 16:02 ` Peter Zijlstra @ 2015-09-18 16:12 ` Linus Torvalds 2015-09-28 14:47 ` Peter Zijlstra 1 sibling, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-18 16:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Fri, Sep 18, 2015 at 8:59 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't > preempt while we're already scheduling. PREEMPT_ACTIVE does more than that. It really is a sign that "this is not synchronous". It causes the scheduler to ignore the current task flags (because it might already be TASK_SLEEPING, but we aren't _actually_ ready to sleep yet) etc. So no. It's not "you can't be preempted during scheduling". That's the *normal* preempt count, and all scheduling calls end up setting that some way (ie "schedule()" just does preempt_disable()). So I disagree with your notion that it's a recursion flag. It is absolutely nothing of the sort. It gets set by preemption - and, somewhat illogically, by cond_resched(). The fact that cond_resched() sets it is *probably* because some of the callers end up calling it from page fault paths etc, and the same "ignore TASK_SLEEPING etc" rules apply. But it does mean that "cond_resched()" is a bit misleaning as a name. It's really a "cond_preempt()". Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 16:12 ` Linus Torvalds @ 2015-09-28 14:47 ` Peter Zijlstra 2015-09-28 16:08 ` Linus Torvalds 0 siblings, 1 reply; 56+ messages in thread From: Peter Zijlstra @ 2015-09-28 14:47 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Fri, Sep 18, 2015 at 09:12:38AM -0700, Linus Torvalds wrote: > > So I disagree with your notion that it's a recursion flag. It is > absolutely nothing of the sort. OK, agreed. I had it classed under recursion in my head, clearly I indexed it sloppily. In any case I have a patch that kills off PREEMPT_ACTIVE entirely. I just have to clean it up, benchmark, split and write changelogs. But it should be forthcoming 'soon'. As is, it boots.. > It gets set by preemption - and, > somewhat illogically, by cond_resched(). I suspect that was done to make cond_resched() (voluntary preemption) more robust and only have a single preemption path/logic. But all that was done well before I got involved. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-28 14:47 ` Peter Zijlstra @ 2015-09-28 16:08 ` Linus Torvalds 2015-09-29 7:55 ` Ingo Molnar 0 siblings, 1 reply; 56+ messages in thread From: Linus Torvalds @ 2015-09-28 16:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> It gets set by preemption - and, >> somewhat illogically, by cond_resched(). > > I suspect that was done to make cond_resched() (voluntary preemption) > more robust and only have a single preemption path/logic. But all that > was done well before I got involved. So I think it's actually the name that is bad, not necessarily the behavior. We tend to put "cond_resched()" (and particularly "cond_resched_lock()") in some fairly awkward places, and it's not always entirely clear that task->state == TASK_RUNNING there. So the preemptive behavior of not *really* putting the task to sleep may actually be the right one. But it is rather non-intuitive given the name - because "cond_resched()" basically is not at all equivalent to "if (need_resched()) schedule()", which you'd kind of expect. An explicit schedule will actually act on the task->state, and make us go to sleep. "cond_resched()" really is just a "voluntary preemption point". And I think it would be better if it got named that way. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-28 16:08 ` Linus Torvalds @ 2015-09-29 7:55 ` Ingo Molnar 0 siblings, 0 replies; 56+ messages in thread From: Ingo Molnar @ 2015-09-29 7:55 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Jens Axboe, Frederic Weisbecker, Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > >> It gets set by preemption - and, > >> somewhat illogically, by cond_resched(). > > > > I suspect that was done to make cond_resched() (voluntary preemption) > > more robust and only have a single preemption path/logic. But all that > > was done well before I got involved. > > So I think it's actually the name that is bad, not necessarily the behavior. > > We tend to put "cond_resched()" (and particularly > "cond_resched_lock()") in some fairly awkward places, and it's not > always entirely clear that task->state == TASK_RUNNING there. > > So the preemptive behavior of not *really* putting the task to sleep > may actually be the right one. But it is rather non-intuitive given > the name - because "cond_resched()" basically is not at all equivalent > to "if (need_resched()) schedule()", which you'd kind of expect. > > An explicit schedule will actually act on the task->state, and make us > go to sleep. "cond_resched()" really is just a "voluntary preemption > point". And I think it would be better if it got named that way. cond_preempt() perhaps? That would allude to preempt_schedule() and such, and would make it clearer that it's supposed to be an invariant on the sleep state (which schedule() is not). Thanks, Ingo ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 6:04 ` Linus Torvalds 2015-09-18 6:06 ` Linus Torvalds 2015-09-18 13:16 ` Chris Mason @ 2015-09-18 22:17 ` Dave Chinner 2015-09-21 9:24 ` Jan Kara 2 siblings, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-18 22:17 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Ok, makes sense - the plug is not being flushed as we switch away, > > but Chris' patch makes it do that. > > Yup. > > And I actually think Chris' patch is better than the one I sent out > (but maybe the scheduler people should take a look at the behavior of > cond_resched()), I just wanted you to test that to verify the > behavior. > > The fact that Chris' patch ends up lowering the context switches > (because it does the unplugging directly) is also an argument for his > approach. > > I just wanted to understand the oddity with kblockd_workqueue. And I > think that's solved. > > > Context switches go back to the 4-4500/sec range. Otherwise > > behaviour and performance is indistinguishable from Chris' patch. > > .. this was exactly what I wanted to hear. So it sounds like we have > no odd unexplained behavior left in this area. > > Which is not to say that there wouldn't be room for improvement, but > it just makes me much happier about the state of these patches to feel > like we understand what was going on. > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I > > can't run blktrace while there's a IO load because: > > > > $ sudo blktrace -d /dev/vdc > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory > > .... > > > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 > > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 > > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 > > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 > > > > gdb) l *(relay_open+0x68) > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). > > 577 return NULL; > > 578 if (subbuf_size > UINT_MAX / n_subbufs) > > 579 return NULL; > > 580 > > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); > > 582 if (!chan) > > 583 return NULL; > > 584 > > 585 chan->version = RELAYFS_CHANNEL_VERSION; > > 586 chan->n_subbufs = n_subbufs; > > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation > > that fails here.... > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? > From a quick check, none of this code seems to be new. Yes, I always build MAX_SMP kernels for testing, because XFS is often used on such machines and so I want to find issues exactly like this in my testing rather than on customer machines... :/ > That said, having that > > struct rchan_buf *buf[NR_CPUS]; > > in "struct rchan" really is something we should fix. We really should > strive to not allocate things by CONFIG_NR_CPU's, but by the actual > real CPU count. *nod*. But it doesn't fix the problem of the memory allocation failing when there's still gigabytes of immediately reclaimable memory available in the page cache. If this is failing under page cache memory pressure, then we're going to be doing an awful lot more falling back to vmalloc in the filesystem code where large allocations like this are done e.g. extended attribute buffers are order-5, and used a lot when doing things like backups which tend to also produce significant page cache memory pressure. Hence I'm tending towards there being a memory reclaim behaviour regression, not so much worrying about whether this specific allocation is optimal or not. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-18 22:17 ` Dave Chinner @ 2015-09-21 9:24 ` Jan Kara 2015-09-21 20:21 ` Andrew Morton 0 siblings, 1 reply; 56+ messages in thread From: Jan Kara @ 2015-09-21 9:24 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Jens Axboe, Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo, linux-mm, Andrew Morton On Sat 19-09-15 08:17:14, Dave Chinner wrote: > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: > > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I > > > can't run blktrace while there's a IO load because: > > > > > > $ sudo blktrace -d /dev/vdc > > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error > > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory > > > .... > > > > > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 > > > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 > > > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 > > > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 > > > > > > gdb) l *(relay_open+0x68) > > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). > > > 577 return NULL; > > > 578 if (subbuf_size > UINT_MAX / n_subbufs) > > > 579 return NULL; > > > 580 > > > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); > > > 582 if (!chan) > > > 583 return NULL; > > > 584 > > > 585 chan->version = RELAYFS_CHANNEL_VERSION; > > > 586 chan->n_subbufs = n_subbufs; > > > > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; > > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation > > > that fails here.... > > > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? > > From a quick check, none of this code seems to be new. > > Yes, I always build MAX_SMP kernels for testing, because XFS is > often used on such machines and so I want to find issues exactly > like this in my testing rather than on customer machines... :/ > > > That said, having that > > > > struct rchan_buf *buf[NR_CPUS]; > > > > in "struct rchan" really is something we should fix. We really should > > strive to not allocate things by CONFIG_NR_CPU's, but by the actual > > real CPU count. > > *nod*. But it doesn't fix the problem of the memory allocation > failing when there's still gigabytes of immediately reclaimable > memory available in the page cache. If this is failing under page > cache memory pressure, then we're going to be doing an awful lot > more falling back to vmalloc in the filesystem code where large > allocations like this are done e.g. extended attribute buffers are > order-5, and used a lot when doing things like backups which tend to > also produce significant page cache memory pressure. > > Hence I'm tending towards there being a memory reclaim behaviour > regression, not so much worrying about whether this specific > allocation is optimal or not. Yup, looks like a regression in reclaim. Added linux-mm folks to CC. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-21 9:24 ` Jan Kara @ 2015-09-21 20:21 ` Andrew Morton 0 siblings, 0 replies; 56+ messages in thread From: Andrew Morton @ 2015-09-21 20:21 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Linus Torvalds, Jens Axboe, Chris Mason, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo, linux-mm On Mon, 21 Sep 2015 11:24:29 +0200 Jan Kara <jack@suse.cz> wrote: > On Sat 19-09-15 08:17:14, Dave Chinner wrote: > > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: > > > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I > > > > can't run blktrace while there's a IO load because: > > > > > > > > $ sudo blktrace -d /dev/vdc > > > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error > > > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory > > > > .... > > > > > > > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 > > > > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 > > > > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 > > > > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 > > > > > > > > gdb) l *(relay_open+0x68) > > > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). > > > > 577 return NULL; > > > > 578 if (subbuf_size > UINT_MAX / n_subbufs) > > > > 579 return NULL; > > > > 580 > > > > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); > > > > 582 if (!chan) > > > > 583 return NULL; > > > > 584 > > > > 585 chan->version = RELAYFS_CHANNEL_VERSION; > > > > 586 chan->n_subbufs = n_subbufs; > > > > > > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; > > > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation > > > > that fails here.... > > > > > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)? > > > From a quick check, none of this code seems to be new. > > > > Yes, I always build MAX_SMP kernels for testing, because XFS is > > often used on such machines and so I want to find issues exactly > > like this in my testing rather than on customer machines... :/ > > > > > That said, having that > > > > > > struct rchan_buf *buf[NR_CPUS]; > > > > > > in "struct rchan" really is something we should fix. We really should > > > strive to not allocate things by CONFIG_NR_CPU's, but by the actual > > > real CPU count. > > > > *nod*. But it doesn't fix the problem of the memory allocation > > failing when there's still gigabytes of immediately reclaimable > > memory available in the page cache. If this is failing under page > > cache memory pressure, then we're going to be doing an awful lot > > more falling back to vmalloc in the filesystem code where large > > allocations like this are done e.g. extended attribute buffers are > > order-5, and used a lot when doing things like backups which tend to > > also produce significant page cache memory pressure. > > > > Hence I'm tending towards there being a memory reclaim behaviour > > regression, not so much worrying about whether this specific > > allocation is optimal or not. > > Yup, looks like a regression in reclaim. Added linux-mm folks to CC. That's going to be hard to find. Possibly Vlastimil's 5-patch series "mm, compaction: more robust check for scanners meeting", possibly Joonsoo's "mm/compaction: correct to flush migrated pages if pageblock skip happens". But probably something else :( Teach relay.c about alloc_percpu()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 2:14 ` Dave Chinner 2015-09-17 19:39 ` Linus Torvalds @ 2015-09-17 23:03 ` Dave Chinner 2015-09-17 23:13 ` Linus Torvalds 1 sibling, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-17 23:03 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 12:14:53PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote: > > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > TL;DR: Results look really bad - not only is the plugging > > > problematic, baseline writeback performance has regressed > > > significantly. > > > > Dave, if you're testing my current -git, the other performance issue > > might still be the spinlock thing. > > I have the fix as the first commit in my local tree - it'll remain > there until I get a conflict after an update. :) > > > The plugging IO pauses are interesting, though. Plugging really > > *shouldn't* cause that kind of pauses, _regardless_ of what level it > > happens on, so I wonder if the patch ends up just exposing some really > > basic problem that just normally goes hidden. > > Right, that's what I suspect - it didn't happen on older kernels, > but we've just completely reworked the writeback code for the > control group awareness since I last looked really closely at > this... > > > Can you match up the IO wait times with just *where* it is > > waiting? Is it waiting for that inode I_SYNC thing in > > inode_sleep_on_writeback()? > > I'll do some more investigation. Ok, I'm happy to report there is actually nothing wrong with the plugging code that is your tree. I finally tracked the problem I was seeing down to a misbehaving RAID controller.[*] With that problem sorted: kernel files/s wall time 3.17 32500 5m54s 4.3-noplug 34400 5m25s 3.17-plug 52900 3m19s 4.3-badplug 60540 3m24s 4.3-rc1 56600 3m23s So the 3.17/4.3-noplug baselines so no regression - 4.3 is slightly faster. All the plugging variants show roughly the same improvement and IO behaviour. These numbers are reproducable and there are no weird performance inconsistencies during any of the 4.3-rc1 kernel runs. Hence my numbers and observed behaviour now aligns with Chris' results and so I think we can say the reworked high level plugging is behaving as we expected it to. Cheers, Dave. [*] It seems to have a dodgy battery connector, and so has been "losing" battery backup and changing the cache mode of the HBA from write back to write through. This results in changing from NVRAM performance to SSD native performance and back again. A small vibration would cause the connection to the battery to reconnect and the controller would switch back to writeback mode. The few log entries in the bios showed changes in status between a few seconds apart to minutes apart - enough for the cache status to change several times a 5-10 minute benchmark run. I didn't notice the hardware was playing up because it wasn't triggering the machine alert indicator through the bios like it's supposed to and so the visible and audible alarms were not being triggered, nor was the BMC logging the raid controller cache status changes. In the end, I noticed it by chance - during a low level test the behaviour changed very obviously as one of my dogs ran past the rack. I unplugged everything inside the server, plugged it all back in, powered it back up and fiddled with cables until I found what was causing the problem. And having done this, the BMC is now sending warnings and the audible alarm is working when the battery is disconnected... :/ -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 23:03 ` Dave Chinner @ 2015-09-17 23:13 ` Linus Torvalds 0 siblings, 0 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-17 23:13 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote: > > Ok, I'm happy to report there is actually nothing wrong with the > plugging code that is your tree. I finally tracked the problem I > was seeing down to a misbehaving RAID controller.[*] Hey, that's great. Just out of interest, try the patch that Chris just sent that turns the unplugging synchronous for the special "cond_resched()" case. If that helps your case too, let's just do it, even while we wonder why kblockd_workqueue messes up so noticeable. Because regardless of the kblockd_workqueue questions, I think it's nicer to avoid the cond_resched_lock(), and do the rescheduling while we're not holding any locks. Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 0:37 ` Dave Chinner 2015-09-17 1:12 ` Linus Torvalds @ 2015-09-17 3:48 ` Chris Mason 2015-09-17 4:30 ` Dave Chinner 1 sibling, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-17 3:48 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > [cc Tejun] > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > > > instead of 30, I suspect that would even out the average sizes between > > > > > the three patches. > > > > > > > > Thanks for the data Chris. So I guess we are fine with what's currently in, > > > > right? > > > > > > Looks like it works well to me. > > > > Graph looks good, though I'll confirm it on my test rig once I get > > out from under the pile of email and other stuff that is queued up > > after being away for a week... > > I ran some tests in the background while reading other email..... > > TL;DR: Results look really bad - not only is the plugging > problematic, baseline writeback performance has regressed > significantly. We need to revert the plugging changes until the > underlying writeback performance regressions are sorted out. > > In more detail, these tests were run on my usual 16p/16GB RAM > performance test VM with storage set up as described here: > > https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b > > The test: > > $ ~/tests/fsmark-10-4-test-xfs.sh > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0 > data = bsize=4096 blocks=134217727500, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > FSUse% Count Size Files/sec App Overhead > 0 80000 4096 106938.0 543310 > 0 160000 4096 102922.7 476362 > 0 240000 4096 107182.9 538206 > 0 320000 4096 107871.7 619821 > 0 400000 4096 99255.6 622021 > 0 480000 4096 103217.8 609943 > 0 560000 4096 96544.2 640988 > 0 640000 4096 100347.3 676237 > 0 720000 4096 87534.8 483495 > 0 800000 4096 72577.5 2556920 > 0 880000 4096 97569.0 646996 > > <RAM fills here, sustained performance is now dependent on writeback> I think too many variables have changed here. My numbers: FSUse% Count Size Files/sec App Overhead 0 160000 4096 356407.1 1458461 0 320000 4096 368755.1 1030047 0 480000 4096 358736.8 992123 0 640000 4096 361912.5 1009566 0 800000 4096 342851.4 1004152 0 960000 4096 358357.2 996014 0 1120000 4096 338025.8 1004412 0 1280000 4096 354440.3 997380 0 1440000 4096 335225.9 1000222 0 1600000 4096 278786.1 1164962 0 1760000 4096 268161.4 1205255 0 1920000 4096 259158.0 1298054 0 2080000 4096 276939.1 1219411 0 2240000 4096 252385.1 1245496 0 2400000 4096 280674.1 1189161 0 2560000 4096 290155.4 1141941 0 2720000 4096 280842.2 1179964 0 2880000 4096 272446.4 1155527 0 3040000 4096 268827.4 1235095 0 3200000 4096 251767.1 1250006 0 3360000 4096 248339.8 1235471 0 3520000 4096 267129.9 1200834 0 3680000 4096 257320.7 1244854 0 3840000 4096 233540.8 1267764 0 4000000 4096 269237.0 1216324 0 4160000 4096 249787.6 1291767 0 4320000 4096 256185.7 1253776 0 4480000 4096 257849.7 1212953 0 4640000 4096 253933.9 1181216 0 4800000 4096 263567.2 1233937 0 4960000 4096 255666.4 1231802 0 5120000 4096 257083.2 1282893 0 5280000 4096 254285.0 1229031 0 5440000 4096 265561.6 1219472 0 5600000 4096 266374.1 1229886 0 5760000 4096 241003.7 1257064 0 5920000 4096 245047.4 1298330 0 6080000 4096 254771.7 1257241 0 6240000 4096 254355.2 1261006 0 6400000 4096 254800.4 1201074 0 6560000 4096 262794.5 1234816 0 6720000 4096 248103.0 1287921 0 6880000 4096 231397.3 1291224 0 7040000 4096 227898.0 1285359 0 7200000 4096 227279.6 1296340 0 7360000 4096 232561.5 1748248 0 7520000 4096 231055.3 1169373 0 7680000 4096 245738.5 1121856 0 7840000 4096 234961.7 1147035 0 8000000 4096 243973.0 1152202 0 8160000 4096 246292.6 1169527 0 8320000 4096 249433.2 1197921 0 8480000 4096 222576.0 1253650 0 8640000 4096 239407.5 1263257 0 8800000 4096 246037.1 1218109 0 8960000 4096 242306.5 1293567 0 9120000 4096 238525.9 3745133 0 9280000 4096 269869.5 1159541 0 9440000 4096 266447.1 4794719 0 9600000 4096 265748.9 1161584 0 9760000 4096 269067.8 1149918 0 9920000 4096 248896.2 1164112 0 10080000 4096 261342.9 1174536 0 10240000 4096 254778.3 1225425 0 10400000 4096 257702.2 1211634 0 10560000 4096 233972.5 1203665 0 10720000 4096 232647.1 1197486 0 10880000 4096 242320.6 1203984 I can push the dirty threshold lower to try and make sure we end up in the hard dirty limits but none of this is going to be related to the plugging patch. I do see lower numbers if I let the test run even longer, but there are a lot of things in the way that can slow it down as the filesystem gets that big. I'll try again with lower ratios. [ ... ] > The baseline of no plugging is a full 3 minutes faster than the > plugging behaviour of Linus' patch. The IO behaviour demonstrates > that, sustaining between 25-30,000 IOPS and throughput of > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > it does not result in a performance improvement like the original > plugging patch did. > How consistent is this across runs? > So I went back and had a look at my original patch, which I've been > using locally for a couple of years and was similar to the original > commit. It has this description from when I last updated the perf > numbers from testing done on 3.17: > > | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem, > | metadata CRCs enabled. > | > | Test: > | > | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d > | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > | /mnt/scratch/6 -d /mnt/scratch/7 > | > | Result: > | wall sys create rate Physical write IO > | time CPU (avg files/s) IOPS Bandwidth > | ----- ----- ------------- ------ --------- > | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s > | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s > | improvement -43.8% -13.3% +62.7% -94.6% +86.6% > > IOWs, what we are seeing here is that the baseline writeback > performance has regressed quite significantly since I took these > numbers back on 3.17. I'm running on exactly the same test setup; > the only difference is the kernel and so the current kernel baseline > is ~20% slower than the baseline numbers I have in my patch. All of this in a VM, I'd much rather see this reproduced on bare metal. I've had really consistent results with VMs in the past, but there is a huge amount of code between 3.17 and now. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 3:48 ` Chris Mason @ 2015-09-17 4:30 ` Dave Chinner 2015-09-17 12:13 ` Chris Mason 0 siblings, 1 reply; 56+ messages in thread From: Dave Chinner @ 2015-09-17 4:30 UTC (permalink / raw) To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote: > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > > [cc Tejun] > > > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > > > FSUse% Count Size Files/sec App Overhead > > 0 80000 4096 106938.0 543310 > > 0 160000 4096 102922.7 476362 > > 0 240000 4096 107182.9 538206 > > 0 320000 4096 107871.7 619821 > > 0 400000 4096 99255.6 622021 > > 0 480000 4096 103217.8 609943 > > 0 560000 4096 96544.2 640988 > > 0 640000 4096 100347.3 676237 > > 0 720000 4096 87534.8 483495 > > 0 800000 4096 72577.5 2556920 > > 0 880000 4096 97569.0 646996 > > > > <RAM fills here, sustained performance is now dependent on writeback> > > I think too many variables have changed here. > > My numbers: > > FSUse% Count Size Files/sec App Overhead > 0 160000 4096 356407.1 1458461 > 0 320000 4096 368755.1 1030047 > 0 480000 4096 358736.8 992123 > 0 640000 4096 361912.5 1009566 > 0 800000 4096 342851.4 1004152 <snip> > I can push the dirty threshold lower to try and make sure we end up in > the hard dirty limits but none of this is going to be related to the > plugging patch. The point of this test is to drive writeback as hard as possible, not to measure how fast we can create files in memory. i.e. if the test isn't pushing the dirty limits on your machines, then it really isn't putting a meaningful load on writeback, and so the plugging won't make significant difference because writeback isn't IO bound.... > I do see lower numbers if I let the test run even > longer, but there are a lot of things in the way that can slow it down > as the filesystem gets that big. Sure, that's why I hit the dirty limits early in the test - so it measures steady state performance before the fs gets to any significant scalability limits.... > > The baseline of no plugging is a full 3 minutes faster than the > > plugging behaviour of Linus' patch. The IO behaviour demonstrates > > that, sustaining between 25-30,000 IOPS and throughput of > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > > it does not result in a performance improvement like the original > > plugging patch did. > > How consistent is this across runs? That's what I'm trying to work out. I didn't report it until I got consistently bad results - the numbers I reported were from the third time I ran the comparison, and they were representative and reproducable. I also ran my inode creation workload that is similar (but has not data writeback so doesn't go through writeback paths at all) and that shows no change in performance, so this problem (whatever it is) is only manifesting itself through data writeback.... The only measurable change I've noticed in my monitoring graphs is that there is a lot more iowait time than I normally see, even when the plugging appears to be working as desired. That's what I'm trying to track down now, and once I've got to the bottom of that I should have some idea of where the performance has gone.... As it is, there are a bunch of other things going wrong with 4.3-rc1+ right now that I'm working through - I haven't updated my kernel tree for 10 days because I've been away on holidays so I'm doing my usual "-rc1 is broken again" dance that I do every release cycle. (e.g every second boot hangs because systemd appears to be waiting for iscsi devices to appear without first starting the iscsi target daemon. Never happened before today, every new kernel I've booted today has hung on the first cold boot of the VM). > > IOWs, what we are seeing here is that the baseline writeback > > performance has regressed quite significantly since I took these > > numbers back on 3.17. I'm running on exactly the same test setup; > > the only difference is the kernel and so the current kernel baseline > > is ~20% slower than the baseline numbers I have in my patch. > > All of this in a VM, I'd much rather see this reproduced on bare metal. > I've had really consistent results with VMs in the past, but there is a > huge amount of code between 3.17 and now. I'm pretty sure it's not the VM - with the locking fix in place, everything else I've looked at (since applying the locking fix Linus mentioned) is within measurement error compared to 4.2. The only thing that is out of whack from a performance POV is data writeback.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-17 4:30 ` Dave Chinner @ 2015-09-17 12:13 ` Chris Mason 0 siblings, 0 replies; 56+ messages in thread From: Chris Mason @ 2015-09-17 12:13 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote: > > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > > > [cc Tejun] > > > > > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. > > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write > > > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > > > > > FSUse% Count Size Files/sec App Overhead > > > 0 80000 4096 106938.0 543310 > > > 0 160000 4096 102922.7 476362 > > > 0 240000 4096 107182.9 538206 > > > 0 320000 4096 107871.7 619821 > > > 0 400000 4096 99255.6 622021 > > > 0 480000 4096 103217.8 609943 > > > 0 560000 4096 96544.2 640988 > > > 0 640000 4096 100347.3 676237 > > > 0 720000 4096 87534.8 483495 > > > 0 800000 4096 72577.5 2556920 > > > 0 880000 4096 97569.0 646996 > > > > > > <RAM fills here, sustained performance is now dependent on writeback> > > > > I think too many variables have changed here. > > > > My numbers: > > > > FSUse% Count Size Files/sec App Overhead > > 0 160000 4096 356407.1 1458461 > > 0 320000 4096 368755.1 1030047 > > 0 480000 4096 358736.8 992123 > > 0 640000 4096 361912.5 1009566 > > 0 800000 4096 342851.4 1004152 > > <snip> > > > I can push the dirty threshold lower to try and make sure we end up in > > the hard dirty limits but none of this is going to be related to the > > plugging patch. > > The point of this test is to drive writeback as hard as possible, > not to measure how fast we can create files in memory. i.e. if the > test isn't pushing the dirty limits on your machines, then it really > isn't putting a meaningful load on writeback, and so the plugging > won't make significant difference because writeback isn't IO > bound.... It does end up IO bound on my rig, just because we do eventually hit the dirty limits. Otherwise there would be zero benefits in fs_mark from any patches vs plain v4.2 But I setup a run last night with a dirty_ratio_bytes at 3G and dirty_background_ratio_bytes at 1.5G. There is definitely variation, but nothing like what you saw: FSUse% Count Size Files/sec App Overhead 0 160000 4096 317427.9 1524951 0 320000 4096 319723.9 1023874 0 480000 4096 336696.4 1053884 0 640000 4096 257113.1 1190851 0 800000 4096 257644.2 1198054 0 960000 4096 254896.6 1225610 0 1120000 4096 241052.6 1203227 0 1280000 4096 214961.2 1386236 0 1440000 4096 239985.7 1264659 0 1600000 4096 232174.3 1310018 0 1760000 4096 250477.9 1227289 0 1920000 4096 221500.9 1276223 0 2080000 4096 235212.1 1284989 0 2240000 4096 238580.2 1257260 0 2400000 4096 224182.6 1326821 0 2560000 4096 234628.7 1236402 0 2720000 4096 244675.3 1228400 0 2880000 4096 234364.0 1268408 0 3040000 4096 229712.6 1306148 0 3200000 4096 241170.5 1254490 0 3360000 4096 220487.8 1331456 0 3520000 4096 215831.7 1313682 0 3680000 4096 210934.7 1235750 0 3840000 4096 218435.4 1258077 0 4000000 4096 232127.7 1271555 0 4160000 4096 212017.6 1381525 0 4320000 4096 216309.3 1370558 0 4480000 4096 239072.4 1269086 0 4640000 4096 221959.1 1333164 0 4800000 4096 228396.8 1213160 0 4960000 4096 225747.5 1318503 0 5120000 4096 115727.0 1237327 0 5280000 4096 184171.4 1547357 0 5440000 4096 209917.8 1380510 0 5600000 4096 181074.7 1391764 0 5760000 4096 263516.7 1155172 0 5920000 4096 236405.8 1239719 0 6080000 4096 231587.2 1221408 0 6240000 4096 237118.8 1244272 0 6400000 4096 236773.2 1201428 0 6560000 4096 243987.5 1240527 0 6720000 4096 232428.0 1283265 0 6880000 4096 234839.9 1209152 0 7040000 4096 234947.3 1223456 0 7200000 4096 231463.1 1260628 0 7360000 4096 226750.3 1290098 0 7520000 4096 213632.0 1236409 0 7680000 4096 194710.2 1411595 0 7840000 4096 213963.1 4146893 0 8000000 4096 225109.8 1323573 0 8160000 4096 251322.1 1380271 0 8320000 4096 220167.2 1159390 0 8480000 4096 210991.2 1110593 0 8640000 4096 197922.8 1126072 0 8800000 4096 203539.3 1143501 0 8960000 4096 193041.7 1134329 0 9120000 4096 184667.9 1119222 0 9280000 4096 165968.7 1172738 0 9440000 4096 192767.3 1098361 0 9600000 4096 227115.7 1158097 0 9760000 4096 232139.8 1264245 0 9920000 4096 213320.5 1270505 0 10080000 4096 217013.4 1324569 0 10240000 4096 227171.6 1308668 0 10400000 4096 208591.4 1392098 0 10560000 4096 212006.0 1359188 0 10720000 4096 213449.3 1352084 0 10880000 4096 219890.1 1326240 0 11040000 4096 215907.7 1239180 0 11200000 4096 214207.2 1334846 0 11360000 4096 212875.2 1338429 0 11520000 4096 211690.0 1249519 0 11680000 4096 217013.0 1262050 0 11840000 4096 204730.1 1205087 0 12000000 4096 191146.9 1188635 0 12160000 4096 207844.6 1157033 0 12320000 4096 208857.7 1168111 0 12480000 4096 198256.4 1388368 0 12640000 4096 214996.1 1305412 0 12800000 4096 212332.9 1357814 0 12960000 4096 210325.8 1336127 0 13120000 4096 200292.1 1282419 0 13280000 4096 202030.2 1412105 0 13440000 4096 216553.7 1424076 0 13600000 4096 218721.7 1298149 0 13760000 4096 202037.4 1266877 0 13920000 4096 224032.3 1198159 0 14080000 4096 206105.6 1336489 0 14240000 4096 227540.3 1160841 0 14400000 4096 236921.7 1190394 0 14560000 4096 229343.3 1147451 0 14720000 4096 199435.1 1284374 0 14880000 4096 215177.3 1178542 0 15040000 4096 206194.1 1170832 0 15200000 4096 215762.3 1125633 0 15360000 4096 194511.0 1122947 0 15520000 4096 179008.5 1292603 0 15680000 4096 208636.9 1094960 0 15840000 4096 192173.1 1237891 0 16000000 4096 212888.9 1111551 0 16160000 4096 218403.0 1143400 0 16320000 4096 207260.5 1233526 0 16480000 4096 202123.2 1151509 0 16640000 4096 191033.0 1257706 0 16800000 4096 196865.4 1154520 0 16960000 4096 210361.2 1128930 0 17120000 4096 201755.2 1160469 0 17280000 4096 196946.6 1173529 0 17440000 4096 199677.8 1165750 0 17600000 4096 194248.4 1234944 0 17760000 4096 200027.9 1256599 0 17920000 4096 206507.0 1166820 0 18080000 4096 215082.7 1167599 0 18240000 4096 201475.5 1212202 0 18400000 4096 208247.6 1252255 0 18560000 4096 205482.7 1311436 0 18720000 4096 200111.9 1358784 0 18880000 4096 200028.3 1351332 0 19040000 4096 198873.4 1287400 0 19200000 4096 209609.3 1268400 0 19360000 4096 203538.6 1249787 0 19520000 4096 203427.9 1294105 0 19680000 4096 201905.3 1280714 0 19840000 4096 209642.9 1283281 0 20000000 4096 203438.9 1315427 0 20160000 4096 199690.7 1252267 0 20320000 4096 185965.2 1398905 0 20480000 4096 203221.6 1214029 0 20640000 4096 208654.8 1232679 0 20800000 4096 212488.6 1298458 0 20960000 4096 189701.1 1356640 0 21120000 4096 198522.1 1361240 0 21280000 4096 203857.3 1263402 0 21440000 4096 204616.8 1362853 0 21600000 4096 196310.6 1266710 0 21760000 4096 203275.4 1391150 0 21920000 4096 205998.5 1378741 0 22080000 4096 205434.2 1283787 0 22240000 4096 195918.0 1415912 0 22400000 4096 186193.0 1413623 0 22560000 4096 192911.3 1393471 0 22720000 4096 203726.3 1264281 0 22880000 4096 204853.4 1221048 0 23040000 4096 222803.2 1153031 0 23200000 4096 198558.6 1346256 0 23360000 4096 201001.4 1278817 0 23520000 4096 206225.2 1270440 0 23680000 4096 190894.2 1425299 0 23840000 4096 198555.6 1334122 0 24000000 4096 202386.4 1332157 0 24160000 4096 205103.1 1313607 > > > I do see lower numbers if I let the test run even > > longer, but there are a lot of things in the way that can slow it down > > as the filesystem gets that big. > > Sure, that's why I hit the dirty limits early in the test - so it > measures steady state performance before the fs gets to any > significant scalability limits.... > > > > The baseline of no plugging is a full 3 minutes faster than the > > > plugging behaviour of Linus' patch. The IO behaviour demonstrates > > > that, sustaining between 25-30,000 IOPS and throughput of > > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns, > > > it does not result in a performance improvement like the original > > > plugging patch did. > > > > How consistent is this across runs? > > That's what I'm trying to work out. I didn't report it until I got > consistently bad results - the numbers I reported were from the > third time I ran the comparison, and they were representative and > reproducable. I also ran my inode creation workload that is similar > (but has not data writeback so doesn't go through writeback paths at > all) and that shows no change in performance, so this problem > (whatever it is) is only manifesting itself through data > writeback.... The big change between Linus' patch and your patch is with Linus kblockd is probably doing most of the actual unplug work (except for the last super block in the list). If a process is waiting for dirty writeout progress, it has to wait for that context switch to kblockd. In the VM, that's going to hurt more then my big two socket mostly idle machine. > > The only measurable change I've noticed in my monitoring graphs is > that there is a lot more iowait time than I normally see, even when > the plugging appears to be working as desired. That's what I'm > trying to track down now, and once I've got to the bottom of that I > should have some idea of where the performance has gone.... > > As it is, there are a bunch of other things going wrong with > 4.3-rc1+ right now that I'm working through - I haven't updated my > kernel tree for 10 days because I've been away on holidays so I'm > doing my usual "-rc1 is broken again" dance that I do every release > cycle. (e.g every second boot hangs because systemd appears to be > waiting for iscsi devices to appear without first starting the iscsi > target daemon. Never happened before today, every new kernel I've > booted today has hung on the first cold boot of the VM). I've been doing 4.2 plus patches because rc1 didn't boot on this strange box. Let me nail that down and rerun. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 21:04 ` Linus Torvalds 2015-09-11 22:06 ` Linus Torvalds @ 2015-09-11 23:06 ` Chris Mason 2015-09-11 23:13 ` Linus Torvalds 1 sibling, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-11 23:06 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 02:04:13PM -0700, Linus Torvalds wrote: > On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote: > > > > So we talked about this when we were trying to figure out a solution. The > > problem with this approach is now we have a plug that covers multiple super > > blocks (__writeback_inodes_wb loops through the sb's starts writeback), > > which is likely to give us crappier performance than no plug at all. > > Why would that be? Either they are on separate disks, and the IO is > all independent anyway, and at most it got started by some (small) > CPU-amount later. Actual throughput should be the same. No? > > Or the filesystems are on the same disk, in which case it should > presumably be a win to submit the IO together. > > Of course, actual numbers would be the deciding factor if this really > is noticeable. But "cleaner code and saner locking" is definitely an > issue at least for me. Originally I was worried about the latency impact of holding the plugs over more than one super with high end flash. I just didn't want to hold onto the IO for longer than we had to. But, since this isn't really latency sensitive anyway, if we find we're not keeping the flash pipelines full the right answer is to short circuit the plugging in general. I'd agree actual throughput should be the same. But benchmarking is the best choice, I'll be able to reproduce Dave's original results without too much trouble. Our thinking for this duct tape patch was the lock wasn't very hot and this was the best immediate compromise between the bug and the perf improvement. Happy to sign up for pushing the lock higher if that's what people would rather see. -chris ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-11 23:06 ` Chris Mason @ 2015-09-11 23:13 ` Linus Torvalds 0 siblings, 0 replies; 56+ messages in thread From: Linus Torvalds @ 2015-09-11 23:13 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig On Fri, Sep 11, 2015 at 4:06 PM, Chris Mason <clm@fb.com> wrote: > > Originally I was worried about the latency impact of holding the > plugs over more than one super with high end flash. I just didn't want > to hold onto the IO for longer than we had to. > > But, since this isn't really latency sensitive anyway, if we find we're > not keeping the flash pipelines full the right answer is to short > circuit the plugging in general. I'd agree actual throughput should be > the same. Yeah, this only triggers for system-wide writeback, so I don't seer that it should be latency-sensitive at that level, afaik. But hey, it's IO, and I've been surprised by magic pattern sensitivites before... Linus ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() @ 2015-09-09 15:23 Chris Mason 2015-09-11 18:49 ` Jens Axboe 0 siblings, 1 reply; 56+ messages in thread From: Chris Mason @ 2015-09-09 15:23 UTC (permalink / raw) To: linux-fsdevel, Josef Bacik, Dave Chinner Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes, but writeback_sb_inodes has a horrible secret...it's called with the wb->list_lock held. Btrfs raid5/6 and MD raid10 have horrible secrets of their own...they both do allocations in their unplug callbacks. None of the options to fix it are very pretty. We don't want to kick off workers for all of these unplugs, and the lock doesn't look hot enough to justify bigger restructuring. [ 2854.025042] BUG: sleeping function called from invalid context at mm/page_alloc.c:3189 [ 2854.041366] in_atomic(): 1, irqs_disabled(): 0, pid: 145562, name: kworker/u66:15 [ 2854.056813] INFO: lockdep is turned off. [ 2854.064870] CPU: 13 PID: 145562 Comm: kworker/u66:15 Not tainted 4.2.0-mason+ #1 [ 2854.080082] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [ 2854.096211] Workqueue: writeback wb_workfn (flush-btrfs-244) [ 2854.107821] ffffffff81a2bbee ffff880ee09a7598 ffffffff813307bb ffff880ee09a7598 [ 2854.123162] ffff881010d1ca00 ffff880ee09a75c8 ffffffff81086615 0000000000000000 [ 2854.138556] 0000000000000000 0000000000000c75 ffffffff81a2bbee ffff880ee09a75f8 [ 2854.153936] Call Trace: [ 2854.181101] [<ffffffff81086722>] __might_sleep+0x52/0x90 [ 2854.192136] [<ffffffff8116d2b4>] __alloc_pages_nodemask+0x344/0xbe0 [ 2854.229682] [<ffffffff811b54aa>] alloc_pages_current+0x10a/0x1e0 [ 2854.255508] [<ffffffffa0663f19>] full_stripe_write+0x59/0xc0 [btrfs] [ 2854.268600] [<ffffffffa0663fb9>] __raid56_parity_write+0x39/0x60 [btrfs] [ 2854.282385] [<ffffffffa06640fb>] run_plug+0x11b/0x140 [btrfs] [ 2854.294259] [<ffffffffa0664143>] btrfs_raid_unplug+0x23/0x70 [btrfs] [ 2854.307334] [<ffffffff81307622>] blk_flush_plug_list+0x82/0x1f0 [ 2854.319542] [<ffffffff813077c4>] blk_finish_plug+0x34/0x50 [ 2854.330878] [<ffffffff812079c2>] writeback_sb_inodes+0x122/0x580 [ 2854.343256] [<ffffffff81208016>] wb_writeback+0x136/0x4e0 Signed-off-by: Chris Mason <clm@fb.com> --- fs/fs-writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ae0f438..07c9c50 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1539,7 +1539,9 @@ static long writeback_sb_inodes(struct super_block *sb, break; } } + spin_unlock(&wb->list_lock); blk_finish_plug(&plug); + spin_lock(&wb->list_lock); return wrote; } -- 1.8.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() 2015-09-09 15:23 Chris Mason @ 2015-09-11 18:49 ` Jens Axboe 0 siblings, 0 replies; 56+ messages in thread From: Jens Axboe @ 2015-09-11 18:49 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, Josef Bacik, Dave Chinner On Wed, Sep 09 2015, Chris Mason wrote: > Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes, > but writeback_sb_inodes has a horrible secret...it's called with the > wb->list_lock held. > > Btrfs raid5/6 and MD raid10 have horrible secrets of their own...they > both do allocations in their unplug callbacks. > > None of the options to fix it are very pretty. We don't want to kick > off workers for all of these unplugs, and the lock doesn't look hot > enough to justify bigger restructuring. > > [ 2854.025042] BUG: sleeping function called from invalid context at mm/page_alloc.c:3189 > [ 2854.041366] in_atomic(): 1, irqs_disabled(): 0, pid: 145562, name: kworker/u66:15 > [ 2854.056813] INFO: lockdep is turned off. > [ 2854.064870] CPU: 13 PID: 145562 Comm: kworker/u66:15 Not tainted 4.2.0-mason+ #1 > [ 2854.080082] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 > [ 2854.096211] Workqueue: writeback wb_workfn (flush-btrfs-244) > [ 2854.107821] ffffffff81a2bbee ffff880ee09a7598 ffffffff813307bb ffff880ee09a7598 > [ 2854.123162] ffff881010d1ca00 ffff880ee09a75c8 ffffffff81086615 0000000000000000 > [ 2854.138556] 0000000000000000 0000000000000c75 ffffffff81a2bbee ffff880ee09a75f8 > [ 2854.153936] Call Trace: > [ 2854.181101] [<ffffffff81086722>] __might_sleep+0x52/0x90 > [ 2854.192136] [<ffffffff8116d2b4>] __alloc_pages_nodemask+0x344/0xbe0 > [ 2854.229682] [<ffffffff811b54aa>] alloc_pages_current+0x10a/0x1e0 > [ 2854.255508] [<ffffffffa0663f19>] full_stripe_write+0x59/0xc0 [btrfs] > [ 2854.268600] [<ffffffffa0663fb9>] __raid56_parity_write+0x39/0x60 [btrfs] > [ 2854.282385] [<ffffffffa06640fb>] run_plug+0x11b/0x140 [btrfs] > [ 2854.294259] [<ffffffffa0664143>] btrfs_raid_unplug+0x23/0x70 [btrfs] > [ 2854.307334] [<ffffffff81307622>] blk_flush_plug_list+0x82/0x1f0 > [ 2854.319542] [<ffffffff813077c4>] blk_finish_plug+0x34/0x50 > [ 2854.330878] [<ffffffff812079c2>] writeback_sb_inodes+0x122/0x580 > [ 2854.343256] [<ffffffff81208016>] wb_writeback+0x136/0x4e0 > > Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Jens Axboe <axboe@fb.com> -- Jens Axboe ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2015-09-29 7:55 UTC | newest] Thread overview: 56+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason 2015-09-11 20:02 ` Linus Torvalds 2015-09-11 20:37 ` Linus Torvalds 2015-09-11 20:40 ` Josef Bacik 2015-09-11 21:04 ` Linus Torvalds 2015-09-11 22:06 ` Linus Torvalds 2015-09-11 23:16 ` Chris Mason 2015-09-11 23:36 ` Linus Torvalds 2015-09-12 0:52 ` Linus Torvalds 2015-09-12 2:15 ` Chris Mason 2015-09-12 2:27 ` Linus Torvalds 2015-09-12 23:00 ` Chris Mason 2015-09-12 23:29 ` Linus Torvalds 2015-09-12 23:46 ` Chris Mason 2015-09-13 13:12 ` Chris Mason 2015-09-13 22:56 ` Dave Chinner 2015-09-13 23:12 ` Dave Chinner 2015-09-14 20:06 ` Linus Torvalds 2015-09-16 15:16 ` Chris Mason 2015-09-16 19:58 ` Jan Kara 2015-09-16 20:00 ` Chris Mason 2015-09-16 22:07 ` Dave Chinner 2015-09-17 0:37 ` Dave Chinner 2015-09-17 1:12 ` Linus Torvalds 2015-09-17 2:14 ` Dave Chinner 2015-09-17 19:39 ` Linus Torvalds 2015-09-17 22:42 ` Chris Mason 2015-09-17 23:08 ` Linus Torvalds 2015-09-17 23:56 ` Chris Mason 2015-09-18 0:37 ` Dave Chinner 2015-09-18 1:50 ` Linus Torvalds 2015-09-18 5:40 ` Dave Chinner 2015-09-18 6:04 ` Linus Torvalds 2015-09-18 6:06 ` Linus Torvalds 2015-09-18 14:21 ` Jens Axboe 2015-09-18 13:16 ` Chris Mason 2015-09-18 14:23 ` Jens Axboe 2015-09-18 15:32 ` Linus Torvalds 2015-09-18 15:59 ` Peter Zijlstra 2015-09-18 16:02 ` Peter Zijlstra 2015-09-18 16:12 ` Linus Torvalds 2015-09-28 14:47 ` Peter Zijlstra 2015-09-28 16:08 ` Linus Torvalds 2015-09-29 7:55 ` Ingo Molnar 2015-09-18 22:17 ` Dave Chinner 2015-09-21 9:24 ` Jan Kara 2015-09-21 20:21 ` Andrew Morton 2015-09-17 23:03 ` Dave Chinner 2015-09-17 23:13 ` Linus Torvalds 2015-09-17 3:48 ` Chris Mason 2015-09-17 4:30 ` Dave Chinner 2015-09-17 12:13 ` Chris Mason 2015-09-11 23:06 ` Chris Mason 2015-09-11 23:13 ` Linus Torvalds -- strict thread matches above, loose matches on Subject: below -- 2015-09-09 15:23 Chris Mason 2015-09-11 18:49 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).