* [PATCH] writeback: Fix broken sync writeback @ 2010-02-12 9:16 Jens Axboe 2010-02-12 15:45 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jens Axboe @ 2010-02-12 9:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel, jack, jengelh, stable, gregkh [-- Attachment #1: Type: text/plain, Size: 2390 bytes --] Hi, There's currently a writeback bug in the 2.6.32 and 2.6.33-rc kernels that prevent proper writeback when sync(1) is being run. Instead of flushing everything older than the sync run, it will do chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and over. This results in very suboptimal writeback for many files, see the initial report from Jan Engelhardt: http://lkml.org/lkml/2010/1/22/382 This fixes it by using the passed in page writeback count, instead of doing MAX_WRITEBACK_PAGES batches, which gets us much better performance (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) finish properly even when new pages are being dirted. Thanks to Jan Kara <jack@suse.cz> for spotting this problem! Cc: stable@kernel.org Reported-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> --- I'm sending this out as a patch on the list instead since I'll be going away for a week skiing very shortly, a mail+patch is easier to discuss here. Greg, I'm attaching a 2.6.32 based patch as well, since this one will not apply to 2.6.32. diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1a7c42c..55aeea9 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -749,6 +749,8 @@ static long wb_writeback(struct bdi_writeback *wb, } for (;;) { + long to_write = 0; + /* * Stop writeback when nr_pages has been consumed */ @@ -762,12 +764,17 @@ static long wb_writeback(struct bdi_writeback *wb, if (args->for_background && !over_bground_thresh()) break; + if (args->sync_mode == WB_SYNC_ALL) + to_write = args->nr_pages; + if (!to_write) + to_write = MAX_WRITEBACK_PAGES; + wbc.more_io = 0; - wbc.nr_to_write = MAX_WRITEBACK_PAGES; + wbc.nr_to_write = to_write; wbc.pages_skipped = 0; writeback_inodes_wb(wb, &wbc); - args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; - wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + args->nr_pages -= to_write - wbc.nr_to_write; + wrote += to_write - wbc.nr_to_write; /* * If we consumed everything, see if we have more @@ -782,7 +789,7 @@ static long wb_writeback(struct bdi_writeback *wb, /* * Did we write something? Try for more */ - if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) + if (wbc.nr_to_write < to_write) continue; /* * Nothing written. Wait for some inode to -- Jens Axboe [-- Attachment #2: writeback-fix-broken-sync-2.6.32.patch --] [-- Type: text/x-diff, Size: 2409 bytes --] commit 057226ca7447880e4e766a82cf32197e492ba963 Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Feb 12 10:14:34 2010 +0100 writeback: Fix broken sync writeback There's currently a writeback bug in the 2.6.32 and 2.6.33-rc kernels that prevent proper writeback when sync(1) is being run. Instead of flushing everything older than the sync run, it will do chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and over. This results in very suboptimal writeback for many files, see the initial report from Jan Engelhardt: http://lkml.org/lkml/2010/1/22/382 This fixes it by using the passed in page writeback count, instead of doing MAX_WRITEBACK_PAGES batches, which gets us much better performance (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) finish properly even when new pages are being dirted. Thanks to Jan Kara <jack@suse.cz> for spotting this problem! Cc: stable@kernel.org Reported-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9d5360c..8a46c67 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -773,6 +773,8 @@ static long wb_writeback(struct bdi_writeback *wb, } for (;;) { + long to_write = 0; + /* * Stop writeback when nr_pages has been consumed */ @@ -786,13 +788,18 @@ static long wb_writeback(struct bdi_writeback *wb, if (args->for_background && !over_bground_thresh()) break; + if (args->sync_mode == WB_SYNC_ALL) + to_write = args->nr_pages; + if (!to_write) + to_write = MAX_WRITEBACK_PAGES; + wbc.more_io = 0; wbc.encountered_congestion = 0; - wbc.nr_to_write = MAX_WRITEBACK_PAGES; + wbc.nr_to_write = to_write; wbc.pages_skipped = 0; writeback_inodes_wb(wb, &wbc); - args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; - wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + args->nr_pages -= to_write - wbc.nr_to_write; + wrote += to_write - wbc.nr_to_write; /* * If we consumed everything, see if we have more @@ -807,7 +814,7 @@ static long wb_writeback(struct bdi_writeback *wb, /* * Did we write something? Try for more */ - if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) + if (wbc.nr_to_write < to_write) continue; /* * Nothing written. Wait for some inode to ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-12 9:16 [PATCH] writeback: Fix broken sync writeback Jens Axboe @ 2010-02-12 15:45 ` Linus Torvalds 2010-02-13 12:58 ` Jan Engelhardt 2010-02-15 14:17 ` [PATCH] writeback: Fix broken sync writeback Jan Kara 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2010-02-12 15:45 UTC (permalink / raw) To: Jens Axboe; +Cc: Linux Kernel, jack, jengelh, stable, gregkh On Fri, 12 Feb 2010, Jens Axboe wrote: > > This fixes it by using the passed in page writeback count, instead of > doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > finish properly even when new pages are being dirted. This seems broken. The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't generate a single IO bigger than a couple of MB. And then we have that loop around things to do multiple chunks. Your change to use nr_pages seems to make the whole looping totally pointless, and breaks that "don't do huge hunks" logic. So I don't think that your patch is correct. That said, I _do_ believe you when you say it makes a difference, which makes me think there is a bug there. I just don't think you fixed the right bug, and your change just happens to do what you wanted by pure random luck. The _real_ bug seems to bethe one you mentioned, but then ignored: > Instead of flushing everything older than the sync run, it will do > chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and > over. and it would seem that the _logical_ way to fix it would be something like the appended... Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that any _future_ data is written back, so the 'oldest_jif' thing would seem to be sane regardless of sync mode. NOTE NOTE NOTE! I do have to admit that this patch scares me, because there could be some bug in the 'older_than_this' logic that means that somebody sets it even if the inode is already dirty. So this patch makes conceptual sense to me, and I think it's the right thing to do, but I also suspect that we do not actually have a lot of test coverage of the whole 'older_than_this' logic, because it historically has been just a small optimization for kupdated. So this patch scares me, as it could break 'fdatasync' entirely. So somebody should really double-check the whole 'dirtied_when' logic, just to be safe. If anybody ever sets 'dirtied_when' to the current time even if the inode is already dirty (and has an earlier dirtied_when'), then that would open up 'fdatasync()' and friends up to not writing things back properly at all (because a newer write set 'dirtied_when' so that old writes get ignored and thought to be 'totally new') Comments? Linus --- fs/fs-writeback.c | 15 ++++++++++----- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1a7c42c..a0a8424 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb, long wrote = 0; struct inode *inode; - if (wbc.for_kupdate) { - wbc.older_than_this = &oldest_jif; - oldest_jif = jiffies - - msecs_to_jiffies(dirty_expire_interval * 10); - } + /* + * We never write back data that was dirtied _after_ we + * started writeback. But kupdate doesn't even want to + * write back recently dirtied stuff, only older data. + */ + oldest_jif = jiffies-1; + wbc.older_than_this = &oldest_jif; + if (wbc.for_kupdate) + oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10); + if (!wbc.range_cyclic) { wbc.range_start = 0; wbc.range_end = LLONG_MAX; ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-12 15:45 ` Linus Torvalds @ 2010-02-13 12:58 ` Jan Engelhardt 2010-02-15 14:49 ` Jan Kara 2010-02-15 14:17 ` [PATCH] writeback: Fix broken sync writeback Jan Kara 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2010-02-13 12:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, Linux Kernel, jack, stable, gregkh On Friday 2010-02-12 16:45, Linus Torvalds wrote: >On Fri, 12 Feb 2010, Jens Axboe wrote: >> >> This fixes it by using the passed in page writeback count, instead of >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) >> finish properly even when new pages are being dirted. > >This seems broken. It seems so. Jens, Jan Kara, your patch does not entirely fix this. While there is no sync/fsync to be seen in these traces, I can tell there's a livelock, without Dirty decreasing at all. INFO: task flush-8:0:354 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D 00000000005691a0 5560 354 2 0x28000000000 Call Trace: [0000000000568de4] start_this_handle+0x36c/0x520 [00000000005691a0] jbd2_journal_start+0xb4/0xe0 [000000000054f99c] ext4_journal_start_sb+0x54/0x9c [0000000000540de0] ext4_da_writepages+0x1e0/0x460 [00000000004ab3e4] do_writepages+0x28/0x4c [00000000004f825c] writeback_single_inode+0xf0/0x330 [00000000004f90a8] writeback_inodes_wb+0x4e4/0x600 [00000000004f9354] wb_writeback+0x190/0x20c [00000000004f967c] wb_do_writeback+0x1d4/0x1f0 [00000000004f96c0] bdi_writeback_task+0x28/0xa0 [00000000004b71dc] bdi_start_fn+0x64/0xc8 [00000000004786f0] kthread+0x58/0x6c [000000000042ae7c] kernel_thread+0x30/0x48 [0000000000478644] kthreadd+0xb8/0x10c 1 lock held by flush-8:0/354: #0: (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>] # writeback_inodes_wb+0x404/0x600 INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/sda6-8 D 000000000056eea0 7272 588 2 0x18000000000 Call Trace: [000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0 [000000000056eea0] kjournald2+0x138/0x2fc [00000000004786f0] kthread+0x58/0x6c [000000000042ae7c] kernel_thread+0x30/0x48 [0000000000478644] kthreadd+0xb8/0x10c no locks held by jbd2/sda6-8/588. INFO: task rpmbuild:6580 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rpmbuild D 00000000005691a0 16 6580 6562 0x310061101000080 Call Trace: [0000000000568de4] start_this_handle+0x36c/0x520 [00000000005691a0] jbd2_journal_start+0xb4/0xe0 [000000000054f99c] ext4_journal_start_sb+0x54/0x9c [000000000053dcd4] ext4_dirty_inode+0x8/0x3c [00000000004f8888] __mark_inode_dirty+0x20/0x15c [00000000004eeb34] file_update_time+0x104/0x124 [00000000004a50d4] __generic_file_aio_write+0x264/0x36c [00000000004a5228] generic_file_aio_write+0x4c/0xa4 [00000000005390c0] ext4_file_write+0x94/0xa4 [00000000004dc12c] do_sync_write+0x84/0xd4 [00000000004dca78] vfs_write+0x70/0x12c [00000000004dcbcc] SyS_write+0x2c/0x5c [0000000000406214] linux_sparc_syscall32+0x34/0x40 1 lock held by rpmbuild/6580: #0: (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>] # generic_file_aio_write+0x38/0xa4 etc. > >The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't >generate a single IO bigger than a couple of MB. And then we have that >loop around things to do multiple chunks. Your change to use nr_pages >seems to make the whole looping totally pointless, and breaks that "don't >do huge hunks" logic. > >So I don't think that your patch is correct. > >That said, I _do_ believe you when you say it makes a difference, which >makes me think there is a bug there. I just don't think you fixed the >right bug, and your change just happens to do what you wanted by pure >random luck. > >The _real_ bug seems to bethe one you mentioned, but then ignored: > >> Instead of flushing everything older than the sync run, it will do >> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and >> over. > >and it would seem that the _logical_ way to fix it would be something like >the appended... > >Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that >any _future_ data is written back, so the 'oldest_jif' thing would seem to >be sane regardless of sync mode. > > NOTE NOTE NOTE! I do have to admit that this patch scares me, because > there could be some bug in the 'older_than_this' logic that means that > somebody sets it even if the inode is already dirty. So this patch makes > conceptual sense to me, and I think it's the right thing to do, but I > also suspect that we do not actually have a lot of test coverage of the > whole 'older_than_this' logic, because it historically has been just a > small optimization for kupdated. > > So this patch scares me, as it could break 'fdatasync' entirely. So > somebody should really double-check the whole 'dirtied_when' logic, just > to be safe. If anybody ever sets 'dirtied_when' to the current time even > if the inode is already dirty (and has an earlier dirtied_when'), then > that would open up 'fdatasync()' and friends up to not writing things > back properly at all (because a newer write set 'dirtied_when' so that > old writes get ignored and thought to be 'totally new') > >Comments? > > Linus > >--- > fs/fs-writeback.c | 15 ++++++++++----- > 1 files changed, 10 insertions(+), 5 deletions(-) > >diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c >index 1a7c42c..a0a8424 100644 >--- a/fs/fs-writeback.c >+++ b/fs/fs-writeback.c >@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb, > long wrote = 0; > struct inode *inode; > >- if (wbc.for_kupdate) { >- wbc.older_than_this = &oldest_jif; >- oldest_jif = jiffies - >- msecs_to_jiffies(dirty_expire_interval * 10); >- } >+ /* >+ * We never write back data that was dirtied _after_ we >+ * started writeback. But kupdate doesn't even want to >+ * write back recently dirtied stuff, only older data. >+ */ >+ oldest_jif = jiffies-1; >+ wbc.older_than_this = &oldest_jif; >+ if (wbc.for_kupdate) >+ oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10); >+ > if (!wbc.range_cyclic) { > wbc.range_start = 0; > wbc.range_end = LLONG_MAX; > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-13 12:58 ` Jan Engelhardt @ 2010-02-15 14:49 ` Jan Kara 2010-02-15 15:41 ` Jan Engelhardt 0 siblings, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-02-15 14:49 UTC (permalink / raw) To: Jan Engelhardt Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jack, stable, gregkh On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: > > On Friday 2010-02-12 16:45, Linus Torvalds wrote: > >On Fri, 12 Feb 2010, Jens Axboe wrote: > >> > >> This fixes it by using the passed in page writeback count, instead of > >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > >> finish properly even when new pages are being dirted. > > > >This seems broken. > > It seems so. Jens, Jan Kara, your patch does not entirely fix this. > While there is no sync/fsync to be seen in these traces, I can > tell there's a livelock, without Dirty decreasing at all. I don't think this is directly connected with my / Jens' patch. Similar traces happen even without the patch (see e.g. http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch makes it worse... So are you able to reproduce these warnings and without the patch they did not happen? Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0? > INFO: task flush-8:0:354 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > flush-8:0 D 00000000005691a0 5560 354 2 0x28000000000 > Call Trace: > [0000000000568de4] start_this_handle+0x36c/0x520 > [00000000005691a0] jbd2_journal_start+0xb4/0xe0 > [000000000054f99c] ext4_journal_start_sb+0x54/0x9c > [0000000000540de0] ext4_da_writepages+0x1e0/0x460 > [00000000004ab3e4] do_writepages+0x28/0x4c > [00000000004f825c] writeback_single_inode+0xf0/0x330 > [00000000004f90a8] writeback_inodes_wb+0x4e4/0x600 > [00000000004f9354] wb_writeback+0x190/0x20c > [00000000004f967c] wb_do_writeback+0x1d4/0x1f0 > [00000000004f96c0] bdi_writeback_task+0x28/0xa0 > [00000000004b71dc] bdi_start_fn+0x64/0xc8 > [00000000004786f0] kthread+0x58/0x6c > [000000000042ae7c] kernel_thread+0x30/0x48 > [0000000000478644] kthreadd+0xb8/0x10c > 1 lock held by flush-8:0/354: > #0: (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>] > # writeback_inodes_wb+0x404/0x600 > INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > jbd2/sda6-8 D 000000000056eea0 7272 588 2 0x18000000000 > Call Trace: > [000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0 > [000000000056eea0] kjournald2+0x138/0x2fc > [00000000004786f0] kthread+0x58/0x6c > [000000000042ae7c] kernel_thread+0x30/0x48 > [0000000000478644] kthreadd+0xb8/0x10c > no locks held by jbd2/sda6-8/588. > INFO: task rpmbuild:6580 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > rpmbuild D 00000000005691a0 16 6580 6562 0x310061101000080 > Call Trace: > [0000000000568de4] start_this_handle+0x36c/0x520 > [00000000005691a0] jbd2_journal_start+0xb4/0xe0 > [000000000054f99c] ext4_journal_start_sb+0x54/0x9c > [000000000053dcd4] ext4_dirty_inode+0x8/0x3c > [00000000004f8888] __mark_inode_dirty+0x20/0x15c > [00000000004eeb34] file_update_time+0x104/0x124 > [00000000004a50d4] __generic_file_aio_write+0x264/0x36c > [00000000004a5228] generic_file_aio_write+0x4c/0xa4 > [00000000005390c0] ext4_file_write+0x94/0xa4 > [00000000004dc12c] do_sync_write+0x84/0xd4 > [00000000004dca78] vfs_write+0x70/0x12c > [00000000004dcbcc] SyS_write+0x2c/0x5c > [0000000000406214] linux_sparc_syscall32+0x34/0x40 > 1 lock held by rpmbuild/6580: > #0: (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>] > # generic_file_aio_write+0x38/0xa4 > etc. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-15 14:49 ` Jan Kara @ 2010-02-15 15:41 ` Jan Engelhardt 2010-02-15 15:58 ` Jan Kara 2010-06-27 16:44 ` Jan Engelhardt 0 siblings, 2 replies; 39+ messages in thread From: Jan Engelhardt @ 2010-02-15 15:41 UTC (permalink / raw) To: Jan Kara; +Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Monday 2010-02-15 15:49, Jan Kara wrote: >On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: >> >> >> >> This fixes it by using the passed in page writeback count, instead of >> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance >> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) >> >> finish properly even when new pages are being dirted. >> > >> >This seems broken. >> >> It seems so. Jens, Jan Kara, your patch does not entirely fix this. >> While there is no sync/fsync to be seen in these traces, I can >> tell there's a livelock, without Dirty decreasing at all. > > I don't think this is directly connected with my / Jens' patch. I start to think so too. >Similar traces happen even without the patch (see e.g. >http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch >makes it worse... So are you able to reproduce these warnings and >without the patch they did not happen? Your patch speeds up the slow sync; without the patch, there was no real chance to observ the hard lockup, as the slow sync would take up all time. So far, no reproduction. It seems to be just as you say. > Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0? 0000000000569554 <jbd2_journal_commit_transaction>: 56976c: 40 04 ee 62 call 6a50f4 <schedule> Since there is an obvious schedule() call in jbd2_journal_commit_transaction's C code, I think that's where it is. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-15 15:41 ` Jan Engelhardt @ 2010-02-15 15:58 ` Jan Kara 2010-06-27 16:44 ` Jan Engelhardt 1 sibling, 0 replies; 39+ messages in thread From: Jan Kara @ 2010-02-15 15:58 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Mon 15-02-10 16:41:17, Jan Engelhardt wrote: > > On Monday 2010-02-15 15:49, Jan Kara wrote: > >On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: > >> >> > >> >> This fixes it by using the passed in page writeback count, instead of > >> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > >> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > >> >> finish properly even when new pages are being dirted. > > >> It seems so. Jens, Jan Kara, your patch does not entirely fix this. > >> While there is no sync/fsync to be seen in these traces, I can > >> tell there's a livelock, without Dirty decreasing at all. > > > > I don't think this is directly connected with my / Jens' patch. > > I start to think so too. > > >Similar traces happen even without the patch (see e.g. > >http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch > >makes it worse... So are you able to reproduce these warnings and > >without the patch they did not happen? > > Your patch speeds up the slow sync; without the patch, there was > no real chance to observ the hard lockup, as the slow sync would > take up all time. > > So far, no reproduction. It seems to be just as you say. > > > Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0? > > 0000000000569554 <jbd2_journal_commit_transaction>: > 56976c: 40 04 ee 62 call 6a50f4 <schedule> > > Since there is an obvious schedule() call in jbd2_journal_commit_transaction's > C code, I think that's where it is. OK. Thanks. It seems some process is spending excessive time with a transaction open (jbd2_journal_commit_transaction waits for all handles of a transaction to be dropped). If you see the traces again, try to obtain stack traces of all the other processes and maybe we can catch the process and see whether it's doing something unexpected. The patch can have an influence on this because we now pass larger nr_to_write to ext4_writepages so maybe that makes some corner case more likely. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-15 15:41 ` Jan Engelhardt 2010-02-15 15:58 ` Jan Kara @ 2010-06-27 16:44 ` Jan Engelhardt 2010-10-24 23:41 ` Sync writeback still broken Jan Engelhardt 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2010-06-27 16:44 UTC (permalink / raw) To: Jan Kara; +Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh Hi Jan Kara, On Monday 2010-02-15 16:41, Jan Engelhardt wrote: >On Monday 2010-02-15 15:49, Jan Kara wrote: >>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: >>> >> >>> >> This fixes it by using the passed in page writeback count, instead of >>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance >>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) >>> >> finish properly even when new pages are being dirted. >>> > >>> >This seems broken. >>> >>> It seems so. Jens, Jan Kara, your patch does not entirely fix this. >>> While there is no sync/fsync to be seen in these traces, I can >>> tell there's a livelock, without Dirty decreasing at all. What ultimately became of the discussion and/or the patch? Your original ad-hoc patch certainly still does its job; had no need to reboot in 86 days and still counting. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Sync writeback still broken 2010-06-27 16:44 ` Jan Engelhardt @ 2010-10-24 23:41 ` Jan Engelhardt 2010-10-30 0:57 ` Linus Torvalds 2010-10-31 12:24 ` Jan Kara 0 siblings, 2 replies; 39+ messages in thread From: Jan Engelhardt @ 2010-10-24 23:41 UTC (permalink / raw) To: Jan Kara, Andrew Morton Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Sunday 2010-06-27 18:44, Jan Engelhardt wrote: >On Monday 2010-02-15 16:41, Jan Engelhardt wrote: >>On Monday 2010-02-15 15:49, Jan Kara wrote: >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: >>>> >> >>>> >> This fixes it by using the passed in page writeback count, instead of >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) >>>> >> finish properly even when new pages are being dirted. >>>> > >>>> >This seems broken. >>>> >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this. >>>> While there is no sync/fsync to be seen in these traces, I can >>>> tell there's a livelock, without Dirty decreasing at all. > >What ultimately became of the discussion and/or the patch? > >Your original ad-hoc patch certainly still does its job; had no need to >reboot in 86 days and still counting. I still observe this behavior on 2.6.36-rc8. This is starting to get frustrating, so I will be happily following akpm's advise to poke people. Thread entrypoint: http://lkml.org/lkml/2010/2/12/41 Previously, many concurrent extractions of tarballs and so on have been one way to trigger the issue; I now also have a rather small testcase (below) that freezes the box here (which has 24G RAM, so even if I'm lacking to call msync, I should be fine) sometime after memset finishes. ---- /* calculate all possible 32-bit hashes needs 16G of address space, so better have a 64-bit kernel at hand */ #define _GNU_SOURCE 1 #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <errno.h> #include <fcntl.h> #include <limits.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) #define S_IRUGO (S_IRUSR | S_IRGRP | S_IROTH) #define S_IWUGO (S_IWUSR | S_IWGRP | S_IWOTH) #define jrot(x,k) (((x) << (k)) | ((x) >> (32 - (k)))) /* jhash_mix - mix 3 32-bit values reversibly. */ #define jhash_mix(a, b, c) { \ a -= c; a ^= jrot(c, 4); c += b; \ b -= a; b ^= jrot(a, 6); a += c; \ c -= b; c ^= jrot(b, 8); b += a; \ a -= c; a ^= jrot(c, 16); c += b; \ b -= a; b ^= jrot(a, 19); a += c; \ c -= b; c ^= jrot(b, 4); b += a; \ } #define jhash_final(a, b, c) { \ c ^= b; c -= jrot(b, 14); \ a ^= c; a -= jrot(c, 11); \ b ^= a; b -= jrot(a, 25); \ c ^= b; c -= jrot(b, 16); \ a ^= c; a -= jrot(c, 4); \ b ^= a; b -= jrot(a, 14); \ c ^= b; c -= jrot(b, 24); \ } static uint32_t hash_jlookup3(const void *vkey, size_t length) { static const unsigned int JHASH_GOLDEN_RATIO = 0x9e3779b9; const uint8_t *key = vkey; uint32_t a, b, c; a = b = c = JHASH_GOLDEN_RATIO + length; /* All but the last block: affect some 32 bits of (a,b,c) */ for (; length > 12; length -= 12, key += 12) { a += key[0] + ((uint32_t)key[1] << 8) + ((uint32_t)key[2] << 16) + ((uint32_t)key[3] << 24); b += key[4] + ((uint32_t)key[5] << 8) + ((uint32_t)key[6] << 16) + ((uint32_t)key[7] << 24); c += key[8] + ((uint32_t)key[9] << 8) + ((uint32_t)key[10] << 16)+ ((uint32_t)key[11] << 24); jhash_mix(a, b, c); } switch (length) { case 12: c += ((uint32_t)key[11]) << 24; case 11: c += ((uint32_t)key[10]) << 16; case 10: c += ((uint32_t)key[9]) << 8; case 9: c += key[8]; case 8: b += ((uint32_t)key[7]) << 24; case 7: b += ((uint32_t)key[6]) << 16; case 6: b += ((uint32_t)key[5]) << 8; case 5: b += key[4]; case 4: a += ((uint32_t)key[3]) << 24; case 3: a += ((uint32_t)key[2]) << 16; case 2: a += ((uint32_t)key[1]) << 8; case 1: a += key[0]; break; case 0: return c; } jhash_final(a,b,c); return c; } static uint32_t *freq; static const unsigned long long freq_size = 0x100000000UL * sizeof(*freq); static void map_freq(void) { int fd; fd = open("jenkins3.frq", O_RDWR | O_CREAT, S_IRUGO | S_IWUGO); if (fd < 0) { perror("open"); abort(); } if (ftruncate(fd, freq_size) < 0) { perror("ftruncate"); abort(); } freq = mmap(NULL, freq_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (freq == NULL) { perror("mmap"); abort(); } } static inline void calc_all_hashes(void) { uint32_t x, y; memset(freq, 0, freq_size); for (x = 0; x < UINT32_MAX; ++x) { if ((x & 0xFFFFF) == 0) fprintf(stderr, "\r\e[2K""fill: %08x", x); y = hash_jlookup3(&x, sizeof(x)); if (freq[y] < UINT32_MAX) ++freq[y]; } } int main(void) { map_freq(); calc_all_hashes(); return 0; } ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-24 23:41 ` Sync writeback still broken Jan Engelhardt @ 2010-10-30 0:57 ` Linus Torvalds 2010-10-30 1:16 ` Linus Torvalds 2010-10-31 12:24 ` Jan Kara 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2010-10-30 0:57 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH, Jens Axboe Guys, what is the status of this? The original patch in that email thread still makes no sense and the commit log for it cannot be the real issue. But the _problem_ seems to be real, and the code is apparently a total mess, still. And the chunking is necessary - as even quoted in that whole thread: On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote: > > Ignoring nr_to_write completely can lead to issues like we used to > have with XFS - it would write an entire extent (8GB) at a time and > starve all other writeback. Those starvation problems - which were > very obvious on NFS servers - went away when we trimmed back the > amount to write in a single pass to saner amounts... so we can't just stay with one single inode and do that one completely. At the same time, the VFS chunking code itself is at least supposed to try to write out 4MB at a time, which means that the whole "only 400kB/s throughput" thing is pretty damn unlikely - but if it's true, then that obviously means that the chunking is somehow broken. IOW, we haven't seemed to get anywhere, and I haven't seen anybody reply to Jan's plaintive email. Anybody? Linus On Sun, Oct 24, 2010 at 4:41 PM, Jan Engelhardt <jengelh@medozas.de> wrote: >> >>What ultimately became of the discussion and/or the patch? >> >>Your original ad-hoc patch certainly still does its job; had no need to >>reboot in 86 days and still counting. > > I still observe this behavior on 2.6.36-rc8. This is starting to > get frustrating, so I will be happily following akpm's advise to > poke people. > > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41 > > Previously, many concurrent extractions of tarballs and so on have been > one way to trigger the issue; I now also have a rather small testcase > (below) that freezes the box here (which has 24G RAM, so even if I'm > lacking to call msync, I should be fine) sometime after memset finishes. > > ---- > /* calculate all possible 32-bit hashes > needs 16G of address space, so better have a 64-bit kernel at hand > */ > #define _GNU_SOURCE 1 > #include <sys/mman.h> > #include <sys/stat.h> > #include <sys/types.h> > #include <errno.h> > #include <fcntl.h> > #include <limits.h> > #include <stdint.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > > #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) > #define S_IRUGO (S_IRUSR | S_IRGRP | S_IROTH) > #define S_IWUGO (S_IWUSR | S_IWGRP | S_IWOTH) > > #define jrot(x,k) (((x) << (k)) | ((x) >> (32 - (k)))) > > /* jhash_mix - mix 3 32-bit values reversibly. */ > #define jhash_mix(a, b, c) { \ > a -= c; a ^= jrot(c, 4); c += b; \ > b -= a; b ^= jrot(a, 6); a += c; \ > c -= b; c ^= jrot(b, 8); b += a; \ > a -= c; a ^= jrot(c, 16); c += b; \ > b -= a; b ^= jrot(a, 19); a += c; \ > c -= b; c ^= jrot(b, 4); b += a; \ > } > > #define jhash_final(a, b, c) { \ > c ^= b; c -= jrot(b, 14); \ > a ^= c; a -= jrot(c, 11); \ > b ^= a; b -= jrot(a, 25); \ > c ^= b; c -= jrot(b, 16); \ > a ^= c; a -= jrot(c, 4); \ > b ^= a; b -= jrot(a, 14); \ > c ^= b; c -= jrot(b, 24); \ > } > > static uint32_t hash_jlookup3(const void *vkey, size_t length) > { > static const unsigned int JHASH_GOLDEN_RATIO = 0x9e3779b9; > const uint8_t *key = vkey; > uint32_t a, b, c; > > a = b = c = JHASH_GOLDEN_RATIO + length; > /* All but the last block: affect some 32 bits of (a,b,c) */ > for (; length > 12; length -= 12, key += 12) { > a += key[0] + ((uint32_t)key[1] << 8) + > ((uint32_t)key[2] << 16) + ((uint32_t)key[3] << 24); > b += key[4] + ((uint32_t)key[5] << 8) + > ((uint32_t)key[6] << 16) + ((uint32_t)key[7] << 24); > c += key[8] + ((uint32_t)key[9] << 8) + > ((uint32_t)key[10] << 16)+ ((uint32_t)key[11] << 24); > jhash_mix(a, b, c); > } > > switch (length) { > case 12: c += ((uint32_t)key[11]) << 24; > case 11: c += ((uint32_t)key[10]) << 16; > case 10: c += ((uint32_t)key[9]) << 8; > case 9: c += key[8]; > case 8: b += ((uint32_t)key[7]) << 24; > case 7: b += ((uint32_t)key[6]) << 16; > case 6: b += ((uint32_t)key[5]) << 8; > case 5: b += key[4]; > case 4: a += ((uint32_t)key[3]) << 24; > case 3: a += ((uint32_t)key[2]) << 16; > case 2: a += ((uint32_t)key[1]) << 8; > case 1: a += key[0]; > break; > case 0: return c; > } > > jhash_final(a,b,c); > return c; > } > > static uint32_t *freq; > static const unsigned long long freq_size = 0x100000000UL * sizeof(*freq); > > static void map_freq(void) > { > int fd; > > fd = open("jenkins3.frq", O_RDWR | O_CREAT, S_IRUGO | S_IWUGO); > if (fd < 0) { > perror("open"); > abort(); > } > > if (ftruncate(fd, freq_size) < 0) { > perror("ftruncate"); > abort(); > } > > freq = mmap(NULL, freq_size, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > if (freq == NULL) { > perror("mmap"); > abort(); > } > } > > static inline void calc_all_hashes(void) > { > uint32_t x, y; > > memset(freq, 0, freq_size); > for (x = 0; x < UINT32_MAX; ++x) { > if ((x & 0xFFFFF) == 0) > fprintf(stderr, "\r\e[2K""fill: %08x", x); > y = hash_jlookup3(&x, sizeof(x)); > if (freq[y] < UINT32_MAX) > ++freq[y]; > } > } > > int main(void) > { > map_freq(); > calc_all_hashes(); > return 0; > } > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-30 0:57 ` Linus Torvalds @ 2010-10-30 1:16 ` Linus Torvalds 2010-10-30 1:30 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Linus Torvalds @ 2010-10-30 1:16 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH, Jens Axboe On Fri, Oct 29, 2010 at 5:57 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Guys, what is the status of this? > > The original patch in that email thread still makes no sense and the > commit log for it cannot be the real issue. But the _problem_ seems to > be real, and the code is apparently a total mess, still. Btw, is the problem just that insane WB_SYNC_ALL thing? The problem with WB_SYNC_ALL seems to be that it synchrnously writes out one inode at a time. And it's not just the data, it's the inode itself. So rather than write out all pages for _all_ inodes, and then wait for them, and write out _all_ metadata, and then wait for that, it seems like the WB_SYNC_ALL code does the truly stupid thing, which is to "write out some data for one inode, then _synchronously_ wait for that, then write out the metadata for that single inode, then _synchronously_ wait for THAT" and then rinse and repeat for each inode. The sane thing for "WB_SYNC_ALL" would be to: - for_each_inode: write out all data (no waiting) - for_each_inode: wait for the data for that inode, write out the inode - for_each_inode: wait for the inode so that you avoid the whole synchronous wait thing, and can do all inodes in one go. I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't it? And "umount()", I guess. I didn't actually look at the code. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-30 1:16 ` Linus Torvalds @ 2010-10-30 1:30 ` Linus Torvalds 2010-10-30 3:18 ` Andrew Morton 2010-10-30 13:15 ` Christoph Hellwig 2 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2010-10-30 1:30 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH, Jens Axboe On Fri, Oct 29, 2010 at 6:16 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't > it? And "umount()", I guess. I didn't actually look at the code. Hmm. The test-case Jan provided doesn't seem to do sync() or anything, so either this is different from his original email issues, or there's some external "sync()" going on unmentioned. Or something else entirely? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-30 1:16 ` Linus Torvalds 2010-10-30 1:30 ` Linus Torvalds @ 2010-10-30 3:18 ` Andrew Morton 2010-10-30 13:15 ` Christoph Hellwig 2 siblings, 0 replies; 39+ messages in thread From: Andrew Morton @ 2010-10-30 3:18 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Engelhardt, Jan Kara, Linux Kernel, stable, Greg KH, Jens Axboe On Fri, 29 Oct 2010 18:16:48 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Oct 29, 2010 at 5:57 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > Guys, what is the status of this? > > > > The original patch in that email thread still makes no sense and the > > commit log for it cannot be the real issue. But the _problem_ seems to > > be real, and the code is apparently a total mess, still. > > Btw, is the problem just that insane WB_SYNC_ALL thing? > > The problem with WB_SYNC_ALL seems to be that it synchrnously writes > out one inode at a time. And it's not just the data, it's the inode > itself. > > So rather than write out all pages for _all_ inodes, and then wait for > them, and write out _all_ metadata, and then wait for that, it seems > like the WB_SYNC_ALL code does the truly stupid thing, which is to > "write out some data for one inode, then _synchronously_ wait for > that, then write out the metadata for that single inode, then > _synchronously_ wait for THAT" and then rinse and repeat for each > inode. > > The sane thing for "WB_SYNC_ALL" would be to: > - for_each_inode: write out all data (no waiting) > - for_each_inode: wait for the data for that inode, write out the inode > - for_each_inode: wait for the inode > > so that you avoid the whole synchronous wait thing, and can do all > inodes in one go. The way I originally implemented all this was that the top level would do two passes. The first was as async as possible, to get lots of IO underway against lots of devices. Then the second pass was WB_SYNC_ALL to get any straggler IO started and then to wait on everything. eg (2.6.20): static void do_sync(unsigned long wait) { wakeup_pdflush(0); sync_inodes(0); /* All mappings, inodes and their blockdevs */ DQUOT_SYNC(NULL); sync_supers(); /* Write the superblocks */ sync_filesystems(0); /* Start syncing the filesystems */ sync_filesystems(wait); /* Waitingly sync the filesystems */ sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */ if (!wait) printk("Emergency Sync complete\n"); if (unlikely(laptop_mode)) laptop_sync_completion(); } note the two sync_inodes() calls and the two sync_filesystems() calls. This used to work reasonably well. Maybe it got broken, who knows. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-30 1:16 ` Linus Torvalds 2010-10-30 1:30 ` Linus Torvalds 2010-10-30 3:18 ` Andrew Morton @ 2010-10-30 13:15 ` Christoph Hellwig 2 siblings, 0 replies; 39+ messages in thread From: Christoph Hellwig @ 2010-10-30 13:15 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Engelhardt, Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH, Jens Axboe On Fri, Oct 29, 2010 at 06:16:48PM -0700, Linus Torvalds wrote: > Btw, is the problem just that insane WB_SYNC_ALL thing? > > The problem with WB_SYNC_ALL seems to be that it synchrnously writes > out one inode at a time. And it's not just the data, it's the inode > itself. > > So rather than write out all pages for _all_ inodes, and then wait for > them, and write out _all_ metadata, and then wait for that, it seems > like the WB_SYNC_ALL code does the truly stupid thing, which is to > "write out some data for one inode, then _synchronously_ wait for > that, then write out the metadata for that single inode, then > _synchronously_ wait for THAT" and then rinse and repeat for each > inode. > > The sane thing for "WB_SYNC_ALL" would be to: > - for_each_inode: write out all data (no waiting) > - for_each_inode: wait for the data for that inode, write out the inode > - for_each_inode: wait for the inode What we do currently at a high level is: for_each_inode: write data (no waiting) for_each_inode: write metadata (no waiting) for_each_inode: write and wait on data for_each_inode: write and wait on metadata except that as pointed out in the earlier thread we switch around inodes in a really dumb way. It's still not quite as efficient as your version above. I'd like to get to something similar to your by splitting the data and metadata list, but for now we have a fairly nice workaround in XFS for the inefficient metadata writes. We basically do not start any real I/O at all in ->write_inode but only put the inode changes in the log, and do we single force of the log contents at the end of the sync process. Dimitri has implemented something very similar for ext4 as well recently. Together with fixing the behaviour of randomly switching between the inodes that should get us to a point where we are doing pretty well. I'd still like to see the split data/metadata dirty tracking and writeout in common code eventually. > > so that you avoid the whole synchronous wait thing, and can do all > inodes in one go. > > I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't > it? And "umount()", I guess. I didn't actually look at the code. It's sync/umount as for the writeback threads are concerned, we of course use the flag for synchronous writeout of single inodes, but the issue does not apply there. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-24 23:41 ` Sync writeback still broken Jan Engelhardt 2010-10-30 0:57 ` Linus Torvalds @ 2010-10-31 12:24 ` Jan Kara 2010-10-31 22:40 ` Jan Kara 1 sibling, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-10-31 12:24 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Mon 25-10-10 01:41:48, Jan Engelhardt wrote: > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote: > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote: > >>On Monday 2010-02-15 15:49, Jan Kara wrote: > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: > >>>> >> > >>>> >> This fixes it by using the passed in page writeback count, instead of > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > >>>> >> finish properly even when new pages are being dirted. > >>>> > > >>>> >This seems broken. > >>>> > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this. > >>>> While there is no sync/fsync to be seen in these traces, I can > >>>> tell there's a livelock, without Dirty decreasing at all. > > > >What ultimately became of the discussion and/or the patch? > > > >Your original ad-hoc patch certainly still does its job; had no need to > >reboot in 86 days and still counting. > > I still observe this behavior on 2.6.36-rc8. This is starting to > get frustrating, so I will be happily following akpm's advise to > poke people. Yes, that's a good way :) > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41 > > Previously, many concurrent extractions of tarballs and so on have been > one way to trigger the issue; I now also have a rather small testcase > (below) that freezes the box here (which has 24G RAM, so even if I'm > lacking to call msync, I should be fine) sometime after memset finishes. I've tried your test but didn't succeed in freezing my laptop. Everything was running smooth, the machine even felt reasonably responsive although constantly reading and writing to disk. Also sync(1) finished in a couple of seconds as one would expect in an optimistic case. Needless to say that my laptop has only 1G of ram so I had to downsize the hash table from 16G to 1G to be able to run the test and the disk is Intel SSD so the performance of the backing storage compared to the amount of needed IO is much in my favor. OK, so I've taken a machine with standard rotational drive and 28GB of ram and there I can see sync(1) hanging (but otherwise the machine looks OK). Investigating further... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-31 12:24 ` Jan Kara @ 2010-10-31 22:40 ` Jan Kara 2010-11-05 21:33 ` Jan Kara 0 siblings, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-10-31 22:40 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Sun 31-10-10 13:24:37, Jan Kara wrote: > On Mon 25-10-10 01:41:48, Jan Engelhardt wrote: > > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote: > > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote: > > >>On Monday 2010-02-15 15:49, Jan Kara wrote: > > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: > > >>>> >> > > >>>> >> This fixes it by using the passed in page writeback count, instead of > > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > > >>>> >> finish properly even when new pages are being dirted. > > >>>> > > > >>>> >This seems broken. > > >>>> > > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this. > > >>>> While there is no sync/fsync to be seen in these traces, I can > > >>>> tell there's a livelock, without Dirty decreasing at all. > > > > > >What ultimately became of the discussion and/or the patch? > > > > > >Your original ad-hoc patch certainly still does its job; had no need to > > >reboot in 86 days and still counting. > > > > I still observe this behavior on 2.6.36-rc8. This is starting to > > get frustrating, so I will be happily following akpm's advise to > > poke people. > Yes, that's a good way :) > > > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41 > > > > Previously, many concurrent extractions of tarballs and so on have been > > one way to trigger the issue; I now also have a rather small testcase > > (below) that freezes the box here (which has 24G RAM, so even if I'm > > lacking to call msync, I should be fine) sometime after memset finishes. > I've tried your test but didn't succeed in freezing my laptop. > Everything was running smooth, the machine even felt reasonably responsive > although constantly reading and writing to disk. Also sync(1) finished in a > couple of seconds as one would expect in an optimistic case. > Needless to say that my laptop has only 1G of ram so I had to downsize > the hash table from 16G to 1G to be able to run the test and the disk is > Intel SSD so the performance of the backing storage compared to the amount > of needed IO is much in my favor. > OK, so I've taken a machine with standard rotational drive and 28GB of > ram and there I can see sync(1) hanging (but otherwise the machine looks > OK). Investigating further... So with the writeback tracing, I verified that indeed the trouble is that work queued by sync(1) gets queued behind the background writeback which is just running. And background writeback won't stop because your process is dirtying pages so agressively. Actually, it would stop after writing LONG_MAX pages but that's effectively infinity. I have a patch (e.g. http://www.kerneltrap.com/mailarchive/linux-fsdevel/2010/8/3/6886244) to stop background writeback when other work is queued but it's kind of hacky so I can see why Christoph doesn't like it ;) So I'll have to code something different to fix this issue... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-10-31 22:40 ` Jan Kara @ 2010-11-05 21:33 ` Jan Kara 2010-11-05 21:34 ` Jan Kara 2010-11-05 22:03 ` Jan Engelhardt 0 siblings, 2 replies; 39+ messages in thread From: Jan Kara @ 2010-11-05 21:33 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh [-- Attachment #1: Type: text/plain, Size: 3842 bytes --] On Sun 31-10-10 23:40:12, Jan Kara wrote: > On Sun 31-10-10 13:24:37, Jan Kara wrote: > > On Mon 25-10-10 01:41:48, Jan Engelhardt wrote: > > > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote: > > > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote: > > > >>On Monday 2010-02-15 15:49, Jan Kara wrote: > > > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote: > > > >>>> >> > > > >>>> >> This fixes it by using the passed in page writeback count, instead of > > > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > > > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > > > >>>> >> finish properly even when new pages are being dirted. > > > >>>> > > > > >>>> >This seems broken. > > > >>>> > > > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this. > > > >>>> While there is no sync/fsync to be seen in these traces, I can > > > >>>> tell there's a livelock, without Dirty decreasing at all. > > > > > > > >What ultimately became of the discussion and/or the patch? > > > > > > > >Your original ad-hoc patch certainly still does its job; had no need to > > > >reboot in 86 days and still counting. > > > > > > I still observe this behavior on 2.6.36-rc8. This is starting to > > > get frustrating, so I will be happily following akpm's advise to > > > poke people. > > Yes, that's a good way :) > > > > > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41 > > > > > > Previously, many concurrent extractions of tarballs and so on have been > > > one way to trigger the issue; I now also have a rather small testcase > > > (below) that freezes the box here (which has 24G RAM, so even if I'm > > > lacking to call msync, I should be fine) sometime after memset finishes. > > I've tried your test but didn't succeed in freezing my laptop. > > Everything was running smooth, the machine even felt reasonably responsive > > although constantly reading and writing to disk. Also sync(1) finished in a > > couple of seconds as one would expect in an optimistic case. > > Needless to say that my laptop has only 1G of ram so I had to downsize > > the hash table from 16G to 1G to be able to run the test and the disk is > > Intel SSD so the performance of the backing storage compared to the amount > > of needed IO is much in my favor. > > OK, so I've taken a machine with standard rotational drive and 28GB of > > ram and there I can see sync(1) hanging (but otherwise the machine looks > > OK). Investigating further... > So with the writeback tracing, I verified that indeed the trouble is that > work queued by sync(1) gets queued behind the background writeback which is > just running. And background writeback won't stop because your process is > dirtying pages so agressively. Actually, it would stop after writing > LONG_MAX pages but that's effectively infinity. I have a patch > (e.g. http://www.kerneltrap.com/mailarchive/linux-fsdevel/2010/8/3/6886244) > to stop background writeback when other work is queued but it's kind > of hacky so I can see why Christoph doesn't like it ;) > So I'll have to code something different to fix this issue... OK, so at Kernel Summit we agreed to fix the issue as I originally wanted by patches http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2 and http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2 I needed one more patch to resolve the issue (attached) which I've just posted for review and possible inclusion. I had a similar one long time ago but now I'm better able to explain why it works because of tracepoints. Yay! ;). With those three patches I'm not able to trigger livelocks (but sync takes still 15 minutes because the througput to disk is about 4MB/s - no big surprise given the random nature of the load) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-mm-Avoid-livelocking-of-WB_SYNC_ALL-writeback.patch --] [-- Type: text/x-patch, Size: 3000 bytes --] >From 44c256bbd627ae75039b99724ce3c7caa7f4fd95 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Fri, 5 Nov 2010 17:56:03 +0100 Subject: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and thus we easily end up with negative nr_to_write after the function returns. wb_writeback() then decides we need another round of writeback but this is wrong in some cases! For example when a single large file is continuously dirtied, we would never finish syncing it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp never gets updated (as inode is never completely clean). Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not need nr_to_write in WB_SYNC_ALL mode anyway since livelock avoidance is done differently for it. After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer able to stall sync forever. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fs-writeback.c | 18 ++++++++++++++---- 1 files changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 6b4d02a..d5873a6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writeback *wb, }; unsigned long oldest_jif; long wrote = 0; + long write_chunk; struct inode *inode; if (wbc.for_kupdate) { @@ -640,6 +641,15 @@ static long wb_writeback(struct bdi_writeback *wb, wbc.range_start = 0; wbc.range_end = LLONG_MAX; } + /* + * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as + * we need to write everything and livelock avoidance is implemented + * differently. + */ + if (wbc.sync_mode == WB_SYNC_NONE) + write_chunk = MAX_WRITEBACK_PAGES; + else + write_chunk = LONG_MAX; wbc.wb_start = jiffies; /* livelock avoidance */ for (;;) { @@ -665,7 +675,7 @@ static long wb_writeback(struct bdi_writeback *wb, break; wbc.more_io = 0; - wbc.nr_to_write = MAX_WRITEBACK_PAGES; + wbc.nr_to_write = write_chunk; wbc.pages_skipped = 0; trace_wbc_writeback_start(&wbc, wb->bdi); @@ -675,8 +685,8 @@ static long wb_writeback(struct bdi_writeback *wb, writeback_inodes_wb(wb, &wbc); trace_wbc_writeback_written(&wbc, wb->bdi); - work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; - wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + work->nr_pages -= write_chunk - wbc.nr_to_write; + wrote += write_chunk - wbc.nr_to_write; /* * If we consumed everything, see if we have more @@ -691,7 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb, /* * Did we write something? Try for more */ - if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) + if (wbc.nr_to_write < write_chunk) continue; /* * Nothing written. Wait for some inode to -- 1.7.1 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-11-05 21:33 ` Jan Kara @ 2010-11-05 21:34 ` Jan Kara 2010-11-05 21:41 ` Linus Torvalds 2010-11-05 22:03 ` Jan Engelhardt 1 sibling, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-11-05 21:34 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Fri 05-11-10 22:33:24, Jan Kara wrote: > I needed one more patch to resolve the issue (attached) which I've just > posted for review and possible inclusion. I had a similar one long time ago > but now I'm better able to explain why it works because of tracepoints. > Yay! ;). With those three patches I'm not able to trigger livelocks (but > sync takes still 15 minutes because the througput to disk is about 4MB/s - > no big surprise given the random nature of the load) PS: And big thanks to you for providing the test case and your persistence ;) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-11-05 21:34 ` Jan Kara @ 2010-11-05 21:41 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2010-11-05 21:41 UTC (permalink / raw) To: Jan Kara Cc: Jan Engelhardt, Andrew Morton, Jens Axboe, Linux Kernel, stable, gregkh On Fri, Nov 5, 2010 at 2:34 PM, Jan Kara <jack@suse.cz> wrote: > On Fri 05-11-10 22:33:24, Jan Kara wrote: >> I needed one more patch to resolve the issue (attached) which I've just >> posted for review and possible inclusion. I had a similar one long time ago >> but now I'm better able to explain why it works because of tracepoints. >> Yay! ;). With those three patches I'm not able to trigger livelocks (but >> sync takes still 15 minutes because the througput to disk is about 4MB/s - >> no big surprise given the random nature of the load) > PS: And big thanks to you for providing the test case and your > persistence ;) Ok, I'm inclined to just apply these three patches. Can we get a quick ack/tested-by for them? I'm sure there's more work to be done, but let's put this damn thing to rest if it really does fix the problem Jan(E) sees. Comments? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-11-05 21:33 ` Jan Kara 2010-11-05 21:34 ` Jan Kara @ 2010-11-05 22:03 ` Jan Engelhardt 2010-11-07 12:57 ` Jan Kara 2011-01-20 22:50 ` Jan Engelhardt 1 sibling, 2 replies; 39+ messages in thread From: Jan Engelhardt @ 2010-11-05 22:03 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Friday 2010-11-05 22:33, Jan Kara wrote: > OK, so at Kernel Summit we agreed to fix the issue as I originally wanted >by patches >http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2 >and >http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2 > > I needed one more patch to resolve the issue (attached) which I've just >posted for review and possible inclusion. If you have a branch that has all three and is easy to fetch, I'll get on it right away. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-11-05 22:03 ` Jan Engelhardt @ 2010-11-07 12:57 ` Jan Kara 2011-01-20 22:50 ` Jan Engelhardt 1 sibling, 0 replies; 39+ messages in thread From: Jan Kara @ 2010-11-07 12:57 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Fri 05-11-10 23:03:31, Jan Engelhardt wrote: > > On Friday 2010-11-05 22:33, Jan Kara wrote: > > OK, so at Kernel Summit we agreed to fix the issue as I originally wanted > >by patches > >http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2 > >and > >http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2 > > > > I needed one more patch to resolve the issue (attached) which I've just > >posted for review and possible inclusion. > > If you have a branch that has all three and is easy to fetch, I'll > get on it right away. In my repo: git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6.git I've created branch writeback_fixes which has 2.6.37-rc1 + the three patches. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2010-11-05 22:03 ` Jan Engelhardt 2010-11-07 12:57 ` Jan Kara @ 2011-01-20 22:50 ` Jan Engelhardt 2011-01-21 15:09 ` Jan Kara 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2011-01-20 22:50 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh On Friday 2010-11-05 23:03, Jan Engelhardt wrote: >On Friday 2010-11-05 22:33, Jan Kara wrote: >> OK, so at Kernel Summit we agreed to fix the issue as I originally wanted >>by patches >>http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2 >>and >>http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2 >> >> I needed one more patch to resolve the issue (attached) which I've just >>posted for review and possible inclusion. > >If you have a branch that has all three and is easy to fetch, I'll >get on it right away. I tested these now [2.6.37-rc1+yourpatches]. Within the last 13 days of uptime, the following messages have accumulated, however the machine is still alive, so I guess it's fine. (I see Andrew merged the patches already, so if it becomes a problem again I will notice it within the next kernel releases.) INFO: task flush-259:0:324 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-259:0 D 00000000005a3d38 0 324 2 0x18000000000 Call Trace: [00000000005a39c4] do_get_write_access+0x244/0x480 [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40 [0000000000591278] __ext4_journal_get_write_access+0x18/0x60 [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0 [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0 [000000000058ba98] ext4_ext_insert_extent+0x718/0x17a0 [000000000058f2d4] ext4_ext_map_blocks+0x9f4/0x12c0 [00000000005742f4] ext4_map_blocks+0x1b4/0x240 [0000000000576864] mpage_da_map_and_submit+0x84/0x420 [0000000000576f5c] write_cache_pages_da+0x27c/0x3e0 [00000000005772f8] ext4_da_writepages+0x238/0x4a0 [00000000004cc7a4] do_writepages+0x24/0x60 [0000000000525264] writeback_single_inode+0x84/0x240 [00000000005258d8] writeback_sb_inodes+0x98/0x140 [0000000000525f44] writeback_inodes_wb+0xc4/0x180 [0000000000526294] wb_writeback+0x294/0x300 INFO: task jbd2/sda2-8:327 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/sda2-8 D 000000000052ce28 0 327 2 0x58000000000 Call Trace: [00000000007880d4] io_schedule+0x54/0xc0 [000000000052ce28] sync_buffer+0x48/0x60 [000000000078895c] __wait_on_bit+0x7c/0xe0 [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60 [00000000005a4e10] jbd2_journal_commit_transaction+0x850/0x1280 [00000000005a87bc] kjournald2+0x9c/0x1c0 [0000000000482b84] kthread+0x64/0x80 [000000000042b790] kernel_thread+0x30/0x60 [0000000000482e84] kthreadd+0xe4/0x160 INFO: task lighttpd:3260 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lighttpd D 000000000052ce28 0 3260 1 0x410051101000080 Call Trace: [00000000007880d4] io_schedule+0x54/0xc0 [000000000052ce28] sync_buffer+0x48/0x60 [000000000078895c] __wait_on_bit+0x7c/0xe0 [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60 [000000000057ba44] ext4_find_entry+0x3c4/0x520 [000000000057c7c8] ext4_lookup+0x28/0x180 [000000000050e5a0] d_alloc_and_lookup+0x40/0x80 [000000000050e6dc] do_lookup+0xfc/0x160 [0000000000510578] link_path_walk+0x6b8/0xb00 [0000000000510aa8] path_walk+0x28/0xa0 [0000000000510b64] do_path_lookup+0x44/0xa0 [0000000000510f6c] user_path_at+0x2c/0x80 [00000000005090c8] vfs_fstatat+0x28/0x80 [000000000044390c] compat_sys_stat64+0xc/0x40 [0000000000406114] linux_sparc_syscall32+0x34/0x40 INFO: task rsync:2451 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rsync D 00000000005a3d38 0 2451 2445 0x410061101000080 Call Trace: [00000000005a39c4] do_get_write_access+0x244/0x480 [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40 [0000000000591278] __ext4_journal_get_write_access+0x18/0x60 [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0 [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0 [0000000000577844] ext4_dirty_inode+0x24/0x40 [0000000000525b60] __mark_inode_dirty+0x20/0x200 [0000000000519d84] file_update_time+0xc4/0x140 [00000000004c43f4] __generic_file_aio_write+0x1f4/0x3a0 [00000000004c45d4] generic_file_aio_write+0x34/0xc0 [0000000000504338] do_sync_write+0x78/0xc0 [00000000005049a8] vfs_write+0x68/0x140 [0000000000504c4c] SyS_write+0x2c/0x60 [0000000000406114] linux_sparc_syscall32+0x34/0x40 INFO: task cron:4376 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cron D 00000000004c33f8 0 4376 3244 0x210021101000080 Call Trace: [00000000007880d4] io_schedule+0x54/0xc0 [00000000004c33f8] sync_page+0x58/0x80 [0000000000788808] __wait_on_bit_lock+0x68/0xe0 [00000000004c3360] __lock_page+0x40/0x60 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 [00000000004c5108] filemap_fault+0x248/0x3c0 [00000000004dca90] __do_fault+0x30/0x480 [00000000004df27c] handle_mm_fault+0x13c/0x9a0 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 [00000000004079a8] sparc64_realfault_common+0x10/0x20 INFO: task cron:4377 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cron D 00000000004c33f8 0 4377 3244 0x210021101000080 Call Trace: [00000000007880d4] io_schedule+0x54/0xc0 [00000000004c33f8] sync_page+0x58/0x80 [0000000000788808] __wait_on_bit_lock+0x68/0xe0 [00000000004c3360] __lock_page+0x40/0x60 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 [00000000004c5108] filemap_fault+0x248/0x3c0 [00000000004dca90] __do_fault+0x30/0x480 [00000000004df27c] handle_mm_fault+0x13c/0x9a0 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 [00000000004079a8] sparc64_realfault_common+0x10/0x20 INFO: task cron:4378 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cron D 00000000004c33f8 0 4378 3244 0x210021101000080 Call Trace: [00000000007880d4] io_schedule+0x54/0xc0 [00000000004c33f8] sync_page+0x58/0x80 [0000000000788808] __wait_on_bit_lock+0x68/0xe0 [00000000004c3360] __lock_page+0x40/0x60 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 [00000000004c5108] filemap_fault+0x248/0x3c0 [00000000004dca90] __do_fault+0x30/0x480 [00000000004df27c] handle_mm_fault+0x13c/0x9a0 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 [00000000004079a8] sparc64_realfault_common+0x10/0x20 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Sync writeback still broken 2011-01-20 22:50 ` Jan Engelhardt @ 2011-01-21 15:09 ` Jan Kara 0 siblings, 0 replies; 39+ messages in thread From: Jan Kara @ 2011-01-21 15:09 UTC (permalink / raw) To: Jan Engelhardt Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh Hi, On Thu 20-01-11 23:50:36, Jan Engelhardt wrote: > On Friday 2010-11-05 23:03, Jan Engelhardt wrote: > >On Friday 2010-11-05 22:33, Jan Kara wrote: > >> OK, so at Kernel Summit we agreed to fix the issue as I originally wanted > >>by patches > >>http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2 > >>and > >>http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2 > >> > >> I needed one more patch to resolve the issue (attached) which I've just > >>posted for review and possible inclusion. > > > >If you have a branch that has all three and is easy to fetch, I'll > >get on it right away. > > I tested these now [2.6.37-rc1+yourpatches]. > > Within the last 13 days of uptime, the following messages have > accumulated, however the machine is still alive, so I guess it's > fine. (I see Andrew merged the patches already, so if it becomes > a problem again I will notice it within the next kernel releases.) Thanks for an update :) I went through the messages bellow and there are no signs of bdi-writeout causing any problems. Apparently your disk is rather busy (maybe in combination with suboptiomal IO scheduling decisions?) since I could attribute all the reports to waiting for an IO on a buffer to finish. Honza > INFO: task flush-259:0:324 blocked for more than 120 > seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > flush-259:0 D 00000000005a3d38 0 324 2 0x18000000000 > Call Trace: > [00000000005a39c4] do_get_write_access+0x244/0x480 > [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40 > [0000000000591278] __ext4_journal_get_write_access+0x18/0x60 > [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0 > [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0 > [000000000058ba98] ext4_ext_insert_extent+0x718/0x17a0 > [000000000058f2d4] ext4_ext_map_blocks+0x9f4/0x12c0 > [00000000005742f4] ext4_map_blocks+0x1b4/0x240 > [0000000000576864] mpage_da_map_and_submit+0x84/0x420 > [0000000000576f5c] write_cache_pages_da+0x27c/0x3e0 > [00000000005772f8] ext4_da_writepages+0x238/0x4a0 > [00000000004cc7a4] do_writepages+0x24/0x60 > [0000000000525264] writeback_single_inode+0x84/0x240 > [00000000005258d8] writeback_sb_inodes+0x98/0x140 > [0000000000525f44] writeback_inodes_wb+0xc4/0x180 > [0000000000526294] wb_writeback+0x294/0x300 > INFO: task jbd2/sda2-8:327 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > jbd2/sda2-8 D 000000000052ce28 0 327 2 0x58000000000 > Call Trace: > [00000000007880d4] io_schedule+0x54/0xc0 > [000000000052ce28] sync_buffer+0x48/0x60 > [000000000078895c] __wait_on_bit+0x7c/0xe0 > [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60 > [00000000005a4e10] jbd2_journal_commit_transaction+0x850/0x1280 > [00000000005a87bc] kjournald2+0x9c/0x1c0 > [0000000000482b84] kthread+0x64/0x80 > [000000000042b790] kernel_thread+0x30/0x60 > [0000000000482e84] kthreadd+0xe4/0x160 > INFO: task lighttpd:3260 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > lighttpd D 000000000052ce28 0 3260 1 0x410051101000080 > Call Trace: > [00000000007880d4] io_schedule+0x54/0xc0 > [000000000052ce28] sync_buffer+0x48/0x60 > [000000000078895c] __wait_on_bit+0x7c/0xe0 > [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60 > [000000000057ba44] ext4_find_entry+0x3c4/0x520 > [000000000057c7c8] ext4_lookup+0x28/0x180 > [000000000050e5a0] d_alloc_and_lookup+0x40/0x80 > [000000000050e6dc] do_lookup+0xfc/0x160 > [0000000000510578] link_path_walk+0x6b8/0xb00 > [0000000000510aa8] path_walk+0x28/0xa0 > [0000000000510b64] do_path_lookup+0x44/0xa0 > [0000000000510f6c] user_path_at+0x2c/0x80 > [00000000005090c8] vfs_fstatat+0x28/0x80 > [000000000044390c] compat_sys_stat64+0xc/0x40 > [0000000000406114] linux_sparc_syscall32+0x34/0x40 > INFO: task rsync:2451 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > rsync D 00000000005a3d38 0 2451 2445 0x410061101000080 > Call Trace: > [00000000005a39c4] do_get_write_access+0x244/0x480 > [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40 > [0000000000591278] __ext4_journal_get_write_access+0x18/0x60 > [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0 > [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0 > [0000000000577844] ext4_dirty_inode+0x24/0x40 > [0000000000525b60] __mark_inode_dirty+0x20/0x200 > [0000000000519d84] file_update_time+0xc4/0x140 > [00000000004c43f4] __generic_file_aio_write+0x1f4/0x3a0 > [00000000004c45d4] generic_file_aio_write+0x34/0xc0 > [0000000000504338] do_sync_write+0x78/0xc0 > [00000000005049a8] vfs_write+0x68/0x140 > [0000000000504c4c] SyS_write+0x2c/0x60 > [0000000000406114] linux_sparc_syscall32+0x34/0x40 > INFO: task cron:4376 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > cron D 00000000004c33f8 0 4376 3244 0x210021101000080 > Call Trace: > [00000000007880d4] io_schedule+0x54/0xc0 > [00000000004c33f8] sync_page+0x58/0x80 > [0000000000788808] __wait_on_bit_lock+0x68/0xe0 > [00000000004c3360] __lock_page+0x40/0x60 > [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 > [00000000004c5108] filemap_fault+0x248/0x3c0 > [00000000004dca90] __do_fault+0x30/0x480 > [00000000004df27c] handle_mm_fault+0x13c/0x9a0 > [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 > [00000000004079a8] sparc64_realfault_common+0x10/0x20 > INFO: task cron:4377 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > cron D 00000000004c33f8 0 4377 3244 0x210021101000080 > Call Trace: > [00000000007880d4] io_schedule+0x54/0xc0 > [00000000004c33f8] sync_page+0x58/0x80 > [0000000000788808] __wait_on_bit_lock+0x68/0xe0 > [00000000004c3360] __lock_page+0x40/0x60 > [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 > [00000000004c5108] filemap_fault+0x248/0x3c0 > [00000000004dca90] __do_fault+0x30/0x480 > [00000000004df27c] handle_mm_fault+0x13c/0x9a0 > [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 > [00000000004079a8] sparc64_realfault_common+0x10/0x20 > INFO: task cron:4378 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > cron D 00000000004c33f8 0 4378 3244 0x210021101000080 > Call Trace: > [00000000007880d4] io_schedule+0x54/0xc0 > [00000000004c33f8] sync_page+0x58/0x80 > [0000000000788808] __wait_on_bit_lock+0x68/0xe0 > [00000000004c3360] __lock_page+0x40/0x60 > [00000000004c4eac] __lock_page_or_retry+0x4c/0x60 > [00000000004c5108] filemap_fault+0x248/0x3c0 > [00000000004dca90] __do_fault+0x30/0x480 > [00000000004df27c] handle_mm_fault+0x13c/0x9a0 > [000000000044de4c] do_sparc64_fault+0x32c/0x7c0 > [00000000004079a8] sparc64_realfault_common+0x10/0x20 -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-12 15:45 ` Linus Torvalds 2010-02-13 12:58 ` Jan Engelhardt @ 2010-02-15 14:17 ` Jan Kara 2010-02-16 0:05 ` Linus Torvalds 1 sibling, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-02-15 14:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, Linux Kernel, jack, jengelh, stable, gregkh On Fri 12-02-10 07:45:04, Linus Torvalds wrote: > On Fri, 12 Feb 2010, Jens Axboe wrote: > > > > This fixes it by using the passed in page writeback count, instead of > > doing MAX_WRITEBACK_PAGES batches, which gets us much better performance > > (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1) > > finish properly even when new pages are being dirted. > > This seems broken. > > The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't > generate a single IO bigger than a couple of MB. And then we have that > loop around things to do multiple chunks. Your change to use nr_pages > seems to make the whole looping totally pointless, and breaks that "don't > do huge hunks" logic. > > So I don't think that your patch is correct. > > That said, I _do_ believe you when you say it makes a difference, which > makes me think there is a bug there. I just don't think you fixed the > right bug, and your change just happens to do what you wanted by pure > random luck. A few points to this: Before Jens rewritten the writeback code to use per-bdi flusher threads MAX_WRITEBACK_PAGES applied only to writeback done by pdflush and kupdate. So for example sync used .nr_pages like in the suggested patch. So Jens' patch was just trying to get back to the old behavior... Personally, I don't see why VM shouldn't generate IO from a single inode larger than a few MB for data integrity syncs. There are two reasons I know about for MAX_WRITEBACK_PAGES: a) avoid livelocking of writeout when the file is continuously grown (even nastier is when the file is sparse and huge and the writer just fills the huge hole with data gradually) b) avoid holding I_SYNC for too long because that could block tasks being write-throttled. I find a) a valid reason but MAX_WRITEBACK_PAGES workarounds the livelock only for background writeback where we cycle through all inodes and do not care whether we ever make the inode clean. Limiting data integrity writeback to MAX_WRITEBACK_PAGES just makes things *worse* because you give processes more time to generate new dirty pages. Using dirtied_when does not solve this because the inode dirty timestamp is updated only at clean->dirty transition but inode never gets clean. I don't think b) is a very valid reason anymore. Currently, throttled task writes all inodes from the BDI and the I_SYNC locked inode is just skipped. It can only be a problem when that inode would be the only one with big enough amount of dirty data. Then the throttled thread will only exit balance_dirty_pages as soon as the amount of dirty data at that BDI drops below the BDI dirty threshold. > The _real_ bug seems to bethe one you mentioned, but then ignored: > > > Instead of flushing everything older than the sync run, it will do > > chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and > > over. > > and it would seem that the _logical_ way to fix it would be something like > the appended... > > Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that > any _future_ data is written back, so the 'oldest_jif' thing would seem to > be sane regardless of sync mode. > > NOTE NOTE NOTE! I do have to admit that this patch scares me, because > there could be some bug in the 'older_than_this' logic that means that > somebody sets it even if the inode is already dirty. So this patch makes > conceptual sense to me, and I think it's the right thing to do, but I > also suspect that we do not actually have a lot of test coverage of the > whole 'older_than_this' logic, because it historically has been just a > small optimization for kupdated. A similar logic to what you are suggesting is actually already there. See 'start' variable in writeback_inodes_wb(). It was there exactly to avoid sync(1) livelocks but it is assuming that writeback_inodes_wb() handles all the IO for the sync. Not that it is called just to write MAX_WRITEBACK_PAGES. So the only danger of your patch using older_than_this would be that the ordering on wb->b_dirty list isn't always by dirty time. But as I wrote above the main problem I see in your patch is that it does not avoid all the cases of livelocks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-15 14:17 ` [PATCH] writeback: Fix broken sync writeback Jan Kara @ 2010-02-16 0:05 ` Linus Torvalds 2010-02-16 23:00 ` Jan Kara 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2010-02-16 0:05 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon, 15 Feb 2010, Jan Kara wrote: > > Personally, I don't see why VM shouldn't generate IO from a single inode > larger than a few MB for data integrity syncs. There are two reasons I know > about for MAX_WRITEBACK_PAGES: Two issues: - it shouldn't matter for performance (if you have a broken disk that wants multi-megabyte writes to get good performance, you need to fix the driver, not the VM) - I generally think that _latency_ is much more important than throughput, and huge writes are simply bad for latency. But the real complaint I had about your patch was that it made no sense. Your statement that it speeds up sync is indicative of some _other_ independent problem (perhaps starting from the beginning of the file each time? Who knows?) and the patch _clearly_doesn't actually address the underlying problem, it just changes the logic in a way that makes no obvious sense to anybody, without actually explaining why it would matter. So if you can explain _why_ that patch makes such a big difference for you, and actually write that up, I wouldn't mind it. But right now it was sent as a voodoo patch with no sensible explanation for why it would really make any difference, since the outer loop should already do what the patch does. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-16 0:05 ` Linus Torvalds @ 2010-02-16 23:00 ` Jan Kara 2010-02-16 23:34 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-02-16 23:00 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon 15-02-10 16:05:43, Linus Torvalds wrote: > > > On Mon, 15 Feb 2010, Jan Kara wrote: > > > > Personally, I don't see why VM shouldn't generate IO from a single inode > > larger than a few MB for data integrity syncs. There are two reasons I know > > about for MAX_WRITEBACK_PAGES: > > Two issues: > - it shouldn't matter for performance (if you have a broken disk that > wants multi-megabyte writes to get good performance, you need to fix > the driver, not the VM) The IO size actually does matter for performance because if you switch after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode, that means a seek to a distant place and that has significant impact (seek time are in a 10-20 ms range for a common drives so with 50 MB/s throughput this is like 13-25% of the IO time...). A similar reason why it matters is if the filesystem does delayed allocation - then allocation is performed from writeback code and if it happens in 4 MB chunks, chances for fragmentation are higher. Actually XFS, btrfs, and ext4 *workaround* the fact that VM sends only 4 MB chunks and write more regardless of nr_to_write set - I agree this really sucks but that's the way it is. > - I generally think that _latency_ is much more important than > throughput, and huge writes are simply bad for latency. I agree that latency matters but is VM the right place to decide where to break writes into smaller chunks for latency reasons? It knows neither about the placement of the file nor about possible concurrent requests for that device. So I personally believe that IO scheduler should be the layer that should care about scheduling IO so that we have decent latecies. As splitting bios is probably not an option, we might want to have an artificial upper limit on bio size for latency reasons but it's still it's way below VM... > But the real complaint I had about your patch was that it made no sense. > > Your statement that it speeds up sync is indicative of some _other_ > independent problem (perhaps starting from the beginning of the file each > time? Who knows?) and the patch _clearly_doesn't actually address the > underlying problem, it just changes the logic in a way that makes no > obvious sense to anybody, without actually explaining why it would matter. > > So if you can explain _why_ that patch makes such a big difference for > you, and actually write that up, I wouldn't mind it. But right now it was > sent as a voodoo patch with no sensible explanation for why it would > really make any difference, since the outer loop should already do what > the patch does. OK, fair enough :). The patch is actually Jens' but looking at the code again the nr_to_write trimming should rather happen inside of writeback_inodes_wb, not outside of it. That would make the logic clearer and magically also fix the problem because write_cache_pages() ignores nr_to_write in WB_SYNC_ALL mode. But I guess it's better to not depend on this and explicitely handle that in writeback_inodes_wb... So I'll post a new patch with a better changelog shortly. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-16 23:00 ` Jan Kara @ 2010-02-16 23:34 ` Linus Torvalds 2010-02-17 0:01 ` Linus Torvalds 2010-02-17 1:33 ` Jan Kara 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2010-02-16 23:34 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Wed, 17 Feb 2010, Jan Kara wrote: > > The IO size actually does matter for performance because if you switch > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode, No. Dammit, read the code. That's my whole _point_. Look at the for-loop. We DO NOT SWITCH to another inode, because we just continue in the for-loop. This is why I think your patch is crap. You clearly haven't even read the code, the patch makes no sense, and there must be something else going on than what you _claim_ is going on. > So I'll post a new patch with a better changelog shortly. Not "shortly". Because you clearly haven't looked at the code. Look here: /* * If we consumed everything, see if we have more */ if (wbc.nr_to_write <= 0) continue; and here: /* * Did we write something? Try for more */ if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) continue; the point is, the way the code is written, it is very much _meant_ to write out one file in one go, except it is supposed to chunk it up so that you never see _too_ many dirty pages in flight at any time. Because tons of dirty pages in the queues makes for bad latency. Now, I admit that since your patch makes a difference, there is a bug somewhere. That's never what I've argued against. What I argue against is your patch itself, and your explanation for it. They don't make sense. I suspect that there is something else going on. In particular, I wonder if multiple calls to "writeback_inodes_wb()" somehow mess up the wbc state, and it starts writing the same inode from the beginning, or it simply has some bug that means that it moves the inode to the head of the queue, or something. For example, it might be that the logic in writeback_inodes_wb() moves an inode back (the "redirty_tail()" cases) in bad ways when it shouldn't. See what I'm trying to say? I think your patch probably just hides the _real_ bug. Because it really shouldn't matter if we do things in 4MB chunks or whatever, because the for-loop should happily continue. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-16 23:34 ` Linus Torvalds @ 2010-02-17 0:01 ` Linus Torvalds 2010-02-17 1:33 ` Jan Kara 1 sibling, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2010-02-17 0:01 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue, 16 Feb 2010, Linus Torvalds wrote: > > For example, it might be that the logic in writeback_inodes_wb() moves an > inode back (the "redirty_tail()" cases) in bad ways when it shouldn't. Or another example: perhaps we screw up the inode list ordering when we move the inodes between b_dirty <-> b_io <-> b_more_io? And if we put inodes on the b_more_io list, do we perhaps then end up doing too much waiting in inode_wait_for_writeback()? See what I'm saying? Your patch - by just submitting the maximal sized buffers - may well end up hiding the real problem. But if there is a real problem in that whole list manipulation or waiting, then that problem still exists for async writeback. Wouldn't it be better to fix the real problem, so that async writeback also gets the correct IO patterns? NOTE! It's entirely possible that we do end up wanting to really submit the maximal dirty IO for synchronous dirty writeback, in order to get better IO patterns. So maybe your patch ends up being the right one in the end. But I really _really_ want to understand this. But right now, that patch seems like voodoo programming to me, and I personally suspect that the real problem is in the horribly complex b_io/b_more_io interaction. Or one of the _other_ horribly complex details in the write-back logic (even just writing back a single inode is complicated, see all the logic about "range_start/range_end" in the lower level write_cache_pages() function). Our whole writeback code is very very complicated. I don't like it. But that's also why I want to feel like I understand the patch when I apply it. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-16 23:34 ` Linus Torvalds 2010-02-17 0:01 ` Linus Torvalds @ 2010-02-17 1:33 ` Jan Kara 2010-02-17 1:57 ` Dave Chinner 2010-02-17 3:35 ` Linus Torvalds 1 sibling, 2 replies; 39+ messages in thread From: Jan Kara @ 2010-02-17 1:33 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue 16-02-10 15:34:01, Linus Torvalds wrote: > On Wed, 17 Feb 2010, Jan Kara wrote: > > > > The IO size actually does matter for performance because if you switch > > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode, > > No. > > Dammit, read the code. > > That's my whole _point_. Look at the for-loop. > > We DO NOT SWITCH to another inode, because we just continue in the > for-loop. > > This is why I think your patch is crap. You clearly haven't even read the > code, the patch makes no sense, and there must be something else going on > than what you _claim_ is going on. I've read the code. Maybe I'm missing something but look: writeback_inodes_wb(nr_to_write = 1024) -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list ... writeback_single_inode() ...writes 1024 pages. if we haven't written everything in the inode (more than 1024 dirty pages) we end up doing either requeue_io() or redirty_tail(). In the first case the inode is put to b_more_io list, in the second case to the tail of b_dirty list. In either case it will not receive further writeout until we go through all other members of current b_io list. So I claim we currently *do* switch to another inode after 4 MB. That is a fact. I *think* it is by design - mainly to avoid the situation where someone continuously writes a huge file and kupdate or pdflush would never get to writing other files with dirty data (at least that's impression I've built over the years - heck, even 2.6.16 seems to have this redirty_tail logic with a comment about the above livelock). I do find this design broken as well as you likely do and think that the livelock issue described in the above paragraph should be solved differently (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix. The question is what to do now for 2.6.33 and 2.6.32-stable. Personally, I think that changing the writeback logic so that it does not switch inodes after 4 MB is too risky for these two kernels. So with the above explanation would you accept some fix along the lines of original Jens' fix? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-17 1:33 ` Jan Kara @ 2010-02-17 1:57 ` Dave Chinner 2010-02-17 3:35 ` Linus Torvalds 1 sibling, 0 replies; 39+ messages in thread From: Dave Chinner @ 2010-02-17 1:57 UTC (permalink / raw) To: Jan Kara Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Wed, Feb 17, 2010 at 02:33:37AM +0100, Jan Kara wrote: > On Tue 16-02-10 15:34:01, Linus Torvalds wrote: > > On Wed, 17 Feb 2010, Jan Kara wrote: > > > > > > The IO size actually does matter for performance because if you switch > > > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode, > > > > No. > > > > Dammit, read the code. > > > > That's my whole _point_. Look at the for-loop. > > > > We DO NOT SWITCH to another inode, because we just continue in the > > for-loop. > > > > This is why I think your patch is crap. You clearly haven't even read the > > code, the patch makes no sense, and there must be something else going on > > than what you _claim_ is going on. > I've read the code. Maybe I'm missing something but look: > writeback_inodes_wb(nr_to_write = 1024) > -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list > ... > writeback_single_inode() > ...writes 1024 pages. > if we haven't written everything in the inode (more than 1024 dirty > pages) we end up doing either requeue_io() or redirty_tail(). In the > first case the inode is put to b_more_io list, in the second case to > the tail of b_dirty list. In either case it will not receive further > writeout until we go through all other members of current b_io list. > > So I claim we currently *do* switch to another inode after 4 MB. That > is a fact. That is my understanding of how it works, too. > I *think* it is by design - mainly to avoid the situation where someone > continuously writes a huge file and kupdate or pdflush would never get to > writing other files with dirty data (at least that's impression I've built > over the years - heck, even 2.6.16 seems to have this redirty_tail logic > with a comment about the above livelock). Right, and there is another condition as well - writing lots of small files could starve large file writeback as the large file only got 4MB written back every 30s writeback period because it took that long to write out all the new files. IIRC it was in 2.6.16 that both these problems were fixed... > I do find this design broken as well as you likely do and think that the > livelock issue described in the above paragraph should be solved differently > (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix. > > The question is what to do now for 2.6.33 and 2.6.32-stable. Personally, > I think that changing the writeback logic so that it does not switch inodes > after 4 MB is too risky for these two kernels. So with the above > explanation would you accept some fix along the lines of original Jens' > fix? We've had this sync() writeback behaviour for a long time - my question is why is it only now a problem? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-17 1:33 ` Jan Kara 2010-02-17 1:57 ` Dave Chinner @ 2010-02-17 3:35 ` Linus Torvalds 2010-02-17 4:30 ` tytso 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2010-02-17 3:35 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Wed, 17 Feb 2010, Jan Kara wrote: > > I've read the code. Maybe I'm missing something but look: > writeback_inodes_wb(nr_to_write = 1024) > -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list > ... > writeback_single_inode() > ...writes 1024 pages. > if we haven't written everything in the inode (more than 1024 dirty > pages) we end up doing either requeue_io() or redirty_tail(). In the > first case the inode is put to b_more_io list, in the second case to > the tail of b_dirty list. In either case it will not receive further > writeout until we go through all other members of current b_io list. > > So I claim we currently *do* switch to another inode after 4 MB. That > is a fact. Ok, I think that's the bug. I do agree that it may well be intentional, but considering the performance impact, I suspect it's been "intentional without any performance numbers". Which just makes me very unhappy to just paper it over for the sync case, and leave the now known-broken state alone for the async case. That really isn't how we want to do things. That said, if we've done this forever, I can certainly see the allure to just keep doing it, and then handle the sync case separately. > I do find this design broken as well as you likely do and think that the > livelock issue described in the above paragraph should be solved differently > (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix. Hmm. The thing is, the new radix tree bit you propose also sounds like overdesigning things. If we really do switch inodes (which I obviously didn't expect, even if I may have been aware of it many years ago), then the max rate limiting is just always bad. If it's bad for synchronous syncs, then it's bad for background syncing too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - since the whole latency argument goes away if we don't always honor it ("Oh, we have good latency - _except_ if you do 'sync()' to synchronously write something out" - that's just insane). > The question is what to do now for 2.6.33 and 2.6.32-stable. Personally, > I think that changing the writeback logic so that it does not switch inodes > after 4 MB is too risky for these two kernels. So with the above > explanation would you accept some fix along the lines of original Jens' > fix? What is affected if we just remove MAX_WRITEBACK_PAGES entirely (as opposed to the patch under discussion that effectively removes it for WB_SYNC_ALL)? I see balance_dirty_pages -> bdi_start_writeback, but that if anything would be something that I think would be better off with efficient writeback, and doesn't seem like it should try to round-robin over inodes for latency reasons. But I guess we can do it in stages, if it's about "minimal changes for 2.6.32/33. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-17 3:35 ` Linus Torvalds @ 2010-02-17 4:30 ` tytso 2010-02-17 5:16 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: tytso @ 2010-02-17 4:30 UTC (permalink / raw) To: Linus Torvalds Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue, Feb 16, 2010 at 07:35:35PM -0800, Linus Torvalds wrote: > > writeback_single_inode() > > ...writes 1024 pages. > > if we haven't written everything in the inode (more than 1024 dirty > > pages) we end up doing either requeue_io() or redirty_tail(). In the > > first case the inode is put to b_more_io list, in the second case to > > the tail of b_dirty list. In either case it will not receive further > > writeout until we go through all other members of current b_io list. > > > > So I claim we currently *do* switch to another inode after 4 MB. That > > is a fact. > > Ok, I think that's the bug. I do agree that it may well be intentional, > but considering the performance impact, I suspect it's been "intentional > without any performance numbers". This is well known amongst file system developers. We've even raised it from time to time, but apparently most people are too scared to touch the writeback code. I proposed upping the limit some six months ago, but I got serious pushback. As a result, I followed XFS's lead, and so now, both XFS and ext4 will write more pages than what is requested by the writeback logic, to work around this bug..... What we really want to do is to time how fast the device is. If the device is some Piece of Sh*t USB stick, then maybe you only want to write 4MB at a time to avoid latency problems. Heck, maybe you only want to write 32k at a time, if it's really slow.... But if it's some super-fast RAID array, maybe you want to write a lot more than 4MB at a time. We've had this logic for a long time, and given the increase in disk density, and spindle speeds, the 4MB limit, which might have made sense 10 years ago, probably doesn't make sense now. > If it's bad for synchronous syncs, then it's bad for background syncing > too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - > since the whole latency argument goes away if we don't always honor it > ("Oh, we have good latency - _except_ if you do 'sync()' to synchronously > write something out" - that's just insane). I tried arguing for this six months ago, and got the argument that it might cause latency problems on slow USB sticks. So I added a forced override for ext4, which now writes 128MB at a time --- with a sysfs tuning knob that allow the old behaviour to be restored if users really complained. No one did actually complain.... - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-17 4:30 ` tytso @ 2010-02-17 5:16 ` Linus Torvalds 2010-02-22 17:29 ` Jan Kara 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2010-02-17 5:16 UTC (permalink / raw) To: tytso; +Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue, 16 Feb 2010, tytso@mit.edu wrote: > > We've had this logic for a long time, and given the increase in disk > density, and spindle speeds, the 4MB limit, which might have made > sense 10 years ago, probably doesn't make sense now. I still don't think that 4MB is enough on its own to suck quite that much. Even a fast device should be perfectly happy with 4MB IOs, or it must be sucking really badly. In order to see the kinds of problems that got quoted in the original thread, there must be something else going on too, methinks (disk light was "blinking"). So I would guess that it's also getting stuck on that inode_wait_for_writeback(inode); inside that loop in wb_writeback(). In fact, I'm starting to wonder about that "Nothing written" case. The code basically decides that "if I wrote zero pages, I didn't write anything at all, so I must wait for the inode to complete old writes in order to not busy-loop". Which sounds sensible on the face of it, but the thing is, inodes can be dirty without actually having any dirty _pages_ associated with them. Are we perhaps ending up in a situation where we essentially wait synchronously on just the inode itself being written out? That would explain the "40kB/s" kind of behavior. If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s throughput. But if we do a 4MB chunk (for the one file that had real dirty data in it), and then do a few hundred trivial "write out the inode data _synchronously_" (due to access time changes etc) in between until we hit the file that has real dirty data again - now _that_ would explain 40kB/s throughput. It's not just seeking around - it's not even trying to push multiple IO's to get any elevator going or anything like that. And then the patch that started this discussion makes sense: it improves performance because in between those synchronous inode updates it now writes big chunks. But again, it's mostly hiding us just doing insane things. I dunno. Just a theory. The more I look at that code, the uglier it looks. And I do get the feeling that the "4MB chunking" is really just making the more fundamental problems show up. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-17 5:16 ` Linus Torvalds @ 2010-02-22 17:29 ` Jan Kara 2010-02-22 21:01 ` tytso 0 siblings, 1 reply; 39+ messages in thread From: Jan Kara @ 2010-02-22 17:29 UTC (permalink / raw) To: Linus Torvalds Cc: tytso, Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue 16-02-10 21:16:46, Linus Torvalds wrote: > On Tue, 16 Feb 2010, tytso@mit.edu wrote: > > > > We've had this logic for a long time, and given the increase in disk > > density, and spindle speeds, the 4MB limit, which might have made > > sense 10 years ago, probably doesn't make sense now. > > I still don't think that 4MB is enough on its own to suck quite that > much. Even a fast device should be perfectly happy with 4MB IOs, or it > must be sucking really badly. > > In order to see the kinds of problems that got quoted in the original > thread, there must be something else going on too, methinks (disk light > was "blinking"). <snip> > Are we perhaps ending up in a situation where we essentially wait > synchronously on just the inode itself being written out? That would > explain the "40kB/s" kind of behavior. > > If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s > throughput. Yes, it is actually 400kB/s but still you're right that that seems too low if the only problem were seeks. I was looking at the code and it's even bigger mess than what I thought :(. a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode (when 1024 is passed in nr_to_write). Writeback code kind of expects that in WB_SYNC_ALL mode all dirty pages in the given range are written (the same way as write_cache_pages does that). b) because of delayed allocation, inode is redirtied during ->writepages call and thus writeback_single_inode calls redirty_tail at it. Thus each inode will be written at least twice (synchronously, which means a transaction commit and a disk cache flush for each such write). c) the livelock avoidace in writeback_inodes_wb does not work because the function exists as soon as nr_to_write (set to 1024) gets to 0 and thus the 'start' timestamp gets always renewed. d) ext4_writepage never succeeds to write a page with delayed-allocated data. So pageout() function never succeeds in cleaning a page on ext4. I think that when other problems in writeback code make writeout slow (like in Jan Engelhardt's case), this can bite us and I assume this might be the reason why Jan saw kswapd active doing some work during his problems. > But if we do a 4MB chunk (for the one file that had real dirty data in > it), and then do a few hundred trivial "write out the inode data > _synchronously_" (due to access time changes etc) in between until we hit > the file that has real dirty data again - now _that_ would explain 40kB/s > throughput. It's not just seeking around - it's not even trying to push > multiple IO's to get any elevator going or anything like that. > > And then the patch that started this discussion makes sense: it improves > performance because in between those synchronous inode updates it now > writes big chunks. But again, it's mostly hiding us just doing insane > things. I'm not quite sure whether some of the above problems is really causing the sync writeback to be so slow. At least problems a) and c) would be worked-around by the patch setting nr_to_write to LONG_MAX for sync writeback and the effect of b) would be mitigated to just two synchronous inode writes instead of (dirty_pages + 8191) / 8192 + 1 sync writes. For c) I think the original patch is actually the right solution but I agree that it just hides the other problems... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-22 17:29 ` Jan Kara @ 2010-02-22 21:01 ` tytso 2010-02-22 22:26 ` Jan Kara 2010-02-23 2:53 ` Dave Chinner 0 siblings, 2 replies; 39+ messages in thread From: tytso @ 2010-02-22 21:01 UTC (permalink / raw) To: Jan Kara Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote: > > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode > (when 1024 is passed in nr_to_write). Writeback code kind of expects that > in WB_SYNC_ALL mode all dirty pages in the given range are written (the > same way as write_cache_pages does that). Well, we return after writing 128MB because of the magic s_max_writeback_mb_bump. The fact that nr_to_write limits the number of pages which are written is something which is intentional to the writeback code. I've disagreed with it, but I don't think it would be legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that what you are saying we should do? (If it is indeed legit to ignore nr_to_write, I would have done it a long time ago; I introduced s_max_writeback_mb_bump instead as a workaround to what I consider to be a serious misfeature in the writeback code.) > b) because of delayed allocation, inode is redirtied during ->writepages > call and thus writeback_single_inode calls redirty_tail at it. Thus each > inode will be written at least twice (synchronously, which means a > transaction commit and a disk cache flush for each such write). Hmm, does this happen with XFS, too? If not, I wonder how they handle it? And whether we need to push a solution into the generic layers. > d) ext4_writepage never succeeds to write a page with delayed-allocated > data. So pageout() function never succeeds in cleaning a page on ext4. > I think that when other problems in writeback code make writeout slow (like > in Jan Engelhardt's case), this can bite us and I assume this might be the > reason why Jan saw kswapd active doing some work during his problems. Yeah, I've noticed this. What it means is that if we have a massive memory pressure in a particular zone, pages which are subject to delayed allocation won't get written out by mm/vmscan.c. Anonymous pages will be written out to swap, and data pages which are re-written via random access mmap() (and so we know where they will be written on disk) will get written, and that's not a problem. So with relatively large zones, it happens, but most of the time I don't think it's a major problem. I am worried about this issue in certain configurations where pseudo NUMA zones have been created and are artificially really tiny (128MB) for container support, but that's not standard upstream thing. This is done to avoid a lock inversion, and so this is an ext4-specific thing (at least I don't think XFS's delayed allocation has this misfeature). It would be interesting if we have documented evidence that this is easily triggered under normal situations. If so, we should look into figuring out how to fix this... - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-22 21:01 ` tytso @ 2010-02-22 22:26 ` Jan Kara 2010-02-23 2:53 ` Dave Chinner 1 sibling, 0 replies; 39+ messages in thread From: Jan Kara @ 2010-02-22 22:26 UTC (permalink / raw) To: tytso Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon 22-02-10 16:01:12, tytso@mit.edu wrote: > On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote: > > > > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode > > (when 1024 is passed in nr_to_write). Writeback code kind of expects that > > in WB_SYNC_ALL mode all dirty pages in the given range are written (the > > same way as write_cache_pages does that). > > Well, we return after writing 128MB because of the magic I think it's really just 32 MB because writeback code passes nr_to_write 1024 -> ext4 bumps that to 8192 pages which is 32 MB. As I've understood 128 MB is just an absolute maximum over which we never bump. > s_max_writeback_mb_bump. The fact that nr_to_write limits the number > of pages which are written is something which is intentional to the > writeback code. I've disagreed with it, but I don't think it would be > legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that > what you are saying we should do? (If it is indeed legit to ignore The behavior of nr_to_write for WB_SYNC_ALL mode was always kind of fuzzy. Until Jens rewrote the writeback code, we never passed anything different than LONG_MAX (or similarly huge values) when doing WB_SYNC_ALL writeback. Also write_cache_pages which is used as writepages implementation for most filesystems (ext2, ext3, udf, ...) ignores nr_to_write for WB_SYNC_ALL writeback. So until recently we actually never had to decide whether write_cache_pages behavior is a bug or de-facto standard... Since it's hard to avoid livelocking problems when synchronous writeback would stop earlier than after writing the whole range, I'm leaned to standardize on ignoring the nr_to_write limit for synchronous writeback. > nr_to_write, I would have done it a long time ago; I introduced > s_max_writeback_mb_bump instead as a workaround to what I consider to > be a serious misfeature in the writeback code.) I agree that nr_to_write currently brings more problems than advantages. > > b) because of delayed allocation, inode is redirtied during ->writepages > > call and thus writeback_single_inode calls redirty_tail at it. Thus each > > inode will be written at least twice (synchronously, which means a > > transaction commit and a disk cache flush for each such write). > > Hmm, does this happen with XFS, too? If not, I wonder how they handle > it? And whether we need to push a solution into the generic layers. Yes, I think so. As soon as you mark inode dirty during writepages call, the dirty flag won't be cleared even though ->write_inode will be called after that. So a right fix would IMO be to move clearing of I_DIRTY_SYNC and I_DIRTY_DATASYNC flag after the writepages call. Or we could even put some dirty bit handling inside of ->write_inode because sometimes it would be useful if a filesystem would know in its ->write_inode method what dirty bits were set... Another thing is that in particular in ext[34] case, we would be much better off with submitting all the inode writes and then waiting just once. This is actually what should usually happen because sync calls asynchronous writeback first and only after it a synchronous one is called. But when the machine is under a constant IO load and the livelock avoidance code is invalidated by passing nr_to_write == 1024 to sync writeback, we end up doing a lot of sync IO and thus this problem becomes much more visible. > > d) ext4_writepage never succeeds to write a page with delayed-allocated > > data. So pageout() function never succeeds in cleaning a page on ext4. > > I think that when other problems in writeback code make writeout slow (like > > in Jan Engelhardt's case), this can bite us and I assume this might be the > > reason why Jan saw kswapd active doing some work during his problems. > > Yeah, I've noticed this. What it means is that if we have a massive > memory pressure in a particular zone, pages which are subject to > delayed allocation won't get written out by mm/vmscan.c. Anonymous > pages will be written out to swap, and data pages which are re-written > via random access mmap() (and so we know where they will be written on > disk) will get written, and that's not a problem. So with relatively > large zones, it happens, but most of the time I don't think it's a > major problem. Yes, I also don't think it's a major problem in common scenarios. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-22 21:01 ` tytso 2010-02-22 22:26 ` Jan Kara @ 2010-02-23 2:53 ` Dave Chinner 2010-02-23 3:23 ` tytso 1 sibling, 1 reply; 39+ messages in thread From: Dave Chinner @ 2010-02-23 2:53 UTC (permalink / raw) To: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon, Feb 22, 2010 at 04:01:12PM -0500, tytso@mit.edu wrote: > On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote: > > > > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode > > (when 1024 is passed in nr_to_write). Writeback code kind of expects that > > in WB_SYNC_ALL mode all dirty pages in the given range are written (the > > same way as write_cache_pages does that). > > Well, we return after writing 128MB because of the magic > s_max_writeback_mb_bump. The fact that nr_to_write limits the number > of pages which are written is something which is intentional to the > writeback code. I've disagreed with it, but I don't think it would be > legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that > what you are saying we should do? (If it is indeed legit to ignore > nr_to_write, I would have done it a long time ago; I introduced > s_max_writeback_mb_bump instead as a workaround to what I consider to > be a serious misfeature in the writeback code.) Ignoring nr_to_write completely can lead to issues like we used to have with XFS - it would write an entire extent (8GB) at a time and starve all other writeback. Those starvation problems - which were very obvious on NFS servers - went away when we trimmed back the amount to write in a single pass to saner amounts... > > b) because of delayed allocation, inode is redirtied during ->writepages > > call and thus writeback_single_inode calls redirty_tail at it. Thus each > > inode will be written at least twice (synchronously, which means a > > transaction commit and a disk cache flush for each such write). > > Hmm, does this happen with XFS, too? If not, I wonder how they handle > it? And whether we need to push a solution into the generic layers. Yes, it does - XFS dirties the inode during delayed allocation. In the worst case XFS writes back the inode twice, synchronously. However, XFS inodes carry their own dirty state and so the second flush is often a no-op because the inode is actually clean and not pinned in the log any more. The inode is only dirty the second time around if the file size was updated as a result of IO completion. FWIW, we've only just changed the XFS code to issue transactions for sync inode writes instead of writing the inode directly because of changeѕ to the async inode writeback that avoid writing the inode over and over again while it still has dirty data on it (coming in 2.6.34). These changes mean we need to issue a transaction to capture any unlogged changes to the inode during sync writes, but we still avoid the second transaction if the XFS inode is cleaned by the first transaction. As to a generic solution, why do you think I've been advocating separate per-sb data sync and inode writeback methods that separate data writeback from inode writeback for so long? ;) > > d) ext4_writepage never succeeds to write a page with delayed-allocated > > data. So pageout() function never succeeds in cleaning a page on ext4. > > I think that when other problems in writeback code make writeout slow (like > > in Jan Engelhardt's case), this can bite us and I assume this might be the > > reason why Jan saw kswapd active doing some work during his problems. [...] > This is done to avoid a lock inversion, and so this is an > ext4-specific thing (at least I don't think XFS's delayed allocation > has this misfeature). Not that I know of, but then again I don't know what inversion ext4 is trying to avoid. Can you describe the inversion, Ted? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-23 2:53 ` Dave Chinner @ 2010-02-23 3:23 ` tytso 2010-02-23 5:53 ` Dave Chinner 0 siblings, 1 reply; 39+ messages in thread From: tytso @ 2010-02-23 3:23 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote: > > Ignoring nr_to_write completely can lead to issues like we used to > have with XFS - it would write an entire extent (8GB) at a time and > starve all other writeback. Those starvation problems - which were > very obvious on NFS servers - went away when we trimmed back the > amount to write in a single pass to saner amounts... How do you determine what a "sane amount" is? Is it something that is determined dynamically, or is it a hard-coded or manually tuned value? > As to a generic solution, why do you think I've been advocating > separate per-sb data sync and inode writeback methods that separate > data writeback from inode writeback for so long? ;) Heh. > > This is done to avoid a lock inversion, and so this is an > > ext4-specific thing (at least I don't think XFS's delayed allocation > > has this misfeature). > > Not that I know of, but then again I don't know what inversion ext4 > is trying to avoid. Can you describe the inversion, Ted? The locking order is journal_start_handle (starting a micro transaction in jbd) -> lock_page. A more detailed description of why this locking order is non-trivial for us to fix in ext4 can be found in the description of commit f0e6c985. Regards, - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-23 3:23 ` tytso @ 2010-02-23 5:53 ` Dave Chinner 2010-02-24 14:56 ` Jan Kara 0 siblings, 1 reply; 39+ messages in thread From: Dave Chinner @ 2010-02-23 5:53 UTC (permalink / raw) To: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Mon, Feb 22, 2010 at 10:23:17PM -0500, tytso@mit.edu wrote: > On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote: > > > > Ignoring nr_to_write completely can lead to issues like we used to > > have with XFS - it would write an entire extent (8GB) at a time and > > starve all other writeback. Those starvation problems - which were > > very obvious on NFS servers - went away when we trimmed back the > > amount to write in a single pass to saner amounts... > > How do you determine what a "sane amount" is? Is it something that is > determined dynamically, or is it a hard-coded or manually tuned value? "sane amount" was a hard coded mapping look-ahead value that limited the size of the allocation that was done. It got set to 64 pages, back in 2.6.16 (IIRC) so that we'd do at least a 256k allocation on x86 or 1MB on ia64 (which had 16k page size at the time). It was previously unbound, which is why XFS was mapping entire extents... Hence ->writepage() on XFS will write 64 pages at a time if there are contiguous pages in the page cache, and then go back up to write_cache_pages() for the loop to either terminate due to nr_to_write <= 0 or call ->writepage on the next dirty page not under writeback on the inode. This behaviour pretty much defines the _minimum IO size_ we'd issue if there are contiguous dirty pages in the page cache. It was (and still is) primarily to work around the deficiencies of the random single page writeback that the VM's memory reclaim algorithms are so fond of (another of my pet peeves).... > > As to a generic solution, why do you think I've been advocating > > separate per-sb data sync and inode writeback methods that separate > > data writeback from inode writeback for so long? ;) > > Heh. > > > > This is done to avoid a lock inversion, and so this is an > > > ext4-specific thing (at least I don't think XFS's delayed allocation > > > has this misfeature). > > > > Not that I know of, but then again I don't know what inversion ext4 > > is trying to avoid. Can you describe the inversion, Ted? > > The locking order is journal_start_handle (starting a micro > transaction in jbd) -> lock_page. A more detailed description of why > this locking order is non-trivial for us to fix in ext4 can be found > in the description of commit f0e6c985. Nasty - you need to start a transaction before you lock pages for writeback and allocation, but ->writepage hands you a locked page. And you can't use an existing transaction handle open because you can't guarantee that you have journal credits reserved for the allocation? IIUC, ext3/4 has this problem due to the ordered data writeback constraints, right? Regardless of the reason for the ext4 behaviour, you are right about XFS - it doesn't have this particular problem with delalloc because it does have any transaction/page lock ordering constraints in the transaction subsystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] writeback: Fix broken sync writeback 2010-02-23 5:53 ` Dave Chinner @ 2010-02-24 14:56 ` Jan Kara 0 siblings, 0 replies; 39+ messages in thread From: Jan Kara @ 2010-02-24 14:56 UTC (permalink / raw) To: Dave Chinner Cc: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh On Tue 23-02-10 16:53:35, Dave Chinner wrote: > > > > This is done to avoid a lock inversion, and so this is an > > > > ext4-specific thing (at least I don't think XFS's delayed allocation > > > > has this misfeature). > > > > > > Not that I know of, but then again I don't know what inversion ext4 > > > is trying to avoid. Can you describe the inversion, Ted? > > > > The locking order is journal_start_handle (starting a micro > > transaction in jbd) -> lock_page. A more detailed description of why > > this locking order is non-trivial for us to fix in ext4 can be found > > in the description of commit f0e6c985. > > Nasty - you need to start a transaction before you lock pages for > writeback and allocation, but ->writepage hands you a locked page. > And you can't use an existing transaction handle open because you > can't guarantee that you have journal credits reserved for the > allocation? Exactly. > IIUC, ext3/4 has this problem due to the ordered data writeback > constraints, right? Not quite. I don't know how XFS solves this but in ext3/4 starting a transaction can block (waiting for journal space) until all users of a previous transaction are done with it and we can commit it. Thus the transaction start / stop behave just as an ordinary lock. Because you need a transaction started when writing a page (for metadata updates) there is some lock ordering forced between a page lock and a trasaction start / stop. Ext4 chose it to be transaction -> page lock (which makes writepages more efficient and writepage hard), ext3 has page lock -> transaction (so it has working ->writepage). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2011-01-21 15:09 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-12 9:16 [PATCH] writeback: Fix broken sync writeback Jens Axboe 2010-02-12 15:45 ` Linus Torvalds 2010-02-13 12:58 ` Jan Engelhardt 2010-02-15 14:49 ` Jan Kara 2010-02-15 15:41 ` Jan Engelhardt 2010-02-15 15:58 ` Jan Kara 2010-06-27 16:44 ` Jan Engelhardt 2010-10-24 23:41 ` Sync writeback still broken Jan Engelhardt 2010-10-30 0:57 ` Linus Torvalds 2010-10-30 1:16 ` Linus Torvalds 2010-10-30 1:30 ` Linus Torvalds 2010-10-30 3:18 ` Andrew Morton 2010-10-30 13:15 ` Christoph Hellwig 2010-10-31 12:24 ` Jan Kara 2010-10-31 22:40 ` Jan Kara 2010-11-05 21:33 ` Jan Kara 2010-11-05 21:34 ` Jan Kara 2010-11-05 21:41 ` Linus Torvalds 2010-11-05 22:03 ` Jan Engelhardt 2010-11-07 12:57 ` Jan Kara 2011-01-20 22:50 ` Jan Engelhardt 2011-01-21 15:09 ` Jan Kara 2010-02-15 14:17 ` [PATCH] writeback: Fix broken sync writeback Jan Kara 2010-02-16 0:05 ` Linus Torvalds 2010-02-16 23:00 ` Jan Kara 2010-02-16 23:34 ` Linus Torvalds 2010-02-17 0:01 ` Linus Torvalds 2010-02-17 1:33 ` Jan Kara 2010-02-17 1:57 ` Dave Chinner 2010-02-17 3:35 ` Linus Torvalds 2010-02-17 4:30 ` tytso 2010-02-17 5:16 ` Linus Torvalds 2010-02-22 17:29 ` Jan Kara 2010-02-22 21:01 ` tytso 2010-02-22 22:26 ` Jan Kara 2010-02-23 2:53 ` Dave Chinner 2010-02-23 3:23 ` tytso 2010-02-23 5:53 ` Dave Chinner 2010-02-24 14:56 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox