[PATCH] writeback: Fix broken sync writeback

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] writeback: Fix broken sync writeback
@ 2010-02-12  9:16 Jens Axboe
  2010-02-12 15:45 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2010-02-12  9:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, jack, jengelh, stable, gregkh

[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]

Hi,

There's currently a writeback bug in the 2.6.32 and 2.6.33-rc kernels
that prevent proper writeback when sync(1) is being run. Instead of
flushing everything older than the sync run, it will do chunks of normal
MAX_WRITEBACK_PAGES writeback and restart over and over. This results in
very suboptimal writeback for many files, see the initial report from
Jan Engelhardt:

http://lkml.org/lkml/2010/1/22/382

This fixes it by using the passed in page writeback count, instead of
doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
(Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
finish properly even when new pages are being dirted.

Thanks to Jan Kara <jack@suse.cz> for spotting this problem!

Cc: stable@kernel.org
Reported-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

---

I'm sending this out as a patch on the list instead since I'll be going
away for a week skiing very shortly, a mail+patch is easier to discuss
here.

Greg, I'm attaching a 2.6.32 based patch as well, since this one will
not apply to 2.6.32.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..55aeea9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -749,6 +749,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	}
 
 	for (;;) {
+		long to_write = 0;
+
 		/*
 		 * Stop writeback when nr_pages has been consumed
 		 */
@@ -762,12 +764,17 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (args->for_background && !over_bground_thresh())
 			break;
 
+		if (args->sync_mode == WB_SYNC_ALL)
+			to_write = args->nr_pages;
+		if (!to_write)
+			to_write = MAX_WRITEBACK_PAGES;
+
 		wbc.more_io = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = to_write;
 		wbc.pages_skipped = 0;
 		writeback_inodes_wb(wb, &wbc);
-		args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		args->nr_pages -= to_write - wbc.nr_to_write;
+		wrote += to_write - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -782,7 +789,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < to_write)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to

-- 
Jens Axboe


[-- Attachment #2: writeback-fix-broken-sync-2.6.32.patch --]
[-- Type: text/x-diff, Size: 2409 bytes --]

commit 057226ca7447880e4e766a82cf32197e492ba963
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Feb 12 10:14:34 2010 +0100

    writeback: Fix broken sync writeback
    
    There's currently a writeback bug in the 2.6.32 and 2.6.33-rc kernels
    that prevent proper writeback when sync(1) is being run. Instead of
    flushing everything older than the sync run, it will do chunks of normal
    MAX_WRITEBACK_PAGES writeback and restart over and over. This results in
    very suboptimal writeback for many files, see the initial report from
    Jan Engelhardt:
    
    http://lkml.org/lkml/2010/1/22/382
    
    This fixes it by using the passed in page writeback count, instead of
    doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
    (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
    finish properly even when new pages are being dirted.
    
    Thanks to Jan Kara <jack@suse.cz> for spotting this problem!
    
    Cc: stable@kernel.org
    Reported-by: Jan Engelhardt <jengelh@medozas.de>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9d5360c..8a46c67 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -773,6 +773,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	}
 
 	for (;;) {
+		long to_write = 0;
+
 		/*
 		 * Stop writeback when nr_pages has been consumed
 		 */
@@ -786,13 +788,18 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (args->for_background && !over_bground_thresh())
 			break;
 
+		if (args->sync_mode == WB_SYNC_ALL)
+			to_write = args->nr_pages;
+		if (!to_write)
+			to_write = MAX_WRITEBACK_PAGES;
+
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = to_write;
 		wbc.pages_skipped = 0;
 		writeback_inodes_wb(wb, &wbc);
-		args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		args->nr_pages -= to_write - wbc.nr_to_write;
+		wrote += to_write - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -807,7 +814,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < to_write)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-12  9:16 [PATCH] writeback: Fix broken sync writeback Jens Axboe
@ 2010-02-12 15:45 ` Linus Torvalds
  2010-02-13 12:58   ` Jan Engelhardt
  2010-02-15 14:17   ` [PATCH] writeback: Fix broken sync writeback Jan Kara
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-02-12 15:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, jack, jengelh, stable, gregkh

On Fri, 12 Feb 2010, Jens Axboe wrote:
> 
> This fixes it by using the passed in page writeback count, instead of
> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> finish properly even when new pages are being dirted.

This seems broken.

The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't 
generate a single IO bigger than a couple of MB. And then we have that 
loop around things to do multiple chunks. Your change to use nr_pages 
seems to make the whole looping totally pointless, and breaks that "don't 
do huge hunks" logic.

So I don't think that your patch is correct.

That said, I _do_ believe you when you say it makes a difference, which 
makes me think there is a bug there. I just don't think you fixed the 
right bug, and your change just happens to do what you wanted by pure 
random luck.

The _real_ bug seems to bethe one you mentioned, but then ignored:

> Instead of flushing everything older than the sync run, it will do 
> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and 
> over.

and it would seem that the _logical_ way to fix it would be something like 
the appended...

Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that 
any _future_ data is written back, so the 'oldest_jif' thing would seem to 
be sane regardless of sync mode.

  NOTE NOTE NOTE! I do have to admit that this patch scares me, because 
  there could be some bug in the 'older_than_this' logic that means that 
  somebody sets it even if the inode is already dirty. So this patch makes 
  conceptual sense to me, and I think it's the right thing to do, but I 
  also suspect that we do not actually have a lot of test coverage of the 
  whole 'older_than_this' logic, because it historically has been just a
  small optimization for kupdated.

  So this patch scares me, as it could break 'fdatasync' entirely. So 
  somebody should really double-check the whole 'dirtied_when' logic, just 
  to be safe. If anybody ever sets 'dirtied_when' to the current time even 
  if the inode is already dirty (and has an earlier dirtied_when'), then 
  that would open up 'fdatasync()' and friends up to not writing things 
  back properly at all (because a newer write set 'dirtied_when' so that 
  old writes get ignored and thought to be 'totally new')

Comments? 

		Linus

---
 fs/fs-writeback.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..a0a8424 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
 	long wrote = 0;
 	struct inode *inode;

-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
+	/*
+	 * We never write back data that was dirtied _after_ we
+	 * started writeback. But kupdate doesn't even want to
+	 * write back recently dirtied stuff, only older data.
+	 */
+	oldest_jif = jiffies-1;
+	wbc.older_than_this = &oldest_jif;
+	if (wbc.for_kupdate)
+		oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
+
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-12 15:45 ` Linus Torvalds
@ 2010-02-13 12:58   ` Jan Engelhardt
  2010-02-15 14:49     ` Jan Kara
  2010-02-15 14:17   ` [PATCH] writeback: Fix broken sync writeback Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2010-02-13 12:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, Linux Kernel, jack, stable, gregkh


On Friday 2010-02-12 16:45, Linus Torvalds wrote:
>On Fri, 12 Feb 2010, Jens Axboe wrote:
>> 
>> This fixes it by using the passed in page writeback count, instead of
>> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>> finish properly even when new pages are being dirted.
>
>This seems broken.

It seems so. Jens, Jan Kara, your patch does not entirely fix this.
While there is no sync/fsync to be seen in these traces, I can
tell there's a livelock, without Dirty decreasing at all.

INFO: task flush-8:0:354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0     D 00000000005691a0  5560   354      2 0x28000000000
Call Trace:
 [0000000000568de4] start_this_handle+0x36c/0x520
 [00000000005691a0] jbd2_journal_start+0xb4/0xe0
 [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
 [0000000000540de0] ext4_da_writepages+0x1e0/0x460
 [00000000004ab3e4] do_writepages+0x28/0x4c
 [00000000004f825c] writeback_single_inode+0xf0/0x330
 [00000000004f90a8] writeback_inodes_wb+0x4e4/0x600
 [00000000004f9354] wb_writeback+0x190/0x20c
 [00000000004f967c] wb_do_writeback+0x1d4/0x1f0
 [00000000004f96c0] bdi_writeback_task+0x28/0xa0
 [00000000004b71dc] bdi_start_fn+0x64/0xc8
 [00000000004786f0] kthread+0x58/0x6c
 [000000000042ae7c] kernel_thread+0x30/0x48
 [0000000000478644] kthreadd+0xb8/0x10c
1 lock held by flush-8:0/354:
 #0:  (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>]
 # writeback_inodes_wb+0x404/0x600
INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda6-8   D 000000000056eea0  7272   588      2 0x18000000000
Call Trace:
 [000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0
 [000000000056eea0] kjournald2+0x138/0x2fc
 [00000000004786f0] kthread+0x58/0x6c
 [000000000042ae7c] kernel_thread+0x30/0x48
 [0000000000478644] kthreadd+0xb8/0x10c
no locks held by jbd2/sda6-8/588.
INFO: task rpmbuild:6580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rpmbuild      D 00000000005691a0    16  6580   6562 0x310061101000080
Call Trace:
 [0000000000568de4] start_this_handle+0x36c/0x520
 [00000000005691a0] jbd2_journal_start+0xb4/0xe0
 [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
 [000000000053dcd4] ext4_dirty_inode+0x8/0x3c
 [00000000004f8888] __mark_inode_dirty+0x20/0x15c
 [00000000004eeb34] file_update_time+0x104/0x124
 [00000000004a50d4] __generic_file_aio_write+0x264/0x36c
 [00000000004a5228] generic_file_aio_write+0x4c/0xa4
 [00000000005390c0] ext4_file_write+0x94/0xa4
 [00000000004dc12c] do_sync_write+0x84/0xd4
 [00000000004dca78] vfs_write+0x70/0x12c
 [00000000004dcbcc] SyS_write+0x2c/0x5c
 [0000000000406214] linux_sparc_syscall32+0x34/0x40
1 lock held by rpmbuild/6580:
 #0:  (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>]
 # generic_file_aio_write+0x38/0xa4
etc.

>
>The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't 
>generate a single IO bigger than a couple of MB. And then we have that 
>loop around things to do multiple chunks. Your change to use nr_pages 
>seems to make the whole looping totally pointless, and breaks that "don't 
>do huge hunks" logic.
>
>So I don't think that your patch is correct.
>
>That said, I _do_ believe you when you say it makes a difference, which 
>makes me think there is a bug there. I just don't think you fixed the 
>right bug, and your change just happens to do what you wanted by pure 
>random luck.
>
>The _real_ bug seems to bethe one you mentioned, but then ignored:
>
>> Instead of flushing everything older than the sync run, it will do 
>> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and 
>> over.
>
>and it would seem that the _logical_ way to fix it would be something like 
>the appended...
>
>Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that 
>any _future_ data is written back, so the 'oldest_jif' thing would seem to 
>be sane regardless of sync mode.
>
>  NOTE NOTE NOTE! I do have to admit that this patch scares me, because 
>  there could be some bug in the 'older_than_this' logic that means that 
>  somebody sets it even if the inode is already dirty. So this patch makes 
>  conceptual sense to me, and I think it's the right thing to do, but I 
>  also suspect that we do not actually have a lot of test coverage of the 
>  whole 'older_than_this' logic, because it historically has been just a
>  small optimization for kupdated.
>
>  So this patch scares me, as it could break 'fdatasync' entirely. So 
>  somebody should really double-check the whole 'dirtied_when' logic, just 
>  to be safe. If anybody ever sets 'dirtied_when' to the current time even 
>  if the inode is already dirty (and has an earlier dirtied_when'), then 
>  that would open up 'fdatasync()' and friends up to not writing things 
>  back properly at all (because a newer write set 'dirtied_when' so that 
>  old writes get ignored and thought to be 'totally new')
>
>Comments? 
>
>		Linus
>
>---
> fs/fs-writeback.c |   15 ++++++++++-----
> 1 files changed, 10 insertions(+), 5 deletions(-)
>
>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>index 1a7c42c..a0a8424 100644
>--- a/fs/fs-writeback.c
>+++ b/fs/fs-writeback.c
>@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
> 	long wrote = 0;
> 	struct inode *inode;
> 
>-	if (wbc.for_kupdate) {
>-		wbc.older_than_this = &oldest_jif;
>-		oldest_jif = jiffies -
>-				msecs_to_jiffies(dirty_expire_interval * 10);
>-	}
>+	/*
>+	 * We never write back data that was dirtied _after_ we
>+	 * started writeback. But kupdate doesn't even want to
>+	 * write back recently dirtied stuff, only older data.
>+	 */
>+	oldest_jif = jiffies-1;
>+	wbc.older_than_this = &oldest_jif;
>+	if (wbc.for_kupdate)
>+		oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
>+
> 	if (!wbc.range_cyclic) {
> 		wbc.range_start = 0;
> 		wbc.range_end = LLONG_MAX;
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-13 12:58   ` Jan Engelhardt
@ 2010-02-15 14:49     ` Jan Kara
  2010-02-15 15:41       ` Jan Engelhardt
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-02-15 14:49 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jack, stable, gregkh

On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> 
> On Friday 2010-02-12 16:45, Linus Torvalds wrote:
> >On Fri, 12 Feb 2010, Jens Axboe wrote:
> >> 
> >> This fixes it by using the passed in page writeback count, instead of
> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> >> finish properly even when new pages are being dirted.
> >
> >This seems broken.
> 
> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> While there is no sync/fsync to be seen in these traces, I can
> tell there's a livelock, without Dirty decreasing at all.
  I don't think this is directly connected with my / Jens' patch.
Similar traces happen even without the patch (see e.g.
http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
makes it worse... So are you able to reproduce these warnings and
without the patch they did not happen?
  Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?
 
> INFO: task flush-8:0:354 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-8:0     D 00000000005691a0  5560   354      2 0x28000000000
> Call Trace:
>  [0000000000568de4] start_this_handle+0x36c/0x520
>  [00000000005691a0] jbd2_journal_start+0xb4/0xe0
>  [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
>  [0000000000540de0] ext4_da_writepages+0x1e0/0x460
>  [00000000004ab3e4] do_writepages+0x28/0x4c
>  [00000000004f825c] writeback_single_inode+0xf0/0x330
>  [00000000004f90a8] writeback_inodes_wb+0x4e4/0x600
>  [00000000004f9354] wb_writeback+0x190/0x20c
>  [00000000004f967c] wb_do_writeback+0x1d4/0x1f0
>  [00000000004f96c0] bdi_writeback_task+0x28/0xa0
>  [00000000004b71dc] bdi_start_fn+0x64/0xc8
>  [00000000004786f0] kthread+0x58/0x6c
>  [000000000042ae7c] kernel_thread+0x30/0x48
>  [0000000000478644] kthreadd+0xb8/0x10c
> 1 lock held by flush-8:0/354:
>  #0:  (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>]
>  # writeback_inodes_wb+0x404/0x600
> INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda6-8   D 000000000056eea0  7272   588      2 0x18000000000
> Call Trace:
>  [000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0
>  [000000000056eea0] kjournald2+0x138/0x2fc
>  [00000000004786f0] kthread+0x58/0x6c
>  [000000000042ae7c] kernel_thread+0x30/0x48
>  [0000000000478644] kthreadd+0xb8/0x10c
> no locks held by jbd2/sda6-8/588.
> INFO: task rpmbuild:6580 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> rpmbuild      D 00000000005691a0    16  6580   6562 0x310061101000080
> Call Trace:
>  [0000000000568de4] start_this_handle+0x36c/0x520
>  [00000000005691a0] jbd2_journal_start+0xb4/0xe0
>  [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
>  [000000000053dcd4] ext4_dirty_inode+0x8/0x3c
>  [00000000004f8888] __mark_inode_dirty+0x20/0x15c
>  [00000000004eeb34] file_update_time+0x104/0x124
>  [00000000004a50d4] __generic_file_aio_write+0x264/0x36c
>  [00000000004a5228] generic_file_aio_write+0x4c/0xa4
>  [00000000005390c0] ext4_file_write+0x94/0xa4
>  [00000000004dc12c] do_sync_write+0x84/0xd4
>  [00000000004dca78] vfs_write+0x70/0x12c
>  [00000000004dcbcc] SyS_write+0x2c/0x5c
>  [0000000000406214] linux_sparc_syscall32+0x34/0x40
> 1 lock held by rpmbuild/6580:
>  #0:  (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>]
>  # generic_file_aio_write+0x38/0xa4
> etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-15 14:49     ` Jan Kara
@ 2010-02-15 15:41       ` Jan Engelhardt
  2010-02-15 15:58         ` Jan Kara
  2010-06-27 16:44         ` Jan Engelhardt
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Engelhardt @ 2010-02-15 15:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh


On Monday 2010-02-15 15:49, Jan Kara wrote:
>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>> >> 
>> >> This fixes it by using the passed in page writeback count, instead of
>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>> >> finish properly even when new pages are being dirted.
>> >
>> >This seems broken.
>> 
>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
>> While there is no sync/fsync to be seen in these traces, I can
>> tell there's a livelock, without Dirty decreasing at all.
>
>  I don't think this is directly connected with my / Jens' patch.

I start to think so too.

>Similar traces happen even without the patch (see e.g.
>http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
>makes it worse... So are you able to reproduce these warnings and
>without the patch they did not happen?

Your patch speeds up the slow sync; without the patch, there was
no real chance to observ the hard lockup, as the slow sync would
take up all time.

So far, no reproduction. It seems to be just as you say.

>  Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?

0000000000569554 <jbd2_journal_commit_transaction>:
  56976c:       40 04 ee 62     call  6a50f4 <schedule>

Since there is an obvious schedule() call in jbd2_journal_commit_transaction's
C code, I think that's where it is.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-15 15:41       ` Jan Engelhardt
@ 2010-02-15 15:58         ` Jan Kara
  2010-06-27 16:44         ` Jan Engelhardt
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2010-02-15 15:58 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, stable,
	gregkh

On Mon 15-02-10 16:41:17, Jan Engelhardt wrote:
> 
> On Monday 2010-02-15 15:49, Jan Kara wrote:
> >On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> >> >> 
> >> >> This fixes it by using the passed in page writeback count, instead of
> >> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> >> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> >> >> finish properly even when new pages are being dirted.
>
> >> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> >> While there is no sync/fsync to be seen in these traces, I can
> >> tell there's a livelock, without Dirty decreasing at all.
> >
> >  I don't think this is directly connected with my / Jens' patch.
> 
> I start to think so too.
> 
> >Similar traces happen even without the patch (see e.g.
> >http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
> >makes it worse... So are you able to reproduce these warnings and
> >without the patch they did not happen?
> 
> Your patch speeds up the slow sync; without the patch, there was
> no real chance to observ the hard lockup, as the slow sync would
> take up all time.
> 
> So far, no reproduction. It seems to be just as you say.
> 
> >  Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?
> 
> 0000000000569554 <jbd2_journal_commit_transaction>:
>   56976c:       40 04 ee 62     call  6a50f4 <schedule>
> 
> Since there is an obvious schedule() call in jbd2_journal_commit_transaction's
> C code, I think that's where it is.
  OK. Thanks. It seems some process is spending excessive time with a
transaction open (jbd2_journal_commit_transaction waits for all handles of
a transaction to be dropped). If you see the traces again, try to obtain
stack traces of all the other processes and maybe we can catch the process
and see whether it's doing something unexpected.
  The patch can have an influence on this because we now pass larger
nr_to_write to ext4_writepages so maybe that makes some corner case more
likely.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-15 15:41       ` Jan Engelhardt
  2010-02-15 15:58         ` Jan Kara
@ 2010-06-27 16:44         ` Jan Engelhardt
  2010-10-24 23:41           ` Sync writeback still broken Jan Engelhardt
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2010-06-27 16:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh


Hi Jan Kara,


On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
>On Monday 2010-02-15 15:49, Jan Kara wrote:
>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>>> >> 
>>> >> This fixes it by using the passed in page writeback count, instead of
>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>>> >> finish properly even when new pages are being dirted.
>>> >
>>> >This seems broken.
>>> 
>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
>>> While there is no sync/fsync to be seen in these traces, I can
>>> tell there's a livelock, without Dirty decreasing at all.

What ultimately became of the discussion and/or the patch? 

Your original ad-hoc patch certainly still does its job; had no need to 
reboot in 86 days and still counting.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Sync writeback still broken
  2010-06-27 16:44         ` Jan Engelhardt
@ 2010-10-24 23:41           ` Jan Engelhardt
  2010-10-30  0:57             ` Linus Torvalds
  2010-10-31 12:24             ` Jan Kara
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Engelhardt @ 2010-10-24 23:41 UTC (permalink / raw)
  To: Jan Kara, Andrew Morton
  Cc: Linus Torvalds, Jens Axboe, Linux Kernel, stable, gregkh

On Sunday 2010-06-27 18:44, Jan Engelhardt wrote:
>On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
>>On Monday 2010-02-15 15:49, Jan Kara wrote:
>>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>>>> >> 
>>>> >> This fixes it by using the passed in page writeback count, instead of
>>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>>>> >> finish properly even when new pages are being dirted.
>>>> >
>>>> >This seems broken.
>>>> 
>>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
>>>> While there is no sync/fsync to be seen in these traces, I can
>>>> tell there's a livelock, without Dirty decreasing at all.
>
>What ultimately became of the discussion and/or the patch? 
>
>Your original ad-hoc patch certainly still does its job; had no need to 
>reboot in 86 days and still counting.

I still observe this behavior on 2.6.36-rc8. This is starting to 
get frustrating, so I will be happily following akpm's advise to 
poke people.

Thread entrypoint: http://lkml.org/lkml/2010/2/12/41

Previously, many concurrent extractions of tarballs and so on have been 
one way to trigger the issue; I now also have a rather small testcase 
(below) that freezes the box here (which has 24G RAM, so even if I'm 
lacking to call msync, I should be fine) sometime after memset finishes.

----
/* calculate all possible 32-bit hashes
   needs 16G of address space, so better have a 64-bit kernel at hand
 */
#define _GNU_SOURCE 1
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
#define S_IRUGO (S_IRUSR | S_IRGRP | S_IROTH)
#define S_IWUGO (S_IWUSR | S_IWGRP | S_IWOTH)

#define jrot(x,k) (((x) << (k)) | ((x) >> (32 - (k))))

/* jhash_mix - mix 3 32-bit values reversibly. */
#define jhash_mix(a, b, c) { \
	a -= c; a ^= jrot(c,  4); c += b; \
	b -= a; b ^= jrot(a,  6); a += c; \
	c -= b; c ^= jrot(b,  8); b += a; \
	a -= c; a ^= jrot(c, 16); c += b; \
	b -= a; b ^= jrot(a, 19); a += c; \
	c -= b; c ^= jrot(b,  4); b += a; \
}

#define jhash_final(a, b, c) { \
	c ^= b; c -= jrot(b, 14); \
	a ^= c; a -= jrot(c, 11); \
	b ^= a; b -= jrot(a, 25); \
	c ^= b; c -= jrot(b, 16); \
	a ^= c; a -= jrot(c,  4);  \
	b ^= a; b -= jrot(a, 14); \
	c ^= b; c -= jrot(b, 24); \
}

static uint32_t hash_jlookup3(const void *vkey, size_t length)
{
	static const unsigned int JHASH_GOLDEN_RATIO = 0x9e3779b9;
	const uint8_t *key = vkey;
	uint32_t a, b, c;

	a = b = c = JHASH_GOLDEN_RATIO + length;
	/* All but the last block: affect some 32 bits of (a,b,c) */
	for (; length > 12; length -= 12, key += 12) {
		a += key[0] + ((uint32_t)key[1] << 8) +
		     ((uint32_t)key[2] << 16) + ((uint32_t)key[3] << 24);
		b += key[4] + ((uint32_t)key[5] << 8) +
		     ((uint32_t)key[6] << 16) + ((uint32_t)key[7] << 24);
		c += key[8] + ((uint32_t)key[9] << 8) +
		     ((uint32_t)key[10] << 16)+ ((uint32_t)key[11] << 24);
		jhash_mix(a, b, c);
	}

	switch (length) {
	case 12: c += ((uint32_t)key[11]) << 24;
	case 11: c += ((uint32_t)key[10]) << 16;
	case 10: c += ((uint32_t)key[9])  << 8;
	case  9: c += key[8];
	case  8: b += ((uint32_t)key[7]) << 24;
	case  7: b += ((uint32_t)key[6]) << 16;
	case  6: b += ((uint32_t)key[5]) << 8;
	case  5: b += key[4];
	case  4: a += ((uint32_t)key[3]) << 24;
	case  3: a += ((uint32_t)key[2]) << 16;
	case  2: a += ((uint32_t)key[1]) << 8;
	case  1: a += key[0];
		break;
	case  0: return c;
	}

	jhash_final(a,b,c);
	return c;
}

static uint32_t *freq;
static const unsigned long long freq_size = 0x100000000UL * sizeof(*freq);

static void map_freq(void)
{
	int fd;

	fd = open("jenkins3.frq", O_RDWR | O_CREAT, S_IRUGO | S_IWUGO);
	if (fd < 0) {
		perror("open");
		abort();
	}

	if (ftruncate(fd, freq_size) < 0) {
		perror("ftruncate");
		abort();
	}

	freq = mmap(NULL, freq_size, PROT_READ | PROT_WRITE,
	       MAP_SHARED, fd, 0);
	if (freq == NULL) {
		perror("mmap");
		abort();
	}
}

static inline void calc_all_hashes(void)
{
	uint32_t x, y;

	memset(freq, 0, freq_size);
	for (x = 0; x < UINT32_MAX; ++x) {
		if ((x & 0xFFFFF) == 0)
			fprintf(stderr, "\r\e[2K""fill: %08x", x);
		y = hash_jlookup3(&x, sizeof(x));
		if (freq[y] < UINT32_MAX)
			++freq[y];
	}
}

int main(void)
{
	map_freq();
	calc_all_hashes();
	return 0;
}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-24 23:41           ` Sync writeback still broken Jan Engelhardt
@ 2010-10-30  0:57             ` Linus Torvalds
  2010-10-30  1:16               ` Linus Torvalds
  2010-10-31 12:24             ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2010-10-30  0:57 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH,
	Jens Axboe

Guys, what is the status of this?

The original patch in that email thread still makes no sense and the
commit log for it cannot be the real issue. But the _problem_ seems to
be real, and the code is apparently a total mess, still.

And the chunking is necessary - as even quoted in that whole thread:

  On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote:
  >
  > Ignoring nr_to_write completely can lead to issues like we used to
  > have with XFS - it would write an entire extent (8GB) at a time and
  > starve all other writeback. Those starvation problems - which were
  > very obvious on NFS servers - went away when we trimmed back the
  > amount to write in a single pass to saner amounts...

so we can't just stay with one single inode and do that one
completely. At the same time, the VFS chunking code itself is at least
supposed to try to write out 4MB at a time, which means that the whole
"only 400kB/s throughput" thing is pretty damn unlikely - but if it's
true, then that obviously means that the chunking is somehow broken.

IOW, we haven't seemed to get anywhere, and I haven't seen anybody
reply to Jan's plaintive email. Anybody?

                         Linus

On Sun, Oct 24, 2010 at 4:41 PM, Jan Engelhardt <jengelh@medozas.de> wrote:
>>
>>What ultimately became of the discussion and/or the patch?
>>
>>Your original ad-hoc patch certainly still does its job; had no need to
>>reboot in 86 days and still counting.
>
> I still observe this behavior on 2.6.36-rc8. This is starting to
> get frustrating, so I will be happily following akpm's advise to
> poke people.
>
> Thread entrypoint: http://lkml.org/lkml/2010/2/12/41
>
> Previously, many concurrent extractions of tarballs and so on have been
> one way to trigger the issue; I now also have a rather small testcase
> (below) that freezes the box here (which has 24G RAM, so even if I'm
> lacking to call msync, I should be fine) sometime after memset finishes.
>
> ----
> /* calculate all possible 32-bit hashes
>   needs 16G of address space, so better have a 64-bit kernel at hand
>  */
> #define _GNU_SOURCE 1
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
> #define S_IRUGO (S_IRUSR | S_IRGRP | S_IROTH)
> #define S_IWUGO (S_IWUSR | S_IWGRP | S_IWOTH)
>
> #define jrot(x,k) (((x) << (k)) | ((x) >> (32 - (k))))
>
> /* jhash_mix - mix 3 32-bit values reversibly. */
> #define jhash_mix(a, b, c) { \
>        a -= c; a ^= jrot(c,  4); c += b; \
>        b -= a; b ^= jrot(a,  6); a += c; \
>        c -= b; c ^= jrot(b,  8); b += a; \
>        a -= c; a ^= jrot(c, 16); c += b; \
>        b -= a; b ^= jrot(a, 19); a += c; \
>        c -= b; c ^= jrot(b,  4); b += a; \
> }
>
> #define jhash_final(a, b, c) { \
>        c ^= b; c -= jrot(b, 14); \
>        a ^= c; a -= jrot(c, 11); \
>        b ^= a; b -= jrot(a, 25); \
>        c ^= b; c -= jrot(b, 16); \
>        a ^= c; a -= jrot(c,  4);  \
>        b ^= a; b -= jrot(a, 14); \
>        c ^= b; c -= jrot(b, 24); \
> }
>
> static uint32_t hash_jlookup3(const void *vkey, size_t length)
> {
>        static const unsigned int JHASH_GOLDEN_RATIO = 0x9e3779b9;
>        const uint8_t *key = vkey;
>        uint32_t a, b, c;
>
>        a = b = c = JHASH_GOLDEN_RATIO + length;
>        /* All but the last block: affect some 32 bits of (a,b,c) */
>        for (; length > 12; length -= 12, key += 12) {
>                a += key[0] + ((uint32_t)key[1] << 8) +
>                     ((uint32_t)key[2] << 16) + ((uint32_t)key[3] << 24);
>                b += key[4] + ((uint32_t)key[5] << 8) +
>                     ((uint32_t)key[6] << 16) + ((uint32_t)key[7] << 24);
>                c += key[8] + ((uint32_t)key[9] << 8) +
>                     ((uint32_t)key[10] << 16)+ ((uint32_t)key[11] << 24);
>                jhash_mix(a, b, c);
>        }
>
>        switch (length) {
>        case 12: c += ((uint32_t)key[11]) << 24;
>        case 11: c += ((uint32_t)key[10]) << 16;
>        case 10: c += ((uint32_t)key[9])  << 8;
>        case  9: c += key[8];
>        case  8: b += ((uint32_t)key[7]) << 24;
>        case  7: b += ((uint32_t)key[6]) << 16;
>        case  6: b += ((uint32_t)key[5]) << 8;
>        case  5: b += key[4];
>        case  4: a += ((uint32_t)key[3]) << 24;
>        case  3: a += ((uint32_t)key[2]) << 16;
>        case  2: a += ((uint32_t)key[1]) << 8;
>        case  1: a += key[0];
>                break;
>        case  0: return c;
>        }
>
>        jhash_final(a,b,c);
>        return c;
> }
>
> static uint32_t *freq;
> static const unsigned long long freq_size = 0x100000000UL * sizeof(*freq);
>
> static void map_freq(void)
> {
>        int fd;
>
>        fd = open("jenkins3.frq", O_RDWR | O_CREAT, S_IRUGO | S_IWUGO);
>        if (fd < 0) {
>                perror("open");
>                abort();
>        }
>
>        if (ftruncate(fd, freq_size) < 0) {
>                perror("ftruncate");
>                abort();
>        }
>
>        freq = mmap(NULL, freq_size, PROT_READ | PROT_WRITE,
>               MAP_SHARED, fd, 0);
>        if (freq == NULL) {
>                perror("mmap");
>                abort();
>        }
> }
>
> static inline void calc_all_hashes(void)
> {
>        uint32_t x, y;
>
>        memset(freq, 0, freq_size);
>        for (x = 0; x < UINT32_MAX; ++x) {
>                if ((x & 0xFFFFF) == 0)
>                        fprintf(stderr, "\r\e[2K""fill: %08x", x);
>                y = hash_jlookup3(&x, sizeof(x));
>                if (freq[y] < UINT32_MAX)
>                        ++freq[y];
>        }
> }
>
> int main(void)
> {
>        map_freq();
>        calc_all_hashes();
>        return 0;
> }
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-30  0:57             ` Linus Torvalds
@ 2010-10-30  1:16               ` Linus Torvalds
  2010-10-30  1:30                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-10-30  1:16 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH,
	Jens Axboe

On Fri, Oct 29, 2010 at 5:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Guys, what is the status of this?
>
> The original patch in that email thread still makes no sense and the
> commit log for it cannot be the real issue. But the _problem_ seems to
> be real, and the code is apparently a total mess, still.

Btw, is the problem just that insane WB_SYNC_ALL thing?

The problem with WB_SYNC_ALL seems to be that it synchrnously writes
out one inode at a time. And it's not just the data, it's the inode
itself.

So rather than write out all pages for _all_ inodes, and then wait for
them, and write out _all_ metadata, and then wait for that, it seems
like the WB_SYNC_ALL code does the truly stupid thing, which is to
"write out some data for one inode, then _synchronously_ wait for
that, then write out the metadata for that single inode, then
_synchronously_ wait for THAT" and then rinse and repeat for each
inode.

The sane thing for "WB_SYNC_ALL" would be to:
 - for_each_inode: write out all data (no waiting)
 - for_each_inode: wait for the data for that inode, write out the inode
 - for_each_inode: wait for the inode

so that you avoid the whole synchronous wait thing, and can do all
inodes in one go.

I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't
it? And "umount()", I guess. I didn't actually look at the code.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-30  1:16               ` Linus Torvalds
@ 2010-10-30  1:30                 ` Linus Torvalds
  2010-10-30  3:18                 ` Andrew Morton
  2010-10-30 13:15                 ` Christoph Hellwig
  2 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-10-30  1:30 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linux Kernel, stable, Greg KH,
	Jens Axboe

On Fri, Oct 29, 2010 at 6:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't
> it? And "umount()", I guess. I didn't actually look at the code.

Hmm. The test-case Jan provided doesn't seem to do sync() or anything,
so either this is different from his original email issues, or there's
some external "sync()" going on unmentioned. Or something else
entirely?

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-30  1:16               ` Linus Torvalds
  2010-10-30  1:30                 ` Linus Torvalds
@ 2010-10-30  3:18                 ` Andrew Morton
  2010-10-30 13:15                 ` Christoph Hellwig
  2 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2010-10-30  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Engelhardt, Jan Kara, Linux Kernel, stable, Greg KH,
	Jens Axboe

On Fri, 29 Oct 2010 18:16:48 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Oct 29, 2010 at 5:57 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Guys, what is the status of this?
> >
> > The original patch in that email thread still makes no sense and the
> > commit log for it cannot be the real issue. But the _problem_ seems to
> > be real, and the code is apparently a total mess, still.
> 
> Btw, is the problem just that insane WB_SYNC_ALL thing?
> 
> The problem with WB_SYNC_ALL seems to be that it synchrnously writes
> out one inode at a time. And it's not just the data, it's the inode
> itself.
> 
> So rather than write out all pages for _all_ inodes, and then wait for
> them, and write out _all_ metadata, and then wait for that, it seems
> like the WB_SYNC_ALL code does the truly stupid thing, which is to
> "write out some data for one inode, then _synchronously_ wait for
> that, then write out the metadata for that single inode, then
> _synchronously_ wait for THAT" and then rinse and repeat for each
> inode.
> 
> The sane thing for "WB_SYNC_ALL" would be to:
>  - for_each_inode: write out all data (no waiting)
>  - for_each_inode: wait for the data for that inode, write out the inode
>  - for_each_inode: wait for the inode
> 
> so that you avoid the whole synchronous wait thing, and can do all
> inodes in one go.

The way I originally implemented all this was that the top level would
do two passes.  The first was as async as possible, to get lots of IO
underway against lots of devices.  Then the second pass was WB_SYNC_ALL
to get any straggler IO started and then to wait on everything.  eg
(2.6.20):

static void do_sync(unsigned long wait)
{
	wakeup_pdflush(0);
	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
	DQUOT_SYNC(NULL);
	sync_supers();		/* Write the superblocks */
	sync_filesystems(0);	/* Start syncing the filesystems */
	sync_filesystems(wait);	/* Waitingly sync the filesystems */
	sync_inodes(wait);	/* Mappings, inodes and blockdevs, again. */
	if (!wait)
		printk("Emergency Sync complete\n");
	if (unlikely(laptop_mode))
		laptop_sync_completion();
}


note the two sync_inodes() calls and the two sync_filesystems() calls.

This used to work reasonably well.  Maybe it got broken, who knows.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-30  1:16               ` Linus Torvalds
  2010-10-30  1:30                 ` Linus Torvalds
  2010-10-30  3:18                 ` Andrew Morton
@ 2010-10-30 13:15                 ` Christoph Hellwig
  2 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2010-10-30 13:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Engelhardt, Jan Kara, Andrew Morton, Linux Kernel, stable,
	Greg KH, Jens Axboe

On Fri, Oct 29, 2010 at 06:16:48PM -0700, Linus Torvalds wrote:
> Btw, is the problem just that insane WB_SYNC_ALL thing?
> 
> The problem with WB_SYNC_ALL seems to be that it synchrnously writes
> out one inode at a time. And it's not just the data, it's the inode
> itself.
> 
> So rather than write out all pages for _all_ inodes, and then wait for
> them, and write out _all_ metadata, and then wait for that, it seems
> like the WB_SYNC_ALL code does the truly stupid thing, which is to
> "write out some data for one inode, then _synchronously_ wait for
> that, then write out the metadata for that single inode, then
> _synchronously_ wait for THAT" and then rinse and repeat for each
> inode.
> 
> The sane thing for "WB_SYNC_ALL" would be to:
>  - for_each_inode: write out all data (no waiting)
>  - for_each_inode: wait for the data for that inode, write out the inode
>  - for_each_inode: wait for the inode

What we do currently at a high level is:

	for_each_inode: write data (no waiting)
	for_each_inode: write metadata (no waiting)
	for_each_inode: write and wait on data
	for_each_inode: write and wait on metadata

except that as pointed out in the earlier thread we switch around inodes
in a really dumb way.  It's still not quite as efficient as your version
above.  I'd like to get to something similar to your by splitting the
data and metadata list, but for now we have a fairly nice workaround in
XFS for the inefficient metadata writes.

We basically do not start any real I/O at all in ->write_inode but only
put the inode changes in the log, and do we single force of the log
contents at the end of the sync process.  Dimitri has implemented
something very similar for ext4 as well recently.  Together with fixing
the behaviour of randomly switching between the inodes that should get
us to a point where we are doing pretty well.

I'd still like to see the split data/metadata dirty tracking and
writeout in common code eventually.

> 
> so that you avoid the whole synchronous wait thing, and can do all
> inodes in one go.
> 
> I dunno. Who even uses WB_SYNC_ALL? It's just "sync()" itself, isn't
> it? And "umount()", I guess. I didn't actually look at the code.

It's sync/umount as for the writeback threads are concerned, we of
course use the flag for synchronous writeout of single inodes, but the
issue does not apply there.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-24 23:41           ` Sync writeback still broken Jan Engelhardt
  2010-10-30  0:57             ` Linus Torvalds
@ 2010-10-31 12:24             ` Jan Kara
  2010-10-31 22:40               ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-10-31 12:24 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

On Mon 25-10-10 01:41:48, Jan Engelhardt wrote:
> On Sunday 2010-06-27 18:44, Jan Engelhardt wrote:
> >On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
> >>On Monday 2010-02-15 15:49, Jan Kara wrote:
> >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> >>>> >> 
> >>>> >> This fixes it by using the passed in page writeback count, instead of
> >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> >>>> >> finish properly even when new pages are being dirted.
> >>>> >
> >>>> >This seems broken.
> >>>> 
> >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> >>>> While there is no sync/fsync to be seen in these traces, I can
> >>>> tell there's a livelock, without Dirty decreasing at all.
> >
> >What ultimately became of the discussion and/or the patch? 
> >
> >Your original ad-hoc patch certainly still does its job; had no need to 
> >reboot in 86 days and still counting.
> 
> I still observe this behavior on 2.6.36-rc8. This is starting to 
> get frustrating, so I will be happily following akpm's advise to 
> poke people.
  Yes, that's a good way :)

> Thread entrypoint: http://lkml.org/lkml/2010/2/12/41
> 
> Previously, many concurrent extractions of tarballs and so on have been 
> one way to trigger the issue; I now also have a rather small testcase 
> (below) that freezes the box here (which has 24G RAM, so even if I'm 
> lacking to call msync, I should be fine) sometime after memset finishes.
  I've tried your test but didn't succeed in freezing my laptop.
Everything was running smooth, the machine even felt reasonably responsive
although constantly reading and writing to disk. Also sync(1) finished in a
couple of seconds as one would expect in an optimistic case.
  Needless to say that my laptop has only 1G of ram so I had to downsize
the hash table from 16G to 1G to be able to run the test and the disk is
Intel SSD so the performance of the backing storage compared to the amount
of needed IO is much in my favor.
  OK, so I've taken a machine with standard rotational drive and 28GB of
ram and there I can see sync(1) hanging (but otherwise the machine looks
OK). Investigating further...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-31 12:24             ` Jan Kara
@ 2010-10-31 22:40               ` Jan Kara
  2010-11-05 21:33                 ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-10-31 22:40 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

On Sun 31-10-10 13:24:37, Jan Kara wrote:
> On Mon 25-10-10 01:41:48, Jan Engelhardt wrote:
> > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote:
> > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
> > >>On Monday 2010-02-15 15:49, Jan Kara wrote:
> > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> > >>>> >> 
> > >>>> >> This fixes it by using the passed in page writeback count, instead of
> > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> > >>>> >> finish properly even when new pages are being dirted.
> > >>>> >
> > >>>> >This seems broken.
> > >>>> 
> > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> > >>>> While there is no sync/fsync to be seen in these traces, I can
> > >>>> tell there's a livelock, without Dirty decreasing at all.
> > >
> > >What ultimately became of the discussion and/or the patch? 
> > >
> > >Your original ad-hoc patch certainly still does its job; had no need to 
> > >reboot in 86 days and still counting.
> > 
> > I still observe this behavior on 2.6.36-rc8. This is starting to 
> > get frustrating, so I will be happily following akpm's advise to 
> > poke people.
>   Yes, that's a good way :)
> 
> > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41
> > 
> > Previously, many concurrent extractions of tarballs and so on have been 
> > one way to trigger the issue; I now also have a rather small testcase 
> > (below) that freezes the box here (which has 24G RAM, so even if I'm 
> > lacking to call msync, I should be fine) sometime after memset finishes.
>   I've tried your test but didn't succeed in freezing my laptop.
> Everything was running smooth, the machine even felt reasonably responsive
> although constantly reading and writing to disk. Also sync(1) finished in a
> couple of seconds as one would expect in an optimistic case.
>   Needless to say that my laptop has only 1G of ram so I had to downsize
> the hash table from 16G to 1G to be able to run the test and the disk is
> Intel SSD so the performance of the backing storage compared to the amount
> of needed IO is much in my favor.
>   OK, so I've taken a machine with standard rotational drive and 28GB of
> ram and there I can see sync(1) hanging (but otherwise the machine looks
> OK). Investigating further...
  So with the writeback tracing, I verified that indeed the trouble is that
work queued by sync(1) gets queued behind the background writeback which is
just running. And background writeback won't stop because your process is
dirtying pages so agressively. Actually, it would stop after writing
LONG_MAX pages but that's effectively infinity. I have a patch
(e.g. http://www.kerneltrap.com/mailarchive/linux-fsdevel/2010/8/3/6886244)
to stop background writeback when other work is queued but it's kind
of hacky so I can see why Christoph doesn't like it ;)
  So I'll have to code something different to fix this issue...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-10-31 22:40               ` Jan Kara
@ 2010-11-05 21:33                 ` Jan Kara
  2010-11-05 21:34                   ` Jan Kara
  2010-11-05 22:03                   ` Jan Engelhardt
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Kara @ 2010-11-05 21:33 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

[-- Attachment #1: Type: text/plain, Size: 3842 bytes --]

On Sun 31-10-10 23:40:12, Jan Kara wrote:
> On Sun 31-10-10 13:24:37, Jan Kara wrote:
> > On Mon 25-10-10 01:41:48, Jan Engelhardt wrote:
> > > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote:
> > > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
> > > >>On Monday 2010-02-15 15:49, Jan Kara wrote:
> > > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> > > >>>> >> 
> > > >>>> >> This fixes it by using the passed in page writeback count, instead of
> > > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> > > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> > > >>>> >> finish properly even when new pages are being dirted.
> > > >>>> >
> > > >>>> >This seems broken.
> > > >>>> 
> > > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> > > >>>> While there is no sync/fsync to be seen in these traces, I can
> > > >>>> tell there's a livelock, without Dirty decreasing at all.
> > > >
> > > >What ultimately became of the discussion and/or the patch? 
> > > >
> > > >Your original ad-hoc patch certainly still does its job; had no need to 
> > > >reboot in 86 days and still counting.
> > > 
> > > I still observe this behavior on 2.6.36-rc8. This is starting to 
> > > get frustrating, so I will be happily following akpm's advise to 
> > > poke people.
> >   Yes, that's a good way :)
> > 
> > > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41
> > > 
> > > Previously, many concurrent extractions of tarballs and so on have been 
> > > one way to trigger the issue; I now also have a rather small testcase 
> > > (below) that freezes the box here (which has 24G RAM, so even if I'm 
> > > lacking to call msync, I should be fine) sometime after memset finishes.
> >   I've tried your test but didn't succeed in freezing my laptop.
> > Everything was running smooth, the machine even felt reasonably responsive
> > although constantly reading and writing to disk. Also sync(1) finished in a
> > couple of seconds as one would expect in an optimistic case.
> >   Needless to say that my laptop has only 1G of ram so I had to downsize
> > the hash table from 16G to 1G to be able to run the test and the disk is
> > Intel SSD so the performance of the backing storage compared to the amount
> > of needed IO is much in my favor.
> >   OK, so I've taken a machine with standard rotational drive and 28GB of
> > ram and there I can see sync(1) hanging (but otherwise the machine looks
> > OK). Investigating further...
>   So with the writeback tracing, I verified that indeed the trouble is that
> work queued by sync(1) gets queued behind the background writeback which is
> just running. And background writeback won't stop because your process is
> dirtying pages so agressively. Actually, it would stop after writing
> LONG_MAX pages but that's effectively infinity. I have a patch
> (e.g. http://www.kerneltrap.com/mailarchive/linux-fsdevel/2010/8/3/6886244)
> to stop background writeback when other work is queued but it's kind
> of hacky so I can see why Christoph doesn't like it ;)
>   So I'll have to code something different to fix this issue...
  OK, so at Kernel Summit we agreed to fix the issue as I originally wanted
by patches
http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2
and
http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2

  I needed one more patch to resolve the issue (attached) which I've just
posted for review and possible inclusion. I had a similar one long time ago
but now I'm better able to explain why it works because of tracepoints.
Yay! ;). With those three patches I'm not able to trigger livelocks (but
sync takes still 15 minutes because the througput to disk is about 4MB/s -
no big surprise given the random nature of the load)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Avoid-livelocking-of-WB_SYNC_ALL-writeback.patch --]
[-- Type: text/x-patch, Size: 3000 bytes --]

>From 44c256bbd627ae75039b99724ce3c7caa7f4fd95 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 5 Nov 2010 17:56:03 +0100
Subject: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually
set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb()
with nr_to_write == MAX_WRITEBACK_PAGES and thus we easily end up with negative
nr_to_write after the function returns. wb_writeback() then decides we need
another round of writeback but this is wrong in some cases! For example when
a single large file is continuously dirtied, we would never finish syncing
it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty
timestamp never gets updated (as inode is never completely clean).

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not
need nr_to_write in WB_SYNC_ALL mode anyway since livelock avoidance is done
differently for it.

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer
able to stall sync forever.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6b4d02a..d5873a6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
+	long write_chunk;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -640,6 +641,15 @@ static long wb_writeback(struct bdi_writeback *wb,
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
 	}
+	/*
+	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
+	 * we need to write everything and livelock avoidance is implemented
+	 * differently.
+	 */
+	if (wbc.sync_mode == WB_SYNC_NONE)
+		write_chunk = MAX_WRITEBACK_PAGES;
+	else
+		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
@@ -665,7 +675,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 			break;
 
 		wbc.more_io = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -675,8 +685,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -691,7 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-11-05 21:33                 ` Jan Kara
@ 2010-11-05 21:34                   ` Jan Kara
  2010-11-05 21:41                     ` Linus Torvalds
  2010-11-05 22:03                   ` Jan Engelhardt
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-11-05 21:34 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

On Fri 05-11-10 22:33:24, Jan Kara wrote:
>   I needed one more patch to resolve the issue (attached) which I've just
> posted for review and possible inclusion. I had a similar one long time ago
> but now I'm better able to explain why it works because of tracepoints.
> Yay! ;). With those three patches I'm not able to trigger livelocks (but
> sync takes still 15 minutes because the througput to disk is about 4MB/s -
> no big surprise given the random nature of the load)
  PS: And big thanks to you for providing the test case and your
persistence ;)
 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-11-05 21:34                   ` Jan Kara
@ 2010-11-05 21:41                     ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-11-05 21:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jan Engelhardt, Andrew Morton, Jens Axboe, Linux Kernel, stable,
	gregkh

On Fri, Nov 5, 2010 at 2:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 05-11-10 22:33:24, Jan Kara wrote:
>>   I needed one more patch to resolve the issue (attached) which I've just
>> posted for review and possible inclusion. I had a similar one long time ago
>> but now I'm better able to explain why it works because of tracepoints.
>> Yay! ;). With those three patches I'm not able to trigger livelocks (but
>> sync takes still 15 minutes because the througput to disk is about 4MB/s -
>> no big surprise given the random nature of the load)
>  PS: And big thanks to you for providing the test case and your
> persistence ;)

Ok, I'm inclined to just apply these three patches. Can we get a quick
ack/tested-by for them? I'm sure there's more work to be done, but
let's put this damn thing to rest if it really does fix the problem
Jan(E) sees.

Comments?

                     Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-11-05 21:33                 ` Jan Kara
  2010-11-05 21:34                   ` Jan Kara
@ 2010-11-05 22:03                   ` Jan Engelhardt
  2010-11-07 12:57                     ` Jan Kara
  2011-01-20 22:50                     ` Jan Engelhardt
  1 sibling, 2 replies; 39+ messages in thread
From: Jan Engelhardt @ 2010-11-05 22:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable,
	gregkh


On Friday 2010-11-05 22:33, Jan Kara wrote:
>  OK, so at Kernel Summit we agreed to fix the issue as I originally wanted
>by patches
>http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2
>and
>http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2
>
>  I needed one more patch to resolve the issue (attached) which I've just
>posted for review and possible inclusion.

If you have a branch that has all three and is easy to fetch, I'll
get on it right away.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-11-05 22:03                   ` Jan Engelhardt
@ 2010-11-07 12:57                     ` Jan Kara
  2011-01-20 22:50                     ` Jan Engelhardt
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2010-11-07 12:57 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

On Fri 05-11-10 23:03:31, Jan Engelhardt wrote:
> 
> On Friday 2010-11-05 22:33, Jan Kara wrote:
> >  OK, so at Kernel Summit we agreed to fix the issue as I originally wanted
> >by patches
> >http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2
> >and
> >http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2
> >
> >  I needed one more patch to resolve the issue (attached) which I've just
> >posted for review and possible inclusion.
> 
> If you have a branch that has all three and is easy to fetch, I'll
> get on it right away.
  In my repo:
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6.git
  I've created branch writeback_fixes which has 2.6.37-rc1 + the three
patches.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2010-11-05 22:03                   ` Jan Engelhardt
  2010-11-07 12:57                     ` Jan Kara
@ 2011-01-20 22:50                     ` Jan Engelhardt
  2011-01-21 15:09                       ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2011-01-20 22:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel, stable,
	gregkh


On Friday 2010-11-05 23:03, Jan Engelhardt wrote:
>On Friday 2010-11-05 22:33, Jan Kara wrote:
>>  OK, so at Kernel Summit we agreed to fix the issue as I originally wanted
>>by patches
>>http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2
>>and
>>http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2
>>
>>  I needed one more patch to resolve the issue (attached) which I've just
>>posted for review and possible inclusion.
>
>If you have a branch that has all three and is easy to fetch, I'll
>get on it right away.

I tested these now [2.6.37-rc1+yourpatches].

Within the last 13 days of uptime, the following messages have
accumulated, however the machine is still alive, so I guess it's
fine. (I see Andrew merged the patches already, so if it becomes
a problem again I will notice it within the next kernel releases.)


INFO: task flush-259:0:324 blocked for more than 120
seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
flush-259:0   D 00000000005a3d38     0   324      2 0x18000000000
Call Trace: 
 [00000000005a39c4] do_get_write_access+0x244/0x480
 [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40
 [0000000000591278] __ext4_journal_get_write_access+0x18/0x60
 [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0
 [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0
 [000000000058ba98] ext4_ext_insert_extent+0x718/0x17a0
 [000000000058f2d4] ext4_ext_map_blocks+0x9f4/0x12c0
 [00000000005742f4] ext4_map_blocks+0x1b4/0x240
 [0000000000576864] mpage_da_map_and_submit+0x84/0x420
 [0000000000576f5c] write_cache_pages_da+0x27c/0x3e0
 [00000000005772f8] ext4_da_writepages+0x238/0x4a0
 [00000000004cc7a4] do_writepages+0x24/0x60
 [0000000000525264] writeback_single_inode+0x84/0x240
 [00000000005258d8] writeback_sb_inodes+0x98/0x140
 [0000000000525f44] writeback_inodes_wb+0xc4/0x180
 [0000000000526294] wb_writeback+0x294/0x300
INFO: task jbd2/sda2-8:327 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
jbd2/sda2-8   D 000000000052ce28     0   327      2 0x58000000000
Call Trace: 
 [00000000007880d4] io_schedule+0x54/0xc0
 [000000000052ce28] sync_buffer+0x48/0x60
 [000000000078895c] __wait_on_bit+0x7c/0xe0
 [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60
 [00000000005a4e10] jbd2_journal_commit_transaction+0x850/0x1280
 [00000000005a87bc] kjournald2+0x9c/0x1c0
 [0000000000482b84] kthread+0x64/0x80
 [000000000042b790] kernel_thread+0x30/0x60
 [0000000000482e84] kthreadd+0xe4/0x160
INFO: task lighttpd:3260 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
lighttpd      D 000000000052ce28     0  3260      1 0x410051101000080
Call Trace: 
 [00000000007880d4] io_schedule+0x54/0xc0
 [000000000052ce28] sync_buffer+0x48/0x60
 [000000000078895c] __wait_on_bit+0x7c/0xe0
 [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60
 [000000000057ba44] ext4_find_entry+0x3c4/0x520
 [000000000057c7c8] ext4_lookup+0x28/0x180
 [000000000050e5a0] d_alloc_and_lookup+0x40/0x80
 [000000000050e6dc] do_lookup+0xfc/0x160
 [0000000000510578] link_path_walk+0x6b8/0xb00
 [0000000000510aa8] path_walk+0x28/0xa0
 [0000000000510b64] do_path_lookup+0x44/0xa0
 [0000000000510f6c] user_path_at+0x2c/0x80
 [00000000005090c8] vfs_fstatat+0x28/0x80
 [000000000044390c] compat_sys_stat64+0xc/0x40
 [0000000000406114] linux_sparc_syscall32+0x34/0x40
INFO: task rsync:2451 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
rsync         D 00000000005a3d38     0  2451   2445 0x410061101000080
Call Trace: 
 [00000000005a39c4] do_get_write_access+0x244/0x480
 [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40
 [0000000000591278] __ext4_journal_get_write_access+0x18/0x60
 [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0
 [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0
 [0000000000577844] ext4_dirty_inode+0x24/0x40
 [0000000000525b60] __mark_inode_dirty+0x20/0x200
 [0000000000519d84] file_update_time+0xc4/0x140
 [00000000004c43f4] __generic_file_aio_write+0x1f4/0x3a0
 [00000000004c45d4] generic_file_aio_write+0x34/0xc0
 [0000000000504338] do_sync_write+0x78/0xc0
 [00000000005049a8] vfs_write+0x68/0x140
 [0000000000504c4c] SyS_write+0x2c/0x60
 [0000000000406114] linux_sparc_syscall32+0x34/0x40
INFO: task cron:4376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
cron          D 00000000004c33f8     0  4376   3244 0x210021101000080
Call Trace: 
 [00000000007880d4] io_schedule+0x54/0xc0
 [00000000004c33f8] sync_page+0x58/0x80
 [0000000000788808] __wait_on_bit_lock+0x68/0xe0
 [00000000004c3360] __lock_page+0x40/0x60
 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
 [00000000004c5108] filemap_fault+0x248/0x3c0
 [00000000004dca90] __do_fault+0x30/0x480
 [00000000004df27c] handle_mm_fault+0x13c/0x9a0
 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
 [00000000004079a8] sparc64_realfault_common+0x10/0x20
INFO: task cron:4377 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
cron          D 00000000004c33f8     0  4377   3244 0x210021101000080
Call Trace: 
 [00000000007880d4] io_schedule+0x54/0xc0
 [00000000004c33f8] sync_page+0x58/0x80
 [0000000000788808] __wait_on_bit_lock+0x68/0xe0
 [00000000004c3360] __lock_page+0x40/0x60
 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
 [00000000004c5108] filemap_fault+0x248/0x3c0
 [00000000004dca90] __do_fault+0x30/0x480
 [00000000004df27c] handle_mm_fault+0x13c/0x9a0
 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
 [00000000004079a8] sparc64_realfault_common+0x10/0x20
INFO: task cron:4378 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
cron          D 00000000004c33f8     0  4378   3244 0x210021101000080
Call Trace: 
 [00000000007880d4] io_schedule+0x54/0xc0
 [00000000004c33f8] sync_page+0x58/0x80
 [0000000000788808] __wait_on_bit_lock+0x68/0xe0
 [00000000004c3360] __lock_page+0x40/0x60
 [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
 [00000000004c5108] filemap_fault+0x248/0x3c0
 [00000000004dca90] __do_fault+0x30/0x480
 [00000000004df27c] handle_mm_fault+0x13c/0x9a0
 [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
 [00000000004079a8] sparc64_realfault_common+0x10/0x20

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Sync writeback still broken
  2011-01-20 22:50                     ` Jan Engelhardt
@ 2011-01-21 15:09                       ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2011-01-21 15:09 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jan Kara, Andrew Morton, Linus Torvalds, Jens Axboe, Linux Kernel,
	stable, gregkh

  Hi,

On Thu 20-01-11 23:50:36, Jan Engelhardt wrote:
> On Friday 2010-11-05 23:03, Jan Engelhardt wrote:
> >On Friday 2010-11-05 22:33, Jan Kara wrote:
> >>  OK, so at Kernel Summit we agreed to fix the issue as I originally wanted
> >>by patches
> >>http://marc.info/?l=linux-fsdevel&m=128861735213143&w=2
> >>and
> >>http://marc.info/?l=linux-fsdevel&m=128861734813131&w=2
> >>
> >>  I needed one more patch to resolve the issue (attached) which I've just
> >>posted for review and possible inclusion.
> >
> >If you have a branch that has all three and is easy to fetch, I'll
> >get on it right away.
> 
> I tested these now [2.6.37-rc1+yourpatches].
> 
> Within the last 13 days of uptime, the following messages have
> accumulated, however the machine is still alive, so I guess it's
> fine. (I see Andrew merged the patches already, so if it becomes
> a problem again I will notice it within the next kernel releases.)
  Thanks for an update :) I went through the messages bellow and there
are no signs of bdi-writeout causing any problems. Apparently your disk
is rather busy (maybe in combination with suboptiomal IO scheduling
decisions?) since I could attribute all the reports to waiting for an
IO on a buffer to finish.

								Honza

> INFO: task flush-259:0:324 blocked for more than 120
> seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> flush-259:0   D 00000000005a3d38     0   324      2 0x18000000000
> Call Trace: 
>  [00000000005a39c4] do_get_write_access+0x244/0x480
>  [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40
>  [0000000000591278] __ext4_journal_get_write_access+0x18/0x60
>  [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0
>  [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0
>  [000000000058ba98] ext4_ext_insert_extent+0x718/0x17a0
>  [000000000058f2d4] ext4_ext_map_blocks+0x9f4/0x12c0
>  [00000000005742f4] ext4_map_blocks+0x1b4/0x240
>  [0000000000576864] mpage_da_map_and_submit+0x84/0x420
>  [0000000000576f5c] write_cache_pages_da+0x27c/0x3e0
>  [00000000005772f8] ext4_da_writepages+0x238/0x4a0
>  [00000000004cc7a4] do_writepages+0x24/0x60
>  [0000000000525264] writeback_single_inode+0x84/0x240
>  [00000000005258d8] writeback_sb_inodes+0x98/0x140
>  [0000000000525f44] writeback_inodes_wb+0xc4/0x180
>  [0000000000526294] wb_writeback+0x294/0x300
> INFO: task jbd2/sda2-8:327 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> jbd2/sda2-8   D 000000000052ce28     0   327      2 0x58000000000
> Call Trace: 
>  [00000000007880d4] io_schedule+0x54/0xc0
>  [000000000052ce28] sync_buffer+0x48/0x60
>  [000000000078895c] __wait_on_bit+0x7c/0xe0
>  [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60
>  [00000000005a4e10] jbd2_journal_commit_transaction+0x850/0x1280
>  [00000000005a87bc] kjournald2+0x9c/0x1c0
>  [0000000000482b84] kthread+0x64/0x80
>  [000000000042b790] kernel_thread+0x30/0x60
>  [0000000000482e84] kthreadd+0xe4/0x160
> INFO: task lighttpd:3260 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> lighttpd      D 000000000052ce28     0  3260      1 0x410051101000080
> Call Trace: 
>  [00000000007880d4] io_schedule+0x54/0xc0
>  [000000000052ce28] sync_buffer+0x48/0x60
>  [000000000078895c] __wait_on_bit+0x7c/0xe0
>  [0000000000788a04] out_of_line_wait_on_bit+0x44/0x60
>  [000000000057ba44] ext4_find_entry+0x3c4/0x520
>  [000000000057c7c8] ext4_lookup+0x28/0x180
>  [000000000050e5a0] d_alloc_and_lookup+0x40/0x80
>  [000000000050e6dc] do_lookup+0xfc/0x160
>  [0000000000510578] link_path_walk+0x6b8/0xb00
>  [0000000000510aa8] path_walk+0x28/0xa0
>  [0000000000510b64] do_path_lookup+0x44/0xa0
>  [0000000000510f6c] user_path_at+0x2c/0x80
>  [00000000005090c8] vfs_fstatat+0x28/0x80
>  [000000000044390c] compat_sys_stat64+0xc/0x40
>  [0000000000406114] linux_sparc_syscall32+0x34/0x40
> INFO: task rsync:2451 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> rsync         D 00000000005a3d38     0  2451   2445 0x410061101000080
> Call Trace: 
>  [00000000005a39c4] do_get_write_access+0x244/0x480
>  [00000000005a3d38] jbd2_journal_get_write_access+0x18/0x40
>  [0000000000591278] __ext4_journal_get_write_access+0x18/0x60
>  [0000000000572a94] ext4_reserve_inode_write+0x54/0xa0
>  [0000000000572afc] ext4_mark_inode_dirty+0x1c/0x1e0
>  [0000000000577844] ext4_dirty_inode+0x24/0x40
>  [0000000000525b60] __mark_inode_dirty+0x20/0x200
>  [0000000000519d84] file_update_time+0xc4/0x140
>  [00000000004c43f4] __generic_file_aio_write+0x1f4/0x3a0
>  [00000000004c45d4] generic_file_aio_write+0x34/0xc0
>  [0000000000504338] do_sync_write+0x78/0xc0
>  [00000000005049a8] vfs_write+0x68/0x140
>  [0000000000504c4c] SyS_write+0x2c/0x60
>  [0000000000406114] linux_sparc_syscall32+0x34/0x40
> INFO: task cron:4376 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> cron          D 00000000004c33f8     0  4376   3244 0x210021101000080
> Call Trace: 
>  [00000000007880d4] io_schedule+0x54/0xc0
>  [00000000004c33f8] sync_page+0x58/0x80
>  [0000000000788808] __wait_on_bit_lock+0x68/0xe0
>  [00000000004c3360] __lock_page+0x40/0x60
>  [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
>  [00000000004c5108] filemap_fault+0x248/0x3c0
>  [00000000004dca90] __do_fault+0x30/0x480
>  [00000000004df27c] handle_mm_fault+0x13c/0x9a0
>  [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
>  [00000000004079a8] sparc64_realfault_common+0x10/0x20
> INFO: task cron:4377 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> cron          D 00000000004c33f8     0  4377   3244 0x210021101000080
> Call Trace: 
>  [00000000007880d4] io_schedule+0x54/0xc0
>  [00000000004c33f8] sync_page+0x58/0x80
>  [0000000000788808] __wait_on_bit_lock+0x68/0xe0
>  [00000000004c3360] __lock_page+0x40/0x60
>  [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
>  [00000000004c5108] filemap_fault+0x248/0x3c0
>  [00000000004dca90] __do_fault+0x30/0x480
>  [00000000004df27c] handle_mm_fault+0x13c/0x9a0
>  [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
>  [00000000004079a8] sparc64_realfault_common+0x10/0x20
> INFO: task cron:4378 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> cron          D 00000000004c33f8     0  4378   3244 0x210021101000080
> Call Trace: 
>  [00000000007880d4] io_schedule+0x54/0xc0
>  [00000000004c33f8] sync_page+0x58/0x80
>  [0000000000788808] __wait_on_bit_lock+0x68/0xe0
>  [00000000004c3360] __lock_page+0x40/0x60
>  [00000000004c4eac] __lock_page_or_retry+0x4c/0x60
>  [00000000004c5108] filemap_fault+0x248/0x3c0
>  [00000000004dca90] __do_fault+0x30/0x480
>  [00000000004df27c] handle_mm_fault+0x13c/0x9a0
>  [000000000044de4c] do_sparc64_fault+0x32c/0x7c0
>  [00000000004079a8] sparc64_realfault_common+0x10/0x20
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-12 15:45 ` Linus Torvalds
  2010-02-13 12:58   ` Jan Engelhardt
@ 2010-02-15 14:17   ` Jan Kara
  2010-02-16  0:05     ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-02-15 14:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, Linux Kernel, jack, jengelh, stable, gregkh

On Fri 12-02-10 07:45:04, Linus Torvalds wrote:
> On Fri, 12 Feb 2010, Jens Axboe wrote:
> > 
> > This fixes it by using the passed in page writeback count, instead of
> > doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> > (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> > finish properly even when new pages are being dirted.
> 
> This seems broken.
> 
> The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't 
> generate a single IO bigger than a couple of MB. And then we have that 
> loop around things to do multiple chunks. Your change to use nr_pages 
> seems to make the whole looping totally pointless, and breaks that "don't 
> do huge hunks" logic.
> 
> So I don't think that your patch is correct.
> 
> That said, I _do_ believe you when you say it makes a difference, which 
> makes me think there is a bug there. I just don't think you fixed the 
> right bug, and your change just happens to do what you wanted by pure 
> random luck.
  A few points to this:
Before Jens rewritten the writeback code to use per-bdi flusher threads
MAX_WRITEBACK_PAGES applied only to writeback done by pdflush and kupdate.
So for example sync used .nr_pages like in the suggested patch. So Jens'
patch was just trying to get back to the old behavior...

Personally, I don't see why VM shouldn't generate IO from a single inode
larger than a few MB for data integrity syncs. There are two reasons I know
about for MAX_WRITEBACK_PAGES:
 a) avoid livelocking of writeout when the file is continuously grown
    (even nastier is when the file is sparse and huge and the writer just
    fills the huge hole with data gradually)
 b) avoid holding I_SYNC for too long because that could block tasks being
    write-throttled.

I find a) a valid reason but MAX_WRITEBACK_PAGES workarounds the livelock
only for background writeback where we cycle through all inodes and do not
care whether we ever make the inode clean. Limiting data integrity
writeback to MAX_WRITEBACK_PAGES just makes things *worse* because you give
processes more time to generate new dirty pages. Using dirtied_when does
not solve this because the inode dirty timestamp is updated only at
clean->dirty transition but inode never gets clean.

I don't think b) is a very valid reason anymore. Currently, throttled task
writes all inodes from the BDI and the I_SYNC locked inode is just skipped.
It can only be a problem when that inode would be the only one with big
enough amount of dirty data. Then the throttled thread will only exit
balance_dirty_pages as soon as the amount of dirty data at that BDI drops
below the BDI dirty threshold.

> The _real_ bug seems to bethe one you mentioned, but then ignored:
> 
> > Instead of flushing everything older than the sync run, it will do 
> > chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and 
> > over.
> 
> and it would seem that the _logical_ way to fix it would be something like 
> the appended...
> 
> Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that 
> any _future_ data is written back, so the 'oldest_jif' thing would seem to 
> be sane regardless of sync mode.
> 
>   NOTE NOTE NOTE! I do have to admit that this patch scares me, because 
>   there could be some bug in the 'older_than_this' logic that means that 
>   somebody sets it even if the inode is already dirty. So this patch makes 
>   conceptual sense to me, and I think it's the right thing to do, but I 
>   also suspect that we do not actually have a lot of test coverage of the 
>   whole 'older_than_this' logic, because it historically has been just a
>   small optimization for kupdated.
A similar logic to what you are suggesting is actually already there.  See
'start' variable in writeback_inodes_wb(). It was there exactly to avoid
sync(1) livelocks but it is assuming that writeback_inodes_wb() handles all
the IO for the sync. Not that it is called just to write
MAX_WRITEBACK_PAGES. So the only danger of your patch using older_than_this
would be that the ordering on wb->b_dirty list isn't always by dirty time.

But as I wrote above the main problem I see in your patch is that it does
not avoid all the cases of livelocks.


								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-15 14:17   ` [PATCH] writeback: Fix broken sync writeback Jan Kara
@ 2010-02-16  0:05     ` Linus Torvalds
  2010-02-16 23:00       ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2010-02-16  0:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Mon, 15 Feb 2010, Jan Kara wrote:
> 
> Personally, I don't see why VM shouldn't generate IO from a single inode
> larger than a few MB for data integrity syncs. There are two reasons I know
> about for MAX_WRITEBACK_PAGES:

Two issues:
 - it shouldn't matter for performance (if you have a broken disk that 
   wants multi-megabyte writes to get good performance, you need to fix 
   the driver, not the VM)
 - I generally think that _latency_ is much more important than 
   throughput, and huge writes are simply bad for latency.

But the real complaint I had about your patch was that it made no sense. 

Your statement that it speeds up sync is indicative of some _other_ 
independent problem (perhaps starting from the beginning of the file each 
time? Who knows?) and the patch _clearly_doesn't actually address the 
underlying problem, it just changes the logic in a way that makes no 
obvious sense to anybody, without actually explaining why it would matter.

So if you can explain _why_ that patch makes such a big difference for 
you, and actually write that up, I wouldn't mind it. But right now it was 
sent as a voodoo patch with no sensible explanation for why it would 
really make any difference, since the outer loop should already do what 
the patch does.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-16  0:05     ` Linus Torvalds
@ 2010-02-16 23:00       ` Jan Kara
  2010-02-16 23:34         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-02-16 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Mon 15-02-10 16:05:43, Linus Torvalds wrote:
> 
> 
> On Mon, 15 Feb 2010, Jan Kara wrote:
> > 
> > Personally, I don't see why VM shouldn't generate IO from a single inode
> > larger than a few MB for data integrity syncs. There are two reasons I know
> > about for MAX_WRITEBACK_PAGES:
> 
> Two issues:
>  - it shouldn't matter for performance (if you have a broken disk that 
>    wants multi-megabyte writes to get good performance, you need to fix 
>    the driver, not the VM)
  The IO size actually does matter for performance because if you switch
after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
that means a seek to a distant place and that has significant impact (seek
time are in a 10-20 ms range for a common drives so with 50 MB/s throughput
this is like 13-25% of the IO time...). A similar reason why it matters is
if the filesystem does delayed allocation - then allocation is performed
from writeback code and if it happens in 4 MB chunks, chances for
fragmentation are higher. Actually XFS, btrfs, and ext4 *workaround* the
fact that VM sends only 4 MB chunks and write more regardless of
nr_to_write set - I agree this really sucks but that's the way it is.

>  - I generally think that _latency_ is much more important than 
>    throughput, and huge writes are simply bad for latency.
  I agree that latency matters but is VM the right place to decide where to
break writes into smaller chunks for latency reasons? It knows neither
about the placement of the file nor about possible concurrent requests for
that device. So I personally believe that IO scheduler should be the layer
that should care about scheduling IO so that we have decent latecies. As
splitting bios is probably not an option, we might want to have an
artificial upper limit on bio size for latency reasons but it's still it's
way below VM...

> But the real complaint I had about your patch was that it made no sense. 
>
> Your statement that it speeds up sync is indicative of some _other_ 
> independent problem (perhaps starting from the beginning of the file each 
> time? Who knows?) and the patch _clearly_doesn't actually address the 
> underlying problem, it just changes the logic in a way that makes no 
> obvious sense to anybody, without actually explaining why it would matter.
> 
> So if you can explain _why_ that patch makes such a big difference for 
> you, and actually write that up, I wouldn't mind it. But right now it was 
> sent as a voodoo patch with no sensible explanation for why it would 
> really make any difference, since the outer loop should already do what 
> the patch does.
  OK, fair enough :). The patch is actually Jens' but looking at the code
again the nr_to_write trimming should rather happen inside of
writeback_inodes_wb, not outside of it. That would make the logic clearer
and magically also fix the problem because write_cache_pages() ignores
nr_to_write in WB_SYNC_ALL mode. But I guess it's better to not depend
on this and explicitely handle that in writeback_inodes_wb...
So I'll post a new patch with a better changelog shortly.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-16 23:00       ` Jan Kara
@ 2010-02-16 23:34         ` Linus Torvalds
  2010-02-17  0:01           ` Linus Torvalds
  2010-02-17  1:33           ` Jan Kara
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-02-16 23:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Wed, 17 Feb 2010, Jan Kara wrote:
>
>   The IO size actually does matter for performance because if you switch
> after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,

No.

Dammit, read the code.

That's my whole _point_. Look at the for-loop.

We DO NOT SWITCH to another inode, because we just continue in the 
for-loop.

This is why I think your patch is crap. You clearly haven't even read the 
code, the patch makes no sense, and there must be something else going on 
than what you _claim_ is going on.

> So I'll post a new patch with a better changelog shortly.

Not "shortly". Because you clearly haven't looked at the code. Look here:

                /*
                 * If we consumed everything, see if we have more
                 */
                if (wbc.nr_to_write <= 0)
                        continue;

and here:

                /*
                 * Did we write something? Try for more
                 */
                if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
                        continue;

the point is, the way the code is written, it is very much _meant_ to 
write out one file in one go, except it is supposed to chunk it up so that 
you never see _too_ many dirty pages in flight at any time. Because tons 
of dirty pages in the queues makes for bad latency.

Now, I admit that since your patch makes a difference, there is a bug 
somewhere. That's never what I've argued against. What I argue against is 
your patch itself, and your explanation for it. They don't make sense. 

I suspect that there is something else going on. In particular, I wonder 
if multiple calls to "writeback_inodes_wb()" somehow mess up the wbc 
state, and it starts writing the same inode from the beginning, or it 
simply has some bug that means that it moves the inode to the head of the 
queue, or something.

For example, it might be that the logic in writeback_inodes_wb() moves an 
inode back (the "redirty_tail()" cases) in bad ways when it shouldn't. 

See what I'm trying to say? I think your patch probably just hides the 
_real_ bug. Because it really shouldn't matter if we do things in 4MB 
chunks or whatever, because the for-loop should happily continue.

				Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-16 23:34         ` Linus Torvalds
@ 2010-02-17  0:01           ` Linus Torvalds
  2010-02-17  1:33           ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2010-02-17  0:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Tue, 16 Feb 2010, Linus Torvalds wrote:
> 
> For example, it might be that the logic in writeback_inodes_wb() moves an 
> inode back (the "redirty_tail()" cases) in bad ways when it shouldn't. 

Or another example: perhaps we screw up the inode list ordering when we 
move the inodes between b_dirty <-> b_io <-> b_more_io?

And if we put inodes on the b_more_io list, do we perhaps then end up 
doing too much waiting in inode_wait_for_writeback()?

See what I'm saying? Your patch - by just submitting the maximal sized 
buffers - may well end up hiding the real problem. But if there is a real 
problem in that whole list manipulation or waiting, then that problem 
still exists for async writeback.

Wouldn't it be better to fix the real problem, so that async writeback 
also gets the correct IO patterns? 

NOTE! It's entirely possible that we do end up wanting to really submit 
the maximal dirty IO for synchronous dirty writeback, in order to get 
better IO patterns. So maybe your patch ends up being the right one in the 
end. But I really _really_ want to understand this.

But right now, that patch seems like voodoo programming to me, and I 
personally suspect that the real problem is in the horribly complex 
b_io/b_more_io interaction. Or one of the _other_ horribly complex details 
in the write-back logic (even just writing back a single inode is 
complicated, see all the logic about "range_start/range_end" in the lower 
level write_cache_pages() function).

Our whole writeback code is very very complicated. I don't like it. But 
that's also why I want to feel like I understand the patch when I apply 
it.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-16 23:34         ` Linus Torvalds
  2010-02-17  0:01           ` Linus Torvalds
@ 2010-02-17  1:33           ` Jan Kara
  2010-02-17  1:57             ` Dave Chinner
  2010-02-17  3:35             ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Jan Kara @ 2010-02-17  1:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Tue 16-02-10 15:34:01, Linus Torvalds wrote:
> On Wed, 17 Feb 2010, Jan Kara wrote:
> >
> >   The IO size actually does matter for performance because if you switch
> > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
> 
> No.
> 
> Dammit, read the code.
> 
> That's my whole _point_. Look at the for-loop.
> 
> We DO NOT SWITCH to another inode, because we just continue in the 
> for-loop.
> 
> This is why I think your patch is crap. You clearly haven't even read the 
> code, the patch makes no sense, and there must be something else going on 
> than what you _claim_ is going on.
  I've read the code. Maybe I'm missing something but look:
writeback_inodes_wb(nr_to_write = 1024)
  -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
  ...
  writeback_single_inode()
    ...writes 1024 pages.
    if we haven't written everything in the inode (more than 1024 dirty
    pages) we end up doing either requeue_io() or redirty_tail(). In the
    first case the inode is put to b_more_io list, in the second case to
    the tail of b_dirty list. In either case it will not receive further
    writeout until we go through all other members of current b_io list.

  So I claim we currently *do* switch to another inode after 4 MB. That
is a fact.

  I *think* it is by design - mainly to avoid the situation where someone
continuously writes a huge file and kupdate or pdflush would never get to
writing other files with dirty data (at least that's impression I've built
over the years - heck, even 2.6.16 seems to have this redirty_tail logic
with a comment about the above livelock).

  I do find this design broken as well as you likely do and think that the
livelock issue described in the above paragraph should be solved differently
(e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

  The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
I think that changing the writeback logic so that it does not switch inodes
after 4 MB is too risky for these two kernels. So with the above
explanation would you accept some fix along the lines of original Jens'
fix?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-17  1:33           ` Jan Kara
@ 2010-02-17  1:57             ` Dave Chinner
  2010-02-17  3:35             ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2010-02-17  1:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Wed, Feb 17, 2010 at 02:33:37AM +0100, Jan Kara wrote:
> On Tue 16-02-10 15:34:01, Linus Torvalds wrote:
> > On Wed, 17 Feb 2010, Jan Kara wrote:
> > >
> > >   The IO size actually does matter for performance because if you switch
> > > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
> > 
> > No.
> > 
> > Dammit, read the code.
> > 
> > That's my whole _point_. Look at the for-loop.
> > 
> > We DO NOT SWITCH to another inode, because we just continue in the 
> > for-loop.
> > 
> > This is why I think your patch is crap. You clearly haven't even read the 
> > code, the patch makes no sense, and there must be something else going on 
> > than what you _claim_ is going on.
>   I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
>   -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
>   ...
>   writeback_single_inode()
>     ...writes 1024 pages.
>     if we haven't written everything in the inode (more than 1024 dirty
>     pages) we end up doing either requeue_io() or redirty_tail(). In the
>     first case the inode is put to b_more_io list, in the second case to
>     the tail of b_dirty list. In either case it will not receive further
>     writeout until we go through all other members of current b_io list.
> 
>   So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

That is my understanding of how it works, too.

>   I *think* it is by design - mainly to avoid the situation where someone
> continuously writes a huge file and kupdate or pdflush would never get to
> writing other files with dirty data (at least that's impression I've built
> over the years - heck, even 2.6.16 seems to have this redirty_tail logic
> with a comment about the above livelock).

Right, and there is another condition as well - writing lots of
small files could starve large file writeback as the large file only
got 4MB written back every 30s writeback period because it took that
long to write out all the new files.

IIRC it was in 2.6.16 that both these problems were fixed...

>   I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.
> 
>   The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

We've had this sync() writeback behaviour for a long time - my
question is why is it only now a problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-17  1:33           ` Jan Kara
  2010-02-17  1:57             ` Dave Chinner
@ 2010-02-17  3:35             ` Linus Torvalds
  2010-02-17  4:30               ` tytso
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2010-02-17  3:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Wed, 17 Feb 2010, Jan Kara wrote:
>
>   I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
>   -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
>   ...
>   writeback_single_inode()
>     ...writes 1024 pages.
>     if we haven't written everything in the inode (more than 1024 dirty
>     pages) we end up doing either requeue_io() or redirty_tail(). In the
>     first case the inode is put to b_more_io list, in the second case to
>     the tail of b_dirty list. In either case it will not receive further
>     writeout until we go through all other members of current b_io list.
> 
>   So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

Ok, I think that's the bug. I do agree that it may well be intentional, 
but considering the performance impact, I suspect it's been "intentional 
without any performance numbers".

Which just makes me very unhappy to just paper it over for the sync case, 
and leave the now known-broken state alone for the async case. That really 
isn't how we want to do things.

That said, if we've done this forever, I can certainly see the allure to 
just keep doing it, and then handle the sync case separately. 

>   I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

Hmm. The thing is, the new radix tree bit you propose also sounds like 
overdesigning things. 

If we really do switch inodes (which I obviously didn't expect, even if I 
may have been aware of it many years ago), then the max rate limiting is 
just always bad.

If it's bad for synchronous syncs, then it's bad for background syncing 
too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - 
since the whole latency argument goes away if we don't always honor it 
("Oh, we have good latency - _except_ if you do 'sync()' to synchronously 
write something out" - that's just insane).

>   The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

What is affected if we just remove MAX_WRITEBACK_PAGES entirely (as 
opposed to the patch under discussion that effectively removes it for 
WB_SYNC_ALL)?

I see balance_dirty_pages -> bdi_start_writeback, but that if anything 
would be something that I think would be better off with efficient 
writeback, and doesn't seem like it should try to round-robin over inodes 
for latency reasons.

But I guess we can do it in stages, if it's about "minimal changes for 
2.6.32/33.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-17  3:35             ` Linus Torvalds
@ 2010-02-17  4:30               ` tytso
  2010-02-17  5:16                 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: tytso @ 2010-02-17  4:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Tue, Feb 16, 2010 at 07:35:35PM -0800, Linus Torvalds wrote:
> >   writeback_single_inode()
> >     ...writes 1024 pages.
> >     if we haven't written everything in the inode (more than 1024 dirty
> >     pages) we end up doing either requeue_io() or redirty_tail(). In the
> >     first case the inode is put to b_more_io list, in the second case to
> >     the tail of b_dirty list. In either case it will not receive further
> >     writeout until we go through all other members of current b_io list.
> > 
> >   So I claim we currently *do* switch to another inode after 4 MB. That
> > is a fact.
> 
> Ok, I think that's the bug. I do agree that it may well be intentional, 
> but considering the performance impact, I suspect it's been "intentional 
> without any performance numbers".

This is well known amongst file system developers.  We've even raised
it from time to time, but apparently most people are too scared to
touch the writeback code.  I proposed upping the limit some six months
ago, but I got serious pushback.  As a result, I followed XFS's lead,
and so now, both XFS and ext4 will write more pages than what is
requested by the writeback logic, to work around this bug.....

What we really want to do is to time how fast the device is.  If the
device is some Piece of Sh*t USB stick, then maybe you only want to
write 4MB at a time to avoid latency problems.  Heck, maybe you only
want to write 32k at a time, if it's really slow....  But if it's some
super-fast RAID array, maybe you want to write a lot more than 4MB at
a time.

We've had this logic for a long time, and given the increase in disk
density, and spindle speeds, the 4MB limit, which might have made
sense 10 years ago, probably doesn't make sense now.

> If it's bad for synchronous syncs, then it's bad for background syncing 
> too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - 
> since the whole latency argument goes away if we don't always honor it 
> ("Oh, we have good latency - _except_ if you do 'sync()' to synchronously 
> write something out" - that's just insane).

I tried arguing for this six months ago, and got the argument that it
might cause latency problems on slow USB sticks.  So I added a forced
override for ext4, which now writes 128MB at a time --- with a sysfs
tuning knob that allow the old behaviour to be restored if users
really complained.  No one did actually complain....

						- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-17  4:30               ` tytso
@ 2010-02-17  5:16                 ` Linus Torvalds
  2010-02-22 17:29                   ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2010-02-17  5:16 UTC (permalink / raw)
  To: tytso; +Cc: Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Tue, 16 Feb 2010, tytso@mit.edu wrote:
>
> We've had this logic for a long time, and given the increase in disk
> density, and spindle speeds, the 4MB limit, which might have made
> sense 10 years ago, probably doesn't make sense now.

I still don't think that 4MB is enough on its own to suck quite that 
much. Even a fast device should be perfectly happy with 4MB IOs, or it 
must be sucking really badly. 

In order to see the kinds of problems that got quoted in the original 
thread, there must be something else going on too, methinks (disk light 
was "blinking").

So I would guess that it's also getting stuck on that 

	inode_wait_for_writeback(inode);

inside that loop in wb_writeback(). 

In fact, I'm starting to wonder about that "Nothing written" case. The 
code basically decides that "if I wrote zero pages, I didn't write 
anything at all, so I must wait for the inode to complete old writes in 
order to not busy-loop". Which sounds sensible on the face of it, but the 
thing is, inodes can be dirty without actually having any dirty _pages_ 
associated with them.

Are we perhaps ending up in a situation where we essentially wait 
synchronously on just the inode itself being written out? That would 
explain the "40kB/s" kind of behavior.

If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s 
throughput. 

But if we do a 4MB chunk (for the one file that had real dirty data in 
it), and then do a few hundred trivial "write out the inode data 
_synchronously_" (due to access time changes etc) in between until we hit 
the file that has real dirty data again - now _that_ would explain 40kB/s 
throughput. It's not just seeking around - it's not even trying to push 
multiple IO's to get any elevator going or anything like that.

And then the patch that started this discussion makes sense: it improves 
performance because in between those synchronous inode updates it now 
writes big chunks. But again, it's mostly hiding us just doing insane 
things.

I dunno. Just a theory. The more I look at that code, the uglier it looks. 
And I do get the feeling that the "4MB chunking" is really just making the 
more fundamental problems show up.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-17  5:16                 ` Linus Torvalds
@ 2010-02-22 17:29                   ` Jan Kara
  2010-02-22 21:01                     ` tytso
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2010-02-22 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: tytso, Jan Kara, Jens Axboe, Linux Kernel, jengelh, stable,
	gregkh

On Tue 16-02-10 21:16:46, Linus Torvalds wrote:
> On Tue, 16 Feb 2010, tytso@mit.edu wrote:
> >
> > We've had this logic for a long time, and given the increase in disk
> > density, and spindle speeds, the 4MB limit, which might have made
> > sense 10 years ago, probably doesn't make sense now.
> 
> I still don't think that 4MB is enough on its own to suck quite that 
> much. Even a fast device should be perfectly happy with 4MB IOs, or it 
> must be sucking really badly. 
> 
> In order to see the kinds of problems that got quoted in the original 
> thread, there must be something else going on too, methinks (disk light 
> was "blinking").
  <snip>
> Are we perhaps ending up in a situation where we essentially wait 
> synchronously on just the inode itself being written out? That would 
> explain the "40kB/s" kind of behavior.
> 
> If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s 
> throughput.
  Yes, it is actually 400kB/s but still you're right that that seems too
low if the only problem were seeks. I was looking at the code and it's even
bigger mess than what I thought :(.

a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
(when 1024 is passed in nr_to_write). Writeback code kind of expects that
in WB_SYNC_ALL mode all dirty pages in the given range are written (the
same way as write_cache_pages does that).

b) because of delayed allocation, inode is redirtied during ->writepages
call and thus writeback_single_inode calls redirty_tail at it. Thus each
inode will be written at least twice (synchronously, which means a
transaction commit and a disk cache flush for each such write).

c) the livelock avoidace in writeback_inodes_wb does not work because the
function exists as soon as nr_to_write (set to 1024) gets to 0 and thus
the 'start' timestamp gets always renewed.

d) ext4_writepage never succeeds to write a page with delayed-allocated
data. So pageout() function never succeeds in cleaning a page on ext4.
I think that when other problems in writeback code make writeout slow (like
in Jan Engelhardt's case), this can bite us and I assume this might be the
reason why Jan saw kswapd active doing some work during his problems.

> But if we do a 4MB chunk (for the one file that had real dirty data in 
> it), and then do a few hundred trivial "write out the inode data 
> _synchronously_" (due to access time changes etc) in between until we hit 
> the file that has real dirty data again - now _that_ would explain 40kB/s 
> throughput. It's not just seeking around - it's not even trying to push 
> multiple IO's to get any elevator going or anything like that.
> 
> And then the patch that started this discussion makes sense: it improves 
> performance because in between those synchronous inode updates it now 
> writes big chunks. But again, it's mostly hiding us just doing insane 
> things.
  I'm not quite sure whether some of the above problems is really causing the
sync writeback to be so slow. At least problems a) and c) would be
worked-around by the patch setting nr_to_write to LONG_MAX for sync
writeback and the effect of b) would be mitigated to just two synchronous
inode writes instead of (dirty_pages + 8191) / 8192 + 1 sync writes.
  For c) I think the original patch is actually the right solution but I
agree that it just hides the other problems...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-22 17:29                   ` Jan Kara
@ 2010-02-22 21:01                     ` tytso
  2010-02-22 22:26                       ` Jan Kara
  2010-02-23  2:53                       ` Dave Chinner
  0 siblings, 2 replies; 39+ messages in thread
From: tytso @ 2010-02-22 21:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Jens Axboe, Linux Kernel, jengelh, stable, gregkh

On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> 
> a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> same way as write_cache_pages does that).

Well, we return after writing 128MB because of the magic
s_max_writeback_mb_bump.  The fact that nr_to_write limits the number
of pages which are written is something which is intentional to the
writeback code.  I've disagreed with it, but I don't think it would be
legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
what you are saying we should do?  (If it is indeed legit to ignore
nr_to_write, I would have done it a long time ago; I introduced
s_max_writeback_mb_bump instead as a workaround to what I consider to
be a serious misfeature in the writeback code.)

> b) because of delayed allocation, inode is redirtied during ->writepages
> call and thus writeback_single_inode calls redirty_tail at it. Thus each
> inode will be written at least twice (synchronously, which means a
> transaction commit and a disk cache flush for each such write).

Hmm, does this happen with XFS, too?  If not, I wonder how they handle
it?  And whether we need to push a solution into the generic layers.

> d) ext4_writepage never succeeds to write a page with delayed-allocated
> data. So pageout() function never succeeds in cleaning a page on ext4.
> I think that when other problems in writeback code make writeout slow (like
> in Jan Engelhardt's case), this can bite us and I assume this might be the
> reason why Jan saw kswapd active doing some work during his problems.

Yeah, I've noticed this.  What it means is that if we have a massive
memory pressure in a particular zone, pages which are subject to
delayed allocation won't get written out by mm/vmscan.c.  Anonymous
pages will be written out to swap, and data pages which are re-written
via random access mmap() (and so we know where they will be written on
disk) will get written, and that's not a problem.  So with relatively
large zones, it happens, but most of the time I don't think it's a
major problem.

I am worried about this issue in certain configurations where pseudo
NUMA zones have been created and are artificially really tiny (128MB)
for container support, but that's not standard upstream thing.

This is done to avoid a lock inversion, and so this is an
ext4-specific thing (at least I don't think XFS's delayed allocation
has this misfeature).  It would be interesting if we have documented
evidence that this is easily triggered under normal situations.  If
so, we should look into figuring out how to fix this...

       	      	   		     	 - Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-22 21:01                     ` tytso
@ 2010-02-22 22:26                       ` Jan Kara
  2010-02-23  2:53                       ` Dave Chinner
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2010-02-22 22:26 UTC (permalink / raw)
  To: tytso
  Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh,
	stable, gregkh

On Mon 22-02-10 16:01:12, tytso@mit.edu wrote:
> On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> > 
> > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> > (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> > in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> > same way as write_cache_pages does that).
> 
> Well, we return after writing 128MB because of the magic
  I think it's really just 32 MB because writeback code passes nr_to_write
1024 -> ext4 bumps that to 8192 pages which is 32 MB. As I've understood
128 MB is just an absolute maximum over which we never bump.

> s_max_writeback_mb_bump.  The fact that nr_to_write limits the number
> of pages which are written is something which is intentional to the
> writeback code.  I've disagreed with it, but I don't think it would be
> legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
> what you are saying we should do?  (If it is indeed legit to ignore
  The behavior of nr_to_write for WB_SYNC_ALL mode was always kind of
fuzzy. Until Jens rewrote the writeback code, we never passed anything
different than LONG_MAX (or similarly huge values) when doing WB_SYNC_ALL
writeback. Also write_cache_pages which is used as writepages
implementation for most filesystems (ext2, ext3, udf, ...) ignores
nr_to_write for WB_SYNC_ALL writeback. So until recently we actually never
had to decide whether write_cache_pages behavior is a bug or de-facto
standard...
  Since it's hard to avoid livelocking problems when synchronous writeback
would stop earlier than after writing the whole range, I'm leaned to
standardize on ignoring the nr_to_write limit for synchronous writeback.

> nr_to_write, I would have done it a long time ago; I introduced
> s_max_writeback_mb_bump instead as a workaround to what I consider to
> be a serious misfeature in the writeback code.)
  I agree that nr_to_write currently brings more problems than advantages.

> > b) because of delayed allocation, inode is redirtied during ->writepages
> > call and thus writeback_single_inode calls redirty_tail at it. Thus each
> > inode will be written at least twice (synchronously, which means a
> > transaction commit and a disk cache flush for each such write).
> 
> Hmm, does this happen with XFS, too?  If not, I wonder how they handle
> it?  And whether we need to push a solution into the generic layers.
  Yes, I think so. As soon as you mark inode dirty during writepages
call, the dirty flag won't be cleared even though ->write_inode will be
called after that. So a right fix would IMO be to move clearing of
I_DIRTY_SYNC and I_DIRTY_DATASYNC flag after the writepages call. Or we
could even put some dirty bit handling inside of ->write_inode because
sometimes it would be useful if a filesystem would know in its
->write_inode method what dirty bits were set...
  Another thing is that in particular in ext[34] case, we would be much
better off with submitting all the inode writes and then waiting just once.
This is actually what should usually happen because sync calls asynchronous
writeback first and only after it a synchronous one is called. But when
the machine is under a constant IO load and the livelock avoidance code
is invalidated by passing nr_to_write == 1024 to sync writeback, we end up
doing a lot of sync IO and thus this problem becomes much more visible.

> > d) ext4_writepage never succeeds to write a page with delayed-allocated
> > data. So pageout() function never succeeds in cleaning a page on ext4.
> > I think that when other problems in writeback code make writeout slow (like
> > in Jan Engelhardt's case), this can bite us and I assume this might be the
> > reason why Jan saw kswapd active doing some work during his problems.
> 
> Yeah, I've noticed this.  What it means is that if we have a massive
> memory pressure in a particular zone, pages which are subject to
> delayed allocation won't get written out by mm/vmscan.c.  Anonymous
> pages will be written out to swap, and data pages which are re-written
> via random access mmap() (and so we know where they will be written on
> disk) will get written, and that's not a problem.  So with relatively
> large zones, it happens, but most of the time I don't think it's a
> major problem.
  Yes, I also don't think it's a major problem in common scenarios.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-22 21:01                     ` tytso
  2010-02-22 22:26                       ` Jan Kara
@ 2010-02-23  2:53                       ` Dave Chinner
  2010-02-23  3:23                         ` tytso
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2010-02-23  2:53 UTC (permalink / raw)
  To: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel,
	jengelh, stable, gregkh

On Mon, Feb 22, 2010 at 04:01:12PM -0500, tytso@mit.edu wrote:
> On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> > 
> > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> > (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> > in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> > same way as write_cache_pages does that).
> 
> Well, we return after writing 128MB because of the magic
> s_max_writeback_mb_bump.  The fact that nr_to_write limits the number
> of pages which are written is something which is intentional to the
> writeback code.  I've disagreed with it, but I don't think it would be
> legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
> what you are saying we should do?  (If it is indeed legit to ignore
> nr_to_write, I would have done it a long time ago; I introduced
> s_max_writeback_mb_bump instead as a workaround to what I consider to
> be a serious misfeature in the writeback code.)

Ignoring nr_to_write completely can lead to issues like we used to
have with XFS - it would write an entire extent (8GB) at a time and
starve all other writeback. Those starvation problems - which were
very obvious on NFS servers - went away when we trimmed back the
amount to write in a single pass to saner amounts...

> > b) because of delayed allocation, inode is redirtied during ->writepages
> > call and thus writeback_single_inode calls redirty_tail at it. Thus each
> > inode will be written at least twice (synchronously, which means a
> > transaction commit and a disk cache flush for each such write).
> 
> Hmm, does this happen with XFS, too?  If not, I wonder how they handle
> it?  And whether we need to push a solution into the generic layers.

Yes, it does - XFS dirties the inode during delayed allocation.  In
the worst case XFS writes back the inode twice, synchronously.
However, XFS inodes carry their own dirty state and so the second
flush is often a no-op because the inode is actually clean and not
pinned in the log any more. The inode is only dirty the second time
around if the file size was updated as a result of IO completion.

FWIW, we've only just changed the XFS code to issue transactions for
sync inode writes instead of writing the inode directly because of
changeѕ to the async inode writeback that avoid writing the inode
over and over again while it still has dirty data on it (coming in
2.6.34). These changes mean we need to issue a transaction to
capture any unlogged changes to the inode during sync writes, but we
still avoid the second transaction if the XFS inode is cleaned by
the first transaction.

As to a generic solution, why do you think I've been advocating
separate per-sb data sync and inode writeback methods that separate
data writeback from inode writeback for so long? ;)

> > d) ext4_writepage never succeeds to write a page with delayed-allocated
> > data. So pageout() function never succeeds in cleaning a page on ext4.
> > I think that when other problems in writeback code make writeout slow (like
> > in Jan Engelhardt's case), this can bite us and I assume this might be the
> > reason why Jan saw kswapd active doing some work during his problems.

[...]

> This is done to avoid a lock inversion, and so this is an
> ext4-specific thing (at least I don't think XFS's delayed allocation
> has this misfeature).

Not that I know of, but then again I don't know what inversion ext4
is trying to avoid. Can you describe the inversion, Ted?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-23  2:53                       ` Dave Chinner
@ 2010-02-23  3:23                         ` tytso
  2010-02-23  5:53                           ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: tytso @ 2010-02-23  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel, jengelh,
	stable, gregkh

On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote:
> 
> Ignoring nr_to_write completely can lead to issues like we used to
> have with XFS - it would write an entire extent (8GB) at a time and
> starve all other writeback. Those starvation problems - which were
> very obvious on NFS servers - went away when we trimmed back the
> amount to write in a single pass to saner amounts...

How do you determine what a "sane amount" is?  Is it something that is
determined dynamically, or is it a hard-coded or manually tuned value?

> As to a generic solution, why do you think I've been advocating
> separate per-sb data sync and inode writeback methods that separate
> data writeback from inode writeback for so long? ;)

Heh.

> > This is done to avoid a lock inversion, and so this is an
> > ext4-specific thing (at least I don't think XFS's delayed allocation
> > has this misfeature).
> 
> Not that I know of, but then again I don't know what inversion ext4
> is trying to avoid. Can you describe the inversion, Ted?

The locking order is journal_start_handle (starting a micro
transaction in jbd) -> lock_page.  A more detailed description of why
this locking order is non-trivial for us to fix in ext4 can be found
in the description of commit f0e6c985.

Regards,

							- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-23  3:23                         ` tytso
@ 2010-02-23  5:53                           ` Dave Chinner
  2010-02-24 14:56                             ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2010-02-23  5:53 UTC (permalink / raw)
  To: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel,
	jengelh, stable, gregkh

On Mon, Feb 22, 2010 at 10:23:17PM -0500, tytso@mit.edu wrote:
> On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote:
> > 
> > Ignoring nr_to_write completely can lead to issues like we used to
> > have with XFS - it would write an entire extent (8GB) at a time and
> > starve all other writeback. Those starvation problems - which were
> > very obvious on NFS servers - went away when we trimmed back the
> > amount to write in a single pass to saner amounts...
> 
> How do you determine what a "sane amount" is?  Is it something that is
> determined dynamically, or is it a hard-coded or manually tuned value?

"sane amount" was a hard coded mapping look-ahead value that limited
the size of the allocation that was done. It got set to 64 pages,
back in 2.6.16 (IIRC) so that we'd do at least a 256k allocation
on x86 or 1MB on ia64 (which had 16k page size at the time).
It was previously unbound, which is why XFS was mapping entire
extents...

Hence ->writepage() on XFS will write 64 pages at a time if there
are contiguous pages in the page cache, and then go back up to
write_cache_pages() for the loop to either terminate due to
nr_to_write <= 0 or call ->writepage on the next dirty page not
under writeback on the inode.

This behaviour pretty much defines the _minimum IO size_ we'd issue
if there are contiguous dirty pages in the page cache. It was (and
still is) primarily to work around the deficiencies of the random
single page writeback that the VM's memory reclaim algorithms are so
fond of (another of my pet peeves)....

> > As to a generic solution, why do you think I've been advocating
> > separate per-sb data sync and inode writeback methods that separate
> > data writeback from inode writeback for so long? ;)
> 
> Heh.
> 
> > > This is done to avoid a lock inversion, and so this is an
> > > ext4-specific thing (at least I don't think XFS's delayed allocation
> > > has this misfeature).
> > 
> > Not that I know of, but then again I don't know what inversion ext4
> > is trying to avoid. Can you describe the inversion, Ted?
> 
> The locking order is journal_start_handle (starting a micro
> transaction in jbd) -> lock_page.  A more detailed description of why
> this locking order is non-trivial for us to fix in ext4 can be found
> in the description of commit f0e6c985.

Nasty - you need to start a transaction before you lock pages for
writeback and allocation, but ->writepage hands you a locked page.
And you can't use an existing transaction handle open because you
can't guarantee that you have journal credits reserved for the
allocation?  IIUC, ext3/4 has this problem due to the ordered data
writeback constraints, right?

Regardless of the reason for the ext4 behaviour, you are right about
XFS - it doesn't have this particular problem with delalloc because
it does have any transaction/page lock ordering constraints in the
transaction subsystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] writeback: Fix broken sync writeback
  2010-02-23  5:53                           ` Dave Chinner
@ 2010-02-24 14:56                             ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2010-02-24 14:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, Jan Kara, Linus Torvalds, Jens Axboe, Linux Kernel,
	jengelh, stable, gregkh

On Tue 23-02-10 16:53:35, Dave Chinner wrote:
> > > > This is done to avoid a lock inversion, and so this is an
> > > > ext4-specific thing (at least I don't think XFS's delayed allocation
> > > > has this misfeature).
> > > 
> > > Not that I know of, but then again I don't know what inversion ext4
> > > is trying to avoid. Can you describe the inversion, Ted?
> > 
> > The locking order is journal_start_handle (starting a micro
> > transaction in jbd) -> lock_page.  A more detailed description of why
> > this locking order is non-trivial for us to fix in ext4 can be found
> > in the description of commit f0e6c985.
> 
> Nasty - you need to start a transaction before you lock pages for
> writeback and allocation, but ->writepage hands you a locked page.
> And you can't use an existing transaction handle open because you
> can't guarantee that you have journal credits reserved for the
> allocation?
  Exactly.

> IIUC, ext3/4 has this problem due to the ordered data writeback
> constraints, right?
  Not quite. I don't know how XFS solves this but in ext3/4 starting
a transaction can block (waiting for journal space) until all users of
a previous transaction are done with it and we can commit it. Thus
the transaction start / stop behave just as an ordinary lock. Because
you need a transaction started when writing a page (for metadata updates)
there is some lock ordering forced between a page lock and a trasaction
start / stop. Ext4 chose it to be transaction -> page lock (which makes
writepages more efficient and writepage hard), ext3 has page lock ->
transaction (so it has working ->writepage).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2011-01-21 15:09 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-12  9:16 [PATCH] writeback: Fix broken sync writeback Jens Axboe
2010-02-12 15:45 ` Linus Torvalds
2010-02-13 12:58   ` Jan Engelhardt
2010-02-15 14:49     ` Jan Kara
2010-02-15 15:41       ` Jan Engelhardt
2010-02-15 15:58         ` Jan Kara
2010-06-27 16:44         ` Jan Engelhardt
2010-10-24 23:41           ` Sync writeback still broken Jan Engelhardt
2010-10-30  0:57             ` Linus Torvalds
2010-10-30  1:16               ` Linus Torvalds
2010-10-30  1:30                 ` Linus Torvalds
2010-10-30  3:18                 ` Andrew Morton
2010-10-30 13:15                 ` Christoph Hellwig
2010-10-31 12:24             ` Jan Kara
2010-10-31 22:40               ` Jan Kara
2010-11-05 21:33                 ` Jan Kara
2010-11-05 21:34                   ` Jan Kara
2010-11-05 21:41                     ` Linus Torvalds
2010-11-05 22:03                   ` Jan Engelhardt
2010-11-07 12:57                     ` Jan Kara
2011-01-20 22:50                     ` Jan Engelhardt
2011-01-21 15:09                       ` Jan Kara
2010-02-15 14:17   ` [PATCH] writeback: Fix broken sync writeback Jan Kara
2010-02-16  0:05     ` Linus Torvalds
2010-02-16 23:00       ` Jan Kara
2010-02-16 23:34         ` Linus Torvalds
2010-02-17  0:01           ` Linus Torvalds
2010-02-17  1:33           ` Jan Kara
2010-02-17  1:57             ` Dave Chinner
2010-02-17  3:35             ` Linus Torvalds
2010-02-17  4:30               ` tytso
2010-02-17  5:16                 ` Linus Torvalds
2010-02-22 17:29                   ` Jan Kara
2010-02-22 21:01                     ` tytso
2010-02-22 22:26                       ` Jan Kara
2010-02-23  2:53                       ` Dave Chinner
2010-02-23  3:23                         ` tytso
2010-02-23  5:53                           ` Dave Chinner
2010-02-24 14:56                             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox