* [PATCH 0/3] sync livelock fixes v2
@ 2011-05-02 3:17 Wu Fengguang
2011-05-02 3:17 ` [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-02 3:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Wu Fengguang, LKML,
linux-fsdevel
Andrew,
changes from v1:
- retain tagged sync behavior for all WB_SYNC_ALL callers (Dave Chinner)
- rename wbc.for_sync to wbc.tagged_sync
- remove the changes to trace events
The below sync livelock test script is working fine after this patchset:
sync time: 2
Dirty: 26492 kB
Writeback: 30260 kB
NFS_Unstable: 0 kB
WritebackTmp: 0 kB
sync NOT livelocked
In particular patch 2 fixes the sync livelock problem introduced by patch
"writeback: try more writeback as long as something was written".
Thanks,
Fengguang
---
#!/bin/sh
umount /dev/sda7
# mkfs.xfs -f /dev/sda7
mkfs.ext4 /dev/sda7
mount /dev/sda7 /fs
echo $((50<<20)) > /proc/sys/vm/dirty_bytes
pid=
for i in `seq 10`
do
dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
pid="$pid $!"
done
sleep 1
tic=$(date +'%s')
sync
tac=$(date +'%s')
echo
echo sync time: $((tac-tic))
egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo
pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
2011-05-02 3:17 [PATCH 0/3] sync livelock fixes v2 Wu Fengguang
@ 2011-05-02 3:17 ` Wu Fengguang
2011-05-04 21:00 ` Jan Kara
2011-05-02 3:17 ` [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-05-02 3:17 ` [PATCH 3/3] writeback: avoid extra sync work at enqueue time Wu Fengguang
2 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2011-05-02 3:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig, LKML,
linux-fsdevel
[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 4449 bytes --]
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
livelock prevention for it, too.
Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers need also need livelock
prevention.
Impacts:
- it changes the order in which pages/inodes are synced to disk. Now in
the WB_SYNC_NONE stage, it won't proceed to write the next inode until
finished with the current inode.
- this adds a new field to the writeback trace events and may possibly
break some scripts.
CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/ext4/inode.c | 4 ++--
fs/fs-writeback.c | 9 +++++----
include/linux/writeback.h | 1 +
mm/page-writeback.c | 4 ++--
4 files changed, 10 insertions(+), 8 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:16:12.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:33.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
enum writeback_sync_modes sync_mode;
+ unsigned int tagged_sync:1;
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
@@ -644,6 +645,7 @@ static long wb_writeback(struct bdi_writ
{
struct writeback_control wbc = {
.sync_mode = work->sync_mode,
+ .tagged_sync = work->tagged_sync,
.older_than_this = NULL,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
@@ -651,7 +653,7 @@ static long wb_writeback(struct bdi_writ
};
unsigned long oldest_jif;
long wrote = 0;
- long write_chunk;
+ long write_chunk = MAX_WRITEBACK_PAGES;
struct inode *inode;
if (!wbc.range_cyclic) {
@@ -672,9 +674,7 @@ static long wb_writeback(struct bdi_writ
* (quickly) tag currently dirty pages
* (maybe slowly) sync all tagged pages
*/
- if (wbc.sync_mode == WB_SYNC_NONE)
- write_chunk = MAX_WRITEBACK_PAGES;
- else
+ if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
write_chunk = LONG_MAX;
wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1209,6 +1209,7 @@ void writeback_inodes_sb_nr(struct super
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_NONE,
+ .tagged_sync = 1,
.done = &done,
.nr_pages = nr,
};
--- linux-next.orig/include/linux/writeback.h 2011-05-02 11:16:12.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-02 11:17:31.000000000 +0800
@@ -46,6 +46,7 @@ struct writeback_control {
unsigned encountered_congestion:1; /* An output: a queue is full */
unsigned for_kupdate:1; /* A kupdate writeback */
unsigned for_background:1; /* A background writeback */
+ unsigned tagged_sync:1; /* do livelock prevention for sync */
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c 2011-05-02 11:07:57.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-05-02 11:16:12.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
}
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
retry:
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c 2011-05-02 11:07:57.000000000 +0800
+++ linux-next/fs/ext4/inode.c 2011-05-02 11:16:12.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
@@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
}
retry:
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag_pages_for_writeback(mapping, index, end);
while (!ret && wbc->nr_to_write > 0) {
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock
2011-05-02 3:17 [PATCH 0/3] sync livelock fixes v2 Wu Fengguang
2011-05-02 3:17 ` [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-02 3:17 ` Wu Fengguang
2011-05-04 21:10 ` Jan Kara
2011-05-02 3:17 ` [PATCH 3/3] writeback: avoid extra sync work at enqueue time Wu Fengguang
2 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2011-05-02 3:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig, LKML,
linux-fsdevel
[-- Attachment #1: writeback-fix-sync-busyloop.patch --]
[-- Type: text/plain, Size: 3619 bytes --]
With the more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
no longer enough to stop sync livelock.
The fix is to explicitly update .dirtied_when on synced inodes, so that
they are no longer considered for writeback in the next round.
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 9 +++++++++
1 file changed, 9 insertions(+)
ext3/ext4 are working fine now, however tests show that XFS may still
livelock inside the XFS routines:
[ 3581.181253] sync D ffff8800b6ca15d8 4560 4403 4392 0x00000000
[ 3581.181734] ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
[ 3581.182411] ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
[ 3581.183088] 00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
[ 3581.183765] Call Trace:
[ 3581.184008] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.184392] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.184756] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.185120] [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
[ 3581.185474] [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
[ 3581.185827] [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
[ 3581.186198] [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
[ 3581.186569] [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
[ 3581.186946] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.187311] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.187669] [<ffffffff81092175>] ? local_clock+0x41/0x5a
[ 3581.188020] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.188403] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.188773] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.189130] [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
[ 3581.189488] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.189858] [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
[ 3581.190241] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.190606] [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
[ 3581.190982] [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
[ 3581.191352] [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
[ 3581.191703] [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
[ 3581.192065] [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
[ 3581.192419] [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
[ 3581.192783] [<ffffffff8116120b>] sync_one_sb+0x16/0x18
[ 3581.193128] [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
[ 3581.193482] [<ffffffff81161140>] sync_filesystems+0x20/0x22
[ 3581.193842] [<ffffffff8116127e>] sys_sync+0x21/0x33
[ 3581.194177] [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b
--- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:16:50.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
@@ -431,6 +431,15 @@ writeback_single_inode(struct inode *ino
requeue_io(inode, wb);
} else {
/*
+ * sync livelock prevention: each inode is
+ * tagged and synced in one shot. If still
+ * dirty, move it back to s_dirty with updated
+ * dirty time to prevent syncing it again.
+ */
+ if (wbc->sync_mode == WB_SYNC_ALL ||
+ wbc->tagged_sync)
+ inode->dirtied_when = jiffies;
+ /*
* Writeback blocked by something other than
* congestion. Delay the inode for some time to
* avoid spinning on the CPU (100% iowait)
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-02 3:17 [PATCH 0/3] sync livelock fixes v2 Wu Fengguang
2011-05-02 3:17 ` [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
2011-05-02 3:17 ` [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-02 3:17 ` Wu Fengguang
2011-05-04 21:24 ` Jan Kara
2 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2011-05-02 3:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig, LKML,
linux-fsdevel
[-- Attachment #1: writeback-kill-wb_start.patch --]
[-- Type: text/plain, Size: 2337 bytes --]
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 17 ++++-------------
include/linux/writeback.h | 3 ---
2 files changed, 4 insertions(+), 16 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
continue;
}
- /*
- * Was this inode dirtied after sync_sb_inodes was called?
- * This keeps sync from extra jobs and livelock.
- */
- if (inode_dirtied_after(inode, wbc->wb_start)) {
- spin_unlock(&inode->i_lock);
- return 1;
- }
-
__iget(inode);
pages_skipped = wbc->pages_skipped;
@@ -585,8 +576,6 @@ static void __writeback_inodes_wb(struct
{
int ret = 0;
- if (!wbc->wb_start)
- wbc->wb_start = jiffies; /* livelock avoidance */
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
* (quickly) tag currently dirty pages
* (maybe slowly) sync all tagged pages
*/
- if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
+ if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
write_chunk = LONG_MAX;
+ oldest_jif = jiffies;
+ wbc.older_than_this = &oldest_jif;
+ }
- wbc.wb_start = jiffies; /* livelock avoidance */
for (;;) {
/*
* Stop writeback when nr_pages has been consumed
--- linux-next.orig/include/linux/writeback.h 2011-05-02 11:16:12.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-02 11:17:27.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
enum writeback_sync_modes sync_mode;
unsigned long *older_than_this; /* If !NULL, only write back inodes
older than this */
- unsigned long wb_start; /* Time writeback_inodes_wb was
- called. This is needed to avoid
- extra jobs and livelock */
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
long pages_skipped; /* Pages which were not written */
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
2011-05-02 3:17 ` [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-04 21:00 ` Jan Kara
2011-05-05 12:14 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2011-05-04 21:00 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel
On Mon 02-05-11 11:17:51, Wu Fengguang wrote:
> sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> livelock prevention for it, too.
>
> Note that writeback_inodes_sb() is called by not only sync(), they are
> treated the same because the other callers need also need livelock
> prevention.
I was thinking about this and could not find any - which other callers
of writeback_inodes_sb() need the livelock prevention?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock
2011-05-02 3:17 ` [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-04 21:10 ` Jan Kara
2011-05-05 12:18 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2011-05-04 21:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel
On Mon 02-05-11 11:17:52, Wu Fengguang wrote:
> With the more aggressive "keep writeback as long as we wrote something"
> logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
> b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
> no longer enough to stop sync livelock.
>
> The fix is to explicitly update .dirtied_when on synced inodes, so that
> they are no longer considered for writeback in the next round.
Looks good. But shouldn't you also update i_dirtied_when in the case where
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) but inode->i_state & I_DIRTY?
Mostly for consistency since the chances of livelock because of this are
really close to zero.
Honza
>
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> fs/fs-writeback.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> ext3/ext4 are working fine now, however tests show that XFS may still
> livelock inside the XFS routines:
>
> [ 3581.181253] sync D ffff8800b6ca15d8 4560 4403 4392 0x00000000
> [ 3581.181734] ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
> [ 3581.182411] ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
> [ 3581.183088] 00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
> [ 3581.183765] Call Trace:
> [ 3581.184008] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> [ 3581.184392] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> [ 3581.184756] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> [ 3581.185120] [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
> [ 3581.185474] [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
> [ 3581.185827] [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
> [ 3581.186198] [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
> [ 3581.186569] [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
> [ 3581.186946] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> [ 3581.187311] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> [ 3581.187669] [<ffffffff81092175>] ? local_clock+0x41/0x5a
> [ 3581.188020] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> [ 3581.188403] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> [ 3581.188773] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> [ 3581.189130] [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
> [ 3581.189488] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> [ 3581.189858] [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
> [ 3581.190241] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> [ 3581.190606] [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
> [ 3581.190982] [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
> [ 3581.191352] [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
> [ 3581.191703] [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
> [ 3581.192065] [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
> [ 3581.192419] [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
> [ 3581.192783] [<ffffffff8116120b>] sync_one_sb+0x16/0x18
> [ 3581.193128] [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
> [ 3581.193482] [<ffffffff81161140>] sync_filesystems+0x20/0x22
> [ 3581.193842] [<ffffffff8116127e>] sys_sync+0x21/0x33
> [ 3581.194177] [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b
>
> --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:16:50.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> @@ -431,6 +431,15 @@ writeback_single_inode(struct inode *ino
> requeue_io(inode, wb);
> } else {
> /*
> + * sync livelock prevention: each inode is
> + * tagged and synced in one shot. If still
> + * dirty, move it back to s_dirty with updated
> + * dirty time to prevent syncing it again.
> + */
> + if (wbc->sync_mode == WB_SYNC_ALL ||
> + wbc->tagged_sync)
> + inode->dirtied_when = jiffies;
> + /*
> * Writeback blocked by something other than
> * congestion. Delay the inode for some time to
> * avoid spinning on the CPU (100% iowait)
>
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-02 3:17 ` [PATCH 3/3] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-05-04 21:24 ` Jan Kara
2011-05-05 12:27 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2011-05-04 21:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel
On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> This removes writeback_control.wb_start and does more straightforward
> sync livelock prevention by setting .older_than_this to prevent extra
> inodes from being enqueued in the first place.
>
> --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> * (quickly) tag currently dirty pages
> * (maybe slowly) sync all tagged pages
> */
> - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> write_chunk = LONG_MAX;
> + oldest_jif = jiffies;
> + wbc.older_than_this = &oldest_jif;
> + }
What are the implications of not doing dirty-time livelock avoidance for
other types of writeback? Is that a mistake? I'd prefer to have in
wb_writeback():
if (wbc.for_kupdate)
oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
else
oldest_jif = jiffies;
wbc.older_than_this = &oldest_jif;
And when you have this, you can make wbc.older_than_this just a plain
number and remove all those checks for wbc.older_than_this == NULL.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
2011-05-04 21:00 ` Jan Kara
@ 2011-05-05 12:14 ` Wu Fengguang
2011-05-05 13:55 ` Jan Kara
0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 12:14 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 05:00:59AM +0800, Jan Kara wrote:
> On Mon 02-05-11 11:17:51, Wu Fengguang wrote:
> > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > livelock prevention for it, too.
> >
> > Note that writeback_inodes_sb() is called by not only sync(), they are
> > treated the same because the other callers need also need livelock
> > prevention.
> I was thinking about this and could not find any - which other callers
> of writeback_inodes_sb() need the livelock prevention?
For example, the writeback_inodes_sb_if_idle() call from ext4.
In general anyone that pass get_nr_dirty_pages() as work->nr_pages
may be highly over-estimating the work set.
It won't be directly livelocked since ext4 won't wait for completion,
however there is possibility the works queued behind are delayed and
livelocked.
Ideally simple ->nr_pages works should be given lower priority and
even may be merged with each other, and that would be future work.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock
2011-05-04 21:10 ` Jan Kara
@ 2011-05-05 12:18 ` Wu Fengguang
0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 12:18 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 05:10:50AM +0800, Jan Kara wrote:
> On Mon 02-05-11 11:17:52, Wu Fengguang wrote:
> > With the more aggressive "keep writeback as long as we wrote something"
> > logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
> > b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
> > no longer enough to stop sync livelock.
> >
> > The fix is to explicitly update .dirtied_when on synced inodes, so that
> > they are no longer considered for writeback in the next round.
> Looks good. But shouldn't you also update i_dirtied_when in the case where
> !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) but inode->i_state & I_DIRTY?
> Mostly for consistency since the chances of livelock because of this are
> really close to zero.
Yes the I_DIRTY_SYNC may well be redirtied by the FS. It's very unlikely
to cause live locks, so I'm more than glad to kill the optional code :)
Thanks,
Fengguang
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > fs/fs-writeback.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > ext3/ext4 are working fine now, however tests show that XFS may still
> > livelock inside the XFS routines:
> >
> > [ 3581.181253] sync D ffff8800b6ca15d8 4560 4403 4392 0x00000000
> > [ 3581.181734] ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
> > [ 3581.182411] ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
> > [ 3581.183088] 00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
> > [ 3581.183765] Call Trace:
> > [ 3581.184008] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> > [ 3581.184392] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> > [ 3581.184756] [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> > [ 3581.185120] [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
> > [ 3581.185474] [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
> > [ 3581.185827] [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
> > [ 3581.186198] [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
> > [ 3581.186569] [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
> > [ 3581.186946] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> > [ 3581.187311] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> > [ 3581.187669] [<ffffffff81092175>] ? local_clock+0x41/0x5a
> > [ 3581.188020] [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> > [ 3581.188403] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> > [ 3581.188773] [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> > [ 3581.189130] [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
> > [ 3581.189488] [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> > [ 3581.189858] [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
> > [ 3581.190241] [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> > [ 3581.190606] [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
> > [ 3581.190982] [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
> > [ 3581.191352] [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
> > [ 3581.191703] [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
> > [ 3581.192065] [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
> > [ 3581.192419] [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
> > [ 3581.192783] [<ffffffff8116120b>] sync_one_sb+0x16/0x18
> > [ 3581.193128] [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
> > [ 3581.193482] [<ffffffff81161140>] sync_filesystems+0x20/0x22
> > [ 3581.193842] [<ffffffff8116127e>] sys_sync+0x21/0x33
> > [ 3581.194177] [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b
> >
> > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:16:50.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > @@ -431,6 +431,15 @@ writeback_single_inode(struct inode *ino
> > requeue_io(inode, wb);
> > } else {
> > /*
> > + * sync livelock prevention: each inode is
> > + * tagged and synced in one shot. If still
> > + * dirty, move it back to s_dirty with updated
> > + * dirty time to prevent syncing it again.
> > + */
> > + if (wbc->sync_mode == WB_SYNC_ALL ||
> > + wbc->tagged_sync)
> > + inode->dirtied_when = jiffies;
> > + /*
> > * Writeback blocked by something other than
> > * congestion. Delay the inode for some time to
> > * avoid spinning on the CPU (100% iowait)
> >
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-04 21:24 ` Jan Kara
@ 2011-05-05 12:27 ` Wu Fengguang
2011-05-05 12:41 ` Christoph Hellwig
2011-05-05 14:01 ` Jan Kara
0 siblings, 2 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 12:27 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 05:24:27AM +0800, Jan Kara wrote:
> On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> > This removes writeback_control.wb_start and does more straightforward
> > sync livelock prevention by setting .older_than_this to prevent extra
> > inodes from being enqueued in the first place.
> >
> > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> > @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> > * (quickly) tag currently dirty pages
> > * (maybe slowly) sync all tagged pages
> > */
> > - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> > write_chunk = LONG_MAX;
> > + oldest_jif = jiffies;
> > + wbc.older_than_this = &oldest_jif;
> > + }
> What are the implications of not doing dirty-time livelock avoidance for
> other types of writeback? Is that a mistake? I'd prefer to have in
> wb_writeback():
> if (wbc.for_kupdate)
> oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
> else
> oldest_jif = jiffies;
> wbc.older_than_this = &oldest_jif;
>
> And when you have this, you can make wbc.older_than_this just a plain
> number and remove all those checks for wbc.older_than_this == NULL.
Good point. Here is the fixed patch. Will you send the patch to change
the type when the current patches are settled down?
Thanks,
Fengguang
---
Subject: writeback: avoid extra sync work at enqueue time
Date: Sat Apr 23 11:26:07 CST 2011
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 16 +++-------------
include/linux/writeback.h | 3 ---
2 files changed, 3 insertions(+), 16 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-04 20:19:20.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 13:15:42.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
continue;
}
- /*
- * Was this inode dirtied after sync_sb_inodes was called?
- * This keeps sync from extra jobs and livelock.
- */
- if (inode_dirtied_after(inode, wbc->wb_start)) {
- spin_unlock(&inode->i_lock);
- return 1;
- }
-
__iget(inode);
pages_skipped = wbc->pages_skipped;
@@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
{
int ret = 0;
- if (!wbc->wb_start)
- wbc->wb_start = jiffies; /* livelock avoidance */
-
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
struct super_block *sb = inode->i_sb;
@@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
write_chunk = LONG_MAX;
- wbc.wb_start = jiffies; /* livelock avoidance */
+ oldest_jif = jiffies;
+ wbc.older_than_this = &oldest_jif;
+
for (;;) {
/*
* Stop writeback when nr_pages has been consumed
--- linux-next.orig/include/linux/writeback.h 2011-05-04 20:19:20.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-05 13:14:53.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
enum writeback_sync_modes sync_mode;
unsigned long *older_than_this; /* If !NULL, only write back inodes
older than this */
- unsigned long wb_start; /* Time writeback_inodes_wb was
- called. This is needed to avoid
- extra jobs and livelock */
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
long pages_skipped; /* Pages which were not written */
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 12:27 ` Wu Fengguang
@ 2011-05-05 12:41 ` Christoph Hellwig
2011-05-05 12:42 ` Christoph Hellwig
2011-05-05 14:01 ` Jan Kara
1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2011-05-05 12:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 08:27:32PM +0800, Wu Fengguang wrote:
> - wbc.wb_start = jiffies; /* livelock avoidance */
> + oldest_jif = jiffies;
> + wbc.older_than_this = &oldest_jif;
Can't you kill the oldest_jif variable now?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 12:41 ` Christoph Hellwig
@ 2011-05-05 12:42 ` Christoph Hellwig
2011-05-05 12:48 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2011-05-05 12:42 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 02:41:22PM +0200, Christoph Hellwig wrote:
> On Thu, May 05, 2011 at 08:27:32PM +0800, Wu Fengguang wrote:
> > - wbc.wb_start = jiffies; /* livelock avoidance */
> > + oldest_jif = jiffies;
> > + wbc.older_than_this = &oldest_jif;
>
> Can't you kill the oldest_jif variable now?
Err, -ENOCOFFEE. You could only kill it if you implement Jan's suggestion
of not making older_than_this a pointer..
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 12:42 ` Christoph Hellwig
@ 2011-05-05 12:48 ` Wu Fengguang
0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 12:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jan Kara, Andrew Morton, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 08:42:37PM +0800, Christoph Hellwig wrote:
> On Thu, May 05, 2011 at 02:41:22PM +0200, Christoph Hellwig wrote:
> > On Thu, May 05, 2011 at 08:27:32PM +0800, Wu Fengguang wrote:
> > > - wbc.wb_start = jiffies; /* livelock avoidance */
> > > + oldest_jif = jiffies;
> > > + wbc.older_than_this = &oldest_jif;
> >
> > Can't you kill the oldest_jif variable now?
>
> Err, -ENOCOFFEE. You could only kill it if you implement Jan's suggestion
> of not making older_than_this a pointer..
Yes. And that won't be a straightforward code refactor. It will have
very small side effects, so we'd better kill it in a standalone patch.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
2011-05-05 12:14 ` Wu Fengguang
@ 2011-05-05 13:55 ` Jan Kara
2011-05-05 14:06 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2011-05-05 13:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu 05-05-11 20:14:02, Wu Fengguang wrote:
> On Thu, May 05, 2011 at 05:00:59AM +0800, Jan Kara wrote:
> > On Mon 02-05-11 11:17:51, Wu Fengguang wrote:
> > > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > > livelock prevention for it, too.
> > >
> > > Note that writeback_inodes_sb() is called by not only sync(), they are
> > > treated the same because the other callers need also need livelock
> > > prevention.
> > I was thinking about this and could not find any - which other callers
> > of writeback_inodes_sb() need the livelock prevention?
>
> For example, the writeback_inodes_sb_if_idle() call from ext4.
> In general anyone that pass get_nr_dirty_pages() as work->nr_pages
> may be highly over-estimating the work set.
OK, I see what you mean. I agree using tagging in these cases probably
makes sense.
> It won't be directly livelocked since ext4 won't wait for completion,
> however there is possibility the works queued behind are delayed and
> livelocked.
Actually it will, writeback_inodes_sb() waits for completion (because of
s_umount locking). ext4 should use writeback_inodes_sb_if_idle_nr() (with
some relatively small number) anyway to avoid the pauses.
> Ideally simple ->nr_pages works should be given lower priority and
> even may be merged with each other, and that would be future work.
Merging works wanting to do the same thing would be nice. I though about
it some time ago for a while but getting all the combinations right and
making the merging code resonably simple was hard so I postponed it for
later because it was not urgent.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 12:27 ` Wu Fengguang
2011-05-05 12:41 ` Christoph Hellwig
@ 2011-05-05 14:01 ` Jan Kara
2011-05-05 14:10 ` Wu Fengguang
1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2011-05-05 14:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu 05-05-11 20:27:32, Wu Fengguang wrote:
> On Thu, May 05, 2011 at 05:24:27AM +0800, Jan Kara wrote:
> > On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> > > This removes writeback_control.wb_start and does more straightforward
> > > sync livelock prevention by setting .older_than_this to prevent extra
> > > inodes from being enqueued in the first place.
> > >
> > > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> > > @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> > > * (quickly) tag currently dirty pages
> > > * (maybe slowly) sync all tagged pages
> > > */
> > > - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> > > write_chunk = LONG_MAX;
> > > + oldest_jif = jiffies;
> > > + wbc.older_than_this = &oldest_jif;
> > > + }
> > What are the implications of not doing dirty-time livelock avoidance for
> > other types of writeback? Is that a mistake? I'd prefer to have in
> > wb_writeback():
> > if (wbc.for_kupdate)
> > oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
> > else
> > oldest_jif = jiffies;
> > wbc.older_than_this = &oldest_jif;
> >
> > And when you have this, you can make wbc.older_than_this just a plain
> > number and remove all those checks for wbc.older_than_this == NULL.
>
> Good point. Here is the fixed patch. Will you send the patch to change
> the type when the current patches are settled down?
OK, I will do that.
> @@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
> if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> write_chunk = LONG_MAX;
>
> - wbc.wb_start = jiffies; /* livelock avoidance */
> + oldest_jif = jiffies;
> + wbc.older_than_this = &oldest_jif;
> +
I might be already confused with all the code moving around but won't
this overwrite the value set for the for_kupdate case?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
2011-05-05 13:55 ` Jan Kara
@ 2011-05-05 14:06 ` Wu Fengguang
0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 14:06 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 09:55:08PM +0800, Jan Kara wrote:
> On Thu 05-05-11 20:14:02, Wu Fengguang wrote:
> > On Thu, May 05, 2011 at 05:00:59AM +0800, Jan Kara wrote:
> > > On Mon 02-05-11 11:17:51, Wu Fengguang wrote:
> > > > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > > > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > > > livelock prevention for it, too.
> > > >
> > > > Note that writeback_inodes_sb() is called by not only sync(), they are
> > > > treated the same because the other callers need also need livelock
> > > > prevention.
> > > I was thinking about this and could not find any - which other callers
> > > of writeback_inodes_sb() need the livelock prevention?
> >
> > For example, the writeback_inodes_sb_if_idle() call from ext4.
> > In general anyone that pass get_nr_dirty_pages() as work->nr_pages
> > may be highly over-estimating the work set.
> OK, I see what you mean. I agree using tagging in these cases probably
> makes sense.
>
> > It won't be directly livelocked since ext4 won't wait for completion,
> > however there is possibility the works queued behind are delayed and
> > livelocked.
> Actually it will, writeback_inodes_sb() waits for completion (because of
> s_umount locking). ext4 should use writeback_inodes_sb_if_idle_nr() (with
> some relatively small number) anyway to avoid the pauses.
Ah yes!
> > Ideally simple ->nr_pages works should be given lower priority and
> > even may be merged with each other, and that would be future work.
> Merging works wanting to do the same thing would be nice. I though about
> it some time ago for a while but getting all the combinations right and
> making the merging code resonably simple was hard so I postponed it for
> later because it was not urgent.
Agreed.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 14:01 ` Jan Kara
@ 2011-05-05 14:10 ` Wu Fengguang
2011-05-05 14:13 ` Wu Fengguang
2011-05-05 14:34 ` Jan Kara
0 siblings, 2 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 14:10 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 10:01:34PM +0800, Jan Kara wrote:
> On Thu 05-05-11 20:27:32, Wu Fengguang wrote:
> > On Thu, May 05, 2011 at 05:24:27AM +0800, Jan Kara wrote:
> > > On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> > > > This removes writeback_control.wb_start and does more straightforward
> > > > sync livelock prevention by setting .older_than_this to prevent extra
> > > > inodes from being enqueued in the first place.
> > > >
> > > > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > > > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> > > > @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> > > > * (quickly) tag currently dirty pages
> > > > * (maybe slowly) sync all tagged pages
> > > > */
> > > > - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > > + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> > > > write_chunk = LONG_MAX;
> > > > + oldest_jif = jiffies;
> > > > + wbc.older_than_this = &oldest_jif;
> > > > + }
> > > What are the implications of not doing dirty-time livelock avoidance for
> > > other types of writeback? Is that a mistake? I'd prefer to have in
> > > wb_writeback():
> > > if (wbc.for_kupdate)
> > > oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
> > > else
> > > oldest_jif = jiffies;
> > > wbc.older_than_this = &oldest_jif;
> > >
> > > And when you have this, you can make wbc.older_than_this just a plain
> > > number and remove all those checks for wbc.older_than_this == NULL.
> >
> > Good point. Here is the fixed patch. Will you send the patch to change
> > the type when the current patches are settled down?
> OK, I will do that.
Thank you.
> > @@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
> > if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > write_chunk = LONG_MAX;
> >
> > - wbc.wb_start = jiffies; /* livelock avoidance */
> > + oldest_jif = jiffies;
> > + wbc.older_than_this = &oldest_jif;
> > +
> I might be already confused with all the code moving around but won't
> this overwrite the value set for the for_kupdate case?
It's the opposite -- it will be overwritten inside the loop by
for_kupdate, which may run for long time and hence need to update
oldest_jif from time to time.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 14:10 ` Wu Fengguang
@ 2011-05-05 14:13 ` Wu Fengguang
2011-05-05 14:34 ` Jan Kara
1 sibling, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2011-05-05 14:13 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 10:10:39PM +0800, Wu Fengguang wrote:
> On Thu, May 05, 2011 at 10:01:34PM +0800, Jan Kara wrote:
> > On Thu 05-05-11 20:27:32, Wu Fengguang wrote:
> > > On Thu, May 05, 2011 at 05:24:27AM +0800, Jan Kara wrote:
> > > > On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> > > > > This removes writeback_control.wb_start and does more straightforward
> > > > > sync livelock prevention by setting .older_than_this to prevent extra
> > > > > inodes from being enqueued in the first place.
> > > > >
> > > > > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > > > > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> > > > > @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> > > > > * (quickly) tag currently dirty pages
> > > > > * (maybe slowly) sync all tagged pages
> > > > > */
> > > > > - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > > > + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> > > > > write_chunk = LONG_MAX;
> > > > > + oldest_jif = jiffies;
> > > > > + wbc.older_than_this = &oldest_jif;
> > > > > + }
> > > > What are the implications of not doing dirty-time livelock avoidance for
> > > > other types of writeback? Is that a mistake? I'd prefer to have in
> > > > wb_writeback():
> > > > if (wbc.for_kupdate)
> > > > oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
> > > > else
> > > > oldest_jif = jiffies;
> > > > wbc.older_than_this = &oldest_jif;
> > > >
> > > > And when you have this, you can make wbc.older_than_this just a plain
> > > > number and remove all those checks for wbc.older_than_this == NULL.
> > >
> > > Good point. Here is the fixed patch. Will you send the patch to change
> > > the type when the current patches are settled down?
> > OK, I will do that.
>
> Thank you.
>
> > > @@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
> > > if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > write_chunk = LONG_MAX;
> > >
> > > - wbc.wb_start = jiffies; /* livelock avoidance */
> > > + oldest_jif = jiffies;
> > > + wbc.older_than_this = &oldest_jif;
> > > +
> > I might be already confused with all the code moving around but won't
> > this overwrite the value set for the for_kupdate case?
>
> It's the opposite -- it will be overwritten inside the loop by
> for_kupdate, which may run for long time and hence need to update
> oldest_jif from time to time.
The code is now:
oldest_jif = jiffies;
work->older_than_this = &oldest_jif;
for (;;) {
// ...
if (work->for_kupdate || work->for_background) {
oldest_jif = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
work->older_than_this = &oldest_jif;
}
retry:
// ...
/*
* background writeback will start with expired inodes, and
* if none is found, fallback to all inodes. This order helps
* reduce the number of dirty pages reaching the end of LRU
* lists and cause trouble to the page reclaim.
*/
if (work->for_background &&
work->older_than_this &&
list_empty(&wb->b_io) &&
list_empty(&wb->b_more_io)) {
work->older_than_this = NULL;
goto retry;
}
// ...
}
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] writeback: avoid extra sync work at enqueue time
2011-05-05 14:10 ` Wu Fengguang
2011-05-05 14:13 ` Wu Fengguang
@ 2011-05-05 14:34 ` Jan Kara
1 sibling, 0 replies; 19+ messages in thread
From: Jan Kara @ 2011-05-05 14:34 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel@vger.kernel.org
On Thu 05-05-11 22:10:39, Wu Fengguang wrote:
> On Thu, May 05, 2011 at 10:01:34PM +0800, Jan Kara wrote:
> > On Thu 05-05-11 20:27:32, Wu Fengguang wrote:
> > > On Thu, May 05, 2011 at 05:24:27AM +0800, Jan Kara wrote:
> > > > On Mon 02-05-11 11:17:53, Wu Fengguang wrote:
> > > > > This removes writeback_control.wb_start and does more straightforward
> > > > > sync livelock prevention by setting .older_than_this to prevent extra
> > > > > inodes from being enqueued in the first place.
> > > > >
> > > > > --- linux-next.orig/fs/fs-writeback.c 2011-05-02 11:17:24.000000000 +0800
> > > > > +++ linux-next/fs/fs-writeback.c 2011-05-02 11:17:27.000000000 +0800
> > > > > @@ -683,10 +672,12 @@ static long wb_writeback(struct bdi_writ
> > > > > * (quickly) tag currently dirty pages
> > > > > * (maybe slowly) sync all tagged pages
> > > > > */
> > > > > - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > > > + if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
> > > > > write_chunk = LONG_MAX;
> > > > > + oldest_jif = jiffies;
> > > > > + wbc.older_than_this = &oldest_jif;
> > > > > + }
> > > > What are the implications of not doing dirty-time livelock avoidance for
> > > > other types of writeback? Is that a mistake? I'd prefer to have in
> > > > wb_writeback():
> > > > if (wbc.for_kupdate)
> > > > oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
> > > > else
> > > > oldest_jif = jiffies;
> > > > wbc.older_than_this = &oldest_jif;
> > > >
> > > > And when you have this, you can make wbc.older_than_this just a plain
> > > > number and remove all those checks for wbc.older_than_this == NULL.
> > >
> > > Good point. Here is the fixed patch. Will you send the patch to change
> > > the type when the current patches are settled down?
> > OK, I will do that.
>
> Thank you.
>
> > > @@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
> > > if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> > > write_chunk = LONG_MAX;
> > >
> > > - wbc.wb_start = jiffies; /* livelock avoidance */
> > > + oldest_jif = jiffies;
> > > + wbc.older_than_this = &oldest_jif;
> > > +
> > I might be already confused with all the code moving around but won't
> > this overwrite the value set for the for_kupdate case?
>
> It's the opposite -- it will be overwritten inside the loop by
> for_kupdate, which may run for long time and hence need to update
> oldest_jif from time to time.
Oh, right. I guess I'll setup that git tree (at least for myself if for
nothing else) so that I can see what is actually the current code state.
Because I think I've already lost the track of all the patches...
I've looked at mmotm on the web but it seems to have older versions of the
patches and also some of the patches which I think Andrew took (like this
one) are not there. Can you please make latest versions of all the patches
available somewhere so that I don't have dig them from the mailing list
archives?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-05-05 14:34 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-02 3:17 [PATCH 0/3] sync livelock fixes v2 Wu Fengguang
2011-05-02 3:17 ` [PATCH 1/3] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
2011-05-04 21:00 ` Jan Kara
2011-05-05 12:14 ` Wu Fengguang
2011-05-05 13:55 ` Jan Kara
2011-05-05 14:06 ` Wu Fengguang
2011-05-02 3:17 ` [PATCH 2/3] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-05-04 21:10 ` Jan Kara
2011-05-05 12:18 ` Wu Fengguang
2011-05-02 3:17 ` [PATCH 3/3] writeback: avoid extra sync work at enqueue time Wu Fengguang
2011-05-04 21:24 ` Jan Kara
2011-05-05 12:27 ` Wu Fengguang
2011-05-05 12:41 ` Christoph Hellwig
2011-05-05 12:42 ` Christoph Hellwig
2011-05-05 12:48 ` Wu Fengguang
2011-05-05 14:01 ` Jan Kara
2011-05-05 14:10 ` Wu Fengguang
2011-05-05 14:13 ` Wu Fengguang
2011-05-05 14:34 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).