* [PATCH 0/6] dirty inode lists time delay/ordering fixes
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, linux-kernel
Andrew,
Four bug fixes on the dirty inode lists :-)
They can be put immediately after the patch file named
writeback-fix-periodic-superblock-dirty-inode-flushing.patch:
[PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
[PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
[PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
[PATCH 4/6] writeback: introduce writeback_control.more_io to indicate more io
Also the check_dirty_inode_list.patch should be updated in place:
[PATCH 5/6] check dirty inode list
And this one is required to avoid warnings from the above debug patch:
[PATCH 6/6] prevent time-ordering warnings
Although not yet tested in highly parallel write workloads, I'm pretty sure
that the time ordering bug is gone :)
Thank you,
Fengguang
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Andrew Morton, linux-kernel
[-- Attachment #1: inode-dirty-time-ordering-fix.patch --]
[-- Type: text/plain, Size: 5497 bytes --]
Streamline the management of dirty inode lists and fix time ordering bugs.
The writeback logic used to moving not-yet-expired dirty inodes from s_dirty to
s_io, *only to* move them back. The move-inodes-back-and-forth thing is a mess,
which is eliminated by this patch.
The new scheme is:
- s_dirty acts as a time ordered io delaying queue;
- s_io/s_more_io together acts as an io dispatching queue.
On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
every full scan of s_io. Otherwise(i.e. for sync/throttle/background
writeback), we always pull from s_dirty on each run(a partial scan).
Note that the line
list_splice_init(&sb->s_more_io, &sb->s_io);
is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
sit in s_io for a long time, preventing new expired inodes to get in.
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 60 +++++++++++++++++++++++++++-----------------
1 file changed, 38 insertions(+), 22 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -118,7 +118,7 @@ void __mark_inode_dirty(struct inode *in
goto out;
/*
- * If the inode was already on s_dirty or s_io, don't
+ * If the inode was already on s_dirty/s_io/s_more_io, don't
* reposition it (that would break s_dirty time-ordering).
*/
if (!was_dirty) {
@@ -172,6 +172,33 @@ static void requeue_io(struct inode *ino
}
/*
+ * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
+ */
+static void move_expired_inodes(struct list_head *delaying_queue,
+ struct list_head *dispatch_queue,
+ unsigned long *older_than_this)
+{
+ while (!list_empty(delaying_queue)) {
+ struct inode *inode = list_entry(delaying_queue->prev,
+ struct inode, i_list);
+ if (older_than_this &&
+ time_after(inode->dirtied_when, *older_than_this))
+ break;
+ list_move(&inode->i_list, dispatch_queue);
+ }
+}
+
+/*
+ * Queue all expired dirty inodes for io, eldest first.
+ */
+static void queue_io(struct super_block *sb,
+ unsigned long *older_than_this)
+{
+ list_splice_init(&sb->s_more_io, sb->s_io.prev);
+ move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+}
+
+/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
*
@@ -221,7 +248,7 @@ __sync_single_inode(struct inode *inode,
/*
* We didn't write back all the pages. nfs_writepages()
* sometimes bales out without doing anything. Redirty
- * the inode. It is moved from s_io onto s_dirty.
+ * the inode; Move it from s_io onto s_more_io/s_dirty.
*/
/*
* akpm: if the caller was the kupdate function we put
@@ -234,10 +261,9 @@ __sync_single_inode(struct inode *inode,
*/
if (wbc->for_kupdate) {
/*
- * For the kupdate function we leave the inode
- * at the head of sb_dirty so it will get more
- * writeout as soon as the queue becomes
- * uncongested.
+ * For the kupdate function we move the inode
+ * to s_more_io so it will get more writeout as
+ * soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
requeue_io(inode);
@@ -295,10 +321,10 @@ __writeback_single_inode(struct inode *i
/*
* We're skipping this inode because it's locked, and we're not
- * doing writeback-for-data-integrity. Move it to the head of
- * s_dirty so that writeback can proceed with the other inodes
- * on s_io. We'll have another go at writing back this inode
- * when the s_dirty iodes get moved back onto s_io.
+ * doing writeback-for-data-integrity. Move it to s_more_io so
+ * that writeback can proceed with the other inodes on s_io.
+ * We'll have another go at writing back this inode when we
+ * completed a full scan of s_io.
*/
requeue_io(inode);
@@ -365,7 +391,7 @@ sync_sb_inodes(struct super_block *sb, s
const unsigned long start = jiffies; /* livelock avoidance */
if (!wbc->for_kupdate || list_empty(&sb->s_io))
- list_splice_init(&sb->s_dirty, &sb->s_io);
+ queue_io(sb, wbc->older_than_this);
while (!list_empty(&sb->s_io)) {
struct inode *inode = list_entry(sb->s_io.prev,
@@ -410,13 +436,6 @@ sync_sb_inodes(struct super_block *sb, s
if (time_after(inode->dirtied_when, start))
break;
- /* Was this inode dirtied too recently? */
- if (wbc->older_than_this && time_after(inode->dirtied_when,
- *wbc->older_than_this)) {
- list_splice_init(&sb->s_io, sb->s_dirty.prev);
- break;
- }
-
/* Is another pdflush already flushing this queue? */
if (current_is_pdflush() && !writeback_acquire(bdi))
break;
@@ -446,9 +465,6 @@ sync_sb_inodes(struct super_block *sb, s
break;
}
- if (list_empty(&sb->s_io))
- list_splice_init(&sb->s_more_io, &sb->s_io);
-
return; /* Leave any unwritten inodes on s_io */
}
@@ -458,7 +474,7 @@ sync_sb_inodes(struct super_block *sb, s
* Note:
* We don't need to grab a reference to superblock here. If it has non-empty
* ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until both the ->s_dirty and ->s_io lists are
+ * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
* empty. Since __sync_single_inode() regains inode_lock before it finally moves
* inode from superblock lists we are OK.
*
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Andrew Morton, linux-kernel
[-- Attachment #1: inode-dirty-time-ordering-fix.patch --]
[-- Type: text/plain, Size: 5497 bytes --]
Streamline the management of dirty inode lists and fix time ordering bugs.
The writeback logic used to moving not-yet-expired dirty inodes from s_dirty to
s_io, *only to* move them back. The move-inodes-back-and-forth thing is a mess,
which is eliminated by this patch.
The new scheme is:
- s_dirty acts as a time ordered io delaying queue;
- s_io/s_more_io together acts as an io dispatching queue.
On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
every full scan of s_io. Otherwise(i.e. for sync/throttle/background
writeback), we always pull from s_dirty on each run(a partial scan).
Note that the line
list_splice_init(&sb->s_more_io, &sb->s_io);
is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
sit in s_io for a long time, preventing new expired inodes to get in.
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 60 +++++++++++++++++++++++++++-----------------
1 file changed, 38 insertions(+), 22 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -118,7 +118,7 @@ void __mark_inode_dirty(struct inode *in
goto out;
/*
- * If the inode was already on s_dirty or s_io, don't
+ * If the inode was already on s_dirty/s_io/s_more_io, don't
* reposition it (that would break s_dirty time-ordering).
*/
if (!was_dirty) {
@@ -172,6 +172,33 @@ static void requeue_io(struct inode *ino
}
/*
+ * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
+ */
+static void move_expired_inodes(struct list_head *delaying_queue,
+ struct list_head *dispatch_queue,
+ unsigned long *older_than_this)
+{
+ while (!list_empty(delaying_queue)) {
+ struct inode *inode = list_entry(delaying_queue->prev,
+ struct inode, i_list);
+ if (older_than_this &&
+ time_after(inode->dirtied_when, *older_than_this))
+ break;
+ list_move(&inode->i_list, dispatch_queue);
+ }
+}
+
+/*
+ * Queue all expired dirty inodes for io, eldest first.
+ */
+static void queue_io(struct super_block *sb,
+ unsigned long *older_than_this)
+{
+ list_splice_init(&sb->s_more_io, sb->s_io.prev);
+ move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+}
+
+/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
*
@@ -221,7 +248,7 @@ __sync_single_inode(struct inode *inode,
/*
* We didn't write back all the pages. nfs_writepages()
* sometimes bales out without doing anything. Redirty
- * the inode. It is moved from s_io onto s_dirty.
+ * the inode; Move it from s_io onto s_more_io/s_dirty.
*/
/*
* akpm: if the caller was the kupdate function we put
@@ -234,10 +261,9 @@ __sync_single_inode(struct inode *inode,
*/
if (wbc->for_kupdate) {
/*
- * For the kupdate function we leave the inode
- * at the head of sb_dirty so it will get more
- * writeout as soon as the queue becomes
- * uncongested.
+ * For the kupdate function we move the inode
+ * to s_more_io so it will get more writeout as
+ * soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
requeue_io(inode);
@@ -295,10 +321,10 @@ __writeback_single_inode(struct inode *i
/*
* We're skipping this inode because it's locked, and we're not
- * doing writeback-for-data-integrity. Move it to the head of
- * s_dirty so that writeback can proceed with the other inodes
- * on s_io. We'll have another go at writing back this inode
- * when the s_dirty iodes get moved back onto s_io.
+ * doing writeback-for-data-integrity. Move it to s_more_io so
+ * that writeback can proceed with the other inodes on s_io.
+ * We'll have another go at writing back this inode when we
+ * completed a full scan of s_io.
*/
requeue_io(inode);
@@ -365,7 +391,7 @@ sync_sb_inodes(struct super_block *sb, s
const unsigned long start = jiffies; /* livelock avoidance */
if (!wbc->for_kupdate || list_empty(&sb->s_io))
- list_splice_init(&sb->s_dirty, &sb->s_io);
+ queue_io(sb, wbc->older_than_this);
while (!list_empty(&sb->s_io)) {
struct inode *inode = list_entry(sb->s_io.prev,
@@ -410,13 +436,6 @@ sync_sb_inodes(struct super_block *sb, s
if (time_after(inode->dirtied_when, start))
break;
- /* Was this inode dirtied too recently? */
- if (wbc->older_than_this && time_after(inode->dirtied_when,
- *wbc->older_than_this)) {
- list_splice_init(&sb->s_io, sb->s_dirty.prev);
- break;
- }
-
/* Is another pdflush already flushing this queue? */
if (current_is_pdflush() && !writeback_acquire(bdi))
break;
@@ -446,9 +465,6 @@ sync_sb_inodes(struct super_block *sb, s
break;
}
- if (list_empty(&sb->s_io))
- list_splice_init(&sb->s_more_io, &sb->s_io);
-
return; /* Leave any unwritten inodes on s_io */
}
@@ -458,7 +474,7 @@ sync_sb_inodes(struct super_block *sb, s
* Note:
* We don't need to grab a reference to superblock here. If it has non-empty
* ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until both the ->s_dirty and ->s_io lists are
+ * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
* empty. Since __sync_single_inode() regains inode_lock before it finally moves
* inode from superblock lists we are OK.
*
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Anton Altaparmakov, Andrew Morton, linux-kernel
[-- Attachment #1: nfs-dirty-inodes.patch --]
[-- Type: text/plain, Size: 2520 bytes --]
NTFS's if-condition on dirty inodes is not complete.
Fix it with sb_has_dirty_inodes().
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
---
fs/fs-writeback.c | 10 +++++++++-
fs/ntfs/super.c | 4 ++--
include/linux/fs.h | 1 +
3 files changed, 12 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/ntfs/super.c
+++ linux-2.6.23-rc2-mm2/fs/ntfs/super.c
@@ -2381,14 +2381,14 @@ static void ntfs_put_super(struct super_
*/
ntfs_commit_inode(vol->mft_ino);
write_inode_now(vol->mft_ino, 1);
- if (!list_empty(&sb->s_dirty)) {
+ if (sb_has_dirty_inodes(sb)) {
const char *s1, *s2;
mutex_lock(&vol->mft_ino->i_mutex);
truncate_inode_pages(vol->mft_ino->i_mapping, 0);
mutex_unlock(&vol->mft_ino->i_mutex);
write_inode_now(vol->mft_ino, 1);
- if (!list_empty(&sb->s_dirty)) {
+ if (sb_has_dirty_inodes(sb)) {
static const char *_s1 = "inodes";
static const char *_s2 = "";
s1 = _s1;
--- linux-2.6.23-rc2-mm2.orig/include/linux/fs.h
+++ linux-2.6.23-rc2-mm2/include/linux/fs.h
@@ -1712,6 +1712,7 @@ extern int bdev_read_only(struct block_d
extern int set_blocksize(struct block_device *, int);
extern int sb_set_blocksize(struct super_block *, int);
extern int sb_min_blocksize(struct super_block *, int);
+extern int sb_has_dirty_inodes(struct super_block *);
extern int generic_file_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -198,6 +198,14 @@ static void queue_io(struct super_block
move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
}
+int sb_has_dirty_inodes(struct super_block *sb)
+{
+ return !list_empty(&sb->s_dirty) ||
+ !list_empty(&sb->s_io) ||
+ !list_empty(&sb->s_more_io);
+}
+EXPORT_SYMBOL(sb_has_dirty_inodes);
+
/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
@@ -497,7 +505,7 @@ writeback_inodes(struct writeback_contro
restart:
sb = sb_entry(super_blocks.prev);
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
- if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+ if (sb_has_dirty_inodes(sb)) {
/* we're making our own get_super here */
sb->s_count++;
spin_unlock(&sb_lock);
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Anton Altaparmakov, Andrew Morton, linux-kernel
[-- Attachment #1: nfs-dirty-inodes.patch --]
[-- Type: text/plain, Size: 2520 bytes --]
NTFS's if-condition on dirty inodes is not complete.
Fix it with sb_has_dirty_inodes().
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
---
fs/fs-writeback.c | 10 +++++++++-
fs/ntfs/super.c | 4 ++--
include/linux/fs.h | 1 +
3 files changed, 12 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/ntfs/super.c
+++ linux-2.6.23-rc2-mm2/fs/ntfs/super.c
@@ -2381,14 +2381,14 @@ static void ntfs_put_super(struct super_
*/
ntfs_commit_inode(vol->mft_ino);
write_inode_now(vol->mft_ino, 1);
- if (!list_empty(&sb->s_dirty)) {
+ if (sb_has_dirty_inodes(sb)) {
const char *s1, *s2;
mutex_lock(&vol->mft_ino->i_mutex);
truncate_inode_pages(vol->mft_ino->i_mapping, 0);
mutex_unlock(&vol->mft_ino->i_mutex);
write_inode_now(vol->mft_ino, 1);
- if (!list_empty(&sb->s_dirty)) {
+ if (sb_has_dirty_inodes(sb)) {
static const char *_s1 = "inodes";
static const char *_s2 = "";
s1 = _s1;
--- linux-2.6.23-rc2-mm2.orig/include/linux/fs.h
+++ linux-2.6.23-rc2-mm2/include/linux/fs.h
@@ -1712,6 +1712,7 @@ extern int bdev_read_only(struct block_d
extern int set_blocksize(struct block_device *, int);
extern int sb_set_blocksize(struct super_block *, int);
extern int sb_min_blocksize(struct super_block *, int);
+extern int sb_has_dirty_inodes(struct super_block *);
extern int generic_file_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -198,6 +198,14 @@ static void queue_io(struct super_block
move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
}
+int sb_has_dirty_inodes(struct super_block *sb)
+{
+ return !list_empty(&sb->s_dirty) ||
+ !list_empty(&sb->s_io) ||
+ !list_empty(&sb->s_more_io);
+}
+EXPORT_SYMBOL(sb_has_dirty_inodes);
+
/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
@@ -497,7 +505,7 @@ writeback_inodes(struct writeback_contro
restart:
sb = sb_entry(super_blocks.prev);
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
- if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+ if (sb_has_dirty_inodes(sb)) {
/* we're making our own get_super here */
sb->s_count++;
spin_unlock(&sb_lock);
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, David Chinner, Andrew Morton, linux-kernel
[-- Attachment #1: no-skipped.patch --]
[-- Type: text/plain, Size: 4428 bytes --]
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.
It can be produced by the following test scheme:
# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done
# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023
% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548
# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }
More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.
Also the comment in generic_sync_sb_inodes():
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
and the comment in __block_write_full_page():
1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */
do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!
This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.
This is the two possible new behaviors after the patch:
1) pretty nice: wait 30s and write ALL:)
2) not so good:
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M
The next patch will fix case (2).
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/buffer.c | 1 -
1 file changed, 1 deletion(-)
--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
* The page and buffer_heads can be released at any time from
* here on.
*/
- wbc->pages_skipped++; /* We didn't write this page */
}
return err;
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, David Chinner, Andrew Morton, linux-kernel
[-- Attachment #1: no-skipped.patch --]
[-- Type: text/plain, Size: 4428 bytes --]
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.
It can be produced by the following test scheme:
# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done
# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023
% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548
# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }
More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.
Also the comment in generic_sync_sb_inodes():
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
and the comment in __block_write_full_page():
1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */
do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!
This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.
This is the two possible new behaviors after the patch:
1) pretty nice: wait 30s and write ALL:)
2) not so good:
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M
The next patch will fix case (2).
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/buffer.c | 1 -
1 file changed, 1 deletion(-)
--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
* The page and buffer_heads can be released at any time from
* here on.
*/
- wbc->pages_skipped++; /* We didn't write this page */
}
return err;
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 4/6] writeback: introduce writeback_control.more_io to indicate more io
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, David Chinner, Andrew Morton, linux-kernel
[-- Attachment #1: writeback-more-data.patch --]
[-- Type: text/plain, Size: 3850 bytes --]
After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate this situation. With it the big dirty file
no longer has to wait for the next kupdate invocation 5s later.
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 2 ++
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 9 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -472,6 +472,8 @@ sync_sb_inodes(struct super_block *sb, s
if (wbc->nr_to_write <= 0)
break;
}
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
return; /* Leave any unwritten inodes on s_io */
}
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -61,6 +61,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */
void *fs_private; /* For use by ->writepages() */
};
--- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
@@ -382,6 +382,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -389,8 +390,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
+ congestion_wait(WRITE, HZ/10);
+ else
break;
}
}
@@ -455,11 +457,12 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 4/6] writeback: introduce writeback_control.more_io to indicate more io
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, David Chinner, Andrew Morton, linux-kernel
[-- Attachment #1: writeback-more-data.patch --]
[-- Type: text/plain, Size: 3850 bytes --]
After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate this situation. With it the big dirty file
no longer has to wait for the next kupdate invocation 5s later.
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 2 ++
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 9 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -472,6 +472,8 @@ sync_sb_inodes(struct super_block *sb, s
if (wbc->nr_to_write <= 0)
break;
}
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
return; /* Leave any unwritten inodes on s_io */
}
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -61,6 +61,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */
void *fs_private; /* For use by ->writepages() */
};
--- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
@@ -382,6 +382,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -389,8 +390,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
+ congestion_wait(WRITE, HZ/10);
+ else
break;
}
}
@@ -455,11 +457,12 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 5/6] check dirty inode list
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Mike Waychison, Andrew Morton, linux-kernel
[-- Attachment #1: check_dirty_inode_list.patch --]
[-- Type: text/plain, Size: 6004 bytes --]
From: Andrew Morton <akpm@linux-foundation.org>
The per-superblock dirty-inode list super_block.s_dirty is supposed to be
sorted in reverse order of each inode's time-of-first-dirtying. This is so
that the kupdate function can avoid having to walk all the dirty inodes on the
list: it terminates the search as soon as it finds an inode which was dirtied
less than 30 seconds ago (dirty_expire_centisecs).
We have a bunch of several-year-old bugs which cause that list to not be in
the correct reverse-time-order. The result of this is that under certain
obscure circumstances, inodes get stuck and basically never get written back.
It has been reported a couple of times, but nobody really cared much because
most people use ordered-mode journalling filesystems, which take care of the
writeback independently. Plus we will _eventually_ get onto these inodes even
when the list is out of order, and a /bin/sync will still work OK.
However this is a pretty important data-integrity issue for filesystems such
as ext2.
As preparation for fixing these bugs, this patch adds a pile of fantastically
expensive debugging code which checks the sanity of the s_dirty list all over
the place, so we find out as soon as it goes bad.
The debugging code is controlled by /proc/sys/fs/inode_debug, which defaults
to off. The debugging will disable itself whenever it detects a misordering,
to avoid log spew.
We can remove all this code later.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/fs-writeback.c | 77 ++++++++++++++++++++++++++++++++++++
include/linux/writeback.h | 1
kernel/sysctl.c | 8 +++
3 files changed, 86 insertions(+)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -24,6 +24,75 @@
#include <linux/buffer_head.h>
#include "internal.h"
+int sysctl_inode_debug __read_mostly;
+
+static int __check(struct list_head *head, int print_stuff)
+{
+ struct list_head *cursor = head;
+ unsigned long dirtied_when = 0;
+
+ while ((cursor = cursor->prev) != head) {
+ struct inode *inode = list_entry(cursor, struct inode, i_list);
+ if (print_stuff) {
+ printk("%p:%lu\n", inode, inode->dirtied_when);
+ } else {
+ if (dirtied_when &&
+ time_before(inode->dirtied_when, dirtied_when))
+ return 1;
+ dirtied_when = inode->dirtied_when;
+ }
+ }
+ return 0;
+}
+
+static void __check_dirty_inode_list(struct super_block *sb,
+ struct inode *inode, const char *file, int line)
+{
+ if (!sysctl_inode_debug)
+ return;
+
+ if (__check(&sb->s_dirty, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_dirty got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_dirty got screwed up\n", file, line);
+ __check(&sb->s_dirty, 1);
+ }
+ if (__check(&sb->s_io, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_io got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_io got screwed up\n", file, line);
+ __check(&sb->s_io, 1);
+ }
+ if (__check(&sb->s_more_io, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_more_io got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_more_io got screwed up\n", file, line);
+ __check(&sb->s_more_io, 1);
+ }
+}
+
+#define check_dirty_inode_list(sb) \
+ do { \
+ if (unlikely(sysctl_inode_debug)) \
+ __check_dirty_inode_list(sb, NULL, __FILE__, __LINE__); \
+ } while (0)
+
+#define check_dirty_inode(inode) \
+ do { \
+ if (unlikely(sysctl_inode_debug)) \
+ __check_dirty_inode_list(inode->i_sb, inode, \
+ __FILE__, __LINE__); \
+ } while (0)
+
/**
* __mark_inode_dirty - internal function
* @inode: inode to mark
@@ -122,8 +191,10 @@ void __mark_inode_dirty(struct inode *in
* reposition it (that would break s_dirty time-ordering).
*/
if (!was_dirty) {
+ check_dirty_inode(inode);
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
}
out:
@@ -152,6 +223,7 @@ static void redirty_tail(struct inode *i
{
struct super_block *sb = inode->i_sb;
+ check_dirty_inode(inode);
if (!list_empty(&sb->s_dirty)) {
struct inode *tail_inode;
@@ -161,6 +233,7 @@ static void redirty_tail(struct inode *i
inode->dirtied_when = jiffies;
}
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
/*
@@ -168,7 +241,9 @@ static void redirty_tail(struct inode *i
*/
static void requeue_io(struct inode *inode)
{
+ check_dirty_inode(inode);
list_move(&inode->i_list, &inode->i_sb->s_more_io);
+ check_dirty_inode(inode);
}
static void inode_sync_complete(struct inode *inode)
@@ -463,8 +538,10 @@ int generic_sync_sb_inodes(struct super_
if (!ret)
ret = err;
if (wbc->sync_mode == WB_SYNC_HOLD) {
+ check_dirty_inode(inode);
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
if (current_is_pdflush())
writeback_release(bdi);
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -140,5 +140,6 @@ void writeback_set_ratelimit(void);
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
read-only. */
+extern int sysctl_inode_debug;
#endif /* WRITEBACK_H */
--- linux-2.6.23-rc2-mm2.orig/kernel/sysctl.c
+++ linux-2.6.23-rc2-mm2/kernel/sysctl.c
@@ -1238,6 +1238,14 @@ static struct ctl_table fs_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "inode_debug",
+ .data = &sysctl_inode_debug,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
{
.ctl_name = CTL_UNNUMBERED,
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 5/6] check dirty inode list
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Mike Waychison, Andrew Morton, linux-kernel
[-- Attachment #1: check_dirty_inode_list.patch --]
[-- Type: text/plain, Size: 6004 bytes --]
From: Andrew Morton <akpm@linux-foundation.org>
The per-superblock dirty-inode list super_block.s_dirty is supposed to be
sorted in reverse order of each inode's time-of-first-dirtying. This is so
that the kupdate function can avoid having to walk all the dirty inodes on the
list: it terminates the search as soon as it finds an inode which was dirtied
less than 30 seconds ago (dirty_expire_centisecs).
We have a bunch of several-year-old bugs which cause that list to not be in
the correct reverse-time-order. The result of this is that under certain
obscure circumstances, inodes get stuck and basically never get written back.
It has been reported a couple of times, but nobody really cared much because
most people use ordered-mode journalling filesystems, which take care of the
writeback independently. Plus we will _eventually_ get onto these inodes even
when the list is out of order, and a /bin/sync will still work OK.
However this is a pretty important data-integrity issue for filesystems such
as ext2.
As preparation for fixing these bugs, this patch adds a pile of fantastically
expensive debugging code which checks the sanity of the s_dirty list all over
the place, so we find out as soon as it goes bad.
The debugging code is controlled by /proc/sys/fs/inode_debug, which defaults
to off. The debugging will disable itself whenever it detects a misordering,
to avoid log spew.
We can remove all this code later.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/fs-writeback.c | 77 ++++++++++++++++++++++++++++++++++++
include/linux/writeback.h | 1
kernel/sysctl.c | 8 +++
3 files changed, 86 insertions(+)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -24,6 +24,75 @@
#include <linux/buffer_head.h>
#include "internal.h"
+int sysctl_inode_debug __read_mostly;
+
+static int __check(struct list_head *head, int print_stuff)
+{
+ struct list_head *cursor = head;
+ unsigned long dirtied_when = 0;
+
+ while ((cursor = cursor->prev) != head) {
+ struct inode *inode = list_entry(cursor, struct inode, i_list);
+ if (print_stuff) {
+ printk("%p:%lu\n", inode, inode->dirtied_when);
+ } else {
+ if (dirtied_when &&
+ time_before(inode->dirtied_when, dirtied_when))
+ return 1;
+ dirtied_when = inode->dirtied_when;
+ }
+ }
+ return 0;
+}
+
+static void __check_dirty_inode_list(struct super_block *sb,
+ struct inode *inode, const char *file, int line)
+{
+ if (!sysctl_inode_debug)
+ return;
+
+ if (__check(&sb->s_dirty, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_dirty got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_dirty got screwed up\n", file, line);
+ __check(&sb->s_dirty, 1);
+ }
+ if (__check(&sb->s_io, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_io got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_io got screwed up\n", file, line);
+ __check(&sb->s_io, 1);
+ }
+ if (__check(&sb->s_more_io, 0)) {
+ sysctl_inode_debug = 0;
+ if (inode)
+ printk("%s:%d: s_more_io got screwed up. inode=%p:%lu\n",
+ file, line, inode, inode->dirtied_when);
+ else
+ printk("%s:%d: s_more_io got screwed up\n", file, line);
+ __check(&sb->s_more_io, 1);
+ }
+}
+
+#define check_dirty_inode_list(sb) \
+ do { \
+ if (unlikely(sysctl_inode_debug)) \
+ __check_dirty_inode_list(sb, NULL, __FILE__, __LINE__); \
+ } while (0)
+
+#define check_dirty_inode(inode) \
+ do { \
+ if (unlikely(sysctl_inode_debug)) \
+ __check_dirty_inode_list(inode->i_sb, inode, \
+ __FILE__, __LINE__); \
+ } while (0)
+
/**
* __mark_inode_dirty - internal function
* @inode: inode to mark
@@ -122,8 +191,10 @@ void __mark_inode_dirty(struct inode *in
* reposition it (that would break s_dirty time-ordering).
*/
if (!was_dirty) {
+ check_dirty_inode(inode);
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
}
out:
@@ -152,6 +223,7 @@ static void redirty_tail(struct inode *i
{
struct super_block *sb = inode->i_sb;
+ check_dirty_inode(inode);
if (!list_empty(&sb->s_dirty)) {
struct inode *tail_inode;
@@ -161,6 +233,7 @@ static void redirty_tail(struct inode *i
inode->dirtied_when = jiffies;
}
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
/*
@@ -168,7 +241,9 @@ static void redirty_tail(struct inode *i
*/
static void requeue_io(struct inode *inode)
{
+ check_dirty_inode(inode);
list_move(&inode->i_list, &inode->i_sb->s_more_io);
+ check_dirty_inode(inode);
}
static void inode_sync_complete(struct inode *inode)
@@ -463,8 +538,10 @@ int generic_sync_sb_inodes(struct super_
if (!ret)
ret = err;
if (wbc->sync_mode == WB_SYNC_HOLD) {
+ check_dirty_inode(inode);
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
+ check_dirty_inode(inode);
}
if (current_is_pdflush())
writeback_release(bdi);
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -140,5 +140,6 @@ void writeback_set_ratelimit(void);
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
read-only. */
+extern int sysctl_inode_debug;
#endif /* WRITEBACK_H */
--- linux-2.6.23-rc2-mm2.orig/kernel/sysctl.c
+++ linux-2.6.23-rc2-mm2/kernel/sysctl.c
@@ -1238,6 +1238,14 @@ static struct ctl_table fs_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "inode_debug",
+ .data = &sysctl_inode_debug,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
{
.ctl_name = CTL_UNNUMBERED,
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 6/6] prevent time-ordering warnings
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Andrew Morton, linux-kernel
[-- Attachment #1: dirty-order-fix.patch --]
[-- Type: text/plain, Size: 929 bytes --]
It's -mm staff.
Just to make the inode list time ordering check logic comfortable.
Otherwise the old behavior is preferred.
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -224,14 +224,7 @@ static void redirty_tail(struct inode *i
struct super_block *sb = inode->i_sb;
check_dirty_inode(inode);
- if (!list_empty(&sb->s_dirty)) {
- struct inode *tail_inode;
-
- tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
- inode->dirtied_when = jiffies;
- }
+ inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
check_dirty_inode(inode);
}
--
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 6/6] prevent time-ordering warnings
@ 2007-08-19 6:53 ` Fengguang Wu
0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-19 6:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ken Chen, Andrew Morton, linux-kernel
[-- Attachment #1: dirty-order-fix.patch --]
[-- Type: text/plain, Size: 929 bytes --]
It's -mm staff.
Just to make the inode list time ordering check logic comfortable.
Otherwise the old behavior is preferred.
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/fs-writeback.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -224,14 +224,7 @@ static void redirty_tail(struct inode *i
struct super_block *sb = inode->i_sb;
check_dirty_inode(inode);
- if (!list_empty(&sb->s_dirty)) {
- struct inode *tail_inode;
-
- tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
- inode->dirtied_when = jiffies;
- }
+ inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
check_dirty_inode(inode);
}
--
^ permalink raw reply [flat|nested] 7+ messages in thread