All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Ken Chen <kenchen@google.com>
Cc: linux-mm@kvack.org
Subject: Re: [patch] fix periodic superblock dirty inode flushing
Date: Thu, 12 Jul 2007 12:05:19 -0700	[thread overview]
Message-ID: <20070712120519.8a7241dd.akpm@linux-foundation.org> (raw)
In-Reply-To: <b040c32a0707112121y21d08438u8ca7f138931827b0@mail.gmail.com>

On Wed, 11 Jul 2007 21:21:19 -0700
"Ken Chen" <kenchen@google.com> wrote:

> Current -mm tree has bucketful of bug fixes in periodic writeback path.
> However, we still hit a glitch where dirty pages on a given inode aren't
> completely flushed to the disk, and system will accumulate large amount
> of dirty pages pass beyond what dirty_expire_interval is designed for.
> 
> The problem is __sync_single_inode() will move inode to sb->s_dirty list
> even when there are more pending dirty pages on that inode.  If there is
> another inode with small amount of dirty pages, we hit a case where loop
> iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
> Thus leaving the inode that has large amount of dirty pages behind and it has
> to wait for another dirty_writeback_interval before we flush it again.  It
> effectively only writeout MAX_WRITEBACK_PAGES every dirty_writeback_interval.
> If the rate of dirtying is sufficiently high, system will start accumulate
> large amount of dirty pages.
> 
> So fix it by having another sb->s_more_io list to park the inode while we
> iterate through sb->s_io and allow each dirty inode resides on that sb has
> an equal chance of flushing some amount of dirty pages.
> 

Thanks.  Was this tested in combination with check_dirty_inode_list.patch,
to make sure that the time-orderedness is being retained?


From: Andrew Morton <akpm@linux-foundation.org>

The per-superblock dirty-inode list super_block.s_dirty is supposed to be
sorted in reverse order of each inode's time-of-first-dirtying.  This is so
that the kupdate function can avoid having to walk all the dirty inodes on the
list: it terminates the search as soon as it finds an inode which was dirtied
less than 30 seconds ago (dirty_expire_centisecs).

We have a bunch of several-year-old bugs which cause that list to not be in
the correct reverse-time-order.  The result of this is that under certain
obscure circumstances, inodes get stuck and basically never get written back. 
It has been reported a couple of times, but nobody really cared much because
most people use ordered-mode journalling filesystems, which take care of the
writeback independently.  Plus we will _eventually_ get onto these inodes even
when the list is out of order, and a /bin/sync will still work OK.

However this is a pretty important data-integrity issue for filesystems such
as ext2.


As preparation for fixing these bugs, this patch adds a pile of fantastically
expensive debugging code which checks the sanity of the s_dirty list all over
the place, so we find out as soon as it goes bad.

The debugging code is controlled by /proc/sys/fs/inode_debug, which defaults
to off.  The debugging will disable itself whenever it detects a misordering,
to avoid log spew.

We can remove all this code later.

Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs-writeback.c         |   62 +++++++++++++++++++++++++++++++++++-
 include/linux/writeback.h |    1 
 kernel/sysctl.c           |    8 ++++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff -puN fs/fs-writeback.c~check_dirty_inode_list fs/fs-writeback.c
--- a/fs/fs-writeback.c~check_dirty_inode_list
+++ a/fs/fs-writeback.c
@@ -24,6 +24,57 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+int sysctl_inode_debug __read_mostly;
+
+static int __check(struct super_block *sb, int print_stuff)
+{
+	struct list_head *cursor = &sb->s_dirty;
+	unsigned long dirtied_when = 0;
+
+	while ((cursor = cursor->prev) != &sb->s_dirty) {
+		struct inode *inode = list_entry(cursor, struct inode, i_list);
+		if (print_stuff) {
+			printk("%p:%lu\n", inode, inode->dirtied_when);
+		} else {
+			if (dirtied_when &&
+			    time_before(inode->dirtied_when, dirtied_when))
+				return 1;
+			dirtied_when = inode->dirtied_when;
+		}
+	}
+	return 0;
+}
+
+static void __check_dirty_inode_list(struct super_block *sb,
+			struct inode *inode, const char *file, int line)
+{
+	if (!sysctl_inode_debug)
+		return;
+
+	if (__check(sb, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_dirty got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_dirty got screwed up\n", file, line);
+		__check(sb, 1);
+	}
+}
+
+#define check_dirty_inode_list(sb)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+		__check_dirty_inode_list(sb, NULL, __FILE__, __LINE__);	\
+	} while (0)
+
+#define check_dirty_inode(inode)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+			__check_dirty_inode_list(inode->i_sb, inode,	\
+						__FILE__, __LINE__);	\
+	} while (0)
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -122,8 +173,10 @@ void __mark_inode_dirty(struct inode *in
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
 		if (!was_dirty) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 	}
 out:
@@ -152,6 +205,7 @@ static void redirty_tail(struct inode *i
 {
 	struct super_block *sb = inode->i_sb;
 
+	check_dirty_inode(inode);
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
@@ -161,6 +215,7 @@ static void redirty_tail(struct inode *i
 			inode->dirtied_when = jiffies;
 	}
 	list_move(&inode->i_list, &sb->s_dirty);
+	check_dirty_inode(inode);
 }
 
 /*
@@ -374,8 +429,11 @@ int generic_sync_sb_inodes(struct super_
 
 	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
+	if (!wbc->for_kupdate || list_empty(&sb->s_io)) {
+		check_dirty_inode_list(sb);
 		list_splice_init(&sb->s_dirty, &sb->s_io);
+		check_dirty_inode_list(sb);
+	}
 
 	while (!list_empty(&sb->s_io)) {
 		int err;
@@ -440,8 +498,10 @@ int generic_sync_sb_inodes(struct super_
 		if (!ret)
 			ret = err;
 		if (wbc->sync_mode == WB_SYNC_HOLD) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
diff -puN include/linux/writeback.h~check_dirty_inode_list include/linux/writeback.h
--- a/include/linux/writeback.h~check_dirty_inode_list
+++ a/include/linux/writeback.h
@@ -140,5 +140,6 @@ void writeback_set_ratelimit(void);
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */
 
+extern int sysctl_inode_debug;
 
 #endif		/* WRITEBACK_H */
diff -puN kernel/sysctl.c~check_dirty_inode_list kernel/sysctl.c
--- a/kernel/sysctl.c~check_dirty_inode_list
+++ a/kernel/sysctl.c
@@ -1231,6 +1231,14 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "inode_debug",
+		.data		= &sysctl_inode_debug,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-07-12 19:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-12  4:21 [patch] fix periodic superblock dirty inode flushing Ken Chen
2007-07-12 19:05 ` Andrew Morton [this message]
2007-07-13 22:17   ` Ken Chen
2007-07-17  0:01     ` Ken Chen
2007-07-17  0:15       ` Andrew Morton
2007-07-19  2:59       ` Fengguang Wu
2007-07-19  2:59         ` Fengguang Wu
2007-07-19  3:10           ` Andrew Morton
2007-07-19  8:09             ` Fengguang Wu
2007-07-19  8:09               ` Fengguang Wu
2007-07-19  8:18                 ` Andrew Morton
2007-07-19 22:18                   ` David Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070712120519.8a7241dd.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=kenchen@google.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.