linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
       [not found]   ` <200903250148.53644.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
@ 2009-03-25  5:26     ` Wu Fengguang
  2009-03-27 16:59       ` Jos Houtman
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2009-03-25  5:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jos Houtman, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jeff Layton, Dave Chinner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Attachment #1: Type: text/plain, Size: 4404 bytes --]

On Wed, Mar 25, 2009 at 01:48:53AM +1100, Nick Piggin wrote:
> On Monday 23 March 2009 03:53:29 Jos Houtman wrote:
> > On 3/21/09 11:53 AM, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > > On Fri, 20 Mar 2009 19:26:06 +0100 Jos Houtman <jos-vMeIAzyucXQ@public.gmane.org> wrote:
> > >> Hi,
> > >>
> > >> We have hit a problem where the page-cache writeback algorithm is not
> > >> keeping up.
> > >> When memory gets low this will result in very irregular performance
> > >> drops.
> > >>
> > >> Our setup is as follows:
> > >> 30 x Quad core machine with 64GB ram.
> > >> These are single purpose machines running MySQL.
> > >> Kernel version: 2.6.28.7
> > >> A dedicated SSD drive for the ext2 database partition
> > >> Noop scheduler for the ssd drive.
> > >>
> > >>
> > >> The current hypothesis is as follows:
> > >> The wk_update function does not write enough dirty pages, which allows
> > >> the number of dirty pages to grow to the dirty_background limit.
> > >> When memory is low,  __background_writeout() comes around and
> > >> __forcefully__ writes dirty pages to disk.
> > >> This forced write fills the disk queue and starves read calls that MySQL
> > >> is trying to do: basically killing performance  for a few seconds. This
> > >> pattern repeats as soon as the cleared memory is filled again.
> > >>
> > >> Decreasing the dirty_writeback_centisecs to 100 doesn__t help
> > >>
> > >> I don__t know why this is, but I did some preliminary tracing using
> > >> systemtap and it seems that the majority of times wk_update calls
> > >> decides to do nothing.
> > >>
> > >> Doubling /sys/block/sdb/queue/nr_requests  to 256, seems to help abit: 
> > >> the nr_dirty pages is increasing more slowly.
> > >> But I am unsure of side-effects and am afraid of increasing the
> > >> starvation problem for mysql.
> > >>
> > >>
> > >> I__am very much willing to work on this issue and see it fixed, but
> > >> would like to tap into the knowledge of people here.
> > >> So:
> > >> * Have more people seen this or simular issues?
> > >> * Is the hypothesis above a viable one?
> > >> * Suggestions/pointers for further research and statistics I should
> > >> measure to improve the understanding of this problem.
> > >
> > > I don't think that noop-iosched tries to do anything to prevent
> > > writes-starve-reads.  Do you get better behaviour from any of the other
> > > IO schedulers?
> >
> > I did a quick stress test and cfq does not immediately seem to hurt
> > performance, although some of my colleague's have tested this in the past
> > with the opposite results (which is why we use noop).
> >
> > But despite the scheduler, the real problem is in the writeback algorithm
> > not keeping up.
> > We can grow 600K dirty pages during the day, and only ~300k is flushed to
> > disk during the night hours.
> >
> > While a quick look at the writeback algorithm let me to expect
> > __wk_update()__ to flush ~1024 pages every 5 seconds, which is almost 3GB
> > per hour.  It obviously does not manage to do this in our setup.
> >
> > I don¹t believe the speed of the ssd to be the problem, running sync
> > manually only takes a few minutes to flush 800K dirty pages to disk.
> 
> kupdate surely should just continue to keep trying to write back pages
> so long as there are more old pages to clean, and the queue isn't
> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
> is just the number to write back in a single call, but you see
> nr_to_write is set to the number of dirty pages in the system.
> 
> On your system, what must be happening is more_io is not being set.
> The logic in fs/fs-writeback.c might be busted.

Hi Jos,

I prepared a debugging patch for 2.6.28. (I cannot observe writeback
problems on my local ext2 mount.)

You can view the states of all dirty inodes by doing

        modprobe filecache
        echo ls dirty > /proc/filecache
        cat /proc/filecache

The 'age' field shows (jiffies - inode->dirtied_when), which may also be useful
for debugging Jeff and Ian's case(if it keeps growing, then dirtied_when is stuck).

The detailed dirty writeback traces can be retrieved by doing

        echo 1 > /proc/sys/fs/dirty_debug        
        sleep 6s
        echo 0 > /proc/sys/fs/dirty_debug        
        dmesg

The dmesg trace should help identify the bug in periodic writeback.

Thanks,
Fengguang

[-- Attachment #2: filecache+writeback-debug-2.6.28.patch --]
[-- Type: text/x-diff, Size: 40831 bytes --]

--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
 extern unsigned long num_physpages;
 extern void * high_memory;
 extern int page_cluster;
+extern char * const zone_names[];
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 
-static char * const zone_names[MAX_NR_ZONES] = {
+char * const zone_names[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
 	 "DMA",
 #endif
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1943,7 +1943,10 @@ char *__d_path(const struct path *path, 
 
 		if (dentry == root->dentry && vfsmnt == root->mnt)
 			break;
-		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
+		if (unlikely(!vfsmnt)) {
+			if (IS_ROOT(dentry))
+				break;
+		} else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			/* Global root? */
 			if (vfsmnt->mnt_parent == vfsmnt) {
 				goto global_root;
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -564,7 +564,6 @@ out:
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
 
-#ifndef __KERNEL__	/* Only the test harness uses this at present */
 /**
  * radix_tree_tag_get - get a tag on a radix tree node
  * @root:		radix tree root
@@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
 	}
 }
 EXPORT_SYMBOL(radix_tree_tag_get);
-#endif
 
 /**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -82,6 +82,10 @@ static struct hlist_head *inode_hashtabl
  */
 DEFINE_SPINLOCK(inode_lock);
 
+EXPORT_SYMBOL(inode_in_use);
+EXPORT_SYMBOL(inode_unused);
+EXPORT_SYMBOL(inode_lock);
+
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
@@ -108,6 +112,14 @@ static void wake_up_inode(struct inode *
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+static inline void inode_created_by(struct inode *inode, struct task_struct *task)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	inode->i_cuid = task->uid;
+	memcpy(inode->i_comm, task->comm, sizeof(task->comm));
+#endif
+}
+
 static struct inode *alloc_inode(struct super_block *sb)
 {
 	static const struct address_space_operations empty_aops;
@@ -142,7 +154,7 @@ static struct inode *alloc_inode(struct 
 		inode->i_bdev = NULL;
 		inode->i_cdev = NULL;
 		inode->i_rdev = 0;
-		inode->dirtied_when = 0;
+		inode->dirtied_when = jiffies;
 		if (security_inode_alloc(inode)) {
 			if (inode->i_sb->s_op->destroy_inode)
 				inode->i_sb->s_op->destroy_inode(inode);
@@ -183,6 +195,7 @@ static struct inode *alloc_inode(struct 
 		}
 		inode->i_private = NULL;
 		inode->i_mapping = mapping;
+		inode_created_by(inode, current);
 	}
 	return inode;
 }
@@ -247,6 +260,8 @@ void __iget(struct inode * inode)
 	inodes_stat.nr_unused--;
 }
 
+EXPORT_SYMBOL(__iget);
+
 /**
  * clear_inode - clear an inode
  * @inode: inode to clear
@@ -1353,6 +1368,16 @@ void inode_double_unlock(struct inode *i
 }
 EXPORT_SYMBOL(inode_double_unlock);
 
+
+struct hlist_head * get_inode_hash_budget(unsigned long index)
+{
+       if (index >= (1 << i_hash_shift))
+               return NULL;
+
+       return inode_hashtable + index;
+}
+EXPORT_SYMBOL_GPL(get_inode_hash_budget);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -45,6 +45,9 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+EXPORT_SYMBOL(super_blocks);
+EXPORT_SYMBOL(sb_lock);
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -230,6 +230,7 @@ unsigned long shrink_slab(unsigned long 
 	up_read(&shrinker_rwsem);
 	return ret;
 }
+EXPORT_SYMBOL(shrink_slab);
 
 /* Called without lock on whether page is mapped, so answer is unstable */
 static inline int page_mapping_inuse(struct page *page)
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -44,6 +44,7 @@ struct address_space swapper_space = {
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
 };
+EXPORT_SYMBOL_GPL(swapper_space);
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
--- linux-2.6.orig/Documentation/filesystems/proc.txt
+++ linux-2.6/Documentation/filesystems/proc.txt
@@ -266,6 +266,7 @@ Table 1-4: Kernel info in /proc
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
  fb	     Frame Buffer devices				(2.4)
+ filecache   Query/drop in-memory file cache
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
  interrupts  Interrupt usage                                   
@@ -456,6 +457,88 @@ varies by architecture and compile optio
 
 > cat /proc/meminfo
 
+..............................................................................
+
+filecache:
+
+Provides access to the in-memory file cache.
+
+To list an index of all cached files:
+
+    echo ls > /proc/filecache
+    cat /proc/filecache
+
+The output looks like:
+
+    # filecache 1.0
+    #      ino       size   cached cached%  state   refcnt  dev             file
+       1026334         91       92    100   --      66      03:02(hda2)     /lib/ld-2.3.6.so
+        233608       1242      972     78   --      66      03:02(hda2)     /lib/tls/libc-2.3.6.so
+         65203        651      476     73   --      1       03:02(hda2)     /bin/bash
+       1026445        261      160     61   --      10      03:02(hda2)     /lib/libncurses.so.5.5
+        235427         10       12    100   --      44      03:02(hda2)     /lib/tls/libdl-2.3.6.so
+
+FIELD	INTRO
+---------------------------------------------------------------------------
+ino	inode number
+size	inode size in KB
+cached	cached size in KB
+cached%	percent of file data cached
+state1	'-' clean; 'd' metadata dirty; 'D' data dirty
+state2	'-' unlocked; 'L' locked, normally indicates file being written out
+refcnt	file reference count, it's an in-kernel one, not exactly open count
+dev	major:minor numbers in hex, followed by a descriptive device name
+file	file path _inside_ the filesystem. There are several special names:
+	'(noname)':	the file name is not available
+	'(03:02)':	the file is a block device file of major:minor
+	'...(deleted)': the named file has been deleted from the disk
+
+To list the cached pages of a perticular file:
+
+    echo /bin/bash > /proc/filecache
+    cat /proc/filecache
+
+    # file /bin/bash
+    # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
+    # idx   len     state   refcnt
+    0       36      RAU__M  3
+    36      1       RAU__M  2
+    37      8       RAU__M  3
+    45      2       RAU___  1
+    47      6       RAU__M  3
+    53      3       RAU__M  2
+    56      2       RAU__M  3
+
+FIELD	INTRO
+----------------------------------------------------------------------------
+idx	page index
+len	number of pages which are cached and share the same state
+state	page state of the flags listed in line two
+refcnt	page reference count
+
+Careful users may notice that the file name to be queried is remembered between
+commands. Internally, the module has a global variable to store the file name
+parameter, so that it can be inherited by newly opened /proc/filecache file.
+However it can lead to interference for multiple queriers. The solution here
+is to obey a rule: only root can interactively change the file name parameter;
+normal users must go for scripts to access the interface. Scripts should do it
+by following the code example below:
+
+    filecache = open("/proc/filecache", "rw");
+    # avoid polluting the global parameter filename
+    filecache.write("set private");
+
+To instruct the kernel to drop clean caches, dentries and inodes from memory,
+causing that memory to become free:
+
+    # drop clean file data cache (i.e. file backed pagecache)
+    echo drop pagecache > /proc/filecache
+
+    # drop clean file metadata cache (i.e. dentries and inodes)
+    echo drop slabcache > /proc/filecache
+
+Note that the drop commands are non-destructive operations and dirty objects
+are not freeable, the user should run `sync' first.
 
 MemTotal:     16344972 kB
 MemFree:      13634064 kB
--- /dev/null
+++ linux-2.6/fs/proc/filecache.c
@@ -0,0 +1,1046 @@
+/*
+ * fs/proc/filecache.c
+ *
+ * Copyright (C) 2006, 2007 Fengguang Wu <wfg-fOMaevN1BEbsJZF79Ady7g@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/radix-tree.h>
+#include <linux/page-flags.h>
+#include <linux/pagevec.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/parser.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+
+/*
+ * Increase minor version when new columns are added;
+ * Increase major version when existing columns are changed.
+ */
+#define FILECACHE_VERSION	"1.0"
+
+/* Internal buffer sizes. The larger the more effcient. */
+#define SBUF_SIZE	(128<<10)
+#define IWIN_PAGE_ORDER	3
+#define IWIN_SIZE	((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
+
+/*
+ * Session management.
+ *
+ * Each opened /proc/filecache file is assiocated with a session object.
+ * Also there is a global_session that maintains status across open()/close()
+ * (i.e. the lifetime of an opened file), so that a casual user can query the
+ * filecache via _multiple_ simple shell commands like
+ * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
+ *
+ * session.query_file is the file whose cache info is to be queried.
+ * Its value determines what we get on read():
+ * 	- NULL: ii_*() called to show the inode index
+ * 	- filp: pg_*() called to show the page groups of a filp
+ *
+ * session.query_file is
+ * 	- cloned from global_session.query_file on open();
+ * 	- updated on write("cat filename");
+ * 	  note that the new file will also be saved in global_session.query_file if
+ * 	  session.private_session is false.
+ */
+
+struct session {
+	/* options */
+	int		private_session;
+	unsigned long	ls_options;
+	dev_t		ls_dev;
+
+	/* parameters */
+	struct file	*query_file;
+
+	/* seqfile pos */
+	pgoff_t		start_offset;
+	pgoff_t		next_offset;
+
+	/* inode at last pos */
+	struct {
+		unsigned long pos;
+		unsigned long state;
+		struct inode *inode;
+		struct inode *pinned_inode;
+	} ipos;
+
+	/* inode window */
+	struct {
+		unsigned long cursor;
+		unsigned long origin;
+		unsigned long size;
+		struct inode **inodes;
+	} iwin;
+};
+
+static struct session global_session;
+
+/*
+ * Session address is stored in proc_file->f_ra.start:
+ * we assume that there will be no readahead for proc_file.
+ */
+static struct session *get_session(struct file *proc_file)
+{
+	return (struct session *)proc_file->f_ra.start;
+}
+
+static void set_session(struct file *proc_file, struct session *s)
+{
+	BUG_ON(proc_file->f_ra.start);
+	proc_file->f_ra.start = (unsigned long)s;
+}
+
+static void update_global_file(struct session *s)
+{
+	if (s->private_session)
+		return;
+
+	if (global_session.query_file)
+		fput(global_session.query_file);
+
+	global_session.query_file = s->query_file;
+
+	if (global_session.query_file)
+		get_file(global_session.query_file);
+}
+
+/*
+ * Cases of the name:
+ * 1) NULL                (new session)
+ * 	s->query_file = global_session.query_file = 0;
+ * 2) ""                  (ls/la)
+ * 	s->query_file = global_session.query_file;
+ * 3) a regular file name (cat newfile)
+ * 	s->query_file = global_session.query_file = newfile;
+ */
+static int session_update_file(struct session *s, char *name)
+{
+	static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
+	int err = 0;
+
+	mutex_lock(&mutex);
+
+	/*
+	 * We are to quit, or to list the cached files.
+	 * Reset *.query_file.
+	 */
+	if (!name) {
+		if (s->query_file) {
+			fput(s->query_file);
+			s->query_file = NULL;
+		}
+		update_global_file(s);
+		goto out;
+	}
+
+	/*
+	 * This is a new session.
+	 * Inherit options/parameters from global ones.
+	 */
+	if (name[0] == '\0') {
+		*s = global_session;
+		if (s->query_file)
+			get_file(s->query_file);
+		goto out;
+	}
+
+	/*
+	 * Open the named file.
+	 */
+	if (s->query_file)
+		fput(s->query_file);
+	s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+	if (IS_ERR(s->query_file)) {
+		err = PTR_ERR(s->query_file);
+		s->query_file = NULL;
+	} else
+		update_global_file(s);
+
+out:
+	mutex_unlock(&mutex);
+
+	return err;
+}
+
+static struct session *session_create(void)
+{
+	struct session *s;
+	int err = 0;
+
+	s = kmalloc(sizeof(*s), GFP_KERNEL);
+	if (s)
+		err = session_update_file(s, "");
+	else
+		err = -ENOMEM;
+
+	return err ? ERR_PTR(err) : s;
+}
+
+static void session_release(struct session *s)
+{
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	if (s->query_file)
+		fput(s->query_file);
+	kfree(s);
+}
+
+
+/*
+ * Listing of cached files.
+ *
+ * Usage:
+ * 		echo > /proc/filecache  # enter listing mode
+ * 		cat /proc/filecache     # get the file listing
+ */
+
+/* code style borrowed from ib_srp.c */
+enum {
+	LS_OPT_ERR	=	0,
+	LS_OPT_DIRTY	=	1 << 0,
+	LS_OPT_CLEAN	=	1 << 1,
+	LS_OPT_INUSE	=	1 << 2,
+	LS_OPT_EMPTY	=	1 << 3,
+	LS_OPT_ALL	=	1 << 4,
+	LS_OPT_DEV	=	1 << 5,
+};
+
+static match_table_t ls_opt_tokens = {
+	{ LS_OPT_DIRTY,		"dirty" 	},
+	{ LS_OPT_CLEAN,		"clean" 	},
+	{ LS_OPT_INUSE,		"inuse" 	},
+	{ LS_OPT_EMPTY,		"empty"		},
+	{ LS_OPT_ALL,		"all" 		},
+	{ LS_OPT_DEV,		"dev=%s"	},
+	{ LS_OPT_ERR,		NULL 		}
+};
+
+static int ls_parse_options(const char *buf, struct session *s)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *options, *sep_opt;
+	char *p;
+	int token;
+	int ret = 0;
+
+	if (!buf)
+		return 0;
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	s->ls_options = 0;
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, ls_opt_tokens, args);
+
+		switch (token) {
+		case LS_OPT_DIRTY:
+		case LS_OPT_CLEAN:
+		case LS_OPT_INUSE:
+		case LS_OPT_EMPTY:
+		case LS_OPT_ALL:
+			s->ls_options |= token;
+			break;
+		case LS_OPT_DEV:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (*p == '/') {
+				struct kstat stat;
+				struct nameidata nd;
+				ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
+				if (!ret)
+					ret = vfs_getattr(nd.path.mnt,
+							  nd.path.dentry, &stat);
+				if (!ret)
+					s->ls_dev = stat.rdev;
+			} else
+				s->ls_dev = simple_strtoul(p, NULL, 0);
+			/* printk("%lx %s\n", (long)s->ls_dev, p); */
+			kfree(p);
+			break;
+
+		default:
+			printk(KERN_WARNING "unknown parameter or missing value "
+			       "'%s' in ls command\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	kfree(options);
+	return ret;
+}
+
+/*
+ * Add possible filters here.
+ * No permission check: we cannot verify the path's permission anyway.
+ * We simply demand root previledge for accessing /proc/filecache.
+ */
+static int may_show_inode(struct session *s, struct inode *inode)
+{
+	if (!atomic_read(&inode->i_count))
+		return 0;
+	if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+		return 0;
+	if (!inode->i_mapping)
+		return 0;
+
+	if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
+		return 0;
+
+	if (s->ls_options & LS_OPT_ALL)
+		return 1;
+
+	if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
+		return 0;
+
+	if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
+		return 0;
+
+	if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
+		return 0;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	      S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Full: there are more data following.
+ */
+static int iwin_full(struct session *s)
+{
+	return !s->iwin.cursor ||
+		s->iwin.cursor > s->iwin.origin + s->iwin.size;
+}
+
+static int iwin_push(struct session *s, struct inode *inode)
+{
+	if (!may_show_inode(s, inode))
+		return 0;
+
+	s->iwin.cursor++;
+
+	if (s->iwin.size >= IWIN_SIZE)
+		return 1;
+
+	if (s->iwin.cursor > s->iwin.origin)
+		s->iwin.inodes[s->iwin.size++] = inode;
+	return 0;
+}
+
+/*
+ * Travease the inode lists in order - newest first.
+ * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
+ */
+static int iwin_fill(struct session *s, unsigned long pos)
+{
+	struct inode *inode;
+	struct super_block *sb;
+
+	s->iwin.origin = pos;
+	s->iwin.cursor = 0;
+	s->iwin.size = 0;
+
+	/*
+	 * We have a cursor inode, clean and expected to be unchanged.
+	 */
+	if (s->ipos.inode && pos >= s->ipos.pos &&
+			!(s->ipos.state & I_DIRTY) &&
+			s->ipos.state == s->ipos.inode->i_state) {
+		inode = s->ipos.inode;
+		s->iwin.cursor = s->ipos.pos;
+		goto continue_from_saved;
+	}
+
+	if (s->ls_options & LS_OPT_CLEAN)
+		goto clean_inodes;
+
+	spin_lock(&sb_lock);
+	list_for_each_entry(sb, &super_blocks, s_list) {
+		if (s->ls_dev && s->ls_dev != sb->s_dev)
+			continue;
+
+		list_for_each_entry(inode, &sb->s_dirty, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+		list_for_each_entry(inode, &sb->s_io, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+	}
+	spin_unlock(&sb_lock);
+
+clean_inodes:
+	list_for_each_entry(inode, &inode_in_use, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+continue_from_saved:
+		;
+	}
+
+	if (s->ls_options & LS_OPT_INUSE)
+		return 0;
+
+	list_for_each_entry(inode, &inode_unused, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+	}
+
+	return 0;
+
+out_full_unlock:
+	spin_unlock(&sb_lock);
+out_full:
+	return 1;
+}
+
+static struct inode *iwin_inode(struct session *s, unsigned long pos)
+{
+	if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
+			  || pos < s->iwin.origin)
+		iwin_fill(s, pos);
+
+	if (pos >= s->iwin.cursor)
+		return NULL;
+
+	s->ipos.pos = pos;
+	s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
+	BUG_ON(!s->ipos.inode);
+	return s->ipos.inode;
+}
+
+static void show_inode(struct seq_file *m, struct inode *inode)
+{
+	char state[] = "--"; /* dirty, locked */
+	struct dentry *dentry;
+	loff_t size = i_size_read(inode);
+	unsigned long nrpages;
+	int percent;
+	int refcnt;
+	int shift;
+
+	if (!size)
+		size++;
+
+	if (inode->i_mapping)
+		nrpages = inode->i_mapping->nrpages;
+	else {
+		nrpages = 0;
+		WARN_ON(1);
+	}
+
+	for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
+		;
+	percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
+						(unsigned long)(size >> shift));
+
+	if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
+		state[0] = 'D';
+	else if (inode->i_state & I_DIRTY_SYNC)
+		state[0] = 'd';
+
+	if (inode->i_state & I_LOCK)
+		state[0] = 'L';
+
+	refcnt = 0;
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+		refcnt += atomic_read(&dentry->d_count);
+	}
+
+	seq_printf(m, "%10lu %10llu %8lu %7d ",
+			inode->i_ino,
+			DIV_ROUND_UP(size, 1024),
+			nrpages << (PAGE_CACHE_SHIFT - 10),
+			percent);
+
+	seq_printf(m, "%6d %5s %9lu ",
+			refcnt,
+			state,
+			(jiffies - inode->dirtied_when) / HZ);
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	seq_printf(m, "%8u %5u %-16s",
+			inode->i_access_count,
+			inode->i_cuid,
+			inode->i_comm);
+#endif
+
+	seq_printf(m, "%02x:%02x(%s)\t",
+			MAJOR(inode->i_sb->s_dev),
+			MINOR(inode->i_sb->s_dev),
+			inode->i_sb->s_id);
+
+	if (list_empty(&inode->i_dentry)) {
+		if (!atomic_read(&inode->i_count))
+			seq_puts(m, "(noname)\n");
+		else
+			seq_printf(m, "(%02x:%02x)\n",
+					imajor(inode), iminor(inode));
+	} else {
+		struct path path = {
+			.mnt = NULL,
+			.dentry = list_entry(inode->i_dentry.next,
+					     struct dentry, d_alias)
+		};
+
+		seq_path(m, &path, " \t\n\\");
+		seq_putc(m, '\n');
+	}
+}
+
+static int ii_show(struct seq_file *m, void *v)
+{
+	unsigned long index = *(loff_t *) v;
+	struct session *s = m->private;
+        struct inode *inode;
+
+	if (index == 0) {
+		seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
+		seq_puts(m, "#      ino       size   cached cached% "
+				"refcnt state       age "
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+				"accessed   uid process         "
+#endif
+				"dev\t\tfile\n");
+	}
+
+        inode = iwin_inode(s,index);
+	show_inode(m, inode);
+
+	return 0;
+}
+
+static void *ii_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	s->iwin.size = 0;
+	s->iwin.inodes = (struct inode **)
+				__get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!s->iwin.inodes)
+		return NULL;
+
+	spin_lock(&inode_lock);
+
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	(*pos)++;
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void ii_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct inode *inode = s->ipos.inode;
+
+	if (!s->iwin.inodes)
+		return;
+
+	if (inode) {
+		__iget(inode);
+		s->ipos.state = inode->i_state;
+	}
+	spin_unlock(&inode_lock);
+
+	free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	s->ipos.pinned_inode = inode;
+}
+
+/*
+ * Listing of cached page ranges of a file.
+ *
+ * Usage:
+ * 		echo 'file name' > /proc/filecache
+ * 		cat /proc/filecache
+ */
+
+unsigned long page_mask;
+#define PG_MMAP		PG_lru		/* reuse any non-relevant flag */
+#define PG_BUFFER	PG_swapcache	/* ditto */
+#define PG_DIRTY	PG_error	/* ditto */
+#define PG_WRITEBACK	PG_buddy	/* ditto */
+
+/*
+ * Page state names, prefixed by their abbreviations.
+ */
+struct {
+	unsigned long	mask;
+	const char     *name;
+	int		faked;
+} page_flag [] = {
+	{1 << PG_referenced,	"R:referenced",	0},
+	{1 << PG_active,	"A:active",	0},
+	{1 << PG_MMAP,		"M:mmap",	1},
+
+	{1 << PG_uptodate,	"U:uptodate",	0},
+	{1 << PG_dirty,		"D:dirty",	0},
+	{1 << PG_writeback,	"W:writeback",	0},
+	{1 << PG_reclaim,	"X:readahead",	0},
+
+	{1 << PG_private,	"P:private",	0},
+	{1 << PG_owner_priv_1,	"O:owner",	0},
+
+	{1 << PG_BUFFER,	"b:buffer",	1},
+	{1 << PG_DIRTY,		"d:dirty",	1},
+	{1 << PG_WRITEBACK,	"w:writeback",	1},
+};
+
+static unsigned long page_flags(struct page* page)
+{
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	flags = page->flags & page_mask;
+
+	if (page_mapped(page))
+		flags |= (1 << PG_MMAP);
+
+	if (page_has_buffers(page))
+		flags |= (1 << PG_BUFFER);
+
+	if (mapping) {
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_WRITEBACK))
+			flags |= (1 << PG_WRITEBACK);
+
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_DIRTY))
+			flags |= (1 << PG_DIRTY);
+	}
+
+	return flags;
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+	if (page_count(page0) != page_count(page))
+		return 0;
+
+	if (page_flags(page0) != page_flags(page))
+		return 0;
+
+	return 1;
+}
+
+static void show_range(struct seq_file *m, struct page* page, unsigned long len)
+{
+	int i;
+	unsigned long flags;
+
+	if (!m || !page)
+		return;
+
+	seq_printf(m, "%lu\t%lu\t", page->index, len);
+
+	flags = page_flags(page);
+	for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+		seq_putc(m, (flags & page_flag[i].mask) ?
+					page_flag[i].name[0] : '_');
+
+	seq_printf(m, "\t%d\n", page_count(page));
+}
+
+#define BATCH_LINES	100
+static pgoff_t show_file_cache(struct seq_file *m,
+				struct address_space *mapping, pgoff_t start)
+{
+	int i;
+	int lines = 0;
+	pgoff_t len = 0;
+	struct pagevec pvec;
+	struct page *page;
+	struct page *page0 = NULL;
+
+	for (;;) {
+		pagevec_init(&pvec, 0);
+		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+		if (pvec.nr == 0) {
+			show_range(m, page0, len);
+			start = ULONG_MAX;
+			goto out;
+		}
+
+		if (!page0)
+			page0 = pvec.pages[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			page = pvec.pages[i];
+
+			if (page->index == start + len &&
+					pages_similiar(page0, page))
+				len++;
+			else {
+				show_range(m, page0, len);
+				page0 = page;
+				start = page->index;
+				len = 1;
+				if (++lines > BATCH_LINES)
+					goto out;
+			}
+		}
+	}
+
+out:
+	return start;
+}
+
+static int pg_show(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset;
+
+	if (!file)
+		return ii_show(m, v);
+
+	offset = *(loff_t *) v;
+
+	if (!offset) { /* print header */
+		int i;
+
+		seq_puts(m, "# file ");
+		seq_path(m, &file->f_path, " \t\n\\");
+
+		seq_puts(m, "\n# flags");
+		for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+			seq_printf(m, " %s", page_flag[i].name);
+
+		seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
+	}
+
+	s->start_offset = offset;
+	s->next_offset = show_file_cache(m, file->f_mapping, offset);
+
+	return 0;
+}
+
+static void *file_pos(struct file *file, loff_t *pos)
+{
+	loff_t size = i_size_read(file->f_mapping->host);
+	pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
+	pgoff_t offset = *pos;
+
+	return offset < end ? pos : NULL;
+}
+
+static void *pg_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset = *pos;
+
+	if (!file)
+		return ii_start(m, pos);
+
+	rcu_read_lock();
+
+	if (offset - s->start_offset == 1)
+		*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_next(m, v, pos);
+
+	*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void pg_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_stop(m, v);
+
+	rcu_read_unlock();
+}
+
+struct seq_operations seq_filecache_op = {
+	.start	= pg_start,
+	.next	= pg_next,
+	.stop	= pg_stop,
+	.show	= pg_show,
+};
+
+/*
+ * Implement the manual drop-all-pagecache function
+ */
+
+#define MAX_INODES	(PAGE_SIZE / sizeof(struct inode *))
+static int drop_pagecache(void)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct inode *inode;
+	struct inode **inodes;
+	unsigned long i, j, k;
+	int err = 0;
+
+	inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!inodes)
+		return -ENOMEM;
+
+	for (i = 0; (head = get_inode_hash_budget(i)); i++) {
+		if (hlist_empty(head))
+			continue;
+
+		j = 0;
+		cond_resched();
+
+		/*
+		 * Grab some inodes.
+		 */
+		spin_lock(&inode_lock);
+		hlist_for_each (node, head) {
+			inode = hlist_entry(node, struct inode, i_hash);
+			if (!atomic_read(&inode->i_count))
+				continue;
+			if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+				continue;
+			if (!inode->i_mapping || !inode->i_mapping->nrpages)
+				continue;
+			__iget(inode);
+			inodes[j++] = inode;
+			if (j >= MAX_INODES)
+				break;
+		}
+		spin_unlock(&inode_lock);
+
+		/*
+		 * Free clean pages.
+		 */
+		for (k = 0; k < j; k++) {
+			inode = inodes[k];
+			invalidate_mapping_pages(inode->i_mapping, 0, ~1);
+			iput(inode);
+		}
+
+		/*
+		 * Simply ignore the remaining inodes.
+		 */
+		if (j >= MAX_INODES && !err) {
+			printk(KERN_WARNING
+				"Too many collides in inode hash table.\n"
+				"Pls boot with a larger ihash_entries=XXX.\n");
+			err = -EAGAIN;
+		}
+	}
+
+	free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
+	return err;
+}
+
+static void drop_slabcache(void)
+{
+	int nr_objects;
+
+	do {
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+	} while (nr_objects > 10);
+}
+
+/*
+ * Proc file operations.
+ */
+
+static int filecache_open(struct inode *inode, struct file *proc_file)
+{
+	struct seq_file *m;
+	struct session *s;
+	unsigned size;
+	char *buf = 0;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENOENT;
+
+	s = session_create();
+	if (IS_ERR(s)) {
+		ret = PTR_ERR(s);
+		goto out;
+	}
+	set_session(proc_file, s);
+
+	size = SBUF_SIZE;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = seq_open(proc_file, &seq_filecache_op);
+	if (!ret) {
+		m = proc_file->private_data;
+		m->private = s;
+		m->buf = buf;
+		m->size = size;
+	}
+
+out:
+	if (ret) {
+		kfree(s);
+		kfree(buf);
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static int filecache_release(struct inode *inode, struct file *proc_file)
+{
+	struct session *s = get_session(proc_file);
+	int ret;
+
+	session_release(s);
+	ret = seq_release(inode, proc_file);
+	module_put(THIS_MODULE);
+	return ret;
+}
+
+ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
+			size_t count, loff_t *ppos)
+{
+	struct session *s;
+	char *name;
+	int err = 0;
+
+	if (count >= PATH_MAX + 5)
+		return -ENAMETOOLONG;
+
+	name = kmalloc(count+1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	if (copy_from_user(name, buffer, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	/* strip the optional newline */
+	if (count && name[count-1] == '\n')
+		name[count-1] = '\0';
+	else
+		name[count] = '\0';
+
+	s = get_session(proc_file);
+	if (!strcmp(name, "set private")) {
+		s->private_session = 1;
+		goto out;
+	}
+
+	if (!strncmp(name, "cat ", 4)) {
+		err = session_update_file(s, name+4);
+		goto out;
+	}
+
+	if (!strncmp(name, "ls", 2)) {
+		err = session_update_file(s, NULL);
+		if (!err)
+			err = ls_parse_options(name+2, s);
+		if (!err && !s->private_session) {
+			global_session.ls_dev = s->ls_dev;
+			global_session.ls_options = s->ls_options;
+		}
+		goto out;
+	}
+
+	if (!strncmp(name, "drop pagecache", 14)) {
+		err = drop_pagecache();
+		goto out;
+	}
+
+	if (!strncmp(name, "drop slabcache", 14)) {
+		drop_slabcache();
+		goto out;
+	}
+
+	/* err = -EINVAL; */
+	err = session_update_file(s, name);
+
+out:
+	kfree(name);
+
+	return err ? err : count;
+}
+
+static struct file_operations proc_filecache_fops = {
+	.owner		= THIS_MODULE,
+	.open		= filecache_open,
+	.release	= filecache_release,
+	.write		= filecache_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+
+static __init int filecache_init(void)
+{
+	int i;
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("filecache", 0600, NULL);
+	if (entry)
+		entry->proc_fops = &proc_filecache_fops;
+
+	for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
+		if (!page_flag[i].faked)
+			page_mask |= page_flag[i].mask;
+
+	return 0;
+}
+
+static void filecache_exit(void)
+{
+	remove_proc_entry("filecache", NULL);
+	if (global_session.query_file)
+		fput(global_session.query_file);
+}
+
+MODULE_AUTHOR("Fengguang Wu <wfg-fOMaevN1BEbsJZF79Ady7g@public.gmane.org>");
+MODULE_LICENSE("GPL");
+
+module_init(filecache_init);
+module_exit(filecache_exit);
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -685,6 +685,12 @@ struct inode {
 	void			*i_security;
 #endif
 	void			*i_private; /* fs or device private pointer */
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	unsigned int		i_access_count;	/* opened how many times? */
+	uid_t			i_cuid;		/* opened first by which user? */
+	char			i_comm[16];	/* opened first by which app? */
+#endif
 };
 
 /*
@@ -773,6 +779,13 @@ static inline unsigned imajor(const stru
 	return MAJOR(inode->i_rdev);
 }
 
+static inline void inode_accessed(struct inode *inode)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	inode->i_access_count++;
+#endif
+}
+
 extern struct block_device *I_BDEV(struct inode *inode);
 
 struct fown_struct {
@@ -1907,6 +1920,7 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
+struct hlist_head * get_inode_hash_budget(unsigned long index);
 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -828,6 +828,7 @@ static struct file *__dentry_open(struct
 			goto cleanup_all;
 	}
 
+	inode_accessed(inode);
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -750,6 +750,36 @@ config CONFIGFS_FS
 	  Both sysfs and configfs can and should exist together on the
 	  same system. One is not a replacement for the other.
 
+config PROC_FILECACHE
+	tristate "/proc/filecache support"
+	default m
+	depends on PROC_FS
+	help
+	  This option creates a file /proc/filecache which enables one to
+	  query/drop the cached files in memory.
+
+	  A quick start guide:
+
+	  # echo 'ls' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'cat /bin/bash' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'drop pagecache' > /proc/filecache
+	  # echo 'drop slabcache' > /proc/filecache
+
+	  For more details, please check Documentation/filesystems/proc.txt .
+
+	  It can be a handy tool for sysadms and desktop users.
+
+config PROC_FILECACHE_EXTRAS
+	bool "track extra states"
+	default y
+	depends on PROC_FILECACHE
+	help
+	  Track extra states that costs a little more time/space.
+
 endmenu
 
 menu "Miscellaneous filesystems"
--- linux-2.6.orig/fs/proc/Makefile
+++ linux-2.6/fs/proc/Makefile
@@ -2,7 +2,8 @@
 # Makefile for the Linux proc filesystem routines.
 #
 
-obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FS)		+= proc.o
+obj-$(CONFIG_PROC_FILECACHE)	+= filecache.o
 
 proc-y			:= nommu.o task_nommu.o
 proc-$(CONFIG_MMU)	:= mmu.o task_mmu.o
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -23,9 +23,66 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/pagevec.h>
 #include "internal.h"
 
 
+int sysctl_dirty_debug __read_mostly;
+
+void print_page(struct page *page)
+{
+	printk(KERN_DEBUG "%lu\t%u\t%u\t%c%c%c%c%c\n",
+			page->index,
+			page_count(page),
+			page_mapcount(page),
+			PageUptodate(page)      ? 'U' : '_',
+			PageDirty(page)         ? 'D' : '_',
+			PageWriteback(page)     ? 'W' : '_',
+			PagePrivate(page)       ? 'P' : '_',
+			PageLocked(page)        ? 'L' : '_');
+}
+
+void print_inode_pages(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct pagevec pvec;
+	int nr_pages;
+	int i;
+	struct dentry *dentry;
+	int dcount;
+	char *dname;
+
+	rcu_read_lock();
+	nr_pages = radix_tree_gang_lookup_tag(&mapping->page_tree,
+				(void **)pvec.pages, 0, PAGEVEC_SIZE,
+				PAGECACHE_TAG_DIRTY);
+	rcu_read_unlock();
+
+	if (list_empty(&inode->i_dentry)) {
+		dname = "";
+		dcount = 0;
+	} else {
+		dentry = list_entry(inode->i_dentry.next,
+					struct dentry, d_alias);
+		dname = dentry->d_iname;
+		dcount = atomic_read(&dentry->d_count);
+	}
+
+	printk(KERN_DEBUG "inode %lu(%s/%s) count %d,%d size %llu pages %lu\n",
+			inode->i_ino,
+			inode->i_sb->s_id,
+			dname,
+			atomic_read(&inode->i_count),
+			dcount,
+			i_size_read(inode),
+			mapping->nrpages
+	      );
+
+	for (i = 0; i < nr_pages; i++)
+		print_page(pvec.pages[i]);
+}
+
+
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
  * @bdi: the device's backing_dev_info structure
@@ -179,6 +236,11 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
+#define redirty_tail(inode)						\
+	do {								\
+		__redirty_tail(inode, __LINE__);			\
+	} while (0)
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -188,10 +250,16 @@ static int write_inode(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void __redirty_tail(struct inode *inode, int line)
 {
 	struct super_block *sb = inode->i_sb;
 
+	if (sysctl_dirty_debug) {
+		printk(KERN_DEBUG "redirty_tail line %d: inode %lu\n",
+				line, inode->i_ino);
+		print_inode_pages(inode);
+	}
+
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
@@ -203,12 +271,23 @@ static void redirty_tail(struct inode *i
 	list_move(&inode->i_list, &sb->s_dirty);
 }
 
+#define requeue_io(inode)						\
+	do {								\
+		__requeue_io(inode, __LINE__);				\
+	} while (0)
+
 /*
  * requeue inode for re-scanning after sb->s_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void __requeue_io(struct inode *inode, int line)
 {
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+
+	if (sysctl_dirty_debug) {
+		printk(KERN_DEBUG "requeue_io line %d: inode %lu\n",
+				line, inode->i_ino);
+		print_inode_pages(inode);
+	}
 }
 
 static void inode_sync_complete(struct inode *inode)
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -159,5 +159,6 @@ void writeback_set_ratelimit(void);
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */
 
+extern int sysctl_dirty_debug;
 
 #endif		/* WRITEBACK_H */
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1344,6 +1344,14 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "dirty_debug",
+		.data		= &sysctl_dirty_debug,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -104,6 +104,35 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+#define writeback_debug_report(n, wbc) do {                             \
+	if(sysctl_dirty_debug)						\
+		__writeback_debug_report(n, wbc,			\
+				__FILE__, __LINE__, __FUNCTION__);	\
+} while (0)
+
+void print_writeback_control(struct writeback_control *wbc)
+{
+	printk(KERN_DEBUG
+			"global dirty %lu writeback %lu nfs %lu "
+			"flags %c%c towrite %ld skipped %ld\n",
+			global_page_state(NR_FILE_DIRTY),
+			global_page_state(NR_WRITEBACK),
+			global_page_state(NR_UNSTABLE_NFS),
+			wbc->encountered_congestion ? 'C':'_',
+			wbc->more_io ? 'M':'_',
+			wbc->nr_to_write,
+			wbc->pages_skipped);
+}
+
+void __writeback_debug_report(long n, struct writeback_control *wbc,
+		const char *file, int line, const char *func)
+{
+	printk(KERN_DEBUG "%s %d %s: %s(%d) %ld\n",
+			file, line, func,
+			current->comm, current->pid,
+			n);
+	print_writeback_control(wbc);
+}
 
 static void background_writeout(unsigned long _min_pages);
 
@@ -476,6 +505,7 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
+			writeback_debug_report(pages_written, &wbc);
 		}
 
 		/*
@@ -502,6 +532,7 @@ static void balance_dirty_pages(struct a
 			break;		/* We've done our duty */
 
 		congestion_wait(WRITE, HZ/10);
+		writeback_debug_report(-pages_written, &wbc);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -596,6 +627,11 @@ void throttle_vm_writeout(gfp_t gfp_mask
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(WRITE, HZ/10);
+		printk(KERN_DEBUG "throttle_vm_writeout: "
+				"congestion_wait on %lu+%lu > %lu\n",
+				global_page_state(NR_UNSTABLE_NFS),
+				global_page_state(NR_WRITEBACK),
+				dirty_thresh);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -645,7 +681,9 @@ static void background_writeout(unsigned
 			else
 				break;
 		}
+		writeback_debug_report(min_pages, &wbc);
 	}
+	writeback_debug_report(min_pages, &wbc);
 }
 
 /*
@@ -718,7 +756,9 @@ static void wb_kupdate(unsigned long arg
 				break;	/* All the old data is written */
 		}
 		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		writeback_debug_report(nr_to_write, &wbc);
 	}
+	writeback_debug_report(nr_to_write, &wbc);
 	if (time_before(next_jif, jiffies + HZ))
 		next_jif = jiffies + HZ;
 	if (dirty_writeback_interval)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
  2009-03-25  5:26     ` Page Cache writeback too slow, SSD/noop scheduler/ext2 Wu Fengguang
@ 2009-03-27 16:59       ` Jos Houtman
       [not found]         ` <C5F2C492.D4A8%jos-vMeIAzyucXQ@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jos Houtman @ 2009-03-27 16:59 UTC (permalink / raw)
  To: Wu Fengguang, Nick Piggin
  Cc: linux-kernel, Jeff Layton, Dave Chinner, linux-fsdevel,
	jens.axboe, akpm, hch, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]

Hi,

>> 
>> kupdate surely should just continue to keep trying to write back pages
>> so long as there are more old pages to clean, and the queue isn't
>> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
>> is just the number to write back in a single call, but you see
>> nr_to_write is set to the number of dirty pages in the system.

And when it's congested it should just wait a little bit before continuing.
 
>> On your system, what must be happening is more_io is not being set.
>> The logic in fs/fs-writeback.c might be busted.

I don't know about more_io, but I agree that the logic seems busted.

> 
> Hi Jos,
> 
> I prepared a debugging patch for 2.6.28. (I cannot observe writeback
> problems on my local ext2 mount.)

Thanx for the patch, but for the next time: How should I apply it?
it seems to be context aware (@@) and broke on all kernel versions I tried
2.6.28/2.6.28.7/2.6.29

Because I saw the patch only a few hour ago and didn't want to block on your
reply I decided to patch it manually and in the process ported it to 2.6.29.

As for the information the patch provided: It is most helpful.

Attached you will find a list of files containing dirty pages and the count
of there dirty pages, there is also a dmesg output where I trace the
writeback for 40 seconds.


I did some testing on my own using printk's and what I saw is that for the
inodes located on sdb1 (the database) a lot of times they would pass
http://lxr.linux.no/linux+v2.6.29/fs/fs-writeback.c#L335
And then redirty_tail would be called, I haven't had the time to dig deeper,
but that is my primary suspect for the moment.


Thanx again, 

Jos



[-- Attachment #2: filecache-27-march.txt --]
[-- Type: application/octet-stream, Size: 4636 bytes --]

grep dirty /proc/vmstat; for i in $( echo ls > /proc/filecache; cat /proc/filecache | awk '{ if ($6 ~ /[D]/) print $0 }' | awk '{if ($12 ~ /database/) print $12}' ); do echo -n "$i  dirty: "; echo /var/lib/mysql/$i > /proc/filecache; cat /proc/filecache | grep D | wc -l; done; echo ls > /proc/filecache; cat /proc/filecache | awk '{ if ($6 ~ /[D]/) print $0 }' 
nr_dirty 494902
/database/tbltable3.MYD  dirty: 58101
/database/tbltable3.MYI  dirty: 146806
/database/tbltable4.MYD  dirty: 85737
/database/tbltable4.MYI  dirty: 101299
/database/tbltable1.MYI  dirty: 1727
/database/tbltable5.MYD  dirty: 27189
/database/tbltable5.MYI  dirty: 16847
/database/tbltable2.MYI  dirty: 3
/database/tbltable2.MYD  dirty: 3
/database/tbltable1.MYD  dirty: 4
         0    5242880     1072       0      0    D-         6        0     0 mount           00:02(bdev)        (fd:02)
         0    2097152      884       0      0    D-         6        0     0 mount           00:02(bdev)        (fd:03)
         0    5242880     5532       0      0    D-        23        0     0 mount           00:02(bdev)        (fd:01)
    196632          1        4     100      1    D-         5        1    60 mysqld          fd:01(dm-1)        /run/mysqld/relay-log.info
    196634     306501   306504     100      1    D-        15        1    60 mysqld          fd:01(dm-1)        /run/mysqld/mysqld-relay-bin.000002
     49163        466       48      10      1    D-        12        1     0 syslog-ng       fd:02(dm-2)        /messages
        23          1        4     100      0    D-        11      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_dm-3
        22          1        4     100      0    D-        11      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_dm-2
        21          1        4     100      0    D-        11      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_dm-1
        20          1        4     100      0    D-        11      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_dm-0
        19          1        4     100      0    D-        11      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_sdb1
        18          1        4     100      0    D-        12      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_sda3
        17          1        4     100      0    D-        12      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_sda2
        16          1        4     100      0    D-        12      104     0 bash            fd:03(dm-3)        /differenceWithLastTime_sda1
        14          1        4     100      0    D-        12      104     0 sh              fd:03(dm-3)        /netdev_old
        15          1        4     100      0    D-        12      104     0 perl            fd:03(dm-3)        /.oldnetstat
        13          1        4     100      0    D-        12      520     0 sh              fd:03(dm-3)        /netdev_new
   5865478    3778125   251852       6      2    D-        10       12     0 du              08:11(sdb1)        /database/tbltable3.MYD
   5865487   12871243  1132552       8      1    D-        10       10     0 du              08:11(sdb1)        /database/tbltable3.MYI
   5865475   21707331   503688       2      2    D-        10       11     0 du              08:11(sdb1)        /database/tbltable4.MYD
   5865495   16414241   909488       5      1    D-        10       10     0 du              08:11(sdb1)        /database/tbltable4.MYI
        11          1        4     100      1    D-        15        1    60 mysqld          08:11(sdb1)        /master.info
   5865477     150818    51952      34      1    D-        15       10     0 du              08:11(sdb1)        /database/tbltable1.MYI
   5865479     495759   131276      26      1    D-        15       10     0 du              08:11(sdb1)        /database/tbltable5.MYD
   5865497     903791   276828      30      1    D-        15       10     0 du              08:11(sdb1)        /database/tbltable5.MYI
   5865499     118388       88       0      1    D-        17       10     0 du              08:11(sdb1)        /database/tbltable2.MYI
   5865480     325977      120       0      1    D-        17       10     0 du              08:11(sdb1)        /database/tbltable2.MYD
   5865489      78503     3300       4      2    D-        18       10     0 du              08:11(sdb1)        /database/tbltable1.MYD
       175          1        4     100      1    D-      3100     1161     0 udevd           00:0e(tmpfs)       /.udev/uevent_seqnum

[-- Attachment #3: dmesg-27-march.txt --]
[-- Type: application/octet-stream, Size: 17351 bytes --]

redirty_tail line 539: inode 10144
inode 10144(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 447069
global dirty 438821 writeback 0 nfs 0 flags __ towrite 981 skipped 0
redirty_tail line 417: inode 5865495
inode 5865495(sdb1/tbltable4.MYI) count 2,1 size 16807325696 pages 197538
0       2       0       UD_P_
1340    2       0       UD_P_
1367    2       0       UD_P_
1417    2       0       UD_P_
3578    2       0       UD_P_
3745    2       0       UD_P_
3928    2       0       UD_P_
4116    2       0       UD_P_
4154    2       0       UD_P_
4207    2       0       UD_P_
4838    2       0       UD_P_
4839    2       0       UD_P_
4840    2       0       UD_P_
5600    2       0       UD_P_
redirty_tail line 539: inode 1522
inode 1522(sysfs/0000:01:00.0) count 1,5 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 448366
global dirty 440133 writeback 91 nfs 0 flags C_ towrite 904 skipped 0
redirty_tail line 417: inode 5865475
inode 5865475(sdb1/tbltable4.MYD) count 2,2 size 22227138512 pages 106269
373     2       0       UD_P_
1041    2       0       UD_P_
2134    2       0       UD_P_
3075    2       0       UD_P_
4563    2       0       UD_P_
5306    2       0       UD_P_
5426    2       0       UD_P_
5473    2       0       UD_P_
5657    2       0       UD_P_
5770    2       0       UD_P_
5983    2       0       UD_P_
7883    2       0       UD_P_
9061    2       0       UD_P_
9241    2       0       UD_P_
redirty_tail line 539: inode 10297
inode 10297(sysfs/subsystem) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 448354
global dirty 440133 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 417: inode 5865487
inode 5865487(sdb1/tbltable3.MYI) count 2,1 size 13179606016 pages 250251
0       2       0       UD_P_
65      2       0       UD_P_
109     2       0       UD_P_
195     2       0       UD_P_
200     2       0       UD_P_
368     2       0       UD_P_
473     2       0       UD_P_
481     2       0       UD_P_
530     2       0       UD_P_
547     2       0       UD_P_
695     2       0       UD_P_
735     2       0       UD_P_
751     2       0       UD_P_
970     2       0       UD_P_
redirty_tail line 539: inode 10302
inode 10302(sysfs/driver) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 448342
global dirty 440133 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 417: inode 5865478
inode 5865478(sdb1/tbltable3.MYD) count 2,2 size 3868679145 pages 55009
159     2       0       UD_P_
195     2       0       UD_P_
207     2       0       UD_P_
211     2       0       UD_P_
220     2       0       UD_P_
282     2       0       UD_P_
283     2       0       UD_P_
324     2       0       UD_P_
436     2       0       UD_P_
439     2       0       UD_P_
548     2       0       UD_P_
553     2       0       UD_P_
682     2       0       UD_P_
934     2       0       UD_P_
redirty_tail line 539: inode 10066
inode 10066(sysfs/timeout) count 1,0 size 4096 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 448329
global dirty 440042 writeback 91 nfs 0 flags C_ towrite 1011 skipped 0
redirty_tail line 539: inode 10289
inode 10289(sysfs/timeout) count 1,0 size 4096 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 448329
global dirty 440042 writeback 91 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 539: inode 8766
inode 8766(sysfs/ram0) count 1,14 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 448543
global dirty 440295 writeback 0 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 539: inode 8778
inode 8778(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 448873
global dirty 440625 writeback 0 nfs 0 flags __ towrite 1016 skipped 0
redirty_tail line 539: inode 8780
inode 8780(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 449115
global dirty 440867 writeback 0 nfs 0 flags __ towrite 1021 skipped 0
redirty_tail line 539: inode 8781
inode 8781(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 449441
global dirty 441193 writeback 0 nfs 0 flags __ towrite 984 skipped 0
redirty_tail line 417: inode 5865497
inode 5865497(sdb1/tbltable5.MYI) count 2,1 size 925447168 pages 63265
0       2       0       UD_P_
1172    2       0       UD_P_
1187    2       0       UD_P_
1522    2       0       UD_P_
1567    2       0       UD_P_
1720    2       0       UD_P_
1760    2       0       UD_P_
1807    2       0       UD_P_
1811    2       0       UD_P_
2272    2       0       UD_P_
2377    2       0       UD_P_
2534    2       0       UD_P_
2647    2       0       UD_P_
2777    2       0       UD_P_
redirty_tail line 539: inode 8791
inode 8791(sysfs/ram1) count 1,14 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449463
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 564 skipped 0
redirty_tail line 417: inode 5865479
inode 5865479(sdb1/tbltable5.MYD) count 2,1 size 507638336 pages 29268
0       2       0       UD_P_
1       2       0       UD_P_
2       2       0       UD_P_
3       2       0       UD_P_
4       2       0       UD_P_
8       2       0       UD_P_
12      2       0       UD_P_
16      2       0       UD_P_
22      2       0       UD_P_
26      2       0       UD_P_
32      2       0       UD_P_
33      2       0       UD_P_
40      2       0       UD_P_
41      2       0       UD_P_
redirty_tail line 539: inode 8803
inode 8803(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449448
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1009 skipped 0
redirty_tail line 539: inode 8805
inode 8805(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449448
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1024 skipped 0
redirty_tail line 539: inode 8806
inode 8806(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449448
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1024 skipped 0
redirty_tail line 539: inode 8816
inode 8816(sysfs/ram2) count 1,14 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449448
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1024 skipped 0
redirty_tail line 539: inode 8828
inode 8828(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449448
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1024 skipped 0
redirty_tail line 417: inode 5865477
inode 5865477(sdb1/tbltable1.MYI) count 2,1 size 154434560 pages 11790
0       2       0       UD_P_
68      2       0       UD_P_
106     2       0       UD_P_
131     2       0       UD_P_
185     2       0       UD_P_
244     2       0       UD_P_
252     2       0       UD_P_
331     2       0       UD_P_
393     2       0       UD_P_
394     2       0       UD_P_
395     2       0       UD_P_
400     2       0       UD_P_
482     2       0       UD_P_
781     2       0       UD_P_
redirty_tail line 539: inode 8830
inode 8830(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449436
global dirty 441220 writeback 182 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 539: inode 8831
inode 8831(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 449436
global dirty 441220 writeback 182 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 417: inode 5865495
inode 5865495(sdb1/tbltable4.MYI) count 2,1 size 16807348224 pages 197993
0       2       0       UD_P_
1340    2       0       UD_P_
1367    2       0       UD_P_
1417    2       0       UD_P_
3578    2       0       UD_P_
3745    2       0       UD_P_
3928    2       0       UD_P_
4116    2       0       UD_P_
4154    2       0       UD_P_
4207    2       0       UD_P_
4838    2       0       UD_P_
4839    2       0       UD_P_
4840    2       0       UD_P_
5600    2       0       UD_P_
redirty_tail line 539: inode 8841
inode 8841(sysfs/ram3) count 1,14 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449551
global dirty 441343 writeback 91 nfs 0 flags C_ towrite 893 skipped 0
redirty_tail line 417: inode 5865475
inode 5865475(sdb1/tbltable4.MYD) count 2,2 size 22227169324 pages 106507
373     2       0       UD_P_
1041    2       0       UD_P_
2134    2       0       UD_P_
3075    2       0       UD_P_
4563    2       0       UD_P_
5306    2       0       UD_P_
5426    2       0       UD_P_
5473    2       0       UD_P_
5657    2       0       UD_P_
5770    2       0       UD_P_
5983    2       0       UD_P_
7883    2       0       UD_P_
9061    2       0       UD_P_
9241    2       0       UD_P_
redirty_tail line 539: inode 8853
inode 8853(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449538
global dirty 441343 writeback 91 nfs 0 flags C_ towrite 1011 skipped 0
redirty_tail line 417: inode 5865487
inode 5865487(sdb1/tbltable3.MYI) count 2,1 size 13179631616 pages 251244
0       2       0       UD_P_
65      2       0       UD_P_
109     2       0       UD_P_
195     2       0       UD_P_
200     2       0       UD_P_
368     2       0       UD_P_
473     2       0       UD_P_
481     2       0       UD_P_
530     2       0       UD_P_
547     2       0       UD_P_
695     2       0       UD_P_
735     2       0       UD_P_
751     2       0       UD_P_
970     2       0       UD_P_
redirty_tail line 539: inode 8855
inode 8855(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449526
global dirty 441343 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 417: inode 5865478
inode 5865478(sdb1/tbltable3.MYD) count 2,2 size 3868689540 pages 55174
159     2       0       UD_P_
195     2       0       UD_P_
207     2       0       UD_P_
211     2       0       UD_P_
220     2       0       UD_P_
282     2       0       UD_P_
283     2       0       UD_P_
324     2       0       UD_P_
436     2       0       UD_P_
439     2       0       UD_P_
548     2       0       UD_P_
553     2       0       UD_P_
682     2       0       UD_P_
934     2       0       UD_P_
redirty_tail line 539: inode 8856
inode 8856(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 449514
global dirty 441343 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 539: inode 8866
inode 8866(sysfs/ram4) count 1,14 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 449514
global dirty 441343 writeback 91 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 539: inode 8878
inode 8878(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 449732
global dirty 441484 writeback 0 nfs 0 flags __ towrite 1000 skipped 0
redirty_tail line 539: inode 8880
inode 8880(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450062
global dirty 441814 writeback 0 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 539: inode 8881
inode 8881(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450496
global dirty 442248 writeback 0 nfs 0 flags __ towrite 1014 skipped 0
redirty_tail line 539: inode 8891
inode 8891(sysfs/ram5) count 1,14 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450655
global dirty 442407 writeback 0 nfs 0 flags __ towrite 1018 skipped 0
redirty_tail line 417: inode 5865497
inode 5865497(sdb1/tbltable5.MYI) count 2,1 size 925447168 pages 63374
0       2       0       UD_P_
1172    2       0       UD_P_
1187    2       0       UD_P_
1522    2       0       UD_P_
1567    2       0       UD_P_
1720    2       0       UD_P_
1760    2       0       UD_P_
1807    2       0       UD_P_
1811    2       0       UD_P_
2272    2       0       UD_P_
2377    2       0       UD_P_
2534    2       0       UD_P_
2647    2       0       UD_P_
2777    2       0       UD_P_
redirty_tail line 539: inode 8903
inode 8903(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450809
global dirty 442608 writeback 91 nfs 0 flags C_ towrite 886 skipped 0
redirty_tail line 417: inode 5865479
inode 5865479(sdb1/tbltable5.MYD) count 2,1 size 507638658 pages 29338
0       2       0       UD_P_
1       2       0       UD_P_
2       2       0       UD_P_
3       2       0       UD_P_
4       2       0       UD_P_
8       2       0       UD_P_
12      2       0       UD_P_
16      2       0       UD_P_
22      2       0       UD_P_
26      2       0       UD_P_
32      2       0       UD_P_
33      2       0       UD_P_
40      2       0       UD_P_
41      2       0       UD_P_
redirty_tail line 539: inode 8905
inode 8905(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450793
global dirty 442517 writeback 91 nfs 0 flags C_ towrite 1008 skipped 0
redirty_tail line 417: inode 5865477
inode 5865477(sdb1/tbltable1.MYI) count 2,1 size 154434560 pages 11815
0       2       0       UD_P_
68      2       0       UD_P_
106     2       0       UD_P_
131     2       0       UD_P_
185     2       0       UD_P_
244     2       0       UD_P_
252     2       0       UD_P_
331     2       0       UD_P_
393     2       0       UD_P_
394     2       0       UD_P_
395     2       0       UD_P_
400     2       0       UD_P_
482     2       0       UD_P_
781     2       0       UD_P_
redirty_tail line 539: inode 8906
inode 8906(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450781
global dirty 442517 writeback 182 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 539: inode 8916
inode 8916(sysfs/ram6) count 1,14 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450781
global dirty 442517 writeback 182 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 417: inode 5865495
inode 5865495(sdb1/tbltable4.MYI) count 2,1 size 16807370752 pages 198676
0       2       0       UD_P_
1340    2       0       UD_P_
1367    2       0       UD_P_
1417    2       0       UD_P_
3578    2       0       UD_P_
3745    2       0       UD_P_
3928    2       0       UD_P_
4116    2       0       UD_P_
4154    2       0       UD_P_
4207    2       0       UD_P_
4838    2       0       UD_P_
4839    2       0       UD_P_
4840    2       0       UD_P_
5600    2       0       UD_P_
redirty_tail line 539: inode 8928
inode 8928(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450654
global dirty 442401 writeback 91 nfs 0 flags C_ towrite 644 skipped 0
redirty_tail line 417: inode 5865475
inode 5865475(sdb1/tbltable4.MYD) count 2,2 size 22227197236 pages 106846
373     2       0       UD_P_
1041    2       0       UD_P_
2134    2       0       UD_P_
3075    2       0       UD_P_
4563    2       0       UD_P_
5306    2       0       UD_P_
5426    2       0       UD_P_
5473    2       0       UD_P_
5657    2       0       UD_P_
5770    2       0       UD_P_
5983    2       0       UD_P_
7883    2       0       UD_P_
9061    2       0       UD_P_
9241    2       0       UD_P_
redirty_tail line 539: inode 8930
inode 8930(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450641
global dirty 442401 writeback 91 nfs 0 flags C_ towrite 1011 skipped 0
redirty_tail line 417: inode 5865487
inode 5865487(sdb1/tbltable3.MYI) count 2,1 size 13179643904 pages 251935
0       2       0       UD_P_
65      2       0       UD_P_
109     2       0       UD_P_
195     2       0       UD_P_
200     2       0       UD_P_
368     2       0       UD_P_
473     2       0       UD_P_
481     2       0       UD_P_
530     2       0       UD_P_
547     2       0       UD_P_
695     2       0       UD_P_
735     2       0       UD_P_
751     2       0       UD_P_
970     2       0       UD_P_
redirty_tail line 539: inode 8931
inode 8931(sysfs/slaves) count 1,0 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450629
global dirty 442401 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 417: inode 5865478
inode 5865478(sdb1/tbltable3.MYD) count 2,2 size 3868692045 pages 55323
159     2       0       UD_P_
195     2       0       UD_P_
207     2       0       UD_P_
211     2       0       UD_P_
220     2       0       UD_P_
282     2       0       UD_P_
283     2       0       UD_P_
324     2       0       UD_P_
436     2       0       UD_P_
439     2       0       UD_P_
548     2       0       UD_P_
553     2       0       UD_P_
682     2       0       UD_P_
934     2       0       UD_P_
redirty_tail line 539: inode 8941
inode 8941(sysfs/ram7) count 1,14 size 0 pages 0
mm/page-writeback.c 829 wb_kupdate: pdflush(361) 450617
global dirty 442401 writeback 91 nfs 0 flags C_ towrite 1012 skipped 0
redirty_tail line 539: inode 8953
inode 8953(sysfs/power) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450617
global dirty 442401 writeback 91 nfs 0 flags __ towrite 1024 skipped 0
redirty_tail line 539: inode 8955
inode 8955(sysfs/holders) count 1,0 size 0 pages 0
mm/page-writeback.c 831 wb_kupdate: pdflush(361) 450881
global dirty 442633 writeback 0 nfs 0 flags __ towrite 977 skipped 0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
       [not found]         ` <C5F2C492.D4A8%jos-vMeIAzyucXQ@public.gmane.org>
@ 2009-03-29  2:32           ` Wu Fengguang
  2009-03-30 16:47             ` Jos Houtman
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2009-03-29  2:32 UTC (permalink / raw)
  To: Jos Houtman
  Cc: Nick Piggin, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jeff Layton, Dave Chinner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Attachment #1: Type: text/plain, Size: 3171 bytes --]

On Sat, Mar 28, 2009 at 12:59:43AM +0800, Jos Houtman wrote:
> Hi,
> 
> >> 
> >> kupdate surely should just continue to keep trying to write back pages
> >> so long as there are more old pages to clean, and the queue isn't
> >> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
> >> is just the number to write back in a single call, but you see
> >> nr_to_write is set to the number of dirty pages in the system.
> 
> And when it's congested it should just wait a little bit before continuing.
>  
> >> On your system, what must be happening is more_io is not being set.
> >> The logic in fs/fs-writeback.c might be busted.
> 
> I don't know about more_io, but I agree that the logic seems busted.
> 
> > 
> > Hi Jos,
> > 
> > I prepared a debugging patch for 2.6.28. (I cannot observe writeback
> > problems on my local ext2 mount.)
> 
> Thanx for the patch, but for the next time: How should I apply it?
> it seems to be context aware (@@) and broke on all kernel versions I tried
> 2.6.28/2.6.28.7/2.6.29

Do you mean that the patch applies after removing " @@.*$"?

To be safe, I created the patch with quilt as well as git, for 2.6.29.

> Because I saw the patch only a few hour ago and didn't want to block on your
> reply I decided to patch it manually and in the process ported it to 2.6.29.
> 
> As for the information the patch provided: It is most helpful.
> 
> Attached you will find a list of files containing dirty pages and the count
> of there dirty pages, there is also a dmesg output where I trace the
> writeback for 40 seconds.

They helped, thank you!

> I did some testing on my own using printk's and what I saw is that for the
> inodes located on sdb1 (the database) a lot of times they would pass
> http://lxr.linux.no/linux+v2.6.29/fs/fs-writeback.c#L335
> And then redirty_tail would be called, I haven't had the time to dig deeper,
> but that is my primary suspect for the moment.

You are right. In your case, there are several big dirty files in sdb1,
and the sdb write queue is constantly (almost-)congested. The SSD write
speed is so slow, that in each round of sdb1 writeback, it begins with
an uncongested queue, but quickly fills up after writing some pages.
Hence all the inodes will get redirtied because of (nr_to_write > 0).

The following quick fix should solve the slow-writeback-on-congested-SSD
problem. However the writeback sequence is suboptimal: it sync-and-requeue
each file until congested (in your case about 3~600 pages) instead of
until MAX_WRITEBACK_PAGES=1024 pages.

A more complete fix would be turning MAX_WRITEBACK_PAGES into an exact
per-file limit. It has been sitting in my todo list for quite a while...

Thanks,
Fengguang

---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

[-- Attachment #2: writeback-requeue-congestion-quickfix.patch --]
[-- Type: text/x-diff, Size: 486 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e3fe991..da5f88d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
  2009-03-29  2:32           ` Wu Fengguang
@ 2009-03-30 16:47             ` Jos Houtman
       [not found]               ` <C5F6B627.D9D0%jos-vMeIAzyucXQ@public.gmane.org>
  2009-03-31 12:16               ` Jos Houtman
  0 siblings, 2 replies; 8+ messages in thread
From: Jos Houtman @ 2009-03-30 16:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, linux-kernel, Jeff Layton, Dave Chinner,
	linux-fsdevel, jens.axboe, akpm, hch, linux-nfs


>> Thanx for the patch, but for the next time: How should I apply it?
>> it seems to be context aware (@@) and broke on all kernel versions I tried
>> 2.6.28/2.6.28.7/2.6.29
> 
> Do you mean that the patch applies after removing " @@.*$"?

I didn't try that, but this time it worked. So it was probably my error.

> 
> You are right. In your case, there are several big dirty files in sdb1,
> and the sdb write queue is constantly (almost-)congested. The SSD write
> speed is so slow, that in each round of sdb1 writeback, it begins with
> an uncongested queue, but quickly fills up after writing some pages.
> Hence all the inodes will get redirtied because of (nr_to_write > 0).
> 
> The following quick fix should solve the slow-writeback-on-congested-SSD
> problem. However the writeback sequence is suboptimal: it sync-and-requeue
> each file until congested (in your case about 3~600 pages) instead of
> until MAX_WRITEBACK_PAGES=1024 pages.

Yeah that fixed it, but performance dropped due to the more constant
congestion. So I will need to try some different io schedulers.


Next to that I was wondering if there are any plans to make sure that not
all dirty-files are written back in the same interval.

In my case all database files are written back each 30 seconds, while I
would prefer them to be more divided over the interval.

Thanks,

Jos


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
       [not found]               ` <C5F6B627.D9D0%jos-vMeIAzyucXQ@public.gmane.org>
@ 2009-03-31  0:28                 ` Wu Fengguang
  0 siblings, 0 replies; 8+ messages in thread
From: Wu Fengguang @ 2009-03-31  0:28 UTC (permalink / raw)
  To: Jos Houtman
  Cc: Nick Piggin, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jeff Layton, Dave Chinner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Mar 31, 2009 at 12:47:19AM +0800, Jos Houtman wrote:
> 
> >> Thanx for the patch, but for the next time: How should I apply it?
> >> it seems to be context aware (@@) and broke on all kernel versions I tried
> >> 2.6.28/2.6.28.7/2.6.29
> > 
> > Do you mean that the patch applies after removing " @@.*$"?
> 
> I didn't try that, but this time it worked. So it was probably my error.
> 
> > 
> > You are right. In your case, there are several big dirty files in sdb1,
> > and the sdb write queue is constantly (almost-)congested. The SSD write
> > speed is so slow, that in each round of sdb1 writeback, it begins with
> > an uncongested queue, but quickly fills up after writing some pages.
> > Hence all the inodes will get redirtied because of (nr_to_write > 0).
> > 
> > The following quick fix should solve the slow-writeback-on-congested-SSD
> > problem. However the writeback sequence is suboptimal: it sync-and-requeue
> > each file until congested (in your case about 3~600 pages) instead of
> > until MAX_WRITEBACK_PAGES=1024 pages.
> 
> Yeah that fixed it, but performance dropped due to the more constant
> congestion. So I will need to try some different io schedulers.

Read performance or write performance?

> Next to that I was wondering if there are any plans to make sure that not
> all dirty-files are written back in the same interval.
> 
> In my case all database files are written back each 30 seconds, while I
> would prefer them to be more divided over the interval.

pdflush will wake up every 5s to sync files dirtied more than 30s.
So the writeback of inodes should be distributed(somehow randomly)
into these 5s-interval-wakeups due to varied dirty times.

However the distribution may well be uneven in may cases. It seems to
be conflicting goals for HDD and SSD: one favors somehow small busty
writeback, another favors smooth writeback streams. I guess the better
scheme would be bursty pdflush writebacks plus IO scheduler level QoS.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
  2009-03-30 16:47             ` Jos Houtman
       [not found]               ` <C5F6B627.D9D0%jos-vMeIAzyucXQ@public.gmane.org>
@ 2009-03-31 12:16               ` Jos Houtman
       [not found]                 ` <C5F7D654.DE6F%jos-vMeIAzyucXQ@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Jos Houtman @ 2009-03-31 12:16 UTC (permalink / raw)
  To: Jos Houtman, Wu Fengguang
  Cc: Nick Piggin, linux-kernel, Jeff Layton, Dave Chinner,
	linux-fsdevel, jens.axboe, akpm, hch, linux-nfs

> 
> Next to that I was wondering if there are any plans to make sure that not
> all dirty-files are written back in the same interval.
> 
> In my case all database files are written back each 30 seconds, while I
> would prefer them to be more divided over the interval.

There another question I have: does the writeback go through the io
scheduler? Because no matter the io scheduler or the tuning done, the
writeback algorithm totally starves the reads.

See the url below for an example with CFQ, but deadline or noop all show
this behaviour:
http://94.100.113.33/535450001-535500000/535451701-535451800/535451800_6_L7g
t.jpeg

Is there anything I can do about this behaviour by creating a better
interleaving of the reads and writes?

Jos


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
       [not found]                 ` <C5F7D654.DE6F%jos-vMeIAzyucXQ@public.gmane.org>
@ 2009-03-31 12:31                   ` Wu Fengguang
  2009-03-31 14:10                     ` Jos Houtman
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2009-03-31 12:31 UTC (permalink / raw)
  To: Jos Houtman
  Cc: Nick Piggin, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jeff Layton, Dave Chinner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Mar 31, 2009 at 08:16:52PM +0800, Jos Houtman wrote:
> > 
> > Next to that I was wondering if there are any plans to make sure that not
> > all dirty-files are written back in the same interval.
> > 
> > In my case all database files are written back each 30 seconds, while I
> > would prefer them to be more divided over the interval.
> 
> There another question I have: does the writeback go through the io
> scheduler? Because no matter the io scheduler or the tuning done, the
> writeback algorithm totally starves the reads.

I noticed this annoying writes-starve-reads problem too. I'll look into it.

> See the url below for an example with CFQ, but deadline or noop all show
> this behaviour:
> http://94.100.113.33/535450001-535500000/535451701-535451800/535451800_6_L7g
> t.jpeg
> 
> Is there anything I can do about this behaviour by creating a better
> interleaving of the reads and writes?

I guess it should be handled in the generic block io layer.  Once we
solved the writes-starve-reads problem, the bursty-writeback behavior
becomes a no-problem for SSD.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2
  2009-03-31 12:31                   ` Wu Fengguang
@ 2009-03-31 14:10                     ` Jos Houtman
  0 siblings, 0 replies; 8+ messages in thread
From: Jos Houtman @ 2009-03-31 14:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, linux-kernel, Jeff Layton, Dave Chinner,
	linux-fsdevel, jens.axboe, akpm, hch, linux-nfs


>> There another question I have: does the writeback go through the io
>> scheduler? Because no matter the io scheduler or the tuning done, the
>> writeback algorithm totally starves the reads.
> 
> I noticed this annoying writes-starve-reads problem too. I'll look into it.

Thanks
 
> 
>> Is there anything I can do about this behaviour by creating a better
>> interleaving of the reads and writes?
> 
> I guess it should be handled in the generic block io layer.  Once we
> solved the writes-starve-reads problem, the bursty-writeback behavior
> becomes a no-problem for SSD.

Yeah this was the part where I figured the io-schedulers kicked in, but
obviously I was wrong :P.

If I can do anything more to help this along, let me know.


Thanks

Jos




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-03-31 14:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <C5EC2B99.C3B3%jos@hyves.nl>
     [not found] ` <200903250148.53644.nickpiggin@yahoo.com.au>
     [not found]   ` <200903250148.53644.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
2009-03-25  5:26     ` Page Cache writeback too slow, SSD/noop scheduler/ext2 Wu Fengguang
2009-03-27 16:59       ` Jos Houtman
     [not found]         ` <C5F2C492.D4A8%jos-vMeIAzyucXQ@public.gmane.org>
2009-03-29  2:32           ` Wu Fengguang
2009-03-30 16:47             ` Jos Houtman
     [not found]               ` <C5F6B627.D9D0%jos-vMeIAzyucXQ@public.gmane.org>
2009-03-31  0:28                 ` Wu Fengguang
2009-03-31 12:16               ` Jos Houtman
     [not found]                 ` <C5F7D654.DE6F%jos-vMeIAzyucXQ@public.gmane.org>
2009-03-31 12:31                   ` Wu Fengguang
2009-03-31 14:10                     ` Jos Houtman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).