[PATCH 00/12] per device dirty throttling -v3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/12] per device dirty throttling -v3
@ 2007-04-05 17:42 root
  2007-04-05 17:42 ` [PATCH 01/12] nfs: remove congestion_end() root
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

Against 2.6.21-rc5-mm4 without:
  per-backing_dev-dirty-and-writeback-page-accounting.patch

This series implements BDI independent dirty limits and congestion control.

This should solve several problems we currently have in this area:

 - mutual interference starvation (for any number of BDIs), and
 - deadlocks with stacked BDIs (loop and FUSE).

All the fancy new congestion code has been compile and boot tested, but
not much more. I'm posting to get feedback on the ideas.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/12] nfs: remove congestion_end()
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 02/12] mm: scalable bdi statistics counters root
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: nfs_congestion_fixup.patch --]
[-- Type: text/plain, Size: 2439 bytes --]

Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    4 +---
 include/linux/backing-dev.h |    1 -
 mm/backing-dev.c            |   13 -------------
 3 files changed, 1 insertion(+), 17 deletions(-)

Index: linux-2.6-mm/fs/nfs/write.c
===================================================================
--- linux-2.6-mm.orig/fs/nfs/write.c	2007-04-05 16:24:50.000000000 +0200
+++ linux-2.6-mm/fs/nfs/write.c	2007-04-05 16:25:04.000000000 +0200
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
Index: linux-2.6-mm/include/linux/backing-dev.h
===================================================================
--- linux-2.6-mm.orig/include/linux/backing-dev.h	2007-04-05 16:24:50.000000000 +0200
+++ linux-2.6-mm/include/linux/backing-dev.h	2007-04-05 16:25:08.000000000 +0200
@@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6-mm/mm/backing-dev.c
===================================================================
--- linux-2.6-mm.orig/mm/backing-dev.c	2007-04-05 16:24:50.000000000 +0200
+++ linux-2.6-mm/mm/backing-dev.c	2007-04-05 16:25:16.000000000 +0200
@@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 02/12] mm: scalable bdi statistics counters.
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
  2007-04-05 17:42 ` [PATCH 01/12] nfs: remove congestion_end() root
@ 2007-04-05 17:42 ` root
  2007-04-05 22:37   ` Andrew Morton
  2007-04-05 17:42 ` [PATCH 03/12] mm: count dirty pages per BDI root
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 10066 bytes --]

Provide scalable per backing_dev_info statistics counters modeled on the ZVC
code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c           |    1 
 drivers/block/rd.c          |    2 
 drivers/char/mem.c          |    2 
 fs/char_dev.c               |    1 
 fs/fuse/inode.c             |    1 
 fs/nfs/client.c             |    1 
 include/linux/backing-dev.h |   98 +++++++++++++++++++++++++++++++++++++++++
 mm/backing-dev.c            |  103 ++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 209 insertions(+)

Index: linux-2.6-mm/block/ll_rw_blk.c
===================================================================
--- linux-2.6-mm.orig/block/ll_rw_blk.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/block/ll_rw_blk.c	2007-04-05 16:40:45.000000000 +0200
@@ -208,6 +208,7 @@ void blk_queue_make_request(request_queu
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	q->make_request_fn = mfn;
+	bdi_init(&q->backing_dev_info);
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
Index: linux-2.6-mm/include/linux/backing-dev.h
===================================================================
--- linux-2.6-mm.orig/include/linux/backing-dev.h	2007-04-05 16:40:41.000000000 +0200
+++ linux-2.6-mm/include/linux/backing-dev.h	2007-04-05 16:40:45.000000000 +0200
@@ -8,6 +8,7 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/spinlock.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -22,6 +23,17 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#ifdef CONFIG_SMP
+struct bdi_per_cpu_data {
+	s8 stat_threshold;
+	s8 bdi_stat_diff[NR_BDI_STAT_ITEMS];
+} ____cacheline_aligned_in_smp;
+#endif
+
 typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
@@ -34,8 +46,94 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+#ifdef CONFIG_SMP
+	struct bdi_per_cpu_data pcd[NR_CPUS];
+#endif
 };
 
+extern atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+
+static inline void bdi_stat_add(long x, struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_add(x, &bdi->bdi_stats[item]);
+	atomic_long_add(x, &bdi_stats[item]);
+}
+
+static inline unsigned long __global_bdi_stat(enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi_stats[item]);
+}
+
+static inline unsigned long __bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi->bdi_stats[item]);
+}
+
+/*
+ * cannot be unsigned long and clip on 0.
+ */
+static inline unsigned long global_bdi_stat(enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi->bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+void mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+#else /* CONFIG_SMP */
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	bdi_stat_add(delta, bdi, item);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_inc(&bdi->bdi_stats[item]);
+	atomic_long_inc(&bdi_stats[item]);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_dec(&bdi->bdi_stats[item]);
+	atomic_long_dec(&bdi_stats[item]);
+}
+
+#define mod_bdi_stat __mod_bdi_stat
+#define inc_bdi_stat __inc_bdi_stat
+#define dec_bdi_stat __dec_bdi_stat
+#endif
+
+void bdi_init(struct backing_dev_info *bdi);
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6-mm/mm/backing-dev.c
===================================================================
--- linux-2.6-mm.orig/mm/backing-dev.c	2007-04-05 16:40:41.000000000 +0200
+++ linux-2.6-mm/mm/backing-dev.c	2007-04-05 16:42:37.000000000 +0200
@@ -70,3 +70,106 @@ long congestion_wait_interruptible(int r
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
+
+atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+EXPORT_SYMBOL(bdi_stats);
+
+void bdi_init(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		atomic_long_set(&bdi->bdi_stats[i], 0);
+
+#ifdef CONFIG_SMP
+	for (i = 0; i < NR_CPUS; i++) {
+		int j;
+		for (j = 0; j < NR_BDI_STAT_ITEMS; j++)
+			bdi->pcd[i].bdi_stat_diff[j] = 0;
+		bdi->pcd[i].stat_threshold = 8 * ilog2(num_online_cpus());
+	}
+#endif
+}
+EXPORT_SYMBOL(bdi_init);
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+	long x;
+
+	x = delta + *p;
+
+	if (unlikely(x > pcd->stat_threshold || x < -pcd->stat_threshold)) {
+		bdi_stat_add(x, bdi, item);
+		x = 0;
+	}
+	*p = x;
+}
+EXPORT_SYMBOL(__mod_bdi_stat);
+
+void mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_bdi_stat(bdi, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_bdi_stat);
+
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)++;
+
+	if (unlikely(*p > pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p + overstep, bdi, item);
+		*p = -overstep;
+	}
+}
+EXPORT_SYMBOL(__inc_bdi_stat);
+
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_bdi_stat);
+
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)--;
+
+	if (unlikely(*p < -pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p - overstep, bdi, item);
+		*p = overstep;
+	}
+}
+EXPORT_SYMBOL(__dec_bdi_stat);
+
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_bdi_stat);
+#endif
Index: linux-2.6-mm/drivers/block/rd.c
===================================================================
--- linux-2.6-mm.orig/drivers/block/rd.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/drivers/block/rd.c	2007-04-05 16:40:45.000000000 +0200
@@ -421,6 +421,8 @@ static int __init rd_init(void)
 	int i;
 	int err = -ENOMEM;
 
+	bdi_init(&rd_file_backing_dev_info);
+
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
 		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
Index: linux-2.6-mm/drivers/char/mem.c
===================================================================
--- linux-2.6-mm.orig/drivers/char/mem.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/drivers/char/mem.c	2007-04-05 16:40:45.000000000 +0200
@@ -987,6 +987,8 @@ static int __init chr_dev_init(void)
 			      MKDEV(MEM_MAJOR, devlist[i].minor),
 			      devlist[i].name);
 
+	bdi_init(&zero_bdi);
+
 	return 0;
 }
 
Index: linux-2.6-mm/fs/char_dev.c
===================================================================
--- linux-2.6-mm.orig/fs/char_dev.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/fs/char_dev.c	2007-04-05 16:40:45.000000000 +0200
@@ -548,6 +548,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6-mm/fs/fuse/inode.c
===================================================================
--- linux-2.6-mm.orig/fs/fuse/inode.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/fs/fuse/inode.c	2007-04-05 16:40:45.000000000 +0200
@@ -413,6 +413,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		bdi_init(&fc->bdi);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
Index: linux-2.6-mm/fs/nfs/client.c
===================================================================
--- linux-2.6-mm.orig/fs/nfs/client.c	2007-04-05 16:39:56.000000000 +0200
+++ linux-2.6-mm/fs/nfs/client.c	2007-04-05 16:40:45.000000000 +0200
@@ -661,6 +661,7 @@ static void nfs_server_set_fsinfo(struct
 	server->backing_dev_info.ra_pages0 = min_t(unsigned, server->rpages,
 				VM_MIN_READAHEAD >> (PAGE_CACHE_SHIFT - 10));
 	server->backing_dev_info.ra_thrash_bytes = server->rsize * NFS_MAX_READAHEAD;
+	bdi_init(&server->backing_dev_info);
 
 	if (server->wsize > max_rpc_payload)
 		server->wsize = max_rpc_payload;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 03/12] mm: count dirty pages per BDI
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
  2007-04-05 17:42 ` [PATCH 01/12] nfs: remove congestion_end() root
  2007-04-05 17:42 ` [PATCH 02/12] mm: scalable bdi statistics counters root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 04/12] mm: count writeback " root
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat_dirty.patch --]
[-- Type: text/plain, Size: 2580 bytes --]

Count per BDI dirty pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    1 +
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    2 ++
 mm/truncate.c               |    1 +
 4 files changed, 5 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -740,6 +740,7 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -828,6 +828,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -961,6 +962,7 @@ int clear_page_dirty_for_io(struct page 
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -24,6 +24,7 @@ enum bdi_state {
 };
 
 enum bdi_stat_item {
+	BDI_DIRTY,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 04/12] mm: count writeback pages per BDI
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (2 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 03/12] mm: count dirty pages per BDI root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 05/12] mm: count unstable " root
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1856 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -981,10 +981,12 @@ int test_clear_page_writeback(struct pag
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -1004,10 +1006,12 @@ int test_set_page_writeback(struct page 
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -25,6 +25,7 @@ enum bdi_state {
 
 enum bdi_stat_item {
 	BDI_DIRTY,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 05/12] mm: count unstable pages per BDI
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (3 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 04/12] mm: count writeback " root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 06/12] mm: expose BDI statistics in sysfs root
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat_unstable.patch --]
[-- Type: text/plain, Size: 2244 bytes --]

Count per BDI unstable pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    4 ++++
 include/linux/backing-dev.h |    1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page 
 	nfsi->ncommit++;
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -511,6 +512,7 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_list_remove_request(req);
 		nfs_inode_remove_request(req);
 		nfs_unlock_request(req);
@@ -1236,6 +1238,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1260,6 +1263,7 @@ static void nfs_commit_done(struct rpc_t
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -26,6 +26,7 @@ enum bdi_state {
 enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
+	BDI_UNSTABLE,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 06/12] mm: expose BDI statistics in sysfs.
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (4 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 05/12] mm: count unstable " root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 07/12] mm: per device dirty threshold root
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 2406 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c   |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c |    2 -
 2 files changed, 82 insertions(+), 1 deletion(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3975,6 +3975,20 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_DIRTY));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_WRITEBACK));
+}
+
+static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_UNSTABLE));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4005,6 +4019,21 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_dirty_entry = {
+	.attr = {.name = "dirty_pages", .mode = S_IRUGO },
+	.show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_pages", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_unstable_entry = {
+	.attr = {.name = "unstable_pages", .mode = S_IRUGO },
+	.show = queue_nr_unstable_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4017,6 +4046,9 @@ static struct attribute *default_attrs[]
 	&queue_initial_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_dirty_entry.attr,
+	&queue_writeback_entry.attr,
+	&queue_unstable_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 07/12] mm: per device dirty threshold
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (5 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 06/12] mm: expose BDI statistics in sysfs root
@ 2007-04-05 17:42 ` root
  2007-04-05 17:42 ` [PATCH 08/12] mm: fixup possible deadlock root
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 11069 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
> Which problem are we trying to solve here?  afaik our two uppermost
> problems are:
> 
> a) Heavy write to queue A causes light writer to queue B to blok for a long
> time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

> b) heavy write to device A causes light write to device A to block for a
> long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the total vm size to act as a period -
vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    8 ++
 mm/backing-dev.c            |    3 
 mm/page-writeback.c         |  166 +++++++++++++++++++++++++++++++++++---------
 3 files changed, 145 insertions(+), 32 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -27,6 +27,8 @@ enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
 	BDI_UNSTABLE,
+	BDI_WRITEOUT,
+	BDI_WRITEOUT_TOTAL,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -50,6 +52,12 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	/*
+	 * data used for scaling the writeback cache
+	 */
+	spinlock_t lock;	/* protect the cycle count */
+	unsigned long cycles;	/* writeout cycles */
+
 	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
 #ifdef CONFIG_SMP
 	struct bdi_per_cpu_data pcd[NR_CPUS];
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,87 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i > thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when we choose thresh to be 2^n it can be written in terms of
+ * binary operations and wraparound artifacts disappear.
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   i / thresh
+ *
+ * Its monotonous increasing property can be applied to mitigate the wrap-
+ * around issue.
+ */
+static int vm_cycle_shift __read_mostly;
+
+/*
+ * Sync up the per BDI average to the global cycle.
+ */
+static void bdi_writeout_norm(struct backing_dev_info *bdi)
+{
+	int bits = vm_cycle_shift;
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = ~(cycle - 1);
+	unsigned long global_cycle =
+		(__global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1) & mask;
+	unsigned long flags;
+
+	if ((bdi->cycles & mask) == global_cycle)
+		return;
+
+	spin_lock_irqsave(&bdi->lock, flags);
+	while ((bdi->cycles & mask) != global_cycle) {
+		unsigned long val = __bdi_stat(bdi, BDI_WRITEOUT);
+		unsigned long half = (val + 1) >> 1;
+
+		if (!val)
+			break;
+
+		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
+		bdi->cycles += cycle;
+	}
+	bdi->cycles = global_cycle;
+	spin_unlock_irqrestore(&bdi->lock, flags);
+}
+
+static void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	bdi_writeout_norm(bdi);
+
+	__inc_bdi_stat(bdi, BDI_WRITEOUT);
+	__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
+}
+
+void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
+{
+	int bits = vm_cycle_shift - 1;
+	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL);
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = cycle - 1;
+
+	if (bdi_cap_writeback_dirty(bdi)) {
+		bdi_writeout_norm(bdi);
+		*scale = __bdi_stat(bdi, BDI_WRITEOUT);
+	} else
+		*scale = 0;
+
+	*div = cycle + (total & mask);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -158,8 +237,8 @@ static unsigned long determine_dirtyable
 }
 
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +272,31 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long tmp = dirty;
+		long reserve;
+		int scale, div;
+
+		get_writeout_scale(bdi, &scale, &div);
+
+		tmp *= scale;
+		do_div(tmp, div);
+
+		reserve = dirty -
+			(global_bdi_stat(BDI_DIRTY) +
+			 global_bdi_stat(BDI_WRITEBACK) +
+			 global_bdi_stat(BDI_UNSTABLE));
+
+		if (reserve < 0)
+			reserve = 0;
+
+		reserve += bdi_stat(bdi, BDI_DIRTY) +
+			bdi_stat(bdi, BDI_WRITEBACK) +
+			bdi_stat(bdi, BDI_UNSTABLE);
+
+		*pbdi_dirty = min((long)tmp, reserve);
+	}
 }
 
 /*
@@ -204,9 +308,10 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,32 +326,31 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+					bdi_stat(bdi, BDI_UNSTABLE);
+		if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+		     	bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
-
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+						bdi_stat(bdi, BDI_UNSTABLE);
+			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+			     	bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,10 +358,6 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
-
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
 
@@ -270,7 +370,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -305,9 +407,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long ratelimit;
 	unsigned long *p;
 
-	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = 8;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
@@ -342,7 +442,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +477,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -585,6 +685,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+	vm_cycle_shift = 1 + ilog2(vm_total_pages);
 }
 
 /**
@@ -986,6 +1087,7 @@ int test_clear_page_writeback(struct pag
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+			bdi_writeout_inc(mapping->backing_dev_info);
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -91,6 +91,9 @@ void bdi_init(struct backing_dev_in
 {
 	int i;
 
+	spin_lock_init(&bdi->lock);
+	bdi->cycles = 0;
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		atomic_long_set(&bdi->bdi_stats[i], 0);
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 08/12] mm: fixup possible deadlock
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (6 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 07/12] mm: per device dirty threshold root
@ 2007-04-05 17:42 ` root
  2007-04-05 22:43   ` Andrew Morton
  2007-04-05 17:42 ` [PATCH 09/12] mm: remove throttle_vm_writeback root
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_stat_accurate.patch --]
[-- Type: text/plain, Size: 5767 bytes --]

When the threshol is in the order of the per cpu inaccuracies we can
deadlock by not receiveing the updated count, introduce a more expensive
but more accurate stat read function to use on low thresholds.

(TODO: roll into the bdi_stat patch)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   13 ++++++++++++-
 mm/backing-dev.c            |   31 +++++++++++++++++++++++++------
 mm/page-writeback.c         |   19 +++++++++++++++----
 3 files changed, 52 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -8,6 +8,7 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/cpumask.h>
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
 
@@ -34,7 +35,6 @@ enum bdi_stat_item {
 
 #ifdef CONFIG_SMP
 struct bdi_per_cpu_data {
-	s8 stat_threshold;
 	s8 bdi_stat_diff[NR_BDI_STAT_ITEMS];
 } ____cacheline_aligned_in_smp;
 #endif
@@ -60,6 +60,7 @@ struct backing_dev_info {
 
 	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
 #ifdef CONFIG_SMP
+	int stat_threshold;
 	struct bdi_per_cpu_data pcd[NR_CPUS];
 #endif
 };
@@ -109,6 +110,8 @@ static inline unsigned long bdi_stat(str
 }
 
 #ifdef CONFIG_SMP
+unsigned long bdi_stat_accurate(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
 void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
 void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
 void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
@@ -117,8 +120,14 @@ void mod_bdi_stat(struct backing_dev_inf
 void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
 void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
 
+static inline unsigned long bdi_stat_delta(struct backing_dev_info *bdi)
+{
+	return num_online_cpus() * bdi->stat_threshold;
+}
 #else /* CONFIG_SMP */
 
+#define bdi_stat_accurate bdi_stat
+
 static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, int delta)
 {
@@ -142,6 +151,8 @@ static inline void __dec_bdi_stat(struct
 #define mod_bdi_stat __mod_bdi_stat
 #define inc_bdi_stat __inc_bdi_stat
 #define dec_bdi_stat __dec_bdi_stat
+
+#define bdi_stat_delta(bdi) 1UL
 #endif
 
 void bdi_init(struct backing_dev_info *bdi);
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -98,17 +98,36 @@ void bdi_init(struct backing_dev_in
 		atomic_long_set(&bdi->bdi_stats[i], 0);
 
 #ifdef CONFIG_SMP
+	bdi->stat_threshold = 8 * ilog2(num_online_cpus());
 	for (i = 0; i < NR_CPUS; i++) {
 		int j;
 		for (j = 0; j < NR_BDI_STAT_ITEMS; j++)
 			bdi->pcd[i].bdi_stat_diff[j] = 0;
-		bdi->pcd[i].stat_threshold = 8 * ilog2(num_online_cpus());
 	}
 #endif
 }
 EXPORT_SYMBOL(bdi_init);
 
 #ifdef CONFIG_SMP
+unsigned long bdi_stat_accurate(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi_stats[item]);
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct bdi_per_cpu_data *pcd = &bdi->pcd[cpu];
+		s8 *p = pcd->bdi_stat_diff + item;
+
+		x += *p;
+	}
+
+	if (x < 0)
+		x = 0;
+
+	return x;
+}
+
 void __mod_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, int delta)
 {
@@ -118,7 +137,7 @@ void __mod_bdi_stat(struct backing_dev_i
 
 	x = delta + *p;
 
-	if (unlikely(x > pcd->stat_threshold || x < -pcd->stat_threshold)) {
+	if (unlikely(x > bdi->stat_threshold || x < -bdi->stat_threshold)) {
 		bdi_stat_add(x, bdi, item);
 		x = 0;
 	}
@@ -144,8 +163,8 @@ void __inc_bdi_stat(struct backing_dev_i
 
 	(*p)++;
 
-	if (unlikely(*p > pcd->stat_threshold)) {
-		int overstep = pcd->stat_threshold / 2;
+	if (unlikely(*p > bdi->stat_threshold)) {
+		int overstep = bdi->stat_threshold / 2;
 
 		bdi_stat_add(*p + overstep, bdi, item);
 		*p = -overstep;
@@ -170,8 +189,8 @@ void __dec_bdi_stat(struct backing_dev_i
 
 	(*p)--;
 
-	if (unlikely(*p < -pcd->stat_threshold)) {
-		int overstep = pcd->stat_threshold / 2;
+	if (unlikely(*p < -bdi->stat_threshold)) {
+		int overstep = bdi->stat_threshold / 2;
 
 		bdi_stat_add(*p - overstep, bdi, item);
 		*p = overstep;
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -341,14 +341,25 @@ static void balance_dirty_pages(struct a
 		 * been flushed to permanent storage.
 		 */
 		if (bdi_nr_reclaimable) {
+			unsigned long bdi_nr_writeback;
 			writeback_inodes(&wbc);
 
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
-						bdi_stat(bdi, BDI_UNSTABLE);
-			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
-			     	bdi_thresh)
+
+			if (bdi_thresh < bdi_stat_delta(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_accurate(bdi, BDI_DIRTY) +
+					bdi_stat_accurate(bdi, BDI_UNSTABLE);
+				bdi_nr_writeback =
+					bdi_stat_accurate(bdi, NR_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+					bdi_stat(bdi, BDI_UNSTABLE);
+				bdi_nr_writeback = bdi_stat(bdi, NR_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
 			pages_written += write_chunk - wbc.nr_to_write;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 09/12] mm: remove throttle_vm_writeback
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (7 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 08/12] mm: fixup possible deadlock root
@ 2007-04-05 17:42 ` root
  2007-04-05 22:44   ` Andrew Morton
  2007-04-05 17:42 ` [PATCH 10/12] mm: page_alloc_wait root
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: remove-throttle_vm_writeout.patch --]
[-- Type: text/plain, Size: 2842 bytes --]

rely on accurate dirty page accounting to provide enough push back

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

Index: linux-2.6-mm/include/linux/writeback.h
===================================================================
--- linux-2.6-mm.orig/include/linux/writeback.h	2007-04-05 13:23:51.000000000 +0200
+++ linux-2.6-mm/include/linux/writeback.h	2007-04-05 13:24:11.000000000 +0200
@@ -84,7 +84,6 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
 
 extern struct timer_list laptop_mode_wb_timer;
 static inline int laptop_spinned_down(void)
Index: linux-2.6-mm/mm/page-writeback.c
===================================================================
--- linux-2.6-mm.orig/mm/page-writeback.c	2007-04-05 13:23:51.000000000 +0200
+++ linux-2.6-mm/mm/page-writeback.c	2007-04-05 13:24:38.000000000 +0200
@@ -437,37 +437,6 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
-{
-	long background_thresh;
-	long dirty_thresh;
-
-	if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
-		/*
-		 * The caller might hold locks which can prevent IO completion
-		 * or progress in the filesystem.  So we cannot just sit here
-		 * waiting for IO to complete.
-		 */
-		congestion_wait(WRITE, HZ/10);
-		return;
-	}
-
-        for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-
-                /*
-                 * Boost the allowable dirty threshold a bit for page
-                 * allocators so they don't get DoS'ed by heavy writers
-                 */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
-
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(WRITE, HZ/10);
-        }
-}
-
 /*
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
Index: linux-2.6-mm/mm/vmscan.c
===================================================================
--- linux-2.6-mm.orig/mm/vmscan.c	2007-04-03 12:17:57.000000000 +0200
+++ linux-2.6-mm/mm/vmscan.c	2007-04-05 13:24:03.000000000 +0200
@@ -1047,8 +1047,6 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	throttle_vm_writeout(sc->gfp_mask);
-
 	atomic_dec(&zone->reclaim_in_progress);
 	return nr_reclaimed;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 10/12] mm: page_alloc_wait
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (8 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 09/12] mm: remove throttle_vm_writeback root
@ 2007-04-05 17:42 ` root
  2007-04-05 22:57   ` Andrew Morton
  2007-04-05 17:42 ` [PATCH 11/12] mm: accurate pageout congestion wait root
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: page_alloc_wait.patch --]
[-- Type: text/plain, Size: 4914 bytes --]

Introduce a mechanism to wait on free memory.

Currently congestion_wait() is abused to do this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/i386/lib/usercopy.c |    2 +-
 fs/xfs/linux-2.6/kmem.c  |    4 ++--
 include/linux/mm.h       |    3 +++
 mm/page_alloc.c          |   25 +++++++++++++++++++++++--
 mm/shmem.c               |    2 +-
 mm/vmscan.c              |    1 +
 6 files changed, 31 insertions(+), 6 deletions(-)

Index: linux-2.6-mm/arch/i386/lib/usercopy.c
===================================================================
--- linux-2.6-mm.orig/arch/i386/lib/usercopy.c	2007-04-05 16:24:15.000000000 +0200
+++ linux-2.6-mm/arch/i386/lib/usercopy.c	2007-04-05 16:29:49.000000000 +0200
@@ -751,7 +751,7 @@ survive:
 
 			if (retval == -ENOMEM && is_init(current)) {
 				up_read(&current->mm->mmap_sem);
-				congestion_wait(WRITE, HZ/50);
+				page_alloc_wait(HZ/50);
 				goto survive;
 			}
 
Index: linux-2.6-mm/fs/xfs/linux-2.6/kmem.c
===================================================================
--- linux-2.6-mm.orig/fs/xfs/linux-2.6/kmem.c	2007-04-05 16:24:15.000000000 +0200
+++ linux-2.6-mm/fs/xfs/linux-2.6/kmem.c	2007-04-05 16:29:49.000000000 +0200
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __n
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__FUNCTION__, lflags);
-		congestion_wait(WRITE, HZ/50);
+		page_alloc_wait(HZ/50);
 	} while (1);
 }
 
@@ -131,7 +131,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsig
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__FUNCTION__, lflags);
-		congestion_wait(WRITE, HZ/50);
+		page_alloc_wait(HZ/50);
 	} while (1);
 }
 
Index: linux-2.6-mm/include/linux/mm.h
===================================================================
--- linux-2.6-mm.orig/include/linux/mm.h	2007-04-05 16:24:15.000000000 +0200
+++ linux-2.6-mm/include/linux/mm.h	2007-04-05 16:29:49.000000000 +0200
@@ -1028,6 +1028,9 @@ extern void setup_per_cpu_pageset(void);
 static inline void setup_per_cpu_pageset(void) {}
 #endif
 
+void page_alloc_ok(void);
+long page_alloc_wait(long timeout);
+
 /* prio_tree.c */
 void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
 void vma_prio_tree_insert(struct vm_area_struct *, struct prio_tree_root *);
Index: linux-2.6-mm/mm/page_alloc.c
===================================================================
--- linux-2.6-mm.orig/mm/page_alloc.c	2007-04-05 16:24:15.000000000 +0200
+++ linux-2.6-mm/mm/page_alloc.c	2007-04-05 16:35:04.000000000 +0200
@@ -107,6 +107,9 @@ unsigned long __meminitdata nr_kernel_pa
 unsigned long __meminitdata nr_all_pages;
 static unsigned long __initdata dma_reserve;
 
+static wait_queue_head_t page_alloc_wqh =
+	__WAIT_QUEUE_HEAD_INITIALIZER(page_alloc_wqh);
+
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
   /*
    * MAX_ACTIVE_REGIONS determines the maxmimum number of distinct
@@ -1698,7 +1701,7 @@ nofail_alloc:
 			if (page)
 				goto got_pg;
 			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
+				page_alloc_wait(HZ/50);
 				goto nofail_alloc;
 			}
 		}
@@ -1763,7 +1766,7 @@ nofail_alloc:
 			do_retry = 1;
 	}
 	if (do_retry) {
-		congestion_wait(WRITE, HZ/50);
+		page_alloc_wait(HZ/50);
 		goto rebalance;
 	}
 
@@ -4217,3 +4220,21 @@ void set_pageblock_flags_group(struct pa
 		else
 			__clear_bit(bitidx + start_bitidx, bitmap);
 }
+
+void page_alloc_ok(void)
+{
+	if (waitqueue_active(&page_alloc_wqh))
+		wake_up(&page_alloc_wqh);
+}
+
+long page_alloc_wait(long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&page_alloc_wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(&page_alloc_wqh, &wait);
+	return ret;
+}
+EXPORT_SYMBOL(page_alloc_wait);
Index: linux-2.6-mm/mm/shmem.c
===================================================================
--- linux-2.6-mm.orig/mm/shmem.c	2007-04-05 16:24:15.000000000 +0200
+++ linux-2.6-mm/mm/shmem.c	2007-04-05 16:30:31.000000000 +0200
@@ -1216,7 +1216,7 @@ repeat:
 			page_cache_release(swappage);
 			if (error == -ENOMEM) {
 				/* let kswapd refresh zone for GFP_ATOMICs */
-				congestion_wait(WRITE, HZ/50);
+				page_alloc_wait(HZ/50);
 			}
 			goto repeat;
 		}
Index: linux-2.6-mm/mm/vmscan.c
===================================================================
--- linux-2.6-mm.orig/mm/vmscan.c	2007-04-05 16:29:46.000000000 +0200
+++ linux-2.6-mm/mm/vmscan.c	2007-04-05 16:29:49.000000000 +0200
@@ -1436,6 +1436,7 @@ static int kswapd(void *p)
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
 		balance_pgdat(pgdat, order);
+		page_alloc_ok();
 	}
 	return 0;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 11/12] mm: accurate pageout congestion wait
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (9 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 10/12] mm: page_alloc_wait root
@ 2007-04-05 17:42 ` root
  2007-04-05 23:17   ` Andrew Morton
  2007-04-05 17:42 ` [PATCH 12/12] mm: per BDI congestion feedback root
  2007-04-05 17:47 ` [PATCH 00/12] per device dirty throttling -v3 Peter Zijlstra
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: kswapd-writeout-wait.patch --]
[-- Type: text/plain, Size: 5122 bytes --]

Only do the congestion wait when we actually encountered congestion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 include/linux/swap.h |    1 +
 mm/page_io.c         |    9 +++++++++
 mm/vmscan.c          |   25 ++++++++++++++++++++-----
 3 files changed, 30 insertions(+), 5 deletions(-)

Index: linux-2.6-mm/mm/vmscan.c
===================================================================
--- linux-2.6-mm.orig/mm/vmscan.c	2007-04-05 16:29:49.000000000 +0200
+++ linux-2.6-mm/mm/vmscan.c	2007-04-05 16:35:36.000000000 +0200
@@ -70,6 +70,8 @@ struct scan_control {
 	int all_unreclaimable;
 
 	int order;
+
+	int encountered_congestion;
 };
 
 /*
@@ -315,7 +317,8 @@ typedef enum {
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+		struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -357,6 +360,7 @@ static pageout_t pageout(struct page *pa
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
+		struct backing_dev_info *bdi;
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_NONE,
 			.nr_to_write = SWAP_CLUSTER_MAX,
@@ -366,6 +370,14 @@ static pageout_t pageout(struct page *pa
 			.for_reclaim = 1,
 		};
 
+		if (mapping == &swapper_space)
+			bdi = swap_bdi(page);
+		else
+			bdi = mapping->backing_dev_info;
+
+		if (bdi_congested(bdi, WRITE))
+			sc->encountered_congestion = 1;
+
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
 		if (res < 0)
@@ -533,7 +545,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch(pageout(page, mapping)) {
+			switch(pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -1141,6 +1153,7 @@ unsigned long try_to_free_pages(struct z
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.nr_scanned = 0;
+		sc.encountered_congestion = 0;
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
@@ -1169,7 +1182,7 @@ unsigned long try_to_free_pages(struct z
 		}
 
 		/* Take a nap, wait for some writeback to complete */
-		if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
+		if (sc.encountered_congestion)
 			congestion_wait(WRITE, HZ/10);
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
@@ -1250,6 +1263,7 @@ loop_again:
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 
+		sc.encountered_congestion = 0;
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
 			disable_swap_token();
@@ -1337,7 +1351,7 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
+		if (sc.encountered_congestion)
 			congestion_wait(WRITE, HZ/10);
 
 		/*
@@ -1580,6 +1594,7 @@ unsigned long shrink_all_memory(unsigned
 			unsigned long nr_to_scan = nr_pages - ret;
 
 			sc.nr_scanned = 0;
+			sc.encountered_congestion = 0;
 			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
 			if (ret >= nr_pages)
 				goto out;
@@ -1591,7 +1606,7 @@ unsigned long shrink_all_memory(unsigned
 			if (ret >= nr_pages)
 				goto out;
 
-			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
+			if (sc.encountered_congestion)
 				congestion_wait(WRITE, HZ / 10);
 		}
 	}
Index: linux-2.6-mm/include/linux/swap.h
===================================================================
--- linux-2.6-mm.orig/include/linux/swap.h	2007-04-05 16:24:02.000000000 +0200
+++ linux-2.6-mm/include/linux/swap.h	2007-04-05 16:35:36.000000000 +0200
@@ -220,6 +220,7 @@ extern void swap_unplug_io_fn(struct bac
 
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
+extern struct backing_dev_info *swap_bdi(struct page *);
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern int end_swap_bio_read(struct bio *bio, unsigned int bytes_done, int err);
Index: linux-2.6-mm/mm/page_io.c
===================================================================
--- linux-2.6-mm.orig/mm/page_io.c	2007-04-05 16:24:02.000000000 +0200
+++ linux-2.6-mm/mm/page_io.c	2007-04-05 16:36:26.000000000 +0200
@@ -19,6 +19,15 @@
 #include <linux/writeback.h>
 #include <asm/pgtable.h>
 
+struct backing_dev_info *swap_bdi(struct page *page)
+{
+	struct swap_info_struct *sis;
+	swp_entry_t entry = { .val = page_private(page), };
+
+	sis = get_swap_info_struct(swp_type(entry));
+	return blk_get_backing_dev_info(sis->bdev);
+}
+
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
 				struct page *page, bio_end_io_t end_io)
 {

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 12/12] mm: per BDI congestion feedback
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (10 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 11/12] mm: accurate pageout congestion wait root
@ 2007-04-05 17:42 ` root
  2007-04-05 23:24   ` Andrew Morton
  2007-04-05 17:47 ` [PATCH 00/12] per device dirty throttling -v3 Peter Zijlstra
  12 siblings, 1 reply; 27+ messages in thread
From: root @ 2007-04-05 17:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita

[-- Attachment #1: bdi_congestion.patch --]
[-- Type: text/plain, Size: 16559 bytes --]

Now that we have per BDI dirty throttling is makes sense to also have oer BDI
congestion feedback; why wait on another device if the current one is not
congested.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/block/pktcdvd.c     |    2 -
 drivers/md/dm-crypt.c       |    7 +++--
 fs/cifs/file.c              |    2 -
 fs/ext4/writeback.c         |    2 -
 fs/fat/file.c               |    4 ++
 fs/fs-writeback.c           |    2 -
 fs/nfs/write.c              |    2 -
 fs/reiser4/vfs_ops.c        |    2 -
 fs/reiserfs/journal.c       |    7 +++--
 fs/xfs/linux-2.6/xfs_aops.c |    2 -
 include/linux/backing-dev.h |    7 +++--
 include/linux/writeback.h   |    3 +-
 mm/backing-dev.c            |   61 ++++++++++++++++++++++++++++----------------
 mm/page-writeback.c         |   19 +++++++------
 mm/vmscan.c                 |   19 +++++++------
 15 files changed, 88 insertions(+), 53 deletions(-)

Index: linux-2.6-mm/include/linux/backing-dev.h
===================================================================
--- linux-2.6-mm.orig/include/linux/backing-dev.h	2007-04-05 18:24:34.000000000 +0200
+++ linux-2.6-mm/include/linux/backing-dev.h	2007-04-05 19:26:24.000000000 +0200
@@ -10,6 +10,7 @@
 
 #include <linux/cpumask.h>
 #include <linux/spinlock.h>
+#include <linux/wait.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -52,6 +53,8 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	wait_queue_head_t congestion_wqh[2];
+
 	/*
 	 * data used for scaling the writeback cache
 	 */
@@ -214,8 +217,8 @@ static inline int bdi_rw_congested(struc
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
-long congestion_wait(int rw, long timeout);
-long congestion_wait_interruptible(int rw, long timeout);
+long congestion_wait(struct backing_dev_info *bdi, int rw, long timeout);
+long congestion_wait_interruptible(struct backing_dev_info *bdi, int rw, long timeout);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6-mm/mm/backing-dev.c
===================================================================
--- linux-2.6-mm.orig/mm/backing-dev.c	2007-04-05 18:24:34.000000000 +0200
+++ linux-2.6-mm/mm/backing-dev.c	2007-04-05 18:26:00.000000000 +0200
@@ -5,16 +5,10 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
-static wait_queue_head_t congestion_wqh[2] = {
-		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
-		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
-	};
-
-
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw)
 {
 	enum bdi_state bit;
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
+	wait_queue_head_t *wqh = &bdi->congestion_wqh[rw];
 
 	bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
 	clear_bit(bit, &bdi->state);
@@ -42,31 +36,48 @@ EXPORT_SYMBOL(set_bdi_congested);
  * write congestion.  If no backing_devs are congested then just wait for the
  * next write to be completed.
  */
-long congestion_wait(int rw, long timeout)
+long congestion_wait(struct backing_dev_info *bdi, int rw, long timeout)
 {
-	long ret;
+	long ret = 0;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
+	wait_queue_head_t *wqh = &bdi->congestion_wqh[rw];
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	if (bdi_congested(bdi, rw)) {
+		for (;;) {
+			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+			if (!bdi_congested(bdi, rw))
+				break;
+			ret = io_schedule_timeout(timeout);
+			if (!ret)
+				break;
+		}
+		finish_wait(wqh, &wait);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
 
-long congestion_wait_interruptible(int rw, long timeout)
+long congestion_wait_interruptible(struct backing_dev_info *bdi,
+		int rw, long timeout)
 {
-	long ret;
+	long ret = 0;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
+	wait_queue_head_t *wqh = &bdi->congestion_wqh[rw];
 
-	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
-	if (signal_pending(current))
-		ret = -ERESTARTSYS;
-	else
-		ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	if (bdi_congested(bdi, rw)) {
+		for (;;) {
+			prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+			if (!bdi_congested(bdi, rw))
+				break;
+			if (signal_pending(current))
+				ret = -ERESTARTSYS;
+			else
+				ret = io_schedule_timeout(timeout);
+			if (!ret)
+				break;
+		}
+		finish_wait(wqh, &wait);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
@@ -78,6 +89,10 @@ void bdi_init(struct backing_dev_info *b
 {
 	int i;
 
+	for (i = 0; i < ARRAY_SIZE(bdi->congestion_wqh); i++)
+		bdi->congestion_wqh[i] = (wait_queue_head_t)
+			__WAIT_QUEUE_HEAD_INITIALIZER(bdi->congestion_wqh[i]);
+
 	spin_lock_init(&bdi->lock);
 	bdi->cycles = 0;
 
@@ -195,3 +210,5 @@ void dec_bdi_stat(struct backing_dev_inf
 }
 EXPORT_SYMBOL(dec_bdi_stat);
 #endif
+
+
Index: linux-2.6-mm/mm/page-writeback.c
===================================================================
--- linux-2.6-mm.orig/mm/page-writeback.c	2007-04-05 18:24:34.000000000 +0200
+++ linux-2.6-mm/mm/page-writeback.c	2007-04-05 18:24:34.000000000 +0200
@@ -366,7 +366,7 @@ static void balance_dirty_pages(struct a
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		congestion_wait(WRITE, HZ/10);
+		congestion_wait(bdi, WRITE, HZ/10);
 	}
 
 	if (writeback_in_progress(bdi))
@@ -462,15 +462,17 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
-		wbc.encountered_congestion = 0;
+		wbc.encountered_congestion = NULL;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 		writeback_inodes(&wbc);
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion)
+				congestion_wait(wbc.encountered_congestion,
+						WRITE, HZ/10);
+			else
 				break;
 		}
 	}
@@ -535,12 +537,13 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
-		wbc.encountered_congestion = 0;
+		wbc.encountered_congestion = NULL;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion)
-				congestion_wait(WRITE, HZ/10);
+				congestion_wait(wbc.encountered_congestion,
+					       	WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */
 		}
@@ -698,7 +701,7 @@ int write_cache_pages(struct address_spa
 	int range_whole = 0;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
+		wbc->encountered_congestion = bdi;
 		return 0;
 	}
 
@@ -760,7 +763,7 @@ retry:
 			if (ret || (--(wbc->nr_to_write) <= 0))
 				done = 1;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
+				wbc->encountered_congestion = bdi;
 				done = 1;
 			}
 		}
Index: linux-2.6-mm/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6-mm.orig/drivers/block/pktcdvd.c	2007-04-05 18:24:29.000000000 +0200
+++ linux-2.6-mm/drivers/block/pktcdvd.c	2007-04-05 18:24:34.000000000 +0200
@@ -2589,7 +2589,7 @@ static int pkt_make_request(request_queu
 		set_bdi_congested(&q->backing_dev_info, WRITE);
 		do {
 			spin_unlock(&pd->lock);
-			congestion_wait(WRITE, HZ);
+			congestion_wait(&q->backing_dev_info, WRITE, HZ);
 			spin_lock(&pd->lock);
 		} while(pd->bio_queue_size > pd->write_congestion_off);
 	}
Index: linux-2.6-mm/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6-mm.orig/drivers/md/dm-crypt.c	2007-04-05 18:24:29.000000000 +0200
+++ linux-2.6-mm/drivers/md/dm-crypt.c	2007-04-05 18:24:34.000000000 +0200
@@ -640,8 +640,11 @@ static void process_write(struct crypt_i
 		 * may be gone already. */
 
 		/* out of memory -> run queues */
-		if (remaining)
-			congestion_wait(WRITE, HZ/100);
+		if (remaining) {
+			struct backing_dev_info *bdi =
+				&io->target->table->md->queue->backing_dev_info;
+			congestion_wait(bdi, WRITE, HZ/100);
+		}
 	}
 }
 
Index: linux-2.6-mm/fs/fat/file.c
===================================================================
--- linux-2.6-mm.orig/fs/fat/file.c	2007-04-05 18:24:28.000000000 +0200
+++ linux-2.6-mm/fs/fat/file.c	2007-04-05 18:24:34.000000000 +0200
@@ -118,8 +118,10 @@ static int fat_file_release(struct inode
 {
 	if ((filp->f_mode & FMODE_WRITE) &&
 	     MSDOS_SB(inode->i_sb)->options.flush) {
+		struct backing_dev_info *bdi =
+			inode->i_mapping->backing_dev_info;
 		fat_flush_inodes(inode->i_sb, inode, NULL);
-		congestion_wait(WRITE, HZ/10);
+		congestion_wait(bdi, WRITE, HZ/10);
 	}
 	return 0;
 }
Index: linux-2.6-mm/fs/reiserfs/journal.c
===================================================================
--- linux-2.6-mm.orig/fs/reiserfs/journal.c	2007-04-05 18:24:28.000000000 +0200
+++ linux-2.6-mm/fs/reiserfs/journal.c	2007-04-05 18:25:00.000000000 +0200
@@ -970,8 +970,11 @@ int reiserfs_async_progress_wait(struct 
 {
 	DEFINE_WAIT(wait);
 	struct reiserfs_journal *j = SB_JOURNAL(s);
-	if (atomic_read(&j->j_async_throttle))
-		congestion_wait(WRITE, HZ / 10);
+	if (atomic_read(&j->j_async_throttle)) {
+		struct backing_dev_info *bdi =
+			blk_get_backing_dev_info(j->j_dev_bd);
+		congestion_wait(bdi, WRITE, HZ / 10);
+	}
 	return 0;
 }
 
Index: linux-2.6-mm/include/linux/writeback.h
===================================================================
--- linux-2.6-mm.orig/include/linux/writeback.h	2007-04-05 18:24:34.000000000 +0200
+++ linux-2.6-mm/include/linux/writeback.h	2007-04-05 18:24:34.000000000 +0200
@@ -54,11 +54,12 @@ struct writeback_control {
 	loff_t range_end;
 
 	unsigned nonblocking:1;		/* Don't get stuck on request queues */
-	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+
+	struct backing_dev_info *encountered_congestion; /* An output: a queue is full */
 };
 
 /*
Index: linux-2.6-mm/fs/cifs/file.c
===================================================================
--- linux-2.6-mm.orig/fs/cifs/file.c	2007-04-05 18:24:28.000000000 +0200
+++ linux-2.6-mm/fs/cifs/file.c	2007-04-05 18:24:34.000000000 +0200
@@ -1143,7 +1143,7 @@ static int cifs_writepages(struct addres
 	 * If it is, we should test it again after we do I/O
 	 */
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
+		wbc->encountered_congestion = bdi;
 		kfree(iov);
 		return 0;
 	}
Index: linux-2.6-mm/fs/ext4/writeback.c
===================================================================
--- linux-2.6-mm.orig/fs/ext4/writeback.c	2007-04-05 18:24:28.000000000 +0200
+++ linux-2.6-mm/fs/ext4/writeback.c	2007-04-05 18:24:34.000000000 +0200
@@ -782,7 +782,7 @@ int ext4_wb_writepages(struct address_sp
 #ifdef EXT4_WB_STATS
 				atomic_inc(&EXT4_SB(inode->i_sb)->s_wb_congested);
 #endif
-				wbc->encountered_congestion = 1;
+				wbc->encountered_congestion = bdi;
 				done = 1;
 			}
 		}
Index: linux-2.6-mm/fs/fs-writeback.c
===================================================================
--- linux-2.6-mm.orig/fs/fs-writeback.c	2007-04-05 18:24:29.000000000 +0200
+++ linux-2.6-mm/fs/fs-writeback.c	2007-04-05 18:24:34.000000000 +0200
@@ -349,7 +349,7 @@ int generic_sync_sb_inodes(struct super_
 		}
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
-			wbc->encountered_congestion = 1;
+			wbc->encountered_congestion = bdi;
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* Skip a congested fs */
 			list_move(&inode->i_list, &sb->s_dirty);
Index: linux-2.6-mm/fs/reiser4/vfs_ops.c
===================================================================
--- linux-2.6-mm.orig/fs/reiser4/vfs_ops.c	2007-04-05 18:24:29.000000000 +0200
+++ linux-2.6-mm/fs/reiser4/vfs_ops.c	2007-04-05 18:24:34.000000000 +0200
@@ -169,7 +169,7 @@ void reiser4_writeout(struct super_block
 		if (wbc->nonblocking &&
 		    bdi_write_congested(mapping->backing_dev_info)) {
 			blk_run_address_space(mapping);
-			wbc->encountered_congestion = 1;
+			wbc->encountered_congestion = mapping->backing_dev_info;
 			break;
 		}
 		repeats++;
Index: linux-2.6-mm/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6-mm.orig/fs/xfs/linux-2.6/xfs_aops.c	2007-04-05 18:24:28.000000000 +0200
+++ linux-2.6-mm/fs/xfs/linux-2.6/xfs_aops.c	2007-04-05 18:24:34.000000000 +0200
@@ -777,7 +777,7 @@ xfs_convert_page(
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
 			if (bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
+				wbc->encountered_congestion = bdi;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
 				done = 1;
Index: linux-2.6-mm/mm/vmscan.c
===================================================================
--- linux-2.6-mm.orig/mm/vmscan.c	2007-04-05 18:24:34.000000000 +0200
+++ linux-2.6-mm/mm/vmscan.c	2007-04-05 18:24:34.000000000 +0200
@@ -71,7 +71,7 @@ struct scan_control {
 
 	int order;
 
-	int encountered_congestion;
+	struct backing_dev_info *encountered_congestion;
 };
 
 /*
@@ -376,7 +376,7 @@ static pageout_t pageout(struct page *pa
 			bdi = mapping->backing_dev_info;
 
 		if (bdi_congested(bdi, WRITE))
-			sc->encountered_congestion = 1;
+			sc->encountered_congestion = bdi;
 
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
@@ -1153,7 +1153,7 @@ unsigned long try_to_free_pages(struct z
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.nr_scanned = 0;
-		sc.encountered_congestion = 0;
+		sc.encountered_congestion = NULL;
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
@@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc.encountered_congestion)
-			congestion_wait(WRITE, HZ/10);
+			congestion_wait(sc.encountered_congestion,
+					WRITE, HZ/10);
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
 	if (!sc.all_unreclaimable)
@@ -1263,7 +1264,7 @@ loop_again:
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 
-		sc.encountered_congestion = 0;
+		sc.encountered_congestion = NULL;
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
 			disable_swap_token();
@@ -1352,7 +1353,8 @@ loop_again:
 		 * another pass across the zones.
 		 */
 		if (sc.encountered_congestion)
-			congestion_wait(WRITE, HZ/10);
+			congestion_wait(sc.encountered_congestion,
+					WRITE, HZ/10);
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
@@ -1594,7 +1596,7 @@ unsigned long shrink_all_memory(unsigned
 			unsigned long nr_to_scan = nr_pages - ret;
 
 			sc.nr_scanned = 0;
-			sc.encountered_congestion = 0;
+			sc.encountered_congestion = NULL;
 			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
 			if (ret >= nr_pages)
 				goto out;
@@ -1607,7 +1609,8 @@ unsigned long shrink_all_memory(unsigned
 				goto out;
 
 			if (sc.encountered_congestion)
-				congestion_wait(WRITE, HZ / 10);
+				congestion_wait(sc.encountered_congestion,
+						WRITE, HZ / 10);
 		}
 	}
 
Index: linux-2.6-mm/fs/nfs/write.c
===================================================================
--- linux-2.6-mm.orig/fs/nfs/write.c	2007-04-05 18:24:33.000000000 +0200
+++ linux-2.6-mm/fs/nfs/write.c	2007-04-05 18:24:34.000000000 +0200
@@ -567,7 +567,7 @@ static int nfs_wait_on_write_congestion(
 		sigset_t oldset;
 
 		rpc_clnt_sigmask(clnt, &oldset);
-		ret = congestion_wait_interruptible(WRITE, HZ/10);
+		ret = congestion_wait_interruptible(bdi, WRITE, HZ/10);
 		rpc_clnt_sigunmask(clnt, &oldset);
 		if (ret == -ERESTARTSYS)
 			break;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/12] per device dirty throttling -v3
  2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
                   ` (11 preceding siblings ...)
  2007-04-05 17:42 ` [PATCH 12/12] mm: per BDI congestion feedback root
@ 2007-04-05 17:47 ` Peter Zijlstra
  12 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-04-05 17:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, nikita

Don't worry, it's me!

Seems I forgot to edit the From field :-(

On Thu, 2007-04-05 at 19:42 +0200, root@programming.kicks-ass.net wrote:
> Against 2.6.21-rc5-mm4 without:
>   per-backing_dev-dirty-and-writeback-page-accounting.patch
> 
> This series implements BDI independent dirty limits and congestion control.
> 
> This should solve several problems we currently have in this area:
> 
>  - mutual interference starvation (for any number of BDIs), and
>  - deadlocks with stacked BDIs (loop and FUSE).
> 
> All the fancy new congestion code has been compile and boot tested, but
> not much more. I'm posting to get feedback on the ideas.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/12] mm: scalable bdi statistics counters.
  2007-04-05 17:42 ` [PATCH 02/12] mm: scalable bdi statistics counters root
@ 2007-04-05 22:37   ` Andrew Morton
  2007-04-06  7:22     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 22:37 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:11 +0200
root@programming.kicks-ass.net wrote:

> Provide scalable per backing_dev_info statistics counters modeled on the ZVC
> code.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  block/ll_rw_blk.c           |    1 
>  drivers/block/rd.c          |    2 
>  drivers/char/mem.c          |    2 
>  fs/char_dev.c               |    1 
>  fs/fuse/inode.c             |    1 
>  fs/nfs/client.c             |    1 
>  include/linux/backing-dev.h |   98 +++++++++++++++++++++++++++++++++++++++++
>  mm/backing-dev.c            |  103 ++++++++++++++++++++++++++++++++++++++++++++

madness!  Quite duplicative of vmstat.h, yet all this infrastructure
is still only usable in one specific application.

Can we please look at generalising the vmstat.h stuff?

Or, the API in percpu_counter.h appears suitable to this application.
(The comment at line 6 is a total lie).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 08/12] mm: fixup possible deadlock
  2007-04-05 17:42 ` [PATCH 08/12] mm: fixup possible deadlock root
@ 2007-04-05 22:43   ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 22:43 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:17 +0200
root@programming.kicks-ass.net wrote:

> When the threshol is in the order of the per cpu inaccuracies we can
> deadlock by not receiveing the updated count,

That explanation is a bit, umm, terse.

> introduce a more expensive
> but more accurate stat read function to use on low thresholds.

Looks like percpu_counter_sum().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/12] mm: remove throttle_vm_writeback
  2007-04-05 17:42 ` [PATCH 09/12] mm: remove throttle_vm_writeback root
@ 2007-04-05 22:44   ` Andrew Morton
  2007-09-26 20:42     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 22:44 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:18 +0200
root@programming.kicks-ass.net wrote:

> rely on accurate dirty page accounting to provide enough push back

I think we'd like to see a bit more justification than that, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/12] mm: page_alloc_wait
  2007-04-05 17:42 ` [PATCH 10/12] mm: page_alloc_wait root
@ 2007-04-05 22:57   ` Andrew Morton
  2007-04-06  6:37     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 22:57 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:19 +0200
root@programming.kicks-ass.net wrote:

> Introduce a mechanism to wait on free memory.
> 
> Currently congestion_wait() is abused to do this.

Such a very small explanation for such a terrifying change.

> ...
>
> --- linux-2.6-mm.orig/mm/vmscan.c	2007-04-05 16:29:46.000000000 +0200
> +++ linux-2.6-mm/mm/vmscan.c	2007-04-05 16:29:49.000000000 +0200
> @@ -1436,6 +1436,7 @@ static int kswapd(void *p)
>  		finish_wait(&pgdat->kswapd_wait, &wait);
>  
>  		balance_pgdat(pgdat, order);
> +		page_alloc_ok();
>  	}
>  	return 0;
>  }

For a start, we don't know that kswapd freed pages which are in a suitable
zone.  And we don't know that kswapd freed pages which are in a suitable
cpuset.

congestion_wait() is similarly ignorant of the suitability of the pages,
but the whole idea behind congestion_wait is that it will throttle page
allocators to some speed which is proportional to the speed at which the IO
systems can retire writes - view it as a variable-speed polling operation,
in which the polling frequency goes up when the IO system gets faster. 
This patch changes that philosophy fundamentally.  That's worth more than a
2-line changelog.

Also, there might be situations in which kswapd gets stuck in some dark
corner.  Perhaps the process which is waiting in the page allocator holds
filesystem locks which kswapd is blocked on.  Or kswapd might be blocked on
a particular request queue, or a dead NFS server or something.  The timeout
will save us, but things will be slow.

There could be other problems too, dunno - this stuff is tricky.  Why are
you changing it, what problems are being solved, etc?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 11/12] mm: accurate pageout congestion wait
  2007-04-05 17:42 ` [PATCH 11/12] mm: accurate pageout congestion wait root
@ 2007-04-05 23:17   ` Andrew Morton
  2007-04-06  6:51     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 23:17 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:20 +0200
root@programming.kicks-ass.net wrote:

> Only do the congestion wait when we actually encountered congestion.

The name congestion_wait() was accurate back in 2002, but it isn't accurate
any more, and you got misled.  It does not only wait for a queue to become
uncongested.

See clear_bdi_congested()'s callers.  As long as the queue is in an
uncongested state, we deliver wakeups to congestion_wait() blockers on
every IO completion.  As I said before, it is so that the MM's polling
operations poll at a higher frequency when the IO system is working faster.
(It is also to synchronise with end_page_writeback()'s feeding of clean
pages to us via rotate_reclaimable_page()).

Page reclaim can get into trouble without any request queue having entered
a congested state.  For example, think about a machine which has a single
disk, and the operator has increased that disk's request queue size to
100,000.  With your patch all the VM's throttling would be bypassed and we
go into a busy loop and declare OOM instantly.

There are probably other situations in which page reclaim gets into trouble
without a request queue being congested.

Minor point: bdi_congested() can be arbitrarily expensive - for DM stackups
it is roughly proportional to the number of subdevices in the device.  We
need to be careful about how frequently we call it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 12/12] mm: per BDI congestion feedback
  2007-04-05 17:42 ` [PATCH 12/12] mm: per BDI congestion feedback root
@ 2007-04-05 23:24   ` Andrew Morton
  2007-04-06  7:01     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-05 23:24 UTC (permalink / raw)
  To: root
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita

On Thu, 05 Apr 2007 19:42:21 +0200
root@programming.kicks-ass.net wrote:

> Now that we have per BDI dirty throttling is makes sense to also have oer BDI
> congestion feedback; why wait on another device if the current one is not
> congested.

Similar comments apply.  congestion_wait() should be called
throttle_at_a_rate_proportional_to_the_speed_of_presently_uncongested_queues().

If a process is throttled in the page allocator waiting for pages to become
reclaimable, that process absolutely does not care whether those pages were
previously dirty against /dev/sda or against /dev/sdb.  It wants to be woken
up for writeout completion against any queue.


-		wbc.encountered_congestion = 0;
+		wbc.encountered_congestion = NULL;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 		writeback_inodes(&wbc);
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
	 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion)
+				congestion_wait(wbc.encountered_congestion,
+						WRITE, HZ/10);
+			else

Well that confused me.  You'd be needing to rename
wbc.encountered_congestion to congested_bdi or something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/12] mm: page_alloc_wait
  2007-04-05 22:57   ` Andrew Morton
@ 2007-04-06  6:37     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-04-06  6:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: root, linux-mm, linux-kernel, miklos, neilb, dgc,
	tomoki.sekiyama.qu, nikita

On Thu, 2007-04-05 at 15:57 -0700, Andrew Morton wrote:
> On Thu, 05 Apr 2007 19:42:19 +0200
> root@programming.kicks-ass.net wrote:
> 
> > Introduce a mechanism to wait on free memory.
> > 
> > Currently congestion_wait() is abused to do this.
> 
> Such a very small explanation for such a terrifying change.

Yes, I suck at writing changelogs, bad me. Normally I would take a day
to write them, but I just wanted to get this code out there. Perhaps a
bad decision.

> > ...
> >
> > --- linux-2.6-mm.orig/mm/vmscan.c	2007-04-05 16:29:46.000000000 +0200
> > +++ linux-2.6-mm/mm/vmscan.c	2007-04-05 16:29:49.000000000 +0200
> > @@ -1436,6 +1436,7 @@ static int kswapd(void *p)
> >  		finish_wait(&pgdat->kswapd_wait, &wait);
> >  
> >  		balance_pgdat(pgdat, order);
> > +		page_alloc_ok();
> >  	}
> >  	return 0;
> >  }
> 
> For a start, we don't know that kswapd freed pages which are in a suitable
> zone.  And we don't know that kswapd freed pages which are in a suitable
> cpuset.
> 
> congestion_wait() is similarly ignorant of the suitability of the pages,
> but the whole idea behind congestion_wait is that it will throttle page
> allocators to some speed which is proportional to the speed at which the IO
> systems can retire writes - view it as a variable-speed polling operation,
> in which the polling frequency goes up when the IO system gets faster. 
> This patch changes that philosophy fundamentally.  That's worth more than a
> 2-line changelog.
> 
> Also, there might be situations in which kswapd gets stuck in some dark
> corner.  Perhaps the process which is waiting in the page allocator holds
> filesystem locks which kswapd is blocked on.  Or kswapd might be blocked on
> a particular request queue, or a dead NFS server or something.  The timeout
> will save us, but things will be slow.
> 
> There could be other problems too, dunno - this stuff is tricky.  Why are
> you changing it, what problems are being solved, etc?

Lets start with the why, because of 12/12; I wanted to introduce per BDI
congestion feedback, and hence needed a BDI context for
congestion_wait(). These specific callers weren't in the context of a
BDI but of a more global idea.

Perhaps I could call page_alloc_ok() from bdi_congestion_end()
irrespective of the actual BDI uncongested? That would more or less give
the old semantics.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 11/12] mm: accurate pageout congestion wait
  2007-04-05 23:17   ` Andrew Morton
@ 2007-04-06  6:51     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-04-06  6:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita

On Thu, 2007-04-05 at 16:17 -0700, Andrew Morton wrote:
> On Thu, 05 Apr 2007 19:42:20 +0200
> root@programming.kicks-ass.net wrote:
> 
> > Only do the congestion wait when we actually encountered congestion.
> 
> The name congestion_wait() was accurate back in 2002, but it isn't accurate
> any more, and you got misled.  It does not only wait for a queue to become
> uncongested.

Quite so indeed.

> See clear_bdi_congested()'s callers.  As long as the queue is in an
> uncongested state, we deliver wakeups to congestion_wait() blockers on
> every IO completion.  As I said before, it is so that the MM's polling
> operations poll at a higher frequency when the IO system is working faster.
> (It is also to synchronise with end_page_writeback()'s feeding of clean
> pages to us via rotate_reclaimable_page()).

Hmm, but the condition under which we did call congestion_wait() is a
bit magical.

> Page reclaim can get into trouble without any request queue having entered
> a congested state.  For example, think about a machine which has a single
> disk, and the operator has increased that disk's request queue size to
> 100,000.  With your patch all the VM's throttling would be bypassed and we
> go into a busy loop and declare OOM instantly.
> 
> There are probably other situations in which page reclaim gets into trouble
> without a request queue being congested.

Ok, in the light of allt his, I will think on this some more.

> Minor point: bdi_congested() can be arbitrarily expensive - for DM stackups
> it is roughly proportional to the number of subdevices in the device.  We
> need to be careful about how frequently we call it.

Yuck, ok, good point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 12/12] mm: per BDI congestion feedback
  2007-04-05 23:24   ` Andrew Morton
@ 2007-04-06  7:01     ` Peter Zijlstra
  2007-04-06 11:00       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2007-04-06  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita

On Thu, 2007-04-05 at 16:24 -0700, Andrew Morton wrote:
> On Thu, 05 Apr 2007 19:42:21 +0200
> root@programming.kicks-ass.net wrote:
> 
> > Now that we have per BDI dirty throttling is makes sense to also have oer BDI
> > congestion feedback; why wait on another device if the current one is not
> > congested.
> 
> Similar comments apply.  congestion_wait() should be called
> throttle_at_a_rate_proportional_to_the_speed_of_presently_uncongested_queues().
> 
> If a process is throttled in the page allocator waiting for pages to become
> reclaimable, that process absolutely does not care whether those pages were
> previously dirty against /dev/sda or against /dev/sdb.  It wants to be woken
> up for writeout completion against any queue.

OK, so you disagree with Miklos' 2nd point here:
  http://lkml.org/lkml/2007/4/4/137

And in the light of clear_bdi_congestion() being called for each
writeout completion under the threshold this does make sense.

So this whole 8-12/12 series is not needed and just served as an
learning experience :-/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/12] mm: scalable bdi statistics counters.
  2007-04-05 22:37   ` Andrew Morton
@ 2007-04-06  7:22     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-04-06  7:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita

On Thu, 2007-04-05 at 15:37 -0700, Andrew Morton wrote:
> On Thu, 05 Apr 2007 19:42:11 +0200
> root@programming.kicks-ass.net wrote:
> 
> > Provide scalable per backing_dev_info statistics counters modeled on the ZVC
> > code.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  block/ll_rw_blk.c           |    1 
> >  drivers/block/rd.c          |    2 
> >  drivers/char/mem.c          |    2 
> >  fs/char_dev.c               |    1 
> >  fs/fuse/inode.c             |    1 
> >  fs/nfs/client.c             |    1 
> >  include/linux/backing-dev.h |   98 +++++++++++++++++++++++++++++++++++++++++
> >  mm/backing-dev.c            |  103 ++++++++++++++++++++++++++++++++++++++++++++
> 
> madness!  Quite duplicative of vmstat.h, yet all this infrastructure
> is still only usable in one specific application.
> 
> Can we please look at generalising the vmstat.h stuff?
> 
> Or, the API in percpu_counter.h appears suitable to this application.
> (The comment at line 6 is a total lie).

Ok, I'll see what I can come up with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 12/12] mm: per BDI congestion feedback
  2007-04-06  7:01     ` Peter Zijlstra
@ 2007-04-06 11:00       ` Andrew Morton
  2007-04-06 11:10         ` Miklos Szeredi
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2007-04-06 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita

On Fri, 06 Apr 2007 09:01:57 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-04-05 at 16:24 -0700, Andrew Morton wrote:
> > On Thu, 05 Apr 2007 19:42:21 +0200
> > root@programming.kicks-ass.net wrote:
> > 
> > > Now that we have per BDI dirty throttling is makes sense to also have oer BDI
> > > congestion feedback; why wait on another device if the current one is not
> > > congested.
> > 
> > Similar comments apply.  congestion_wait() should be called
> > throttle_at_a_rate_proportional_to_the_speed_of_presently_uncongested_queues().
> > 
> > If a process is throttled in the page allocator waiting for pages to become
> > reclaimable, that process absolutely does not care whether those pages were
> > previously dirty against /dev/sda or against /dev/sdb.  It wants to be woken
> > up for writeout completion against any queue.
> 
> OK, so you disagree with Miklos' 2nd point here:
>   http://lkml.org/lkml/2007/4/4/137

Yup, silly man thought that "congestion_wait" has something to do with
congestion ;)  I think it sort-of used to, once.

Now it really means no more than "block until a batch of writes complete".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 12/12] mm: per BDI congestion feedback
  2007-04-06 11:00       ` Andrew Morton
@ 2007-04-06 11:10         ` Miklos Szeredi
  0 siblings, 0 replies; 27+ messages in thread
From: Miklos Szeredi @ 2007-04-06 11:10 UTC (permalink / raw)
  To: akpm
  Cc: a.p.zijlstra, linux-mm, linux-kernel, neilb, dgc,
	tomoki.sekiyama.qu, nikita

> > OK, so you disagree with Miklos' 2nd point here:
> >   http://lkml.org/lkml/2007/4/4/137
> 
> Yup, silly man thought that "congestion_wait" has something to do with
> congestion ;)  I think it sort-of used to, once.

Oh well.  I _usually_ do actually read the code, but this seemed so
obvious...  I'll learn never to trust descriptive function names.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/12] mm: remove throttle_vm_writeback
  2007-04-05 22:44   ` Andrew Morton
@ 2007-09-26 20:42     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2007-09-26 20:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita

On Thu, 2007-04-05 at 15:44 -0700, Andrew Morton wrote:
> On Thu, 05 Apr 2007 19:42:18 +0200
> root@programming.kicks-ass.net wrote:
> 
> > rely on accurate dirty page accounting to provide enough push back
> 
> I think we'd like to see a bit more justification than that, please.

it should read like this:

        for ( ; ; ) {
		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);

                /*
                 * Boost the allowable dirty threshold a bit for page
                 * allocators so they don't get DoS'ed by heavy writers
                 */
                dirty_thresh += dirty_thresh / 10;      /* wheeee... */

                if (global_page_state(NR_FILE_DIRTY) + 
		    global_page_state(NR_UNSTABLE_NFS) +
		    global_page_state(NR_WRITEBACK) <= dirty_thresh)
                        	break;

                congestion_wait(WRITE, HZ/10);
        }

[ note the extra NR_FILE_DIRTY ]

now, balance_dirty_pages() is there to ensure:

  nr_dirty + nr_unstable + nr_writeback < dirty_thresh      (1)

reclaim will (with the introduction of dirty page tracking) never
generate dirty pages, so the only disturbance of that equation is an
increase in nr_writeback.

[ pageout() sets wbc.for_reclaim=1, so NFS traffic will not generate
  unstable pages ]

So, what throttle_vm_writeout() does is limit the number of added
writeback pages to 10% of the total limit.

pageout() seems to avoid stuffing pages down a congested bdi 
(TODO: has details), along with the much smaller io-queues, the initial
purpose of this function - which was to avoid all memory getting stuck
in io-queues - seems to be handled.

Now the problems...

Trouble is that it currently does not take nr_dirty into account which
in the worst case limits it to 110% of the limit.

Also, I'm seeing (2.6.23-rc8-mm1) live-locks in throttle_vm_writeback()
where nr_dirty + nr_unstable > thresh - which according to (1) should
not happen, and will not change without explicit action.

Hmm maybe the 10% is < nr_cpus * ratelimit_pages.

2 cpus, mem=128M -> ratelimit_pages ~ 512
threshold ~ 1500

so indeed: 150 < 1024.

Still not conclusive but at least getting somewhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-09-26 20:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-05 17:42 [PATCH 00/12] per device dirty throttling -v3 root
2007-04-05 17:42 ` [PATCH 01/12] nfs: remove congestion_end() root
2007-04-05 17:42 ` [PATCH 02/12] mm: scalable bdi statistics counters root
2007-04-05 22:37   ` Andrew Morton
2007-04-06  7:22     ` Peter Zijlstra
2007-04-05 17:42 ` [PATCH 03/12] mm: count dirty pages per BDI root
2007-04-05 17:42 ` [PATCH 04/12] mm: count writeback " root
2007-04-05 17:42 ` [PATCH 05/12] mm: count unstable " root
2007-04-05 17:42 ` [PATCH 06/12] mm: expose BDI statistics in sysfs root
2007-04-05 17:42 ` [PATCH 07/12] mm: per device dirty threshold root
2007-04-05 17:42 ` [PATCH 08/12] mm: fixup possible deadlock root
2007-04-05 22:43   ` Andrew Morton
2007-04-05 17:42 ` [PATCH 09/12] mm: remove throttle_vm_writeback root
2007-04-05 22:44   ` Andrew Morton
2007-09-26 20:42     ` Peter Zijlstra
2007-04-05 17:42 ` [PATCH 10/12] mm: page_alloc_wait root
2007-04-05 22:57   ` Andrew Morton
2007-04-06  6:37     ` Peter Zijlstra
2007-04-05 17:42 ` [PATCH 11/12] mm: accurate pageout congestion wait root
2007-04-05 23:17   ` Andrew Morton
2007-04-06  6:51     ` Peter Zijlstra
2007-04-05 17:42 ` [PATCH 12/12] mm: per BDI congestion feedback root
2007-04-05 23:24   ` Andrew Morton
2007-04-06  7:01     ` Peter Zijlstra
2007-04-06 11:00       ` Andrew Morton
2007-04-06 11:10         ` Miklos Szeredi
2007-04-05 17:47 ` [PATCH 00/12] per device dirty throttling -v3 Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).