linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Cpuset aware writeback V1
@ 2007-01-20  3:10 Christoph Lameter
  2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Peter Zijlstra, Paul Menage, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Currently cpusets are not able to do proper writeback since dirty ratio
calculations and writeback are all done for the system as a whole. This
may result in a large percentage of the nodes in a cpuset to become dirty
without background writeout being triggered and without synchrononous
writes occurring. Instead writeout occurs during reclaim when memory
is tight which may lead to dicey VM situations.

In order to fix the problem we first of all introduce a method to establish
a map of dirty nodes for each struct address_space.

Secondly, we modify the dirty limit calculation to be based on the current
state of memory on the nodes of the cpuset that the current tasks belongs to.

If the current tasks is part of a cpuset that is not allowed to allocate
from all nodes in the system then we select only inodes for writeback
that have pages on the nodes that we are allowed to allocate from.

Tested on:
IA64 NUMA 128p, 12p

Compiles on:
i386 SMP
x86_64 UP

Changelog: RFC->V1
------------------

- Rework dirty_map logic to allocate it dynamically on larger
  NUMA systems. Move to struct address_space and address various minor issues.

- Dynamically allocate dirty maps only if an inode is dirtied.

- Clear the dirty map only when an inode is cleared (simplifies
  locking and we need to keep the dirty state even after the dirty state of
  all pages has be cleared for NFS writeout to occur correctly).

- Drop nr_node_ids patches

- Drop the NR_UNRECLAIMABLE patch. There may be other ideas around on how
  to accomplish the same in a more elegant way.

- Drop mentioning the NFS issues since Peter is working on those.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
@ 2007-01-20  3:10 ` Christoph Lameter
  2007-01-20  5:15   ` Paul Jackson
  2007-01-22  1:31   ` David Chinner
  2007-01-20  3:10 ` [PATCH 2/5] Add a nodemask to pdflush functions Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)	First time init of the map



The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
if necessary for systems with a large number of nodes (f.e. IA64 with
1024 nodes).

Updating the dirty map may involve allocating it first for large
configuration. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead.

The dirty map is only cleared (or freed) if the inode is cleared.
At that point no dirty pages exist anymore and therefore it can
be done without any locking. The dirty map records all nodes that
have been used for dirty pages by that inode until the inode is
no longer used.

Signed-off-by; Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/fs/fs-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/fs-writeback.c	2007-01-18 13:48:29.899625484 -0600
+++ linux-2.6.20-rc5/fs/fs-writeback.c	2007-01-19 18:40:27.421969825 -0600
@@ -22,6 +22,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 /**
@@ -349,6 +350,12 @@ sync_sb_inodes(struct super_block *sb, s
 			continue;		/* blockdev has wrong queue */
 		}
 
+		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+			/* No pages on the nodes under writeback */
+			list_move(&inode->i_list, &sb->s_dirty);
+			continue;
+		}
+
 		/* Was this inode dirtied after sync_sb_inodes was called? */
 		if (time_after(inode->dirtied_when, start))
 			break;
Index: linux-2.6.20-rc5/fs/inode.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/inode.c	2007-01-18 13:48:29.908415315 -0600
+++ linux-2.6.20-rc5/fs/inode.c	2007-01-19 18:40:02.611062349 -0600
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/cpuset.h>
 
 /*
  * This is needed for the following functions:
@@ -148,6 +149,7 @@ static struct inode *alloc_inode(struct 
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
+		cpuset_init_dirty_nodes(mapping);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -257,6 +259,7 @@ void clear_inode(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	cpuset_clear_dirty_nodes(inode->i_mapping);
 	inode->i_state = I_CLEAR;
 }
 
Index: linux-2.6.20-rc5/include/linux/fs.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/fs.h	2007-01-18 13:48:29.926971624 -0600
+++ linux-2.6.20-rc5/include/linux/fs.h	2007-01-19 12:42:11.572375552 -0600
@@ -447,6 +447,13 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+	nodemask_t		dirty_nodes;	/* Map of nodes with dirty pages */
+#else
+	nodemask_t		*dirty_nodes;	/* pointer to map if dirty */
+#endif
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: linux-2.6.20-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:29.956271059 -0600
+++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-19 19:45:08.755650133 -0600
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/cpuset.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -776,6 +777,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		cpuset_update_dirty_nodes(mapping, page);
 		write_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -940,10 +942,12 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		if (!PageDirty(page))
-			radix_tree_tag_clear(&mapping->page_tree,
+		if (!PageDirty(page)) {
+			if (radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
-						PAGECACHE_TAG_DIRTY);
+						PAGECACHE_TAG_DIRTY))
+				cpuset_clear_dirty_nodes(mapping);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
Index: linux-2.6.20-rc5/fs/buffer.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/buffer.c	2007-01-18 13:48:29.918181793 -0600
+++ linux-2.6.20-rc5/fs/buffer.c	2007-01-19 19:45:08.784949542 -0600
@@ -42,6 +42,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static void invalidate_bh_lrus(void);
@@ -736,6 +737,7 @@ int __set_page_dirty_buffers(struct page
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+		cpuset_update_dirty_nodes(mapping, page);
 	}
 	write_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
Index: linux-2.6.20-rc5/include/linux/cpuset.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/cpuset.h	2007-01-18 13:48:29.935761454 -0600
+++ linux-2.6.20-rc5/include/linux/cpuset.h	2007-01-19 19:44:58.201026705 -0600
@@ -75,6 +75,44 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __node) \
+	if (!node_isset((__node, (__mapping)->dirty_nodes) \
+		node_set((__node), (__mapping)->dirty_inodes)
+
+#define cpuset_clear_dirty_nodes(__mapping) \
+		(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_init_dirty_nodes(__mapping) \
+		(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr) \
+		(!(__nodemask_ptr) || nodes_intersects((__mapping)->dirty_nodes, \
+		*(__nodemask_ptr)))
+
+#define cpuset_dirty_node_set(__mapping, __node) \
+		node_isset((__mapping_>dirty_nodes, (__nodes))
+
+#else
+
+#define cpuset_init_dirty_nodes(__mapping) \
+	(__mapping)->dirty_nodes = NULL
+
+struct address_space;
+
+extern void cpuset_update_dirty_nodes(struct address_space *a,
+					struct page *p);
+
+extern void cpuset_clear_dirty_nodes(struct address_space *a);
+
+extern int cpuset_intersects_dirty_nodes(struct address_space *a,
+					nodemask_t *mask);
+
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -146,6 +184,26 @@ static inline int cpuset_do_slab_mem_spr
 
 static inline void cpuset_track_online_nodes(void) {}
 
+struct address_space;
+
+static inline void cpuset_update_dirty_nodes(struct address_space *a,
+					struct page *p) {}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *a) {}
+
+static inline void cpuset_init_dirty_nodes(struct address_space *a) {}
+
+static inline int cpuset_dirty_node_set(struct inode *i, int node)
+{
+	return 1;
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *a,
+		nodemask_t *n)
+{
+	return 1;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
Index: linux-2.6.20-rc5/kernel/cpuset.c
===================================================================
--- linux-2.6.20-rc5.orig/kernel/cpuset.c	2007-01-18 13:48:29.967990834 -0600
+++ linux-2.6.20-rc5/kernel/cpuset.c	2007-01-19 19:45:37.552086499 -0600
@@ -2530,6 +2530,63 @@ int cpuset_mem_spread_node(void)
 }
 EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 
+#if MAX_NUMNODES > BITS_PER_LONG
+
+/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space. Modifications
+ * to the dirty_nodes pointer are protected by the tree_lock.
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+			struct page *page)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+	int node = page_to_nid(page);
+
+	if (!nodes) {
+		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+		if (!nodes)
+			return;
+
+		*nodes = NODE_MASK_NONE;
+		mapping->dirty_nodes = nodes;
+	}
+
+	if (!node_isset(node, *nodes))
+		node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+
+	if (nodes) {
+		mapping->dirty_nodes = NULL;
+		kfree(nodes);
+	}
+}
+
+/*
+ * Called without the tree_lock! So we may on rare occasions (when we race with
+ * cpuset_clear_dirty_nodes()) follow the dirty_node pointer to random data.
+ * However, the potential false positive may only cause a needless writeout
+ * of an inode whose pages are not in the intended cpuset.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+			nodemask_t *mask)
+{
+	nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+	if (!mask)
+		return 1;
+
+	if (!dirty_nodes)
+		return 0;
+
+	return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
 /**
  * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
  * @p: pointer to task_struct of some other task.
Index: linux-2.6.20-rc5/include/linux/writeback.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/writeback.h	2007-01-18 13:48:29.946504581 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h	2007-01-19 19:45:08.746860311 -0600
@@ -59,11 +59,12 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	nodemask_t *nodes;		/* Set of nodes of interest */
 };
 
 /*
  * fs/fs-writeback.c
- */	
+ */
 void writeback_inodes(struct writeback_control *wbc);
 void wake_up_inode(struct inode *inode);
 int inode_wait(void *);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/5] Add a nodemask to pdflush functions
  2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
  2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
@ 2007-01-20  3:10 ` Christoph Lameter
  2007-01-20  3:10 ` [PATCH 3/5] Per cpuset dirty ratio calculation Christoph Lameter
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Peter Zijlstra, Paul Menage, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/include/linux/writeback.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/writeback.h	2007-01-18 13:48:35.514373979 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h	2007-01-18 13:48:37.326055740 -0600
@@ -82,7 +82,7 @@ static inline void wait_on_inode(struct 
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
@@ -110,7 +110,8 @@ balance_dirty_pages_ratelimited(struct a
 	balance_dirty_pages_ratelimited_nr(mapping, 1);
 }
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+		unsigned long arg0, nodemask_t *nodes);
 extern int generic_writepages(struct address_space *mapping,
 			      struct writeback_control *wbc);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
Index: linux-2.6.20-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:35.448938573 -0600
+++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-18 13:48:37.342658753 -0600
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -244,7 +244,7 @@ static void balance_dirty_pages(struct a
 	 */
 	if ((laptop_mode && pages_written) ||
 	     (!laptop_mode && (nr_reclaimable > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -325,7 +325,7 @@ void throttle_vm_writeout(void)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -365,12 +365,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush threads were busy.
  */
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
 {
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	return pdflush_operation(background_writeout, nr_pages, nodes);
 }
 
 static void wb_timer_fn(unsigned long unused);
@@ -394,7 +394,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
 {
 	unsigned long oldest_jif;
 	unsigned long start_jif;
@@ -454,18 +454,18 @@ int dirty_writeback_centisecs_handler(ct
 
 static void wb_timer_fn(unsigned long unused)
 {
-	if (pdflush_operation(wb_kupdate, 0) < 0)
+	if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
 		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
 }
 
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
 {
 	sys_sync();
 }
 
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	pdflush_operation(laptop_flush, 0, NULL);
 }
 
 /*
Index: linux-2.6.20-rc5/mm/pdflush.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/pdflush.c	2007-01-17 22:06:10.073655594 -0600
+++ linux-2.6.20-rc5/mm/pdflush.c	2007-01-18 13:48:37.353401880 -0600
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
  */
 struct pdflush_work {
 	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
+	void (*fn)(unsigned long, nodemask_t *); /* A callback function */
 	unsigned long arg0;		/* An argument to the callback */
 	struct list_head list;		/* On pdflush_list, when idle */
 	unsigned long when_i_went_to_sleep;
+	int have_nodes;			/* Nodes were specified */
+	nodemask_t nodes;		/* Nodes of interest */
 };
 
 static int __pdflush(struct pdflush_work *my_work)
@@ -123,7 +125,8 @@ static int __pdflush(struct pdflush_work
 		}
 		spin_unlock_irq(&pdflush_lock);
 
-		(*my_work->fn)(my_work->arg0);
+		(*my_work->fn)(my_work->arg0,
+			my_work->have_nodes ? &my_work->nodes : NULL);
 
 		/*
 		 * Thread creation: For how long have there been zero
@@ -197,7 +200,8 @@ static int pdflush(void *dummy)
  * Returns zero if it indeed managed to find a worker thread, and passed your
  * payload to it.
  */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *),
+			unsigned long arg0, nodemask_t *nodes)
 {
 	unsigned long flags;
 	int ret = 0;
@@ -217,6 +221,11 @@ int pdflush_operation(void (*fn)(unsigne
 			last_empty_jifs = jiffies;
 		pdf->fn = fn;
 		pdf->arg0 = arg0;
+		if (nodes) {
+			pdf->nodes = *nodes;
+			pdf->have_nodes = 1;
+		} else
+			pdf->have_nodes = 0;
 		wake_up_process(pdf->who);
 		spin_unlock_irqrestore(&pdflush_lock, flags);
 	}
Index: linux-2.6.20-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/vmscan.c	2007-01-17 22:06:10.093188092 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c	2007-01-18 13:48:37.366098302 -0600
@@ -1065,7 +1065,7 @@ unsigned long try_to_free_pages(struct z
 		 */
 		if (total_scanned > sc.swap_cluster_max +
 					sc.swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
 			sc.may_writepage = 1;
 		}
 
Index: linux-2.6.20-rc5/fs/buffer.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/buffer.c	2007-01-18 13:48:35.467494882 -0600
+++ linux-2.6.20-rc5/fs/buffer.c	2007-01-18 13:48:37.394421089 -0600
@@ -357,7 +357,7 @@ static void free_more_memory(void)
 	struct zone **zones;
 	pg_data_t *pgdat;
 
-	wakeup_pdflush(1024);
+	wakeup_pdflush(1024, NULL);
 	yield();
 
 	for_each_online_pgdat(pgdat) {
Index: linux-2.6.20-rc5/fs/super.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/super.c	2007-01-17 22:06:10.133229713 -0600
+++ linux-2.6.20-rc5/fs/super.c	2007-01-18 13:48:37.408094159 -0600
@@ -618,7 +618,7 @@ int do_remount_sb(struct super_block *sb
 	return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
 	struct super_block *sb;
 
@@ -646,7 +646,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-	pdflush_operation(do_emergency_remount, 0);
+	pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
Index: linux-2.6.20-rc5/fs/sync.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/sync.c	2007-01-17 22:06:10.165458335 -0600
+++ linux-2.6.20-rc5/fs/sync.c	2007-01-18 13:48:37.417860638 -0600
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-	wakeup_pdflush(0);
+	wakeup_pdflush(0, NULL);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	DQUOT_SYNC(NULL);
 	sync_supers();		/* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-	do_sync(1);
+	do_sync(1, NULL);
 	return 0;
 }
 
 void emergency_sync(void)
 {
-	pdflush_operation(do_sync, 0);
+	pdflush_operation(do_sync, 0, NULL);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/5] Per cpuset dirty ratio calculation
  2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
  2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
  2007-01-20  3:10 ` [PATCH 2/5] Add a nodemask to pdflush functions Christoph Lameter
@ 2007-01-20  3:10 ` Christoph Lameter
  2007-01-20  3:10 ` [PATCH 4/5] Cpuset aware writeback during reclaim Christoph Lameter
  2007-01-20  3:10 ` [PATCH 5/5] Throttle vm writeout per cpuset Christoph Lameter
  4 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:37.000000000 -0600
+++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-18 13:48:50.838955335 -0600
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+	long thresh_background;
+	long thresh_dirty;
+	unsigned long nr_dirty;
+	unsigned long nr_unstable;
+	unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -120,31 +128,74 @@ static void background_writeout(unsigned
  * We make sure that the background writeout level is below the adjusted
  * clamping level.
  */
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+		nodemask_t *nodes)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
 	int unmapped_ratio;
 	long background;
 	long dirty;
-	unsigned long available_memory = vm_total_pages;
+	unsigned long available_memory;
+	unsigned long high_memory;
+	unsigned long nr_mapped;
 	struct task_struct *tsk;
+	int is_subset = 0;
 
+#ifdef CONFIG_CPUSETS
+	/*
+	 * Calculate the limits relative to the current cpuset if necessary.
+	 */
+	if (unlikely(nodes &&
+			!nodes_subset(node_online_map, *nodes))) {
+		int node;
+
+		is_subset = 1;
+		memset(dl, 0, sizeof(struct dirty_limits));
+		available_memory = 0;
+		high_memory = 0;
+		nr_mapped = 0;
+		for_each_node_mask(node, *nodes) {
+			if (!node_online(node))
+				continue;
+			dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+			dl->nr_unstable +=
+				node_page_state(node, NR_UNSTABLE_NFS);
+			dl->nr_writeback +=
+				node_page_state(node, NR_WRITEBACK);
+			available_memory +=
+				NODE_DATA(node)->node_present_pages;
+#ifdef CONFIG_HIGHMEM
+			high_memory += NODE_DATA(node)
+				->node_zones[ZONE_HIGHMEM]->present_pages;
+#endif
+			nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
+					node_page_state(node, NR_ANON_PAGES);
+		}
+	} else
+#endif
+	{
+		/* Global limits */
+		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
+		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+		dl->nr_writeback = global_page_state(NR_WRITEBACK);
+		available_memory = vm_total_pages;
+		high_memory = totalhigh_pages;
+		nr_mapped = global_page_state(NR_FILE_MAPPED) +
+				global_page_state(NR_ANON_PAGES);
+	}
 #ifdef CONFIG_HIGHMEM
 	/*
 	 * If this mapping can only allocate from low memory,
 	 * we exclude high memory from our count.
 	 */
 	if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
-		available_memory -= totalhigh_pages;
+		available_memory -= high_memory;
 #endif
 
 
-	unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
-				global_page_state(NR_ANON_PAGES)) * 100) /
-					vm_total_pages;
+	unmapped_ratio = 100 - (nr_mapped * 100) / available_memory;
 
 	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio > unmapped_ratio / 2)
@@ -164,8 +215,9 @@ get_dirty_limits(long *pbackground, long
 		background += background / 4;
 		dirty += dirty / 4;
 	}
-	*pbackground = background;
-	*pdirty = dirty;
+	dl->thresh_background = background;
+	dl->thresh_dirty = dirty;
+	return is_subset;
 }
 
 /*
@@ -178,8 +230,7 @@ get_dirty_limits(long *pbackground, long
 static void balance_dirty_pages(struct address_space *mapping)
 {
 	long nr_reclaimable;
-	long background_thresh;
-	long dirty_thresh;
+	struct dirty_limits dl;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -194,11 +245,12 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		if (get_dirty_limits(&dl, mapping,
+				&cpuset_current_mems_allowed))
+			wbc.nodes = &cpuset_current_mems_allowed;
+		nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+		if (nr_reclaimable + dl.nr_writeback <=
+			dl.thresh_dirty)
 				break;
 
 		if (!dirty_exceeded)
@@ -212,13 +264,10 @@ static void balance_dirty_pages(struct a
 		 */
 		if (nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
+			get_dirty_limits(&dl, mapping,
+				&cpuset_current_mems_allowed);
+			nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+			if (nr_reclaimable + dl.nr_writeback <= dl.thresh_dirty)
 						break;
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
@@ -227,8 +276,8 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
+	if (nr_reclaimable + dl.nr_writeback
+		<= dl.thresh_dirty && dirty_exceeded)
 			dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -243,8 +292,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
-		pdflush_operation(background_writeout, 0, NULL);
+	     (!laptop_mode && (nr_reclaimable > dl.thresh_background)))
+		pdflush_operation(background_writeout, 0,
+			&cpuset_current_mems_allowed);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -301,21 +351,19 @@ EXPORT_SYMBOL(balance_dirty_pages_rateli
 
 void throttle_vm_writeout(void)
 {
-	long background_thresh;
-	long dirty_thresh;
+	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&dl, NULL, &node_online_map);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
                  * allocators so they don't get DoS'ed by heavy writers
                  */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
+                dl.thresh_dirty += dl.thresh_dirty / 10; /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+                if (dl.nr_unstable + dl.nr_writeback <= dl.thresh_dirty)
+                       break;
                 congestion_wait(WRITE, HZ/10);
         }
 }
@@ -325,7 +373,7 @@ void throttle_vm_writeout(void)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -338,12 +386,11 @@ static void background_writeout(unsigned
 	};
 
 	for ( ; ; ) {
-		long background_thresh;
-		long dirty_thresh;
+		struct dirty_limits dl;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
+		if (get_dirty_limits(&dl, NULL, nodes))
+			wbc.nodes = nodes;
+		if (dl.nr_dirty + dl.nr_unstable < dl.thresh_background
 				&& min_pages <= 0)
 			break;
 		wbc.encountered_congestion = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] Cpuset aware writeback during reclaim
  2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-01-20  3:10 ` [PATCH 3/5] Per cpuset dirty ratio calculation Christoph Lameter
@ 2007-01-20  3:10 ` Christoph Lameter
  2007-01-20  3:10 ` [PATCH 5/5] Throttle vm writeout per cpuset Christoph Lameter
  4 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Peter Zijlstra, Paul Menage, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/vmscan.c	2007-01-15 21:34:43.173887398 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c	2007-01-15 21:37:26.605346439 -0600
@@ -1065,7 +1065,8 @@ unsigned long try_to_free_pages(struct z
 		 */
 		if (total_scanned > sc.swap_cluster_max +
 					sc.swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				&cpuset_current_mems_allowed);
 			sc.may_writepage = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/5] Throttle vm writeout per cpuset
  2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-01-20  3:10 ` [PATCH 4/5] Cpuset aware writeback during reclaim Christoph Lameter
@ 2007-01-20  3:10 ` Christoph Lameter
  4 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-20  3:10 UTC (permalink / raw)
  To: akpm
  Cc: Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/include/linux/writeback.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/writeback.h	2007-01-15 21:37:05.209897874 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h	2007-01-15 21:37:33.283671963 -0600
@@ -85,7 +85,7 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(void);
+void throttle_vm_writeout(nodemask_t *);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
Index: linux-2.6.20-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-15 21:35:28.013794159 -0600
+++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-15 21:37:33.302228293 -0600
@@ -349,12 +349,12 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(void)
+void throttle_vm_writeout(nodemask_t *nodes)
 {
 	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&dl, NULL, &node_online_map);
+		get_dirty_limits(&dl, NULL, nodes);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
Index: linux-2.6.20-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/vmscan.c	2007-01-15 21:37:26.605346439 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c	2007-01-15 21:37:33.316878027 -0600
@@ -949,7 +949,7 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	throttle_vm_writeout();
+	throttle_vm_writeout(&cpuset_current_mems_allowed);
 
 	atomic_dec(&zone->reclaim_in_progress);
 	return nr_reclaimed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
@ 2007-01-20  5:15   ` Paul Jackson
  2007-01-22 17:41     ` Christoph Lameter
  2007-01-22  1:31   ` David Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-01-20  5:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, menage, a.p.zijlstra, nickpiggin, linux-mm, dgc, ak

Christoph wrote:
> + * Called without the tree_lock! So we may on rare occasions (when we race with
> + * cpuset_clear_dirty_nodes()) follow the dirty_node pointer to random data.

Random is ok, on rate occassion, as you note.

But is there any chance you could follow it to a non-existent memory location
and oops?  These long nodemasks are kmalloc/kfree'd, and I thought that once
kfree'd, there was no guarantee that the stale address would even point to
a mapped page of RAM.  This situation reminds me of the one that led to adding
some RCU dependent code to kernel/cpuset.c.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
  2007-01-20  5:15   ` Paul Jackson
@ 2007-01-22  1:31   ` David Chinner
  2007-01-22 19:30     ` Christoph Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: David Chinner @ 2007-01-22  1:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Paul Jackson, Dave Chinner, Andi Kleen

On Fri, Jan 19, 2007 at 07:10:12PM -0800, Christoph Lameter wrote:
> Index: linux-2.6.20-rc5/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/fs/fs-writeback.c	2007-01-18 13:48:29.899625484 -0600
> +++ linux-2.6.20-rc5/fs/fs-writeback.c	2007-01-19 18:40:27.421969825 -0600
> @@ -22,6 +22,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/cpuset.h>
>  #include "internal.h"
>  
>  /**
> @@ -349,6 +350,12 @@ sync_sb_inodes(struct super_block *sb, s
>  			continue;		/* blockdev has wrong queue */
>  		}
>  
> +		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
> +			/* No pages on the nodes under writeback */
> +			list_move(&inode->i_list, &sb->s_dirty);
> +			continue;
> +		}

This breaks aging of dirty inodes, right? The s_dirty list a time
ordered list this will mean that the inode doesn't get written out
by a background sync for (potentially) another 30s. IOWs, we can
delay writeback of inodes on with data on other nodes by running
a single node out of memory. The normal background pdflush won't
help us either as it will see that there is an existing pdflush
working on the bdi and skip it....

> Index: linux-2.6.20-rc5/fs/inode.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/fs/inode.c	2007-01-18 13:48:29.908415315 -0600
> +++ linux-2.6.20-rc5/fs/inode.c	2007-01-19 18:40:02.611062349 -0600
> @@ -22,6 +22,7 @@
>  #include <linux/bootmem.h>
>  #include <linux/inotify.h>
>  #include <linux/mount.h>
> +#include <linux/cpuset.h>
>  
>  /*
>   * This is needed for the following functions:
> @@ -148,6 +149,7 @@ static struct inode *alloc_inode(struct 
>  		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
>  		mapping->assoc_mapping = NULL;
>  		mapping->backing_dev_info = &default_backing_dev_info;
> +		cpuset_init_dirty_nodes(mapping);
>  
>  		/*
>  		 * If the block_device provides a backing_dev_info for client
> @@ -257,6 +259,7 @@ void clear_inode(struct inode *inode)
>  		bd_forget(inode);
>  	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
>  		cd_forget(inode);
> +	cpuset_clear_dirty_nodes(inode->i_mapping);
>  	inode->i_state = I_CLEAR;
>  }

This is rather late to be clearing this, right? At the start of clear_inode()
we:

	BUG_ON(inode->i_data.nrpages);

Which tends to implicate that we should have already freed the dirty
map as there should be no pages (dirty or otherwise) attached to the
inode at this point. i.e. we should BUG here if we've still got
a dirty mask indicating dirty nodes on the inode because it should
be clear at this point.

> ===================================================================
> --- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:29.956271059 -0600
> +++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-19 19:45:08.755650133 -0600
> @@ -33,6 +33,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <linux/cpuset.h>
>  
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -776,6 +777,7 @@ int __set_page_dirty_nobuffers(struct pa
>  			radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  		}
> +		cpuset_update_dirty_nodes(mapping, page);

Shouldn't this be done in the same context of setting the
PAGECACHE_TAG_DIRTY? i.e. we set the node dirty at the same time
we set the page dirty tag.

>  		write_unlock_irq(&mapping->tree_lock);
>  		if (mapping->host) {
>  			/* !PageAnon && !swapper_space */
> @@ -940,10 +942,12 @@ int test_set_page_writeback(struct page 
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
> -		if (!PageDirty(page))
> -			radix_tree_tag_clear(&mapping->page_tree,
> +		if (!PageDirty(page)) {
> +			if (radix_tree_tag_clear(&mapping->page_tree,
>  						page_index(page),
> -						PAGECACHE_TAG_DIRTY);
> +						PAGECACHE_TAG_DIRTY))
> +				cpuset_clear_dirty_nodes(mapping);
> +		}

Because you are clearing the dirty node state at the same time we clear
the PAGECACHE_TAG_DIRTY.

> +#if MAX_NUMNODES <= BITS_PER_LONG
> +#define cpuset_update_dirty_nodes(__mapping, __node) \
> +	if (!node_isset((__node, (__mapping)->dirty_nodes) \
> +		node_set((__node), (__mapping)->dirty_inodes)
> +
> +#define cpuset_clear_dirty_nodes(__mapping) \
> +		(__mapping)->dirty_nodes = NODE_MASK_NONE

Hmmm - the above is going to lose dirty state - you're calling
cpuset_clear_dirty_nodes() in the case that a page is now under
writeback. cpuset_clear_dirty_nodes() clears the _entire_ dirty node mask
but all you want to do above is remove the dirty state from the
node mask if that is the only page on the node that is dirty.

So we set the dirty node mask on a page by page basis, but we shoot
it down as soon as _any_ page transistions from dirty to writeback.
Hence if you've got dirty pages on other nodes (or other dirty pages
on this node) you have now lost track of them because cleaning a
single page clears all dirty node state on the inode. This seems
badly broken to me.

Because you are not tracking pages-per-node dirty state, the only way
you can really clear the dirty node state is when the inode is
completely clean. e.g. in __sync_single_inode where (inode->i_state
& I_DIRTY) == 0. Otherwise I can't see how this would work at all....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-20  5:15   ` Paul Jackson
@ 2007-01-22 17:41     ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-22 17:41 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, a.p.zijlstra, nickpiggin, linux-mm, dgc, ak

On Fri, 19 Jan 2007, Paul Jackson wrote:

> Christoph wrote:
> > + * Called without the tree_lock! So we may on rare occasions (when we race with
> > + * cpuset_clear_dirty_nodes()) follow the dirty_node pointer to random data.
> 
> Random is ok, on rate occassion, as you note.
> 
> But is there any chance you could follow it to a non-existent memory location
> and oops?  These long nodemasks are kmalloc/kfree'd, and I thought that once
> kfree'd, there was no guarantee that the stale address would even point to
> a mapped page of RAM.  This situation reminds me of the one that led to adding
> some RCU dependent code to kernel/cpuset.c.

This could become an issue if we implement memory unplug and then RCU 
locking could help. But right now that situation is only possible with 
memory mapped via page tables (vmalloc or user space pages). The slab 
allocator can currently only allocate from 1-1 mapped memory. So no danger 
there.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-22  1:31   ` David Chinner
@ 2007-01-22 19:30     ` Christoph Lameter
  2007-01-28 21:38       ` David Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2007-01-22 19:30 UTC (permalink / raw)
  To: David Chinner
  Cc: akpm, Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Paul Jackson, Andi Kleen

On Mon, 22 Jan 2007, David Chinner wrote:

> > +			/* No pages on the nodes under writeback */
> > +			list_move(&inode->i_list, &sb->s_dirty);
> > +			continue;
> > +		}
> 
> This breaks aging of dirty inodes, right? The s_dirty list a time
> ordered list this will mean that the inode doesn't get written out
> by a background sync for (potentially) another 30s. IOWs, we can
> delay writeback of inodes on with data on other nodes by running
> a single node out of memory. The normal background pdflush won't
> help us either as it will see that there is an existing pdflush
> working on the bdi and skip it....

No. Aging is done without specifying a node mask and we move back to the 
dirty list without updating the dirty time.

> >  	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
> >  		cd_forget(inode);
> > +	cpuset_clear_dirty_nodes(inode->i_mapping);
> >  	inode->i_state = I_CLEAR;
> >  }
> 
> This is rather late to be clearing this, right? At the start of clear_inode()
> we:
> 
> 	BUG_ON(inode->i_data.nrpages);
> 
> Which tends to implicate that we should have already freed the dirty
> map as there should be no pages (dirty or otherwise) attached to the
> inode at this point. i.e. we should BUG here if we've still got
> a dirty mask indicating dirty nodes on the inode because it should
> be clear at this point.

The dirty map is needed even after all the dirty pages have become 
writeback/unstable pages. The dirty_map reflects the nodes that the inode 
has or had dirty pages on. If we would want an accurate dirty map then we 
would have to maintain an array of per node counters of pages. Too 
expensive.

> 
> > ===================================================================
> > --- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:29.956271059 -0600
> > +++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-19 19:45:08.755650133 -0600
> > @@ -33,6 +33,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/buffer_head.h>
> >  #include <linux/pagevec.h>
> > +#include <linux/cpuset.h>
> >  
> >  /*
> >   * The maximum number of pages to writeout in a single bdflush/kupdate
> > @@ -776,6 +777,7 @@ int __set_page_dirty_nobuffers(struct pa
> >  			radix_tree_tag_set(&mapping->page_tree,
> >  				page_index(page), PAGECACHE_TAG_DIRTY);
> >  		}
> > +		cpuset_update_dirty_nodes(mapping, page);
> 
> Shouldn't this be done in the same context of setting the
> PAGECACHE_TAG_DIRTY? i.e. we set the node dirty at the same time
> we set the page dirty tag.

We could move that statement up one line without trouble.

> > +			if (radix_tree_tag_clear(&mapping->page_tree,
> >  						page_index(page),
> > -						PAGECACHE_TAG_DIRTY);
> > +						PAGECACHE_TAG_DIRTY))
> > +				cpuset_clear_dirty_nodes(mapping);
> > +		}
> 
> Because you are clearing the dirty node state at the same time we clear
> the PAGECACHE_TAG_DIRTY.

Yuck this chunk should not be here. dirty node state should only be 
cleared when the inode is cleared and radix_tree_tag_clear does not 
deliver a boolean in the context of this patchset.

> > +#if MAX_NUMNODES <= BITS_PER_LONG
> > +#define cpuset_update_dirty_nodes(__mapping, __node) \
> > +	if (!node_isset((__node, (__mapping)->dirty_nodes) \
> > +		node_set((__node), (__mapping)->dirty_inodes)
> > +
> > +#define cpuset_clear_dirty_nodes(__mapping) \
> > +		(__mapping)->dirty_nodes = NODE_MASK_NONE
> 
> Hmmm - the above is going to lose dirty state - you're calling
> cpuset_clear_dirty_nodes() in the case that a page is now under
> writeback. cpuset_clear_dirty_nodes() clears the _entire_ dirty node mask
> but all you want to do above is remove the dirty state from the
> node mask if that is the only page on the node that is dirty.
> 
> So we set the dirty node mask on a page by page basis, but we shoot
> it down as soon as _any_ page transistions from dirty to writeback.
> Hence if you've got dirty pages on other nodes (or other dirty pages
> on this node) you have now lost track of them because cleaning a
> single page clears all dirty node state on the inode. This seems
> badly broken to me.
> 
> Because you are not tracking pages-per-node dirty state, the only way
> you can really clear the dirty node state is when the inode is
> completely clean. e.g. in __sync_single_inode where (inode->i_state
> & I_DIRTY) == 0. Otherwise I can't see how this would work at all....

Correct. Remove the chunk above and everything is fine. I am going to post 
an updated version. Sigh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-23 18:52 [PATCH 0/5] Cpuset aware writeback V2 Christoph Lameter
@ 2007-01-23 18:52 ` Christoph Lameter
  2007-01-25  3:04   ` Ethan Solomita
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2007-01-23 18:52 UTC (permalink / raw)
  To: akpm
  Cc: Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Christoph Lameter, Paul Jackson, Dave Chinner, Andi Kleen

Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)	First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Signed-off-by; Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.20-rc5/fs/fs-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/fs-writeback.c	2007-01-22 13:31:30.440219103 -0600
+++ linux-2.6.20-rc5/fs/fs-writeback.c	2007-01-23 12:21:44.669179863 -0600
@@ -22,6 +22,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 /**
@@ -349,6 +350,12 @@ sync_sb_inodes(struct super_block *sb, s
 			continue;		/* blockdev has wrong queue */
 		}
 
+		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+			/* No pages on the nodes under writeback */
+			list_move(&inode->i_list, &sb->s_dirty);
+			continue;
+		}
+
 		/* Was this inode dirtied after sync_sb_inodes was called? */
 		if (time_after(inode->dirtied_when, start))
 			break;
Index: linux-2.6.20-rc5/fs/inode.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/inode.c	2007-01-22 13:31:30.449985520 -0600
+++ linux-2.6.20-rc5/fs/inode.c	2007-01-22 13:34:22.397545917 -0600
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/cpuset.h>
 
 /*
  * This is needed for the following functions:
@@ -148,6 +149,7 @@ static struct inode *alloc_inode(struct 
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
+		cpuset_init_dirty_nodes(mapping);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -257,6 +259,7 @@ void clear_inode(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	cpuset_clear_dirty_nodes(inode->i_mapping);
 	inode->i_state = I_CLEAR;
 }
 
Index: linux-2.6.20-rc5/include/linux/fs.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/fs.h	2007-01-22 13:31:30.468541713 -0600
+++ linux-2.6.20-rc5/include/linux/fs.h	2007-01-23 12:33:26.331272886 -0600
@@ -447,6 +447,13 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
+#else
+	nodemask_t		*dirty_nodes;	/* pointer to map if dirty */
+#endif
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: linux-2.6.20-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-22 13:31:30.498817606 -0600
+++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-23 12:22:15.322270609 -0600
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/cpuset.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -776,6 +777,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		cpuset_update_dirty_nodes(mapping, page);
 		write_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
Index: linux-2.6.20-rc5/fs/buffer.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/buffer.c	2007-01-22 13:31:30.459751937 -0600
+++ linux-2.6.20-rc5/fs/buffer.c	2007-01-23 12:22:15.352546714 -0600
@@ -42,6 +42,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static void invalidate_bh_lrus(void);
@@ -736,6 +737,7 @@ int __set_page_dirty_buffers(struct page
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+		cpuset_update_dirty_nodes(mapping, page);
 	}
 	write_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
Index: linux-2.6.20-rc5/include/linux/cpuset.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/cpuset.h	2007-01-22 13:31:30.477331488 -0600
+++ linux-2.6.20-rc5/include/linux/cpuset.h	2007-01-23 12:32:53.783494005 -0600
@@ -75,6 +75,44 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __node)			\
+	do {								\
+		if (!node_isset((__node, (__mapping)->dirty_nodes)))	\
+			node_set((__node), (__mapping)->dirty_inodes);	\
+	} while (0)
+
+#define cpuset_clear_dirty_nodes(__mapping)				\
+		(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_init_dirty_nodes(__mapping)				\
+		(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr)	\
+		(!(__nodemask_ptr) ||					\
+			nodes_intersects((__mapping)->dirty_nodes,	\
+				*(__nodemask_ptr)))
+
+#else
+
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NULL
+
+struct address_space;
+
+extern void cpuset_update_dirty_nodes(struct address_space *a,
+					struct page *p);
+
+extern void cpuset_clear_dirty_nodes(struct address_space *a);
+
+extern int cpuset_intersects_dirty_nodes(struct address_space *a,
+					nodemask_t *mask);
+
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -146,6 +184,26 @@ static inline int cpuset_do_slab_mem_spr
 
 static inline void cpuset_track_online_nodes(void) {}
 
+struct address_space;
+
+static inline void cpuset_update_dirty_nodes(struct address_space *a,
+					struct page *p) {}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *a) {}
+
+static inline void cpuset_init_dirty_nodes(struct address_space *a) {}
+
+static inline int cpuset_dirty_node_set(struct inode *i, int node)
+{
+	return 1;
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *a,
+		nodemask_t *n)
+{
+	return 1;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
Index: linux-2.6.20-rc5/kernel/cpuset.c
===================================================================
--- linux-2.6.20-rc5.orig/kernel/cpuset.c	2007-01-22 13:31:30.511513948 -0600
+++ linux-2.6.20-rc5/kernel/cpuset.c	2007-01-23 12:26:02.804117253 -0600
@@ -4,7 +4,7 @@
  *  Processor and Memory placement constraints for sets of tasks.
  *
  *  Copyright (C) 2003 BULL SA.
- *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2007 Silicon Graphics, Inc.
  *
  *  Portions derived from Patrick Mochel's sysfs code.
  *  sysfs is Copyright (c) 2001-3 Patrick Mochel
@@ -12,6 +12,7 @@
  *  2003-10-10 Written by Simon Derr.
  *  2003-10-22 Updates by Stephen Hemminger.
  *  2004 May-July Rework by Paul Jackson.
+ *  2007 Cpuset writeback by Christoph Lameter.
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -2530,6 +2531,63 @@ int cpuset_mem_spread_node(void)
 }
 EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 
+#if MAX_NUMNODES > BITS_PER_LONG
+
+/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+			struct page *page)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+	int node = page_to_nid(page);
+
+	if (!nodes) {
+		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+		if (!nodes)
+			return;
+
+		*nodes = NODE_MASK_NONE;
+		mapping->dirty_nodes = nodes;
+	}
+
+	if (!node_isset(node, *nodes))
+		node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+
+	if (nodes) {
+		mapping->dirty_nodes = NULL;
+		kfree(nodes);
+	}
+}
+
+/*
+ * Called without the tree_lock. The nodemask is only freed when the inode
+ * is cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+			nodemask_t *mask)
+{
+	nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+	if (!mask)
+		return 1;
+
+	if (!dirty_nodes)
+		return 0;
+
+	return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
 /**
  * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
  * @p: pointer to task_struct of some other task.
Index: linux-2.6.20-rc5/include/linux/writeback.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/writeback.h	2007-01-22 13:31:30.489051189 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h	2007-01-23 12:22:15.313480772 -0600
@@ -59,11 +59,12 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	nodemask_t *nodes;		/* Set of nodes of interest */
 };
 
 /*
  * fs/fs-writeback.c
- */	
+ */
 void writeback_inodes(struct writeback_control *wbc);
 void wake_up_inode(struct inode *inode);
 int inode_wait(void *);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-23 18:52 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
@ 2007-01-25  3:04   ` Ethan Solomita
  2007-01-25  5:52     ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Ethan Solomita @ 2007-01-25  3:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Paul Jackson, Dave Chinner, Andi Kleen

Do we want this even with WB_SYNC_ALL and WB_SYNC_HOLD? It seems that 
callers from sync_inodes_sb(), which are the ones that pass in those 
options, may want to know that everything is written.
    -- Ethan


Christoph Lameter wrote:
> Index: linux-2.6.20-rc5/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/fs/fs-writeback.c	2007-01-22 13:31:30.440219103 -0600
> +++ linux-2.6.20-rc5/fs/fs-writeback.c	2007-01-23 12:21:44.669179863 -0600
> @@ -22,6 +22,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/cpuset.h>
>  #include "internal.h"
>  
>  /**
> @@ -349,6 +350,12 @@ sync_sb_inodes(struct super_block *sb, s
>  			continue;		/* blockdev has wrong queue */
>  		}
>  
> +		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
> +			/* No pages on the nodes under writeback */
> +			list_move(&inode->i_list, &sb->s_dirty);
> +			continue;
> +		}
> +
>  		/* Was this inode dirtied after sync_sb_inodes was called? */
>  		if (time_after(inode->dirtied_when, start))
>  			break;
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-25  3:04   ` Ethan Solomita
@ 2007-01-25  5:52     ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-25  5:52 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: akpm, Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Paul Jackson, Dave Chinner, Andi Kleen

On Wed, 24 Jan 2007, Ethan Solomita wrote:

>    The below addition makes us skip inodes outside of our dirty nodes. Do we
> want this even with WB_SYNC_ALL and WB_SYNC_HOLD? It seems that callers from
> sync_inodes_sb(), which are the ones that pass in those options, may want to
> know that everything is written.

A constraint on the nodes requires setting wbc.nodes. I could not find a 
writeback_control setup that uses either of those flags and also sets 
wbc.nodes.

And WB_SYNC_ALL is already used with another constraints on a 
mapping to only partially sync.

If a caller in the future sets wbc.nodes plus either of those options then 
it is an attempt to perform these operations only on a subset of nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-22 19:30     ` Christoph Lameter
@ 2007-01-28 21:38       ` David Chinner
  2007-01-29 16:50         ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: David Chinner @ 2007-01-28 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Chinner, akpm, Paul Menage, Peter Zijlstra, Nick Piggin,
	linux-mm, Paul Jackson, Andi Kleen

On Mon, Jan 22, 2007 at 11:30:50AM -0800, Christoph Lameter wrote:
> On Mon, 22 Jan 2007, David Chinner wrote:
> > >  	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
> > >  		cd_forget(inode);
> > > +	cpuset_clear_dirty_nodes(inode->i_mapping);
> > >  	inode->i_state = I_CLEAR;
> > >  }
> > 
> > This is rather late to be clearing this, right? At the start of clear_inode()
> > we:
> > 
> > 	BUG_ON(inode->i_data.nrpages);
> > 
> > Which tends to implicate that we should have already freed the dirty
> > map as there should be no pages (dirty or otherwise) attached to the
> > inode at this point. i.e. we should BUG here if we've still got
> > a dirty mask indicating dirty nodes on the inode because it should
> > be clear at this point.
> 
> The dirty map is needed even after all the dirty pages have become 
> writeback/unstable pages. The dirty_map reflects the nodes that the inode 
> has or had dirty pages on. If we would want an accurate dirty map then we 
> would have to maintain an array of per node counters of pages. Too 
> expensive.

I think you missed my point - when we call into this function, the
inode _must_ have already had all it's data written back. That is,
by definition the inode mapping is clean if inode->i_data.nrpages ==
0. Hence if we have any dirty nodes, then we have a mismatch between
the dirty node mask and the inode dirty state.  That is BUG-worthy,
IMO.

Cheers,

Dave.
> 
> > 
> > > ===================================================================
> > > --- linux-2.6.20-rc5.orig/mm/page-writeback.c	2007-01-18 13:48:29.956271059 -0600
> > > +++ linux-2.6.20-rc5/mm/page-writeback.c	2007-01-19 19:45:08.755650133 -0600
> > > @@ -33,6 +33,7 @@
> > >  #include <linux/syscalls.h>
> > >  #include <linux/buffer_head.h>
> > >  #include <linux/pagevec.h>
> > > +#include <linux/cpuset.h>
> > >  
> > >  /*
> > >   * The maximum number of pages to writeout in a single bdflush/kupdate
> > > @@ -776,6 +777,7 @@ int __set_page_dirty_nobuffers(struct pa
> > >  			radix_tree_tag_set(&mapping->page_tree,
> > >  				page_index(page), PAGECACHE_TAG_DIRTY);
> > >  		}
> > > +		cpuset_update_dirty_nodes(mapping, page);
> > 
> > Shouldn't this be done in the same context of setting the
> > PAGECACHE_TAG_DIRTY? i.e. we set the node dirty at the same time
> > we set the page dirty tag.
> 
> We could move that statement up one line without trouble.
> 
> > > +			if (radix_tree_tag_clear(&mapping->page_tree,
> > >  						page_index(page),
> > > -						PAGECACHE_TAG_DIRTY);
> > > +						PAGECACHE_TAG_DIRTY))
> > > +				cpuset_clear_dirty_nodes(mapping);
> > > +		}
> > 
> > Because you are clearing the dirty node state at the same time we clear
> > the PAGECACHE_TAG_DIRTY.
> 
> Yuck this chunk should not be here. dirty node state should only be 
> cleared when the inode is cleared and radix_tree_tag_clear does not 
> deliver a boolean in the context of this patchset.
> 
> > > +#if MAX_NUMNODES <= BITS_PER_LONG
> > > +#define cpuset_update_dirty_nodes(__mapping, __node) \
> > > +	if (!node_isset((__node, (__mapping)->dirty_nodes) \
> > > +		node_set((__node), (__mapping)->dirty_inodes)
> > > +
> > > +#define cpuset_clear_dirty_nodes(__mapping) \
> > > +		(__mapping)->dirty_nodes = NODE_MASK_NONE
> > 
> > Hmmm - the above is going to lose dirty state - you're calling
> > cpuset_clear_dirty_nodes() in the case that a page is now under
> > writeback. cpuset_clear_dirty_nodes() clears the _entire_ dirty node mask
> > but all you want to do above is remove the dirty state from the
> > node mask if that is the only page on the node that is dirty.
> > 
> > So we set the dirty node mask on a page by page basis, but we shoot
> > it down as soon as _any_ page transistions from dirty to writeback.
> > Hence if you've got dirty pages on other nodes (or other dirty pages
> > on this node) you have now lost track of them because cleaning a
> > single page clears all dirty node state on the inode. This seems
> > badly broken to me.
> > 
> > Because you are not tracking pages-per-node dirty state, the only way
> > you can really clear the dirty node state is when the inode is
> > completely clean. e.g. in __sync_single_inode where (inode->i_state
> > & I_DIRTY) == 0. Otherwise I can't see how this would work at all....
> 
> Correct. Remove the chunk above and everything is fine. I am going to post 
> an updated version. Sigh.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] Add a map to to track dirty pages per node
  2007-01-28 21:38       ` David Chinner
@ 2007-01-29 16:50         ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2007-01-29 16:50 UTC (permalink / raw)
  To: David Chinner
  Cc: akpm, Paul Menage, Peter Zijlstra, Nick Piggin, linux-mm,
	Paul Jackson, Andi Kleen

On Mon, 29 Jan 2007, David Chinner wrote:

> I think you missed my point - when we call into this function, the
> inode _must_ have already had all it's data written back. That is,
> by definition the inode mapping is clean if inode->i_data.nrpages ==
> 0. Hence if we have any dirty nodes, then we have a mismatch between
> the dirty node mask and the inode dirty state.  That is BUG-worthy,
> IMO.

This is the way it is supposed to be. The dirty map is only reset in 
clean_inode() when we know that all pages have been written back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-01-29 16:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-20  3:10 [PATCH 0/5] Cpuset aware writeback V1 Christoph Lameter
2007-01-20  3:10 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
2007-01-20  5:15   ` Paul Jackson
2007-01-22 17:41     ` Christoph Lameter
2007-01-22  1:31   ` David Chinner
2007-01-22 19:30     ` Christoph Lameter
2007-01-28 21:38       ` David Chinner
2007-01-29 16:50         ` Christoph Lameter
2007-01-20  3:10 ` [PATCH 2/5] Add a nodemask to pdflush functions Christoph Lameter
2007-01-20  3:10 ` [PATCH 3/5] Per cpuset dirty ratio calculation Christoph Lameter
2007-01-20  3:10 ` [PATCH 4/5] Cpuset aware writeback during reclaim Christoph Lameter
2007-01-20  3:10 ` [PATCH 5/5] Throttle vm writeout per cpuset Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2007-01-23 18:52 [PATCH 0/5] Cpuset aware writeback V2 Christoph Lameter
2007-01-23 18:52 ` [PATCH 1/5] Add a map to to track dirty pages per node Christoph Lameter
2007-01-25  3:04   ` Ethan Solomita
2007-01-25  5:52     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).