* [PATCH 0/6] cpuset aware writeback
@ 2007-07-17 21:23 Ethan Solomita
2007-07-17 21:32 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
` (7 more replies)
0 siblings, 8 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:23 UTC (permalink / raw)
To: linux-mm, LKML, Andrew Morton, Christoph Lameter
Perform writeback and dirty throttling with awareness of cpuset mem_allowed.
The theory of operation has two primary elements:
1. Add a nodemask per mapping which indicates the nodes
which have set PageDirty on any page of the mappings.
2. Add a nodemask argument to wakeup_pdflush() which is
propagated down to sync_sb_inodes.
This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.
The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.
cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.
Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.
A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.
These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/6] cpuset write dirty map
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
@ 2007-07-17 21:32 ` Ethan Solomita
2007-07-17 21:33 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
` (6 subsequent siblings)
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:32 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
Add a dirty map to struct address_space
In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.
This patch implements the management of dirty node maps for an address
space through the following functions:
cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map
cpuset_init_dirty_nodes(mapping) First time init of the map
The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.
The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c 2007-07-11 20:30:55.000000000 -0700
+++ 1/fs/buffer.c 2007-07-11 21:08:04.000000000 -0700
@@ -41,6 +41,7 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -710,6 +711,7 @@ static int __set_page_dirty(struct page
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+ cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-07-11 20:30:55.000000000 -0700
+++ 1/fs/fs-writeback.c 2007-07-11 21:08:04.000000000 -0700
@@ -22,6 +22,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/cpuset.h>
#include "internal.h"
int sysctl_inode_debug __read_mostly;
@@ -492,6 +493,12 @@ int generic_sync_sb_inodes(struct super_
continue; /* blockdev has wrong queue */
}
+ if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+ /* No pages on the nodes under writeback */
+ list_move(&inode->i_list, &sb->s_dirty);
+ continue;
+ }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c 2007-07-11 20:30:55.000000000 -0700
+++ 1/fs/inode.c 2007-07-11 21:08:04.000000000 -0700
@@ -22,6 +22,7 @@
#include <linux/bootmem.h>
#include <linux/inotify.h>
#include <linux/mount.h>
+#include <linux/cpuset.h>
/*
* This is needed for the following functions:
@@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+ cpuset_init_dirty_nodes(mapping);
/*
* If the block_device provides a backing_dev_info for client
@@ -264,6 +266,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+ cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
}
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h 2007-07-11 20:30:56.000000000 -0700
+++ 1/include/linux/cpuset.h 2007-07-11 21:08:04.000000000 -0700
@@ -76,6 +76,45 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page) \
+ do { \
+ int node = page_to_nid(__page); \
+ if (!node_isset(node, (__mapping)->dirty_nodes)) \
+ node_set(node, (__mapping)->dirty_nodes); \
+ } while (0)
+
+#define cpuset_clear_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_init_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr) \
+ (!(__nodemask_ptr) || \
+ nodes_intersects((__mapping)->dirty_nodes, \
+ *(__nodemask_ptr)))
+
+#else
+
+#define cpuset_init_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NULL
+
+struct address_space;
+
+extern void cpuset_update_dirty_nodes(struct address_space *a,
+ struct page *p);
+
+extern void cpuset_clear_dirty_nodes(struct address_space *a);
+
+extern int cpuset_intersects_dirty_nodes(struct address_space *a,
+ nodemask_t *mask);
+
+#endif
+
#else /* !CONFIG_CPUSETS */
static inline int cpuset_init_early(void) { return 0; }
@@ -150,6 +189,26 @@ static inline int current_cpuset_is_bein
return 0;
}
+struct address_space;
+
+static inline void cpuset_update_dirty_nodes(struct address_space *a,
+ struct page *p) {}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *a) {}
+
+static inline void cpuset_init_dirty_nodes(struct address_space *a) {}
+
+static inline int cpuset_dirty_node_set(struct inode *i, int node)
+{
+ return 1;
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *a,
+ nodemask_t *n)
+{
+ return 1;
+}
+
#endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/fs.h 1/include/linux/fs.h
--- 0/include/linux/fs.h 2007-07-11 20:30:56.000000000 -0700
+++ 1/include/linux/fs.h 2007-07-11 21:08:04.000000000 -0700
@@ -527,6 +527,13 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+ nodemask_t dirty_nodes; /* nodes with dirty pages */
+#else
+ nodemask_t *dirty_nodes; /* pointer to map if dirty */
+#endif
+#endif
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
--- 0/include/linux/writeback.h 2007-07-11 20:30:56.000000000 -0700
+++ 1/include/linux/writeback.h 2007-07-11 21:12:25.000000000 -0700
@@ -63,6 +63,7 @@ struct writeback_control {
unsigned range_cyclic:1; /* range_start is cyclic */
void *fs_private; /* For use by ->writepages() */
+ nodemask_t *nodes; /* Set of nodes of interest */
};
/*
diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
--- 0/kernel/cpuset.c 2007-07-11 20:30:56.000000000 -0700
+++ 1/kernel/cpuset.c 2007-07-12 12:13:53.000000000 -0700
@@ -4,7 +4,7 @@
* Processor and Memory placement constraints for sets of tasks.
*
* Copyright (C) 2003 BULL SA.
- * Copyright (C) 2004-2006 Silicon Graphics, Inc.
+ * Copyright (C) 2004-2007 Silicon Graphics, Inc.
* Copyright (C) 2006 Google, Inc
*
* Portions derived from Patrick Mochel's sysfs code.
@@ -14,6 +14,7 @@
* 2003-10-22 Updates by Stephen Hemminger.
* 2004 May-July Rework by Paul Jackson.
* 2006 Rework by Paul Menage to use generic containers
+ * 2007 Cpuset writeback by Christoph Lameter.
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
@@ -1728,6 +1729,63 @@ int cpuset_mem_spread_node(void)
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
+#if MAX_NUMNODES > BITS_PER_LONG
+
+/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+ struct page *page)
+{
+ nodemask_t *nodes = mapping->dirty_nodes;
+ int node = page_to_nid(page);
+
+ if (!nodes) {
+ nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+ if (!nodes)
+ return;
+
+ *nodes = NODE_MASK_NONE;
+ mapping->dirty_nodes = nodes;
+ }
+
+ if (!node_isset(node, *nodes))
+ node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+ nodemask_t *nodes = mapping->dirty_nodes;
+
+ if (nodes) {
+ mapping->dirty_nodes = NULL;
+ kfree(nodes);
+ }
+}
+
+/*
+ * Called without the tree_lock. The nodemask is only freed when the inode
+ * is cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+ nodemask_t *mask)
+{
+ nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+ if (!mask)
+ return 1;
+
+ if (!dirty_nodes)
+ return 0;
+
+ return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
/**
* cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
* @p: pointer to task_struct of some other task.
diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
--- 0/mm/page-writeback.c 2007-07-11 20:30:56.000000000 -0700
+++ 1/mm/page-writeback.c 2007-07-11 21:08:04.000000000 -0700
@@ -33,6 +33,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <linux/cpuset.h>
/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+ cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 2/6] cpuset write pdflush nodemask
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
2007-07-17 21:32 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
@ 2007-07-17 21:33 ` Ethan Solomita
2007-07-17 21:34 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
` (5 subsequent siblings)
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:33 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.
So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c 2007-07-11 21:08:04.000000000 -0700
+++ 2/fs/buffer.c 2007-07-11 21:15:47.000000000 -0700
@@ -359,7 +359,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
- wakeup_pdflush(1024);
+ wakeup_pdflush(1024, NULL);
yield();
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c 2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/super.c 2007-07-11 21:15:47.000000000 -0700
@@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb
return 0;
}
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
{
struct super_block *sb;
@@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne
void emergency_remount(void)
{
- pdflush_operation(do_emergency_remount, 0);
+ pdflush_operation(do_emergency_remount, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/sync.c 2007-07-11 21:15:47.000000000 -0700
@@ -21,9 +21,9 @@
* sync everything. Start out by waking pdflush, because that writes back
* all queues in parallel.
*/
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
{
- wakeup_pdflush(0);
+ wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers(); /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
asmlinkage long sys_sync(void)
{
- do_sync(1);
+ do_sync(1, NULL);
return 0;
}
void emergency_sync(void)
{
- pdflush_operation(do_sync, 0);
+ pdflush_operation(do_sync, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-07-11 21:12:25.000000000 -0700
+++ 2/include/linux/writeback.h 2007-07-11 21:15:47.000000000 -0700
@@ -92,7 +92,7 @@ static inline void inode_sync_wait(struc
/*
* mm/page-writeback.c
*/
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(gfp_t gfp_mask);
@@ -123,7 +123,8 @@ balance_dirty_pages_ratelimited(struct a
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+ unsigned long arg0, nodemask_t *nodes);
int generic_writepages(struct address_space *mapping,
struct writeback_control *wbc);
int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c
--- 1/mm/page-writeback.c 2007-07-11 21:08:04.000000000 -0700
+++ 2/mm/page-writeback.c 2007-07-11 21:15:47.000000000 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
/* End of sysctl-exported parameters */
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
/*
* Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
*/
if ((laptop_mode && pages_written) ||
(!laptop_mode && (nr_reclaimable > background_thresh)))
- pdflush_operation(background_writeout, 0);
+ pdflush_operation(background_writeout, 0, NULL);
}
void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
*/
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
{
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
* the whole world. Returns 0 if a pdflush thread was dispatched. Returns
* -1 if all pdflush threads were busy.
*/
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
{
if (nr_pages == 0)
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
- return pdflush_operation(background_writeout, nr_pages);
+ return pdflush_operation(background_writeout, nr_pages, nodes);
}
static void wb_timer_fn(unsigned long unused);
@@ -431,7 +431,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer
* older_than_this takes precedence over nr_to_write. So we'll only write back
* all dirty pages if they are all attached to "old" mappings.
*/
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
{
unsigned long oldest_jif;
unsigned long start_jif;
@@ -489,18 +489,18 @@ int dirty_writeback_centisecs_handler(ct
static void wb_timer_fn(unsigned long unused)
{
- if (pdflush_operation(wb_kupdate, 0) < 0)
+ if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
}
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
{
sys_sync();
}
static void laptop_timer_fn(unsigned long unused)
{
- pdflush_operation(laptop_flush, 0);
+ pdflush_operation(laptop_flush, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/mm/pdflush.c 2/mm/pdflush.c
--- 1/mm/pdflush.c 2007-07-11 21:07:48.000000000 -0700
+++ 2/mm/pdflush.c 2007-07-11 21:15:47.000000000 -0700
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
*/
struct pdflush_work {
struct task_struct *who; /* The thread */
- void (*fn)(unsigned long); /* A callback function */
+ void (*fn)(unsigned long, nodemask_t *); /* A callback function */
unsigned long arg0; /* An argument to the callback */
struct list_head list; /* On pdflush_list, when idle */
unsigned long when_i_went_to_sleep;
+ int have_nodes; /* Nodes were specified */
+ nodemask_t nodes; /* Nodes of interest */
};
static int __pdflush(struct pdflush_work *my_work)
@@ -124,7 +126,8 @@ static int __pdflush(struct pdflush_work
}
spin_unlock_irq(&pdflush_lock);
- (*my_work->fn)(my_work->arg0);
+ (*my_work->fn)(my_work->arg0,
+ my_work->have_nodes ? &my_work->nodes : NULL);
/*
* Thread creation: For how long have there been zero
@@ -198,7 +201,8 @@ static int pdflush(void *dummy)
* Returns zero if it indeed managed to find a worker thread, and passed your
* payload to it.
*/
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *),
+ unsigned long arg0, nodemask_t *nodes)
{
unsigned long flags;
int ret = 0;
@@ -218,6 +222,11 @@ int pdflush_operation(void (*fn)(unsigne
last_empty_jifs = jiffies;
pdf->fn = fn;
pdf->arg0 = arg0;
+ if (nodes) {
+ pdf->nodes = *nodes;
+ pdf->have_nodes = 1;
+ } else
+ pdf->have_nodes = 0;
wake_up_process(pdf->who);
spin_unlock_irqrestore(&pdflush_lock, flags);
}
diff -uprN -X 0/Documentation/dontdiff 1/mm/vmscan.c 2/mm/vmscan.c
--- 1/mm/vmscan.c 2007-07-11 21:07:48.000000000 -0700
+++ 2/mm/vmscan.c 2007-07-11 21:15:47.000000000 -0700
@@ -1183,7 +1183,7 @@ unsigned long try_to_free_pages(struct z
*/
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
sc.may_writepage = 1;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 3/6] cpuset write throttle
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
2007-07-17 21:32 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-07-17 21:33 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
@ 2007-07-17 21:34 ` Ethan Solomita
2007-07-17 21:35 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
` (4 subsequent siblings)
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:34 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
Make page writeback obey cpuset constraints
Currently dirty throttling does not work properly in a cpuset.
If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.
This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.
We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.
Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c
--- 2/mm/page-writeback.c 2007-07-11 21:15:47.000000000 -0700
+++ 3/mm/page-writeback.c 2007-07-16 18:30:01.000000000 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
+struct dirty_limits {
+ long thresh_background;
+ long thresh_dirty;
+ unsigned long nr_dirty;
+ unsigned long nr_unstable;
+ unsigned long nr_writeback;
+};
+
/*
* Work out the current dirty-memory clamping and background writeout
* thresholds.
@@ -121,13 +129,15 @@ static void background_writeout(unsigned
* clamping level.
*/
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total)
{
#ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
- for_each_online_node(node) {
+ if (nodes == NULL)
+ nodes = &node_online_mask;
+ for_each_node_mask(node, *nodes) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
@@ -154,26 +164,74 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
- x -= highmem_dirtyable_memory(x);
+ x -= highmem_dirtyable_memory(NULL, x);
return x + 1; /* Ensure that we never return 0 */
}
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
- struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+ nodemask_t *nodes)
{
int background_ratio; /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
- unsigned long available_memory = determine_dirtyable_memory();
+ unsigned long available_memory;
+ unsigned long nr_mapped;
struct task_struct *tsk;
+ int is_subset = 0;
- unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
- global_page_state(NR_ANON_PAGES)) * 100) /
- available_memory;
+#ifdef CONFIG_CPUSETS
+ if (unlikely(nodes &&
+ !nodes_subset(node_online_map, *nodes))) {
+ int node;
+ /*
+ * Calculate the limits relative to the current cpuset.
+ *
+ * We do not disregard highmem because all nodes (except
+ * maybe node 0) have either all memory in HIGHMEM (32 bit) or
+ * all memory in non HIGHMEM (64 bit). If we would disregard
+ * highmem then cpuset throttling would not work on 32 bit.
+ */
+ is_subset = 1;
+ memset(dl, 0, sizeof(struct dirty_limits));
+ available_memory = 0;
+ nr_mapped = 0;
+ for_each_node_mask(node, *nodes) {
+ if (!node_online(node))
+ continue;
+ dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+ dl->nr_unstable +=
+ node_page_state(node, NR_UNSTABLE_NFS);
+ dl->nr_writeback +=
+ node_page_state(node, NR_WRITEBACK);
+ available_memory +=
+ node_page_state(node, NR_ACTIVE) +
+ node_page_state(node, NR_INACTIVE) +
+ node_page_state(node, NR_FREE_PAGES);
+ nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
+ node_page_state(node, NR_ANON_PAGES);
+ }
+ available_memory -= highmem_dirtyable_memory(nodes,
+ available_memory);
+ /* Ensure that we return >= 0 */
+ if (available_memory <= 0)
+ available_memory = 1;
+ } else
+#endif
+ {
+ /* Global limits */
+ dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
+ dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+ dl->nr_writeback = global_page_state(NR_WRITEBACK);
+ available_memory = determine_dirtyable_memory();
+ nr_mapped = global_page_state(NR_FILE_MAPPED) +
+ global_page_state(NR_ANON_PAGES);
+ }
+
+ unmapped_ratio = 100 - (nr_mapped * 100 / available_memory);
dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;
@@ -192,8 +250,9 @@ get_dirty_limits(long *pbackground, long
background += background / 4;
dirty += dirty / 4;
}
- *pbackground = background;
- *pdirty = dirty;
+ dl->thresh_background = background;
+ dl->thresh_dirty = dirty;
+ return is_subset;
}
/*
@@ -206,8 +265,7 @@ get_dirty_limits(long *pbackground, long
static void balance_dirty_pages(struct address_space *mapping)
{
long nr_reclaimable;
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
unsigned long pages_written = 0;
unsigned long write_chunk = sync_writeback_pages();
@@ -222,11 +280,12 @@ static void balance_dirty_pages(struct a
.range_cyclic = 1,
};
- get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
- dirty_thresh)
+ if (get_dirty_limits(&dl, mapping,
+ &cpuset_current_mems_allowed))
+ wbc.nodes = &cpuset_current_mems_allowed;
+ nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+ if (nr_reclaimable + dl.nr_writeback <=
+ dl.thresh_dirty)
break;
if (!dirty_exceeded)
@@ -240,13 +299,10 @@ static void balance_dirty_pages(struct a
*/
if (nr_reclaimable) {
writeback_inodes(&wbc);
- get_dirty_limits(&background_thresh,
- &dirty_thresh, mapping);
- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- if (nr_reclaimable +
- global_page_state(NR_WRITEBACK)
- <= dirty_thresh)
+ get_dirty_limits(&dl, mapping,
+ &cpuset_current_mems_allowed);
+ nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+ if (nr_reclaimable + dl.nr_writeback <= dl.thresh_dirty)
break;
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
@@ -255,8 +311,8 @@ static void balance_dirty_pages(struct a
congestion_wait(WRITE, HZ/10);
}
- if (nr_reclaimable + global_page_state(NR_WRITEBACK)
- <= dirty_thresh && dirty_exceeded)
+ if (nr_reclaimable + dl.nr_writeback
+ <= dl.thresh_dirty && dirty_exceeded)
dirty_exceeded = 0;
if (writeback_in_progress(bdi))
@@ -271,8 +327,9 @@ static void balance_dirty_pages(struct a
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
- (!laptop_mode && (nr_reclaimable > background_thresh)))
- pdflush_operation(background_writeout, 0, NULL);
+ (!laptop_mode && (nr_reclaimable > dl.thresh_background)))
+ pdflush_operation(background_writeout, 0,
+ &cpuset_current_mems_allowed);
}
void set_page_dirty_balance(struct page *page)
@@ -329,8 +386,7 @@ EXPORT_SYMBOL(balance_dirty_pages_rateli
void throttle_vm_writeout(gfp_t gfp_mask)
{
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
/*
@@ -342,27 +398,26 @@ void throttle_vm_writeout(gfp_t gfp_mask
return;
}
- for ( ; ; ) {
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+ for ( ; ; ) {
+ get_dirty_limits(&dl, NULL, &node_online_map);
+
+ /*
+ * Boost the allowable dirty threshold a bit for page
+ * allocators so they don't get DoS'ed by heavy writers
+ */
+ dl.thresh_dirty += dl.thresh_dirty / 10; /* wheeee... */
- /*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
- congestion_wait(WRITE, HZ/10);
- }
+ if (dl.nr_unstable + dl.nr_writeback <= dl.thresh_dirty)
+ break;
+ congestion_wait(WRITE, HZ/10);
+ }
}
/*
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
*/
-static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes)
{
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -375,12 +430,11 @@ static void background_writeout(unsigned
};
for ( ; ; ) {
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
- if (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) < background_thresh
+ if (get_dirty_limits(&dl, NULL, nodes))
+ wbc.nodes = nodes;
+ if (dl.nr_dirty + dl.nr_unstable < dl.thresh_background
&& min_pages <= 0)
break;
wbc.encountered_congestion = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 4/6] cpuset write vmscan
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
` (2 preceding siblings ...)
2007-07-17 21:34 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
@ 2007-07-17 21:35 ` Ethan Solomita
2007-07-17 21:36 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
` (3 subsequent siblings)
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:35 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
Direct reclaim: cpuset aware writeout
During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either
A. No writeout occurs because global dirty limits have not been reached
or
B. Writeout starts randomly for some dirty inode in the system. Pdflush
may just write out data for nodes in another cpuset and miss doing
proper dirty handling for the current cpuset.
In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.
Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c 2007-07-11 21:16:14.000000000 -0700
+++ 4/mm/vmscan.c 2007-07-11 21:16:26.000000000 -0700
@@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z
*/
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+ &cpuset_current_mems_allowed);
sc.may_writepage = 1;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 5/6] cpuset write vm writeout
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
` (3 preceding siblings ...)
2007-07-17 21:35 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
@ 2007-07-17 21:36 ` Ethan Solomita
2007-07-17 21:37 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
` (2 subsequent siblings)
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:36 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
Throttle VM writeout in a cpuset aware way
This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.
kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not from kswapd.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-07-11 21:16:25.000000000 -0700
+++ 5/include/linux/writeback.h 2007-07-11 21:16:50.000000000 -0700
@@ -95,7 +95,7 @@ static inline void inode_sync_wait(struc
int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c 2007-07-16 18:31:13.000000000 -0700
+++ 5/mm/page-writeback.c 2007-07-16 18:32:08.000000000 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
{
struct dirty_limits dl;
@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
for ( ; ; ) {
- get_dirty_limits(&dl, NULL, &node_online_map);
+ get_dirty_limits(&dl, NULL, nodes);
/*
* Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c 2007-07-11 21:16:26.000000000 -0700
+++ 5/mm/vmscan.c 2007-07-11 21:16:50.000000000 -0700
@@ -1064,7 +1064,7 @@ static unsigned long shrink_zone(int pri
}
}
- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 6/6] cpuset dirty limits
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
` (4 preceding siblings ...)
2007-07-17 21:36 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
@ 2007-07-17 21:37 ` Ethan Solomita
2007-07-23 20:18 ` [PATCH 0/6] cpuset aware writeback Christoph Lameter
2007-09-12 1:32 ` Ethan Solomita
7 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-07-17 21:37 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton, Christoph Lameter
Per cpuset dirty ratios
This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:
background_dirty_ratio Percentage at which background writeback starts
throttle_dirty_ratio Percentage at which the application is throttled
and we start synchrononous writeout.
Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 6/include/linux/cpuset.h 2007-07-11 21:17:08.000000000 -0700
+++ 7/include/linux/cpuset.h 2007-07-11 21:17:41.000000000 -0700
@@ -76,6 +76,7 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+extern void cpuset_get_current_ratios(int *background, int *ratio);
/*
* We need macros since struct address_space is not defined yet
*/
diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c
--- 6/kernel/cpuset.c 2007-07-12 12:15:20.000000000 -0700
+++ 7/kernel/cpuset.c 2007-07-12 12:15:34.000000000 -0700
@@ -51,6 +51,7 @@
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>
#include <asm/atomic.h>
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
struct fmeter fmeter; /* memory_pressure filter */
+
+ int background_dirty_ratio;
+ int throttle_dirty_ratio;
};
/* Update the cpuset for a container */
@@ -175,6 +179,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+ .background_dirty_ratio = -1,
+ .throttle_dirty_ratio = -1,
};
/*
@@ -776,6 +782,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
}
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+ char *endp;
+ int val;
+
+ val = simple_strtol(buf, &endp, 10);
+ if (val < min || val > max)
+ return -EINVAL;
+
+ mutex_lock(&callback_mutex);
+ *cs_int = val;
+ mutex_unlock(&callback_mutex);
+ return 0;
+}
+
/*
* Frequency meter - How fast is some event occurring?
*
@@ -924,6 +945,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_THROTTLE_DIRTY_RATIO,
+ FILE_BACKGROUND_DIRTY_RATIO,
} cpuset_filetype_t;
static ssize_t cpuset_common_file_write(struct container *cont,
@@ -988,6 +1011,12 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+ case FILE_BACKGROUND_DIRTY_RATIO:
+ retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100);
+ break;
+ case FILE_THROTTLE_DIRTY_RATIO:
+ retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+ break;
default:
retval = -EINVAL;
goto out2;
@@ -1081,6 +1110,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+ case FILE_BACKGROUND_DIRTY_RATIO:
+ s += sprintf(s, "%d", cs->background_dirty_ratio);
+ break;
+ case FILE_THROTTLE_DIRTY_RATIO:
+ s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1164,6 +1199,20 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};
+static struct cftype cft_background_dirty_ratio = {
+ .name = "background_dirty_ratio",
+ .read = cpuset_common_file_read,
+ .write = cpuset_common_file_write,
+ .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+ .name = "throttle_dirty_ratio",
+ .read = cpuset_common_file_read,
+ .write = cpuset_common_file_write,
+ .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
int cpuset_populate(struct container_subsys *ss, struct container *cont)
{
int err;
@@ -1184,6 +1233,10 @@ int cpuset_populate(struct container_sub
return err;
if ((err = container_add_file(cont, &cft_spread_slab)) < 0)
return err;
+ if ((err = container_add_file(cont, &cft_background_dirty_ratio)) < 0)
+ return err;
+ if ((err = container_add_file(cont, &cft_throttle_dirty_ratio)) < 0)
+ return err;
/* memory_pressure_enabled is in root cpuset only */
if (err == 0 && !cont->parent)
err = container_add_file(cont, &cft_memory_pressure_enabled);
@@ -1262,6 +1315,8 @@ int cpuset_create(struct container_subsy
cs->mems_allowed = NODE_MASK_NONE;
cs->mems_generation = cpuset_mems_generation++;
fmeter_init(&cs->fmeter);
+ cs->background_dirty_ratio = parent->background_dirty_ratio;
+ cs->throttle_dirty_ratio = parent->throttle_dirty_ratio;
cs->parent = parent;
set_container_cs(cont, cs);
@@ -1729,8 +1784,30 @@ int cpuset_mem_spread_node(void)
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
-#if MAX_NUMNODES > BITS_PER_LONG
+/*
+ * Determine the dirty ratios for the currently active cpuset
+ */
+void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
+{
+ int background = -1;
+ int throttle = -1;
+ struct task_struct *tsk = current;
+
+ task_lock(tsk);
+ background = task_cs(tsk)->background_dirty_ratio;
+ throttle = task_cs(tsk)->throttle_dirty_ratio;
+ task_unlock(tsk);
+ if (background == -1)
+ background = dirty_background_ratio;
+ if (throttle == -1)
+ throttle = vm_dirty_ratio;
+
+ *background_ratio = background;
+ *throttle_ratio = throttle;
+}
+
+#if MAX_NUMNODES > BITS_PER_LONG
/*
* Special functions for NUMA systems with a large number of nodes.
* The nodemask is pointed to from the address space structures.
diff -uprN -X 0/Documentation/dontdiff 6/mm/page-writeback.c 7/mm/page-writeback.c
--- 6/mm/page-writeback.c 2007-07-16 18:32:20.000000000 -0700
+++ 7/mm/page-writeback.c 2007-07-17 13:17:31.000000000 -0700
@@ -219,6 +219,7 @@ get_dirty_limits(struct dirty_limits *dl
/* Ensure that we return >= 0 */
if (available_memory <= 0)
available_memory = 1;
+ cpuset_get_current_ratios(&background_ratio, &dirty_ratio);
} else
#endif
{
@@ -229,17 +230,17 @@ get_dirty_limits(struct dirty_limits *dl
available_memory = determine_dirtyable_memory();
nr_mapped = global_page_state(NR_FILE_MAPPED) +
global_page_state(NR_ANON_PAGES);
+ dirty_ratio = vm_dirty_ratio;
+ background_ratio = dirty_background_ratio;
}
unmapped_ratio = 100 - (nr_mapped * 100 / available_memory);
- dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;
if (dirty_ratio < 5)
dirty_ratio = 5;
- background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)
background_ratio = dirty_ratio / 2;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/6] cpuset aware writeback
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
` (5 preceding siblings ...)
2007-07-17 21:37 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
@ 2007-07-23 20:18 ` Christoph Lameter
2007-07-23 21:30 ` Ethan Solomita
2007-09-12 1:32 ` Ethan Solomita
7 siblings, 1 reply; 41+ messages in thread
From: Christoph Lameter @ 2007-07-23 20:18 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton
On Tue, 17 Jul 2007 14:23:14 -0700
Ethan Solomita <solo@google.com> wrote:
> These patches are mostly unchanged from Chris Lameter's original
> changelist posted previously to linux-mm.
Thanks for keeping these patches up to date. Add you signoff if you
did modifications to a patch. Also include the description of the tests
in the introduction to the patchset.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/6] cpuset aware writeback
2007-07-23 20:18 ` [PATCH 0/6] cpuset aware writeback Christoph Lameter
@ 2007-07-23 21:30 ` Ethan Solomita
2007-07-23 21:53 ` Christoph Lameter
0 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-07-23 21:30 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, LKML, Andrew Morton
Christoph Lameter wrote:
> On Tue, 17 Jul 2007 14:23:14 -0700
> Ethan Solomita <solo@google.com> wrote:
>
>> These patches are mostly unchanged from Chris Lameter's original
>> changelist posted previously to linux-mm.
>
> Thanks for keeping these patches up to date. Add you signoff if you
> did modifications to a patch. Also include the description of the tests
> in the introduction to the patchset.
So switch from an Ack to a signed-off? OK, and I'll add descriptions of
testing. Everyone other than you has been silent on these patches. Does
silence equal consent?
-- Ethan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/6] cpuset aware writeback
2007-07-23 21:30 ` Ethan Solomita
@ 2007-07-23 21:53 ` Christoph Lameter
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Lameter @ 2007-07-23 21:53 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Andrew Morton
On Mon, 23 Jul 2007, Ethan Solomita wrote:
> Christoph Lameter wrote:
> > On Tue, 17 Jul 2007 14:23:14 -0700
> > Ethan Solomita <solo@google.com> wrote:
> >
> >> These patches are mostly unchanged from Chris Lameter's original
> >> changelist posted previously to linux-mm.
> >
> > Thanks for keeping these patches up to date. Add you signoff if you
> > did modifications to a patch. Also include the description of the tests
> > in the introduction to the patchset.
>
> So switch from an Ack to a signed-off? OK, and I'll add descriptions of
No. Do a signed-off if you have modified the patch.
> testing. Everyone other than you has been silent on these patches. Does
> silence equal consent?
Sometimes. Howerver, the audience for NUMA and cpusets is rather limited.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/6] cpuset aware writeback
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
` (6 preceding siblings ...)
2007-07-23 20:18 ` [PATCH 0/6] cpuset aware writeback Christoph Lameter
@ 2007-09-12 1:32 ` Ethan Solomita
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
` (5 more replies)
7 siblings, 6 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:32 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Perform writeback and dirty throttling with awareness of cpuset mem_allowed.
The theory of operation has two primary elements:
1. Add a nodemask per mapping which indicates the nodes
which have set PageDirty on any page of the mappings.
2. Add a nodemask argument to wakeup_pdflush() which is
propagated down to sync_sb_inodes.
This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.
The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.
cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.
Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.
A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.
For testing I did a variety of basic tests, verifying individual
features of the test. To verify that it fixes the core problem, I
created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.
Using iostat I can see the initial period of reading from disk, followed
by a period of simultaneous reads and writes as dirty bytes are pushed
to make room for new reads.
In a separate log-in, in the other cpuset, I am running:
while `true`; do date | tee -a date.txt; sleep 5; done
date.txt resides on the same disk as the large file mentioned above. The
above while-loop serves the dual purpose of providing me visual clues of
progress along with the opportunity for the "tee" command to become
throttled writing to the disk.
The effect of this patchset is straightforward. Without it there are
long hangs between appearances of the date. With it the dates are all 5
(or sometimes 6) seconds apart.
I also added printks to the kernel to verify that, without these
patches, the tee was being throttled (along with lots of other things),
and with the patch only pdflush is being throttled.
These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/6] cpuset write dirty map
2007-09-12 1:32 ` Ethan Solomita
@ 2007-09-12 1:36 ` Ethan Solomita
2007-09-14 23:15 ` Andrew Morton
2007-09-12 1:38 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
` (4 subsequent siblings)
5 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:36 UTC (permalink / raw)
To: Ethan Solomita; +Cc: Andrew Morton, linux-mm, LKML, Christoph Lameter
[-- Attachment #1: Type: text/plain, Size: 1 bytes --]
[-- Attachment #2: 1.txt --]
[-- Type: text/plain, Size: 10480 bytes --]
Add a dirty map to struct address_space
In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.
This patch implements the management of dirty node maps for an address
space through the following functions:
cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map
cpuset_init_dirty_nodes(mapping) First time init of the map
The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.
The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c 2007-09-11 14:35:58.000000000 -0700
+++ 1/fs/buffer.c 2007-09-11 14:36:24.000000000 -0700
@@ -41,6 +41,7 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -723,6 +724,7 @@ static int __set_page_dirty(struct page
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+ cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-09-11 14:35:58.000000000 -0700
+++ 1/fs/fs-writeback.c 2007-09-11 14:36:24.000000000 -0700
@@ -22,6 +22,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/cpuset.h>
#include "internal.h"
int sysctl_inode_debug __read_mostly;
@@ -476,6 +477,12 @@ int generic_sync_sb_inodes(struct super_
continue; /* blockdev has wrong queue */
}
+ if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+ /* No pages on the nodes under writeback */
+ list_move(&inode->i_list, &sb->s_dirty);
+ continue;
+ }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c 2007-09-11 14:35:58.000000000 -0700
+++ 1/fs/inode.c 2007-09-11 14:36:24.000000000 -0700
@@ -22,6 +22,7 @@
#include <linux/bootmem.h>
#include <linux/inotify.h>
#include <linux/mount.h>
+#include <linux/cpuset.h>
/*
* This is needed for the following functions:
@@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+ cpuset_init_dirty_nodes(mapping);
/*
* If the block_device provides a backing_dev_info for client
@@ -264,6 +266,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+ cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
}
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h 2007-09-11 14:35:58.000000000 -0700
+++ 1/include/linux/cpuset.h 2007-09-11 14:36:24.000000000 -0700
@@ -77,6 +77,45 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page) \
+ do { \
+ int node = page_to_nid(__page); \
+ if (!node_isset(node, (__mapping)->dirty_nodes)) \
+ node_set(node, (__mapping)->dirty_nodes); \
+ } while (0)
+
+#define cpuset_clear_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_init_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr) \
+ (!(__nodemask_ptr) || \
+ nodes_intersects((__mapping)->dirty_nodes, \
+ *(__nodemask_ptr)))
+
+#else
+
+#define cpuset_init_dirty_nodes(__mapping) \
+ (__mapping)->dirty_nodes = NULL
+
+struct address_space;
+
+extern void cpuset_update_dirty_nodes(struct address_space *a,
+ struct page *p);
+
+extern void cpuset_clear_dirty_nodes(struct address_space *a);
+
+extern int cpuset_intersects_dirty_nodes(struct address_space *a,
+ nodemask_t *mask);
+
+#endif
+
#else /* !CONFIG_CPUSETS */
static inline int cpuset_init_early(void) { return 0; }
@@ -155,6 +194,26 @@ static inline int current_cpuset_is_bein
return 0;
}
+struct address_space;
+
+static inline void cpuset_update_dirty_nodes(struct address_space *a,
+ struct page *p) {}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *a) {}
+
+static inline void cpuset_init_dirty_nodes(struct address_space *a) {}
+
+static inline int cpuset_dirty_node_set(struct inode *i, int node)
+{
+ return 1;
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *a,
+ nodemask_t *n)
+{
+ return 1;
+}
+
#endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/fs.h 1/include/linux/fs.h
--- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
+++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
@@ -516,6 +516,13 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+ nodemask_t dirty_nodes; /* nodes with dirty pages */
+#else
+ nodemask_t *dirty_nodes; /* pointer to map if dirty */
+#endif
+#endif
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
--- 0/include/linux/writeback.h 2007-09-11 14:35:58.000000000 -0700
+++ 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
@@ -62,6 +62,7 @@ struct writeback_control {
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
void *fs_private; /* For use by ->writepages() */
+ nodemask_t *nodes; /* Set of nodes of interest */
};
/*
diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
--- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
+++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
@@ -4,7 +4,7 @@
* Processor and Memory placement constraints for sets of tasks.
*
* Copyright (C) 2003 BULL SA.
- * Copyright (C) 2004-2006 Silicon Graphics, Inc.
+ * Copyright (C) 2004-2007 Silicon Graphics, Inc.
* Copyright (C) 2006 Google, Inc
*
* Portions derived from Patrick Mochel's sysfs code.
@@ -14,6 +14,7 @@
* 2003-10-22 Updates by Stephen Hemminger.
* 2004 May-July Rework by Paul Jackson.
* 2006 Rework by Paul Menage to use generic containers
+ * 2007 Cpuset writeback by Christoph Lameter.
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
@@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
+#if MAX_NUMNODES > BITS_PER_LONG
+
+/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+ struct page *page)
+{
+ nodemask_t *nodes = mapping->dirty_nodes;
+ int node = page_to_nid(page);
+
+ if (!nodes) {
+ nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+ if (!nodes)
+ return;
+
+ *nodes = NODE_MASK_NONE;
+ mapping->dirty_nodes = nodes;
+ }
+
+ if (!node_isset(node, *nodes))
+ node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+ nodemask_t *nodes = mapping->dirty_nodes;
+
+ if (nodes) {
+ mapping->dirty_nodes = NULL;
+ kfree(nodes);
+ }
+}
+
+/*
+ * Called without the tree_lock. The nodemask is only freed when the inode
+ * is cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+ nodemask_t *mask)
+{
+ nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+ if (!mask)
+ return 1;
+
+ if (!dirty_nodes)
+ return 0;
+
+ return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
/**
* cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
* @p: pointer to task_struct of some other task.
diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
--- 0/mm/page-writeback.c 2007-09-11 14:35:58.000000000 -0700
+++ 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
@@ -33,6 +33,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <linux/cpuset.h>
/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+ cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 2/6] cpuset write pdflush nodemask
2007-09-12 1:32 ` Ethan Solomita
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
@ 2007-09-12 1:38 ` Ethan Solomita
2007-09-12 1:39 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
` (3 subsequent siblings)
5 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:38 UTC (permalink / raw)
To: Ethan Solomita; +Cc: Andrew Morton, linux-mm, LKML, Christoph Lameter
If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.
So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c 2007-09-11 14:36:24.000000000 -0700
+++ 2/fs/buffer.c 2007-09-11 14:39:22.000000000 -0700
@@ -372,7 +372,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
- wakeup_pdflush(1024);
+ wakeup_pdflush(1024, NULL);
yield();
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c 2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/super.c 2007-09-11 14:39:22.000000000 -0700
@@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb
return 0;
}
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
{
struct super_block *sb;
@@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne
void emergency_remount(void)
{
- pdflush_operation(do_emergency_remount, 0);
+ pdflush_operation(do_emergency_remount, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/sync.c 2007-09-11 14:39:22.000000000 -0700
@@ -21,9 +21,9 @@
* sync everything. Start out by waking pdflush, because that writes back
* all queues in parallel.
*/
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
{
- wakeup_pdflush(0);
+ wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers(); /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
asmlinkage long sys_sync(void)
{
- do_sync(1);
+ do_sync(1, NULL);
return 0;
}
void emergency_sync(void)
{
- pdflush_operation(do_sync, 0);
+ pdflush_operation(do_sync, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
+++ 2/include/linux/writeback.h 2007-09-11 14:39:22.000000000 -0700
@@ -91,7 +91,7 @@ static inline void inode_sync_wait(struc
/*
* mm/page-writeback.c
*/
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(gfp_t gfp_mask);
@@ -122,7 +122,8 @@ balance_dirty_pages_ratelimited(struct a
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+ unsigned long arg0, nodemask_t *nodes);
int generic_writepages(struct address_space *mapping,
struct writeback_control *wbc);
int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c
--- 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
+++ 2/mm/page-writeback.c 2007-09-11 14:39:22.000000000 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
/* End of sysctl-exported parameters */
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
/*
* Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
*/
if ((laptop_mode && pages_written) ||
(!laptop_mode && (nr_reclaimable > background_thresh)))
- pdflush_operation(background_writeout, 0);
+ pdflush_operation(background_writeout, 0, NULL);
}
void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
*/
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
{
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
* the whole world. Returns 0 if a pdflush thread was dispatched. Returns
* -1 if all pdflush threads were busy.
*/
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
{
if (nr_pages == 0)
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
- return pdflush_operation(background_writeout, nr_pages);
+ return pdflush_operation(background_writeout, nr_pages, nodes);
}
static void wb_timer_fn(unsigned long unused);
@@ -431,7 +431,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer
* older_than_this takes precedence over nr_to_write. So we'll only write back
* all dirty pages if they are all attached to "old" mappings.
*/
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
{
unsigned long oldest_jif;
unsigned long start_jif;
@@ -489,18 +489,18 @@ int dirty_writeback_centisecs_handler(ct
static void wb_timer_fn(unsigned long unused)
{
- if (pdflush_operation(wb_kupdate, 0) < 0)
+ if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
}
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
{
sys_sync();
}
static void laptop_timer_fn(unsigned long unused)
{
- pdflush_operation(laptop_flush, 0);
+ pdflush_operation(laptop_flush, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/mm/pdflush.c 2/mm/pdflush.c
--- 1/mm/pdflush.c 2007-09-11 14:36:06.000000000 -0700
+++ 2/mm/pdflush.c 2007-09-11 14:39:22.000000000 -0700
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
*/
struct pdflush_work {
struct task_struct *who; /* The thread */
- void (*fn)(unsigned long); /* A callback function */
+ void (*fn)(unsigned long, nodemask_t *); /* A callback function */
unsigned long arg0; /* An argument to the callback */
struct list_head list; /* On pdflush_list, when idle */
unsigned long when_i_went_to_sleep;
+ int have_nodes; /* Nodes were specified */
+ nodemask_t nodes; /* Nodes of interest */
};
static int __pdflush(struct pdflush_work *my_work)
@@ -124,7 +126,8 @@ static int __pdflush(struct pdflush_work
}
spin_unlock_irq(&pdflush_lock);
- (*my_work->fn)(my_work->arg0);
+ (*my_work->fn)(my_work->arg0,
+ my_work->have_nodes ? &my_work->nodes : NULL);
/*
* Thread creation: For how long have there been zero
@@ -198,7 +201,8 @@ static int pdflush(void *dummy)
* Returns zero if it indeed managed to find a worker thread, and passed your
* payload to it.
*/
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *),
+ unsigned long arg0, nodemask_t *nodes)
{
unsigned long flags;
int ret = 0;
@@ -218,6 +222,11 @@ int pdflush_operation(void (*fn)(unsigne
last_empty_jifs = jiffies;
pdf->fn = fn;
pdf->arg0 = arg0;
+ if (nodes) {
+ pdf->nodes = *nodes;
+ pdf->have_nodes = 1;
+ } else
+ pdf->have_nodes = 0;
wake_up_process(pdf->who);
spin_unlock_irqrestore(&pdflush_lock, flags);
}
diff -uprN -X 0/Documentation/dontdiff 1/mm/vmscan.c 2/mm/vmscan.c
--- 1/mm/vmscan.c 2007-09-11 14:36:06.000000000 -0700
+++ 2/mm/vmscan.c 2007-09-11 14:41:41.000000000 -0700
@@ -1301,7 +1301,7 @@ unsigned long do_try_to_free_pages(struc
*/
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
sc->may_writepage = 1;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 3/6] cpuset write throttle
2007-09-12 1:32 ` Ethan Solomita
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-09-12 1:38 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
@ 2007-09-12 1:39 ` Ethan Solomita
[not found] ` <20070914161517.5ea3847f.akpm@linux-foundation.org>
2007-09-12 1:40 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
` (2 subsequent siblings)
5 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Make page writeback obey cpuset constraints
Currently dirty throttling does not work properly in a cpuset.
If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.
This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.
We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.
Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c
--- 2/mm/page-writeback.c 2007-09-11 14:39:22.000000000 -0700
+++ 3/mm/page-writeback.c 2007-09-11 14:49:35.000000000 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
+struct dirty_limits {
+ long thresh_background;
+ long thresh_dirty;
+ unsigned long nr_dirty;
+ unsigned long nr_unstable;
+ unsigned long nr_writeback;
+};
+
/*
* Work out the current dirty-memory clamping and background writeout
* thresholds.
@@ -121,16 +129,20 @@ static void background_writeout(unsigned
* clamping level.
*/
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total)
{
#ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
+ if (nodes == NULL)
+ nodes = &node_online_mask;
for_each_node_state(node, N_HIGH_MEMORY) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+ if (!node_isset(node, nodes))
+ continue;
x += zone_page_state(z, NR_FREE_PAGES)
+ zone_page_state(z, NR_INACTIVE)
+ zone_page_state(z, NR_ACTIVE);
@@ -154,26 +166,74 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
- x -= highmem_dirtyable_memory(x);
+ x -= highmem_dirtyable_memory(NULL, x);
return x + 1; /* Ensure that we never return 0 */
}
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
- struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+ nodemask_t *nodes)
{
int background_ratio; /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
- unsigned long available_memory = determine_dirtyable_memory();
+ unsigned long available_memory;
+ unsigned long nr_mapped;
struct task_struct *tsk;
+ int is_subset = 0;
- unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
- global_page_state(NR_ANON_PAGES)) * 100) /
- available_memory;
+#ifdef CONFIG_CPUSETS
+ if (unlikely(nodes &&
+ !nodes_subset(node_online_map, *nodes))) {
+ int node;
+ /*
+ * Calculate the limits relative to the current cpuset.
+ *
+ * We do not disregard highmem because all nodes (except
+ * maybe node 0) have either all memory in HIGHMEM (32 bit) or
+ * all memory in non HIGHMEM (64 bit). If we would disregard
+ * highmem then cpuset throttling would not work on 32 bit.
+ */
+ is_subset = 1;
+ memset(dl, 0, sizeof(struct dirty_limits));
+ available_memory = 0;
+ nr_mapped = 0;
+ for_each_node_mask(node, *nodes) {
+ if (!node_online(node))
+ continue;
+ dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+ dl->nr_unstable +=
+ node_page_state(node, NR_UNSTABLE_NFS);
+ dl->nr_writeback +=
+ node_page_state(node, NR_WRITEBACK);
+ available_memory +=
+ node_page_state(node, NR_ACTIVE) +
+ node_page_state(node, NR_INACTIVE) +
+ node_page_state(node, NR_FREE_PAGES);
+ nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
+ node_page_state(node, NR_ANON_PAGES);
+ }
+ available_memory -= highmem_dirtyable_memory(nodes,
+ available_memory);
+ /* Ensure that we return >= 0 */
+ if (available_memory <= 0)
+ available_memory = 1;
+ } else
+#endif
+ {
+ /* Global limits */
+ dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
+ dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+ dl->nr_writeback = global_page_state(NR_WRITEBACK);
+ available_memory = determine_dirtyable_memory();
+ nr_mapped = global_page_state(NR_FILE_MAPPED) +
+ global_page_state(NR_ANON_PAGES);
+ }
+
+ unmapped_ratio = 100 - (nr_mapped * 100 / available_memory);
dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;
@@ -192,8 +252,9 @@ get_dirty_limits(long *pbackground, long
background += background / 4;
dirty += dirty / 4;
}
- *pbackground = background;
- *pdirty = dirty;
+ dl->thresh_background = background;
+ dl->thresh_dirty = dirty;
+ return is_subset;
}
/*
@@ -206,8 +267,7 @@ get_dirty_limits(long *pbackground, long
static void balance_dirty_pages(struct address_space *mapping)
{
long nr_reclaimable;
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
unsigned long pages_written = 0;
unsigned long write_chunk = sync_writeback_pages();
@@ -222,11 +282,12 @@ static void balance_dirty_pages(struct a
.range_cyclic = 1,
};
- get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
- dirty_thresh)
+ if (get_dirty_limits(&dl, mapping,
+ &cpuset_current_mems_allowed))
+ wbc.nodes = &cpuset_current_mems_allowed;
+ nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+ if (nr_reclaimable + dl.nr_writeback <=
+ dl.thresh_dirty)
break;
if (!dirty_exceeded)
@@ -240,13 +301,10 @@ static void balance_dirty_pages(struct a
*/
if (nr_reclaimable) {
writeback_inodes(&wbc);
- get_dirty_limits(&background_thresh,
- &dirty_thresh, mapping);
- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- if (nr_reclaimable +
- global_page_state(NR_WRITEBACK)
- <= dirty_thresh)
+ get_dirty_limits(&dl, mapping,
+ &cpuset_current_mems_allowed);
+ nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+ if (nr_reclaimable + dl.nr_writeback <= dl.thresh_dirty)
break;
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
@@ -255,8 +313,8 @@ static void balance_dirty_pages(struct a
congestion_wait(WRITE, HZ/10);
}
- if (nr_reclaimable + global_page_state(NR_WRITEBACK)
- <= dirty_thresh && dirty_exceeded)
+ if (nr_reclaimable + dl.nr_writeback
+ <= dl.thresh_dirty && dirty_exceeded)
dirty_exceeded = 0;
if (writeback_in_progress(bdi))
@@ -271,8 +329,9 @@ static void balance_dirty_pages(struct a
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
- (!laptop_mode && (nr_reclaimable > background_thresh)))
- pdflush_operation(background_writeout, 0, NULL);
+ (!laptop_mode && (nr_reclaimable > dl.thresh_background)))
+ pdflush_operation(background_writeout, 0,
+ &cpuset_current_mems_allowed);
}
void set_page_dirty_balance(struct page *page)
@@ -329,8 +388,7 @@ EXPORT_SYMBOL(balance_dirty_pages_rateli
void throttle_vm_writeout(gfp_t gfp_mask)
{
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
/*
@@ -342,27 +400,26 @@ void throttle_vm_writeout(gfp_t gfp_mask
return;
}
- for ( ; ; ) {
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+ for ( ; ; ) {
+ get_dirty_limits(&dl, NULL, &node_online_map);
+
+ /*
+ * Boost the allowable dirty threshold a bit for page
+ * allocators so they don't get DoS'ed by heavy writers
+ */
+ dl.thresh_dirty += dl.thresh_dirty / 10; /* wheeee... */
- /*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
- congestion_wait(WRITE, HZ/10);
- }
+ if (dl.nr_unstable + dl.nr_writeback <= dl.thresh_dirty)
+ break;
+ congestion_wait(WRITE, HZ/10);
+ }
}
/*
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
*/
-static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes)
{
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -375,12 +432,11 @@ static void background_writeout(unsigned
};
for ( ; ; ) {
- long background_thresh;
- long dirty_thresh;
+ struct dirty_limits dl;
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
- if (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) < background_thresh
+ if (get_dirty_limits(&dl, NULL, nodes))
+ wbc.nodes = nodes;
+ if (dl.nr_dirty + dl.nr_unstable < dl.thresh_background
&& min_pages <= 0)
break;
wbc.encountered_congestion = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 4/6] cpuset write vmscan
2007-09-12 1:32 ` Ethan Solomita
` (2 preceding siblings ...)
2007-09-12 1:39 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
@ 2007-09-12 1:40 ` Ethan Solomita
2007-09-12 1:41 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
2007-09-12 1:42 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
5 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:40 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Direct reclaim: cpuset aware writeout
During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either
A. No writeout occurs because global dirty limits have not been reached
or
B. Writeout starts randomly for some dirty inode in the system. Pdflush
may just write out data for nodes in another cpuset and miss doing
proper dirty handling for the current cpuset.
In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.
Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c 2007-09-11 14:41:56.000000000 -0700
+++ 4/mm/vmscan.c 2007-09-11 14:50:41.000000000 -0700
@@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc
*/
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+ &cpuset_current_mems_allowed);
sc->may_writepage = 1;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 5/6] cpuset write vm writeout
2007-09-12 1:32 ` Ethan Solomita
` (3 preceding siblings ...)
2007-09-12 1:40 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
@ 2007-09-12 1:41 ` Ethan Solomita
2007-09-12 1:42 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
5 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:41 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Throttle VM writeout in a cpuset aware way
This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.
kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not from kswapd.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-09-11 14:49:47.000000000 -0700
+++ 5/include/linux/writeback.h 2007-09-11 14:50:52.000000000 -0700
@@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc
int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c 2007-09-11 14:49:47.000000000 -0700
+++ 5/mm/page-writeback.c 2007-09-11 14:50:52.000000000 -0700
@@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
{
struct dirty_limits dl;
@@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
for ( ; ; ) {
- get_dirty_limits(&dl, NULL, &node_online_map);
+ get_dirty_limits(&dl, NULL, nodes);
/*
* Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c 2007-09-11 14:50:41.000000000 -0700
+++ 5/mm/vmscan.c 2007-09-11 14:50:52.000000000 -0700
@@ -1185,7 +1185,7 @@ static unsigned long shrink_zone(int pri
}
}
- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 6/6] cpuset dirty limits
2007-09-12 1:32 ` Ethan Solomita
` (4 preceding siblings ...)
2007-09-12 1:41 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
@ 2007-09-12 1:42 ` Ethan Solomita
2007-09-14 23:15 ` Andrew Morton
5 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-09-12 1:42 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Per cpuset dirty ratios
This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:
background_dirty_ratio Percentage at which background writeback starts
throttle_dirty_ratio Percentage at which the application is throttled
and we start synchrononous writeout.
Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 5/include/linux/cpuset.h 2007-09-11 14:50:48.000000000 -0700
+++ 7/include/linux/cpuset.h 2007-09-11 14:51:12.000000000 -0700
@@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+extern void cpuset_get_current_ratios(int *background, int *ratio);
/*
* We need macros since struct address_space is not defined yet
*/
diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
--- 5/kernel/cpuset.c 2007-09-11 14:50:49.000000000 -0700
+++ 7/kernel/cpuset.c 2007-09-11 14:56:18.000000000 -0700
@@ -51,6 +51,7 @@
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>
#include <asm/atomic.h>
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
struct fmeter fmeter; /* memory_pressure filter */
+
+ int background_dirty_ratio;
+ int throttle_dirty_ratio;
};
/* Retrieve the cpuset for a container */
@@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+ .background_dirty_ratio = -1,
+ .throttle_dirty_ratio = -1,
};
/*
@@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
}
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+ char *endp;
+ int val;
+
+ val = simple_strtol(buf, &endp, 10);
+ if (val < min || val > max)
+ return -EINVAL;
+
+ mutex_lock(&callback_mutex);
+ *cs_int = val;
+ mutex_unlock(&callback_mutex);
+ return 0;
+}
+
/*
* Frequency meter - How fast is some event occurring?
*
@@ -933,6 +954,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_THROTTLE_DIRTY_RATIO,
+ FILE_BACKGROUND_DIRTY_RATIO,
} cpuset_filetype_t;
static ssize_t cpuset_common_file_write(struct container *cont,
@@ -997,6 +1020,12 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+ case FILE_BACKGROUND_DIRTY_RATIO:
+ retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100);
+ break;
+ case FILE_THROTTLE_DIRTY_RATIO:
+ retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+ break;
default:
retval = -EINVAL;
goto out2;
@@ -1090,6 +1119,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+ case FILE_BACKGROUND_DIRTY_RATIO:
+ s += sprintf(s, "%d", cs->background_dirty_ratio);
+ break;
+ case FILE_THROTTLE_DIRTY_RATIO:
+ s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1173,6 +1208,20 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};
+static struct cftype cft_background_dirty_ratio = {
+ .name = "background_dirty_ratio",
+ .read = cpuset_common_file_read,
+ .write = cpuset_common_file_write,
+ .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+ .name = "throttle_dirty_ratio",
+ .read = cpuset_common_file_read,
+ .write = cpuset_common_file_write,
+ .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
static int cpuset_populate(struct container_subsys *ss, struct container *cont)
{
int err;
@@ -1193,6 +1242,10 @@ static int cpuset_populate(struct contai
return err;
if ((err = container_add_file(cont, ss, &cft_spread_slab)) < 0)
return err;
+ if ((err = container_add_file(cont, ss, &cft_background_dirty_ratio)) < 0)
+ return err;
+ if ((err = container_add_file(cont, ss, &cft_throttle_dirty_ratio)) < 0)
+ return err;
/* memory_pressure_enabled is in root cpuset only */
if (err == 0 && !cont->parent)
err = container_add_file(cont, ss,
@@ -1272,6 +1325,8 @@ static struct container_subsys_state *cp
cs->mems_allowed = NODE_MASK_NONE;
cs->mems_generation = cpuset_mems_generation++;
fmeter_init(&cs->fmeter);
+ cs->background_dirty_ratio = parent->background_dirty_ratio;
+ cs->throttle_dirty_ratio = parent->throttle_dirty_ratio;
cs->parent = parent;
number_of_cpusets++;
@@ -1755,8 +1810,30 @@ int cpuset_mem_spread_node(void)
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
-#if MAX_NUMNODES > BITS_PER_LONG
+/*
+ * Determine the dirty ratios for the currently active cpuset
+ */
+void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
+{
+ int background = -1;
+ int throttle = -1;
+ struct task_struct *tsk = current;
+
+ task_lock(tsk);
+ background = task_cs(tsk)->background_dirty_ratio;
+ throttle = task_cs(tsk)->throttle_dirty_ratio;
+ task_unlock(tsk);
+ if (background == -1)
+ background = dirty_background_ratio;
+ if (throttle == -1)
+ throttle = vm_dirty_ratio;
+
+ *background_ratio = background;
+ *throttle_ratio = throttle;
+}
+
+#if MAX_NUMNODES > BITS_PER_LONG
/*
* Special functions for NUMA systems with a large number of nodes.
* The nodemask is pointed to from the address space structures.
diff -uprN -X 0/Documentation/dontdiff 5/mm/page-writeback.c 7/mm/page-writeback.c
--- 5/mm/page-writeback.c 2007-09-11 14:50:52.000000000 -0700
+++ 7/mm/page-writeback.c 2007-09-11 14:51:12.000000000 -0700
@@ -221,6 +221,7 @@ get_dirty_limits(struct dirty_limits *dl
/* Ensure that we return >= 0 */
if (available_memory <= 0)
available_memory = 1;
+ cpuset_get_current_ratios(&background_ratio, &dirty_ratio);
} else
#endif
{
@@ -231,17 +232,17 @@ get_dirty_limits(struct dirty_limits *dl
available_memory = determine_dirtyable_memory();
nr_mapped = global_page_state(NR_FILE_MAPPED) +
global_page_state(NR_ANON_PAGES);
+ dirty_ratio = vm_dirty_ratio;
+ background_ratio = dirty_background_ratio;
}
unmapped_ratio = 100 - (nr_mapped * 100 / available_memory);
- dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;
if (dirty_ratio < 5)
dirty_ratio = 5;
- background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)
background_ratio = dirty_ratio / 2;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
@ 2007-09-14 23:15 ` Andrew Morton
2007-09-14 23:47 ` Satyam Sharma
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Andrew Morton @ 2007-09-14 23:15 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Christoph Lameter
On Tue, 11 Sep 2007 18:36:34 -0700
Ethan Solomita <solo@google.com> wrote:
> Add a dirty map to struct address_space
I get a tremendous number of rejects trying to wedge this stuff on top of
Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am
prepared to partially-fix so that I can usefully look at these changes in
tkdiff, so this is all based on a quick peek at the diff itself..
> In a NUMA system it is helpful to know where the dirty pages of a mapping
> are located. That way we will be able to implement writeout for applications
> that are constrained to a portion of the memory of the system as required by
> cpusets.
>
> This patch implements the management of dirty node maps for an address
> space through the following functions:
>
> cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
>
> cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map
>
> cpuset_init_dirty_nodes(mapping) First time init of the map
>
>
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated
> for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
>
> Updating the dirty map may involve allocating it first for large
> configurations. Therefore we protect the allocation and setting
> of a node in the map through the tree_lock. The tree_lock is
> already taken when a page is dirtied so there is no additional
> locking overhead if we insert the updating of the nodemask there.
>
> The dirty map is only cleared (or freed) when the inode is cleared.
> At that point no pages are attached to the inode anymore and therefore it can
> be done without any locking. The dirty map therefore records all nodes that
> have been used for dirty pages by that inode until the inode is no longer
> used.
>
It'd be nice to see some discussion regarding the memory consumption of
this patch and the associated tradeoffs.
> ...
>
> +#if MAX_NUMNODES <= BITS_PER_LONG
The patch is sprinkled full of this conditional.
I don't understand why this is being done. afaict it isn't described
in a code comment (it should be) nor even in the changelogs?
Given its overall complexity and its likelihood to change in the
future, I'd suggest that this conditional be centralised in a single
place. Something like
/*
* nice comment goes here
*/
#if MAX_NUMNODES <= BITS_PER_LONG
#define CPUSET_DIRTY_LIMITS 1
#else
#define CPUSET_DIRTY_LIMITS 0
#endif
Then use #if CPUSET_DIRTY_LIMITS everywhere else.
(This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
warning if someone typos '#if CPUSET_DITRY_LIMITS')
Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig,
but I suspect you'll need to jump through unfeasible hoops to do that
sort of calculation within Kconfig.
> --- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
> @@ -516,6 +516,13 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> +#ifdef CONFIG_CPUSETS
> +#if MAX_NUMNODES <= BITS_PER_LONG
> + nodemask_t dirty_nodes; /* nodes with dirty pages */
> +#else
> + nodemask_t *dirty_nodes; /* pointer to map if dirty */
> +#endif
> +#endif
afacit there is no code comment and no changelog text which explains the
above design decision? There should be, please.
There is talk of making cpusets available with CONFIG_SMP=n. Will this new
feature be available in that case? (it should be).
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
> --- 0/include/linux/writeback.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
> @@ -62,6 +62,7 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> void *fs_private; /* For use by ->writepages() */
> + nodemask_t *nodes; /* Set of nodes of interest */
> };
That comment is a bit terse. It's always good to be lavish when commenting
data structures, for understanding those is key to understanding a design.
> /*
> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
> --- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
> @@ -4,7 +4,7 @@
> * Processor and Memory placement constraints for sets of tasks.
> *
> * Copyright (C) 2003 BULL SA.
> - * Copyright (C) 2004-2006 Silicon Graphics, Inc.
> + * Copyright (C) 2004-2007 Silicon Graphics, Inc.
> * Copyright (C) 2006 Google, Inc
> *
> * Portions derived from Patrick Mochel's sysfs code.
> @@ -14,6 +14,7 @@
> * 2003-10-22 Updates by Stephen Hemminger.
> * 2004 May-July Rework by Paul Jackson.
> * 2006 Rework by Paul Menage to use generic containers
> + * 2007 Cpuset writeback by Christoph Lameter.
> *
> * This file is subject to the terms and conditions of the GNU General Public
> * License. See the file COPYING in the main directory of the Linux
> @@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
> }
> EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>
> +#if MAX_NUMNODES > BITS_PER_LONG
waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"
> +
> +/*
> + * Special functions for NUMA systems with a large number of nodes.
> + * The nodemask is pointed to from the address space structures.
> + * The attachment of the dirty_node mask is protected by the
> + * tree_lock. The nodemask is freed only when the inode is cleared
> + * (and therefore unused, thus no locking necessary).
> + */
hmm, OK, there's a hint as to wghat's going on.
It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
we might want to tweak that in the future. Yet another argument for
centralising this comparison.
> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> + struct page *page)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> + int node = page_to_nid(page);
> +
> + if (!nodes) {
> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
Does it have to be atomic? atomic is weak and can fail.
If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
should at least pass the gfp_t into this function so it can do the stronger
allocation when possible.
> + if (!nodes)
> + return;
> +
> + *nodes = NODE_MASK_NONE;
> + mapping->dirty_nodes = nodes;
> + }
> +
> + if (!node_isset(node, *nodes))
> + node_set(node, *nodes);
> +}
> +
> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> +
> + if (nodes) {
> + mapping->dirty_nodes = NULL;
> + kfree(nodes);
> + }
> +}
Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
a comment which describes the locking requirements would be good.
> +/*
> + * Called without the tree_lock. The nodemask is only freed when the inode
> + * is cleared and therefore this is safe.
> + */
> +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> + nodemask_t *mask)
> +{
> + nodemask_t *dirty_nodes = mapping->dirty_nodes;
> +
> + if (!mask)
> + return 1;
> +
> + if (!dirty_nodes)
> + return 0;
> +
> + return nodes_intersects(*dirty_nodes, *mask);
> +}
> +#endif
> +
> /**
> * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
> * @p: pointer to task_struct of some other task.
> diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
> --- 0/mm/page-writeback.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
> @@ -33,6 +33,7 @@
> #include <linux/syscalls.h>
> #include <linux/buffer_head.h>
> #include <linux/pagevec.h>
> +#include <linux/cpuset.h>
>
> /*
> * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> + cpuset_update_dirty_nodes(mapping, page);
> write_unlock_irq(&mapping->tree_lock);
> if (mapping->host) {
> /* !PageAnon && !swapper_space */
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/6] cpuset dirty limits
2007-09-12 1:42 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
@ 2007-09-14 23:15 ` Andrew Morton
2007-09-17 19:00 ` Christoph Lameter
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2007-09-14 23:15 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Christoph Lameter
On Tue, 11 Sep 2007 18:42:16 -0700
Ethan Solomita <solo@google.com> wrote:
> Per cpuset dirty ratios
>
> This implements dirty ratios per cpuset. Two new files are added
> to the cpuset directories:
>
> background_dirty_ratio Percentage at which background writeback starts
>
> throttle_dirty_ratio Percentage at which the application is throttled
> and we start synchrononous writeout.
>
> Both variables are set to -1 by default which means that the global
> limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
> are used for a cpuset.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Acked-by: Ethan Solomita <solo@google.com>
>
> ---
>
> Patch against 2.6.23-rc4-mm1
>
> diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h
> --- 5/include/linux/cpuset.h 2007-09-11 14:50:48.000000000 -0700
> +++ 7/include/linux/cpuset.h 2007-09-11 14:51:12.000000000 -0700
> @@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
>
> extern int current_cpuset_is_being_rebound(void);
>
> +extern void cpuset_get_current_ratios(int *background, int *ratio);
> /*
> * We need macros since struct address_space is not defined yet
> */
> diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
> --- 5/kernel/cpuset.c 2007-09-11 14:50:49.000000000 -0700
> +++ 7/kernel/cpuset.c 2007-09-11 14:56:18.000000000 -0700
> @@ -51,6 +51,7 @@
> #include <linux/time.h>
> #include <linux/backing-dev.h>
> #include <linux/sort.h>
> +#include <linux/writeback.h>
>
> #include <asm/uaccess.h>
> #include <asm/atomic.h>
> @@ -92,6 +93,9 @@ struct cpuset {
> int mems_generation;
>
> struct fmeter fmeter; /* memory_pressure filter */
> +
> + int background_dirty_ratio;
> + int throttle_dirty_ratio;
> };
>
> /* Retrieve the cpuset for a container */
> @@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
> .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
> .cpus_allowed = CPU_MASK_ALL,
> .mems_allowed = NODE_MASK_ALL,
> + .background_dirty_ratio = -1,
> + .throttle_dirty_ratio = -1,
> };
>
> /*
> @@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t
> return 0;
> }
>
> +static int update_int(int *cs_int, char *buf, int min, int max)
> +{
> + char *endp;
> + int val;
> +
> + val = simple_strtol(buf, &endp, 10);
> + if (val < min || val > max)
> + return -EINVAL;
> +
> + mutex_lock(&callback_mutex);
> + *cs_int = val;
> + mutex_unlock(&callback_mutex);
I don't think this locking does anything?
> + return 0;
> +}
> +
> /*
> * Frequency meter - How fast is some event occurring?
> *
> ...
> +void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
> +{
> + int background = -1;
> + int throttle = -1;
> + struct task_struct *tsk = current;
> +
> + task_lock(tsk);
> + background = task_cs(tsk)->background_dirty_ratio;
> + throttle = task_cs(tsk)->throttle_dirty_ratio;
> + task_unlock(tsk);
ditto?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-14 23:15 ` Andrew Morton
@ 2007-09-14 23:47 ` Satyam Sharma
2007-09-15 0:07 ` Andrew Morton
2007-09-17 19:10 ` Christoph Lameter
2007-09-19 0:51 ` Ethan Solomita
2 siblings, 1 reply; 41+ messages in thread
From: Satyam Sharma @ 2007-09-14 23:47 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ethan Solomita, linux-mm, LKML, Christoph Lameter
On 9/15/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 11 Sep 2007 18:36:34 -0700
> Ethan Solomita <solo@google.com> wrote:
> > The dirty map may be stored either directly in the mapping (for NUMA
> > systems with less then BITS_PER_LONG nodes) or separately allocated
> > for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
> > --- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
> > +++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
> > @@ -516,6 +516,13 @@ struct address_space {
> > spinlock_t private_lock; /* for use by the address_space */
> > struct list_head private_list; /* ditto */
> > struct address_space *assoc_mapping; /* ditto */
> > +#ifdef CONFIG_CPUSETS
> > +#if MAX_NUMNODES <= BITS_PER_LONG
> > + nodemask_t dirty_nodes; /* nodes with dirty pages */
> > +#else
> > + nodemask_t *dirty_nodes; /* pointer to map if dirty */
> > +#endif
> > +#endif
>
> afacit there is no code comment and no changelog text which explains the
> above design decision? There should be, please.
> > +/*
> > + * Special functions for NUMA systems with a large number of nodes.
> > + * The nodemask is pointed to from the address space structures.
> > + * The attachment of the dirty_node mask is protected by the
> > + * tree_lock. The nodemask is freed only when the inode is cleared
> > + * (and therefore unused, thus no locking necessary).
> > + */
>
> hmm, OK, there's a hint as to wghat's going on.
>
> It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> we might want to tweak that in the future. Yet another argument for
> centralising this comparison.
Looks like just an optimization to me ... Ethan wants to economize and not bloat
struct address_space too much.
So, if sizeof(nodemask_t) == sizeof(long), i.e. when:
MAX_NUMNODES <= BITS_PER_LONG, then we'll be adding only sizeof(long)
extra bytes to the struct (by plonking the object itself into it).
But even when MAX_NUMNODES > BITS_PER_LONG, because we're storing
a pointer, and because sizeof(void *) == sizeof(long), so again the maximum
bloat addition to struct address_space would only be sizeof(long) bytes.
I didn't see the original mail, but if the #ifdeffery for this
conditional is too much
as a result of this optimization, Ethan should probably just do away
with all of it
entirely, and simply put a full nodemask_t object (irrespective of MAX_NUMNODES)
into the struct. After all, struct task_struct does the same unconditionally ...
but admittedly, there are several times more address_space struct's resident in
memory at any given time than there are task_struct's, so this optimization does
make sense too ...
> > + if (!nodes)
> > + return;
> > +
> > + *nodes = NODE_MASK_NONE;
> > + mapping->dirty_nodes = nodes;
> > + }
> > +
> > + if (!node_isset(node, *nodes))
> > + node_set(node, *nodes);
> > +}
> > +
> > +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> > +{
> > + nodemask_t *nodes = mapping->dirty_nodes;
> > +
> > + if (nodes) {
> > + mapping->dirty_nodes = NULL;
> > + kfree(nodes);
> > + }
> > +}
>
> Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> a comment which describes the locking requirements would be good.
>
> > +/*
> > + * Called without the tree_lock. The nodemask is only freed when the inode
> > + * is cleared and therefore this is safe.
> > + */
> > +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> > + nodemask_t *mask)
> > +{
> > + nodemask_t *dirty_nodes = mapping->dirty_nodes;
> > +
> > + if (!mask)
> > + return 1;
> > +
> > + if (!dirty_nodes)
> > + return 0;
> > +
> > + return nodes_intersects(*dirty_nodes, *mask);
> > +}
> > +#endif
> > +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-14 23:47 ` Satyam Sharma
@ 2007-09-15 0:07 ` Andrew Morton
2007-09-15 0:16 ` Satyam Sharma
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2007-09-15 0:07 UTC (permalink / raw)
To: Satyam Sharma; +Cc: Ethan Solomita, linux-mm, LKML, Christoph Lameter
On Sat, 15 Sep 2007 05:17:48 +0530
"Satyam Sharma" <satyam.sharma@gmail.com> wrote:
> > It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> > we might want to tweak that in the future. Yet another argument for
> > centralising this comparison.
>
> Looks like just an optimization to me ... Ethan wants to economize and not bloat
> struct address_space too much.
>
> So, if sizeof(nodemask_t) == sizeof(long), i.e. when:
> MAX_NUMNODES <= BITS_PER_LONG, then we'll be adding only sizeof(long)
> extra bytes to the struct (by plonking the object itself into it).
>
> But even when MAX_NUMNODES > BITS_PER_LONG, because we're storing
> a pointer, and because sizeof(void *) == sizeof(long), so again the maximum
> bloat addition to struct address_space would only be sizeof(long) bytes.
yup.
Note that "It's unobvious" != "It's unobvious to me". I review code for
understandability-by-others, not for understandability-by-me.
> I didn't see the original mail, but if the #ifdeffery for this
> conditional is too much
> as a result of this optimization, Ethan should probably just do away
> with all of it
> entirely, and simply put a full nodemask_t object (irrespective of MAX_NUMNODES)
> into the struct. After all, struct task_struct does the same unconditionally ...
> but admittedly, there are several times more address_space struct's resident in
> memory at any given time than there are task_struct's, so this optimization does
> make sense too ...
I think the optimisation is (probably) desirable, but it would be best to
describe the tradeoff in the changelog and to add some suitable
code-commentary for those who read the code in a year's time and to avoid
sprinkling the logic all over the tree.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-15 0:07 ` Andrew Morton
@ 2007-09-15 0:16 ` Satyam Sharma
2007-09-17 18:37 ` Mike Travis
0 siblings, 1 reply; 41+ messages in thread
From: Satyam Sharma @ 2007-09-15 0:16 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ethan Solomita, linux-mm, LKML, Christoph Lameter
On 9/15/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sat, 15 Sep 2007 05:17:48 +0530
> "Satyam Sharma" <satyam.sharma@gmail.com> wrote:
>
> > > It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> > > we might want to tweak that in the future. Yet another argument for
> > > centralising this comparison.
> >
> > Looks like just an optimization to me ... Ethan wants to economize and not bloat
> > struct address_space too much.
> >
> > So, if sizeof(nodemask_t) == sizeof(long), i.e. when:
> > MAX_NUMNODES <= BITS_PER_LONG, then we'll be adding only sizeof(long)
> > extra bytes to the struct (by plonking the object itself into it).
> >
> > But even when MAX_NUMNODES > BITS_PER_LONG, because we're storing
> > a pointer, and because sizeof(void *) == sizeof(long), so again the maximum
> > bloat addition to struct address_space would only be sizeof(long) bytes.
>
> yup.
>
> Note that "It's unobvious" != "It's unobvious to me". I review code for
> understandability-by-others, not for understandability-by-me.
>
> > I didn't see the original mail, but if the #ifdeffery for this
> > conditional is too much
> > as a result of this optimization, Ethan should probably just do away
> > with all of it
> > entirely, and simply put a full nodemask_t object (irrespective of MAX_NUMNODES)
> > into the struct. After all, struct task_struct does the same unconditionally ...
> > but admittedly, there are several times more address_space struct's resident in
> > memory at any given time than there are task_struct's, so this optimization does
> > make sense too ...
>
> I think the optimisation is (probably) desirable, but it would be best to
> describe the tradeoff in the changelog and to add some suitable
> code-commentary for those who read the code in a year's time and to avoid
> sprinkling the logic all over the tree.
True, the other option could be to put the /pointer/ in there unconditionally,
but that would slow down the MAX_NUMNODES <= BITS_PER_LONG case,
which (after grepping through defconfigs) appears to be the common case on
all archs other than ia64. So I think your idea of making that conditional
centralized in the code with an accompanying comment is the way to go here ...
Satyam
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-15 0:16 ` Satyam Sharma
@ 2007-09-17 18:37 ` Mike Travis
0 siblings, 0 replies; 41+ messages in thread
From: Mike Travis @ 2007-09-17 18:37 UTC (permalink / raw)
To: Satyam Sharma
Cc: Andrew Morton, Ethan Solomita, linux-mm, LKML, Christoph Lameter
Satyam Sharma wrote:
> True, the other option could be to put the /pointer/ in there unconditionally,
> but that would slow down the MAX_NUMNODES <= BITS_PER_LONG case,
> which (after grepping through defconfigs) appears to be the common case on
> all archs other than ia64. So I think your idea of making that conditional
> centralized in the code with an accompanying comment is the way to go here ...
It won't be long before arch's other than ia64 will have
MAX_NUMNODES > BITS_PER_LONG. While it won't be the norm,
we should account for it now.
Thanks,
Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/6] cpuset dirty limits
2007-09-14 23:15 ` Andrew Morton
@ 2007-09-17 19:00 ` Christoph Lameter
2007-09-19 0:23 ` Ethan Solomita
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ethan Solomita, linux-mm, pj, LKML
On Fri, 14 Sep 2007, Andrew Morton wrote:
> > + mutex_lock(&callback_mutex);
> > + *cs_int = val;
> > + mutex_unlock(&callback_mutex);
>
> I don't think this locking does anything?
Locking is wrong here. The lock needs to be taken before the cs pointer
is dereferenced from the caller.
> > + return 0;
> > +}
> > +
> > /*
> > * Frequency meter - How fast is some event occurring?
> > *
> > ...
> > +void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
> > +{
> > + int background = -1;
> > + int throttle = -1;
> > + struct task_struct *tsk = current;
> > +
> > + task_lock(tsk);
> > + background = task_cs(tsk)->background_dirty_ratio;
> > + throttle = task_cs(tsk)->throttle_dirty_ratio;
> > + task_unlock(tsk);
>
> ditto?
It is required to take the task lock while dereferencing the tasks cpuset
pointer.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-14 23:15 ` Andrew Morton
2007-09-14 23:47 ` Satyam Sharma
@ 2007-09-17 19:10 ` Christoph Lameter
2007-09-19 0:51 ` Ethan Solomita
2 siblings, 0 replies; 41+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ethan Solomita, linux-mm, LKML
On Fri, 14 Sep 2007, Andrew Morton wrote:
> It'd be nice to see some discussion regarding the memory consumption of
> this patch and the associated tradeoffs.
On small NUMA system < 64 nodes you basically have an additional word
added to the address_space structure. On large NUMA we may have to
allocate very large per node arrays of up to 1024 nodes which may result
in 128 bytes used. However, the very large systems are rare thus we want
to do the allocation dynamically.
> afacit there is no code comment and no changelog text which explains the
> above design decision? There should be, please.
Hmmmm... There was an extensive discussion on that one early in the year
and I had it in the overview of the earlier releases. Ethan: Maybe get
that piece into the overview?
> There is talk of making cpusets available with CONFIG_SMP=n. Will this new
> feature be available in that case? (it should be).
Yes. Cpuset with CONFIG_SMP=n has been run for years on IA64 with Tony
Luck's special configurations.
> > EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> >
> > +#if MAX_NUMNODES > BITS_PER_LONG
>
> waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"
So
#if !(MAX_NUMNODES <= BITS_PER_LONG)
> > +/*
> > + * Special functions for NUMA systems with a large number of nodes.
> > + * The nodemask is pointed to from the address space structures.
> > + * The attachment of the dirty_node mask is protected by the
> > + * tree_lock. The nodemask is freed only when the inode is cleared
> > + * (and therefore unused, thus no locking necessary).
> > + */
>
> hmm, OK, there's a hint as to wghat's going on.
>
> It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> we might want to tweak that in the future. Yet another argument for
> centralising this comparison.
A pointer has the size of BITS_PER_LONG. If we have less nodes then we
just waste memory and processing by having a pointer there.
> > +void cpuset_update_dirty_nodes(struct address_space *mapping,
> > + struct page *page)
> > +{
> > + nodemask_t *nodes = mapping->dirty_nodes;
> > + int node = page_to_nid(page);
> > +
> > + if (!nodes) {
> > + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
>
> Does it have to be atomic? atomic is weak and can fail.
>
> If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> should at least pass the gfp_t into this function so it can do the stronger
> allocation when possible.
This is called from __set_page_dirty_nobuffers. There is no allocation
context available at that point. If the allocation fails then we do not
allocate a tracking structure which means that no targeted writeout occurs.
> > +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> > +{
> > + nodemask_t *nodes = mapping->dirty_nodes;
> > +
> > + if (nodes) {
> > + mapping->dirty_nodes = NULL;
> > + kfree(nodes);
> > + }
> > +}
>
> Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> a comment which describes the locking requirements would be good.
You just quoted the comment above that explains the locking requirements.
Here it is again:
/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/6] cpuset dirty limits
2007-09-17 19:00 ` Christoph Lameter
@ 2007-09-19 0:23 ` Ethan Solomita
0 siblings, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-19 0:23 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andrew Morton, linux-mm, pj, LKML
Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Andrew Morton wrote:
>
>>> + mutex_lock(&callback_mutex);
>>> + *cs_int = val;
>>> + mutex_unlock(&callback_mutex);
>> I don't think this locking does anything?
>
> Locking is wrong here. The lock needs to be taken before the cs pointer
> is dereferenced from the caller.
I think we can just remove the callback_mutex lock. Since the change is
coming from an update to a cpuset filesystem file, the cpuset is not
going anywhere since the inode is open. And I don't see that any code
really cares whether the dirty ratios change out from under them.
>
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * Frequency meter - How fast is some event occurring?
>>> *
>>> ...
>>> +void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
>>> +{
>>> + int background = -1;
>>> + int throttle = -1;
>>> + struct task_struct *tsk = current;
>>> +
>>> + task_lock(tsk);
>>> + background = task_cs(tsk)->background_dirty_ratio;
>>> + throttle = task_cs(tsk)->throttle_dirty_ratio;
>>> + task_unlock(tsk);
>> ditto?
>
> It is required to take the task lock while dereferencing the tasks cpuset
> pointer.
Agreed.
-- Ethan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-14 23:15 ` Andrew Morton
2007-09-14 23:47 ` Satyam Sharma
2007-09-17 19:10 ` Christoph Lameter
@ 2007-09-19 0:51 ` Ethan Solomita
2007-09-19 2:14 ` Andrew Morton
2007-09-19 17:06 ` Christoph Lameter
2 siblings, 2 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-09-19 0:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, LKML, Christoph Lameter
Andrew Morton wrote:
> On Tue, 11 Sep 2007 18:36:34 -0700
> Ethan Solomita <solo@google.com> wrote:
>
>> Add a dirty map to struct address_space
>
> I get a tremendous number of rejects trying to wedge this stuff on top of
> Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am
> prepared to partially-fix so that I can usefully look at these changes in
> tkdiff, so this is all based on a quick peek at the diff itself..
This isn't surprising. We're both changing the calculation of dirty
limits. If his code is already into your workspace, then I'll have to do
the merging after you release it.
>> +#if MAX_NUMNODES <= BITS_PER_LONG
>
> The patch is sprinkled full of this conditional.
>
> I don't understand why this is being done. afaict it isn't described
> in a code comment (it should be) nor even in the changelogs?
I can add comments.
> Given its overall complexity and its likelihood to change in the
> future, I'd suggest that this conditional be centralised in a single
> place. Something like
>
> /*
> * nice comment goes here
> */
> #if MAX_NUMNODES <= BITS_PER_LONG
> #define CPUSET_DIRTY_LIMITS 1
> #else
> #define CPUSET_DIRTY_LIMITS 0
> #endif
>
> Then use #if CPUSET_DIRTY_LIMITS everywhere else.
>
> (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
> warning if someone typos '#if CPUSET_DITRY_LIMITS')
I can add something like this. Probably something like:
CPUSET_DIRTY_LIMITS_USEPTR
>> --- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
>> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
>> @@ -516,6 +516,13 @@ struct address_space {
>> spinlock_t private_lock; /* for use by the address_space */
>> struct list_head private_list; /* ditto */
>> struct address_space *assoc_mapping; /* ditto */
>> +#ifdef CONFIG_CPUSETS
>> +#if MAX_NUMNODES <= BITS_PER_LONG
>> + nodemask_t dirty_nodes; /* nodes with dirty pages */
>> +#else
>> + nodemask_t *dirty_nodes; /* pointer to map if dirty */
>> +#endif
>> +#endif
>
> afacit there is no code comment and no changelog text which explains the
> above design decision? There should be, please.
OK.
>
> There is talk of making cpusets available with CONFIG_SMP=n. Will this new
> feature be available in that case? (it should be).
I'm not sure how useful it would be in that scenario, but for
consistency we should still be able to specify varying dirty ratios
(from patch 6/6). The above code wouldn't mean anything SMP=n since
there's only the one node. We'd just be indicating whether the inode has
any dirty pages, which we already know.
>
>> } __attribute__((aligned(sizeof(long))));
>> /*
>> * On most architectures that alignment is already the case; but
>> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
>> --- 0/include/linux/writeback.h 2007-09-11 14:35:58.000000000 -0700
>> +++ 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
>> @@ -62,6 +62,7 @@ struct writeback_control {
>> unsigned for_writepages:1; /* This is a writepages() call */
>> unsigned range_cyclic:1; /* range_start is cyclic */
>> void *fs_private; /* For use by ->writepages() */
>> + nodemask_t *nodes; /* Set of nodes of interest */
>> };
>
> That comment is a bit terse. It's always good to be lavish when commenting
> data structures, for understanding those is key to understanding a design.
>
OK
>> /*
>> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
>> --- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
>> +++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
>> @@ -4,7 +4,7 @@
>> * Processor and Memory placement constraints for sets of tasks.
>> *
>> * Copyright (C) 2003 BULL SA.
>> - * Copyright (C) 2004-2006 Silicon Graphics, Inc.
>> + * Copyright (C) 2004-2007 Silicon Graphics, Inc.
>> * Copyright (C) 2006 Google, Inc
>> *
>> * Portions derived from Patrick Mochel's sysfs code.
>> @@ -14,6 +14,7 @@
>> * 2003-10-22 Updates by Stephen Hemminger.
>> * 2004 May-July Rework by Paul Jackson.
>> * 2006 Rework by Paul Menage to use generic containers
>> + * 2007 Cpuset writeback by Christoph Lameter.
>> *
>> * This file is subject to the terms and conditions of the GNU General Public
>> * License. See the file COPYING in the main directory of the Linux
>> @@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
>> }
>> EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>>
>> +#if MAX_NUMNODES > BITS_PER_LONG
>
> waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"
Your sanity is important to me. Will fix.
>
>> +
>> +/*
>> + * Special functions for NUMA systems with a large number of nodes.
>> + * The nodemask is pointed to from the address space structures.
>> + * The attachment of the dirty_node mask is protected by the
>> + * tree_lock. The nodemask is freed only when the inode is cleared
>> + * (and therefore unused, thus no locking necessary).
>> + */
>
> hmm, OK, there's a hint as to wghat's going on.
>
> It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> we might want to tweak that in the future. Yet another argument for
> centralising this comparison.
I'll add a comment to make it more obvious. The point is to only add
one long to the data structure, no matter sizeof(nodemask_t), adding
either a nodemask_t or a pointer to a nodemask_t if nodemask_t is too large.
>
>> +void cpuset_update_dirty_nodes(struct address_space *mapping,
>> + struct page *page)
>> +{
>> + nodemask_t *nodes = mapping->dirty_nodes;
>> + int node = page_to_nid(page);
>> +
>> + if (!nodes) {
>> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
>
> Does it have to be atomic? atomic is weak and can fail.
>
> If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> should at least pass the gfp_t into this function so it can do the stronger
> allocation when possible.
I was going to say that sanity would be improved by just allocing the
nodemask at inode alloc time. A failure here could be a problem because
below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask
pointer means that there are no dirty nodes, thus preventing dirty pages
from getting written to disk. i.e. This must never fail.
Given that we allocate it always at the beginning, I'm leaning towards
just allocating it within mapping no matter its size. It will make the
code much much simpler, and save me writing all the comments we've been
discussing. 8-)
How disastrous would this be? Is the need to support a 1024 node system
with 1,000,000 open mostly-read-only files thus needing to spend 120MB
of extra memory on my nodemasks a real scenario and a showstopper?
>
>
>> + if (!nodes)
>> + return;
>> +
>> + *nodes = NODE_MASK_NONE;
>> + mapping->dirty_nodes = nodes;
>> + }
>> +
>> + if (!node_isset(node, *nodes))
>> + node_set(node, *nodes);
>> +}
>> +
>> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
>> +{
>> + nodemask_t *nodes = mapping->dirty_nodes;
>> +
>> + if (nodes) {
>> + mapping->dirty_nodes = NULL;
>> + kfree(nodes);
>> + }
>> +}
>
> Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> a comment which describes the locking requirements would be good.
I'll add a comment. Such a race should not be possible. It is called
only from clear_inode() which is used when the inode is being freed
"with extreme prejudice" (from its comments). I can add a check that
i_state I_FREEING is set. Would that do?
>
>> +/*
>> + * Called without the tree_lock. The nodemask is only freed when the inode
>> + * is cleared and therefore this is safe.
>> + */
>> +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
>> + nodemask_t *mask)
>> +{
>> + nodemask_t *dirty_nodes = mapping->dirty_nodes;
>> +
>> + if (!mask)
>> + return 1;
>> +
>> + if (!dirty_nodes)
>> + return 0;
>> +
>> + return nodes_intersects(*dirty_nodes, *mask);
>> +}
>> +#endif
>> +
>> /**
>> * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
>> * @p: pointer to task_struct of some other task.
>> diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
>> --- 0/mm/page-writeback.c 2007-09-11 14:35:58.000000000 -0700
>> +++ 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
>> @@ -33,6 +33,7 @@
>> #include <linux/syscalls.h>
>> #include <linux/buffer_head.h>
>> #include <linux/pagevec.h>
>> +#include <linux/cpuset.h>
>>
>> /*
>> * The maximum number of pages to writeout in a single bdflush/kupdate
>> @@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
>> radix_tree_tag_set(&mapping->page_tree,
>> page_index(page), PAGECACHE_TAG_DIRTY);
>> }
>> + cpuset_update_dirty_nodes(mapping, page);
>> write_unlock_irq(&mapping->tree_lock);
>> if (mapping->host) {
>> /* !PageAnon && !swapper_space */
>>
>>
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-19 0:51 ` Ethan Solomita
@ 2007-09-19 2:14 ` Andrew Morton
2007-09-19 17:08 ` Christoph Lameter
2007-09-19 17:06 ` Christoph Lameter
1 sibling, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2007-09-19 2:14 UTC (permalink / raw)
To: Ethan Solomita; +Cc: linux-mm, LKML, Christoph Lameter
On Tue, 18 Sep 2007 17:51:49 -0700 Ethan Solomita <solo@google.com> wrote:
> >
> >> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> >> + struct page *page)
> >> +{
> >> + nodemask_t *nodes = mapping->dirty_nodes;
> >> + int node = page_to_nid(page);
> >> +
> >> + if (!nodes) {
> >> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
> >
> > Does it have to be atomic? atomic is weak and can fail.
> >
> > If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> > should at least pass the gfp_t into this function so it can do the stronger
> > allocation when possible.
>
> I was going to say that sanity would be improved by just allocing the
> nodemask at inode alloc time. A failure here could be a problem because
> below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask
> pointer means that there are no dirty nodes, thus preventing dirty pages
> from getting written to disk. i.e. This must never fail.
>
> Given that we allocate it always at the beginning, I'm leaning towards
> just allocating it within mapping no matter its size. It will make the
> code much much simpler, and save me writing all the comments we've been
> discussing. 8-)
>
> How disastrous would this be? Is the need to support a 1024 node system
> with 1,000,000 open mostly-read-only files thus needing to spend 120MB
> of extra memory on my nodemasks a real scenario and a showstopper?
None of this is very nice. Yes, it would be good to save all that memory
and yes, I_DIRTY_PAGES inodes are very much the uncommon case.
But if a failed GFP_ATOMIC allocation results in data loss then that's a
showstopper.
How hard would it be to handle the allocation failure in a more friendly
manner? Say, if the allocation failed then point mapping->dirty_nodes at
some global all-ones nodemask, and then special-case that nodemask in the
freeing code?
> >
> >
> >> + if (!nodes)
> >> + return;
> >> +
> >> + *nodes = NODE_MASK_NONE;
> >> + mapping->dirty_nodes = nodes;
> >> + }
> >> +
> >> + if (!node_isset(node, *nodes))
> >> + node_set(node, *nodes);
> >> +}
> >> +
> >> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> >> +{
> >> + nodemask_t *nodes = mapping->dirty_nodes;
> >> +
> >> + if (nodes) {
> >> + mapping->dirty_nodes = NULL;
> >> + kfree(nodes);
> >> + }
> >> +}
> >
> > Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> > a comment which describes the locking requirements would be good.
>
> I'll add a comment. Such a race should not be possible. It is called
> only from clear_inode() which is used when the inode is being freed
> "with extreme prejudice" (from its comments). I can add a check that
> i_state I_FREEING is set. Would that do?
Sounds sane.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-19 0:51 ` Ethan Solomita
2007-09-19 2:14 ` Andrew Morton
@ 2007-09-19 17:06 ` Christoph Lameter
1 sibling, 0 replies; 41+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:06 UTC (permalink / raw)
To: Ethan Solomita; +Cc: Andrew Morton, linux-mm, LKML
On Tue, 18 Sep 2007, Ethan Solomita wrote:
> > Does it have to be atomic? atomic is weak and can fail.
> >
> > If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> > should at least pass the gfp_t into this function so it can do the stronger
> > allocation when possible.
>
> I was going to say that sanity would be improved by just allocing the
> nodemask at inode alloc time. A failure here could be a problem because
> below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask
> pointer means that there are no dirty nodes, thus preventing dirty pages
> from getting written to disk. i.e. This must never fail.
Hmmm. It should assume that there is no tracking thus any node can be
dirty? Match by default?
> Given that we allocate it always at the beginning, I'm leaning towards
> just allocating it within mapping no matter its size. It will make the
> code much much simpler, and save me writing all the comments we've been
> discussing. 8-)
>
> How disastrous would this be? Is the need to support a 1024 node system
> with 1,000,000 open mostly-read-only files thus needing to spend 120MB
> of extra memory on my nodemasks a real scenario and a showstopper?
Consider that a 1024 node system has more than 4TB of memory. If that
system is running as a fileserver then you get into some issues. But then
120MB are not that big of a deal. Its more the cache footprint issue I
would think. Having a NULL there avoids touching a 128 byte nodemask. I
think your approach should be fine.
> >> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> >> +{
> >> + nodemask_t *nodes = mapping->dirty_nodes;
> >> +
> >> + if (nodes) {
> >> + mapping->dirty_nodes = NULL;
> >> + kfree(nodes);
> >> + }
> >> +}
> >
> > Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> > a comment which describes the locking requirements would be good.
>
> I'll add a comment. Such a race should not be possible. It is called
> only from clear_inode() which is used when the inode is being freed
> "with extreme prejudice" (from its comments). I can add a check that
> i_state I_FREEING is set. Would that do?
There is already a comment saying that it cannot happen.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/6] cpuset write dirty map
2007-09-19 2:14 ` Andrew Morton
@ 2007-09-19 17:08 ` Christoph Lameter
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:08 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ethan Solomita, linux-mm, LKML
On Tue, 18 Sep 2007, Andrew Morton wrote:
> How hard would it be to handle the allocation failure in a more friendly
> manner? Say, if the allocation failed then point mapping->dirty_nodes at
> some global all-ones nodemask, and then special-case that nodemask in the
> freeing code?
Ack. However, the situation dirty_nodes == NULL && inode dirty then means
that unknown nodes are dirty. If we are later are successful with the
alloc and we know that the pages are dirty in the mapping then the initial
dirty_nodes must be all ones. If this is the first page to be dirtied then
we can start with a dirty_nodes mask of all zeros like now.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
[not found] ` <20070914161517.5ea3847f.akpm@linux-foundation.org>
@ 2007-10-03 0:38 ` Ethan Solomita
2007-10-03 17:46 ` Christoph Lameter
0 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-10-03 0:38 UTC (permalink / raw)
To: Christoph Lameter, a.p.zijlstra, linux-mm; +Cc: Andrew Morton
Andrew Morton wrote:
> On Tue, 11 Sep 2007 18:39:27 -0700
> Ethan Solomita <solo@google.com> wrote:
>
>> Make page writeback obey cpuset constraints
>
> akpm:/usr/src/25> pushpatch
> patching file mm/page-writeback.c
> Hunk #1 FAILED at 103.
> Hunk #2 FAILED at 129.
> Hunk #3 FAILED at 166.
> Hunk #4 FAILED at 252.
> Hunk #5 FAILED at 267.
> Hunk #6 FAILED at 282.
> Hunk #7 FAILED at 301.
> Hunk #8 FAILED at 313.
> Hunk #9 FAILED at 329.
> Hunk #10 succeeded at 563 (offset 175 lines).
> Hunk #11 FAILED at 575.
> Hunk #12 FAILED at 607.
> 11 out of 12 hunks FAILED -- saving rejects to file mm/page-writeback.c.rej
> Failed to apply cpuset-write-throttle
>
> :(
>
>
> Huge number of rejects against Peter's stuff. Please redo for next -mm.
I've been looking at how to merge my cpuset write throttling changes
with Peter's per-BDI write throttling changes that have just been taken
by akpm. (Quick simplifying summary: my proposed patchset will only
throttle a process based upon dirty pages in the nodes to which the
process has access, as limited by cpuset's mems_allowed, thus protecting
one cpuset from the dirtying tendencies of other cpusets)
This is essential if cpusets are to isolate tasks from each other, so
we need to find a way to make it work with per-BDI. Theoretically we
could track of the per-BDI information within per-node structures
instead of globally, but that could lead to a scary increase in code
complexity and CPU time spent in get_dirty_limits. We don't want the
disk to finish flushing all its pages to disk before we calculate the
dirty limit. 8)
We could keep my changes and Peter's changes completely independent
calculations, and make it an "or", i.e. the caller of get_dirty_limits
will decide to throttle if either the per-BDI stats signal a throttling
or if the per-cpuset stats signal a throttling.
Unfortunately this eliminates one of the main reasons for the
per-cpuset throttling. If one cpuset is responsible for pushing one
disk/BDI to its dirty limit, someone in another cpuset can get throttled.
If we used "and" instead of "or", i.e. the caller is only throttled if
get_dirty_limits would throttle because of both per-BDI stats and
per-cpuset stats we make my cpusets case happy, but in the above
scenario the other cpuset can continue to dirty an overly-dirtied BDI
until they hit their own cpuset limits.
I'm hoping to get some input from Christoph, Peter, and anyone else who
can think of a good way to bring this all together.
-- Ethan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-03 0:38 ` Ethan Solomita
@ 2007-10-03 17:46 ` Christoph Lameter
2007-10-03 20:46 ` Ethan Solomita
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Lameter @ 2007-10-03 17:46 UTC (permalink / raw)
To: Ethan Solomita; +Cc: a.p.zijlstra, linux-mm, Andrew Morton
On Tue, 2 Oct 2007, Ethan Solomita wrote:
> Unfortunately this eliminates one of the main reasons for the
> per-cpuset throttling. If one cpuset is responsible for pushing one
> disk/BDI to its dirty limit, someone in another cpuset can get throttled.
I think that is acceptable. All processes that write to one disk/BDI must
be affected by congestion on that device. We may have to deal with
fairness issues later if it indeed becomes a problem.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-03 17:46 ` Christoph Lameter
@ 2007-10-03 20:46 ` Ethan Solomita
2007-10-04 3:56 ` Christoph Lameter
0 siblings, 1 reply; 41+ messages in thread
From: Ethan Solomita @ 2007-10-03 20:46 UTC (permalink / raw)
To: Christoph Lameter; +Cc: a.p.zijlstra, linux-mm, Andrew Morton
Christoph Lameter wrote:
> On Tue, 2 Oct 2007, Ethan Solomita wrote:
>
>> Unfortunately this eliminates one of the main reasons for the
>> per-cpuset throttling. If one cpuset is responsible for pushing one
>> disk/BDI to its dirty limit, someone in another cpuset can get throttled.
>
> I think that is acceptable. All processes that write to one disk/BDI must
> be affected by congestion on that device. We may have to deal with
> fairness issues later if it indeed becomes a problem.
We do see a fairness issue. We've seen delays on the order of 100
seconds for just a few writes to disk, and latency is important to us.
Perhaps we can detect that the bdi already has a long queue of pending
writes and not force more writes at this time so long as the per-cpuset
dirty threshold is not too high.
On a side note, get_dirty_limits() now returns two dirty counts, both
the dirty and bdi_dirty, yet its callers only ever want one of those
results. Could we change get_dirty_limits to only calculate one dirty
value based upon whether bdi is non-NULL? This would save calculation of
regular dirty when a bdi is passed.
-- Ethan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-03 20:46 ` Ethan Solomita
@ 2007-10-04 3:56 ` Christoph Lameter
2007-10-04 7:37 ` Peter Zijlstra
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Lameter @ 2007-10-04 3:56 UTC (permalink / raw)
To: Ethan Solomita; +Cc: a.p.zijlstra, linux-mm, Andrew Morton
On Wed, 3 Oct 2007, Ethan Solomita wrote:
> >> Unfortunately this eliminates one of the main reasons for the
> >> per-cpuset throttling. If one cpuset is responsible for pushing one
> >> disk/BDI to its dirty limit, someone in another cpuset can get throttled.
> >
> > I think that is acceptable. All processes that write to one disk/BDI must
> > be affected by congestion on that device. We may have to deal with
> > fairness issues later if it indeed becomes a problem.
>
> We do see a fairness issue. We've seen delays on the order of 100
> seconds for just a few writes to disk, and latency is important to us.
> Perhaps we can detect that the bdi already has a long queue of pending
> writes and not force more writes at this time so long as the per-cpuset
> dirty threshold is not too high.
Arghy.
> On a side note, get_dirty_limits() now returns two dirty counts, both
> the dirty and bdi_dirty, yet its callers only ever want one of those
> results. Could we change get_dirty_limits to only calculate one dirty
> value based upon whether bdi is non-NULL? This would save calculation of
> regular dirty when a bdi is passed.
Hmmmm.... I think Peter needs to consider this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 3:56 ` Christoph Lameter
@ 2007-10-04 7:37 ` Peter Zijlstra
2007-10-04 7:56 ` Paul Jackson
2007-10-05 19:34 ` Ethan Solomita
0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2007-10-04 7:37 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Ethan Solomita, linux-mm, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]
On Wed, 2007-10-03 at 20:56 -0700, Christoph Lameter wrote:
> On Wed, 3 Oct 2007, Ethan Solomita wrote:
>
> > >> Unfortunately this eliminates one of the main reasons for the
> > >> per-cpuset throttling. If one cpuset is responsible for pushing one
> > >> disk/BDI to its dirty limit, someone in another cpuset can get throttled.
> > >
> > > I think that is acceptable. All processes that write to one disk/BDI must
> > > be affected by congestion on that device. We may have to deal with
> > > fairness issues later if it indeed becomes a problem.
> >
> > We do see a fairness issue. We've seen delays on the order of 100
> > seconds for just a few writes to disk, and latency is important to us.
> > Perhaps we can detect that the bdi already has a long queue of pending
> > writes and not force more writes at this time so long as the per-cpuset
> > dirty threshold is not too high.
>
> Arghy.
clameter gone pirate. Its just that you're a few weeks late :-)
Perhaps you can keep a proportion in the cpu-set, and do a similar trick
that the process proportions do.
currently:
limit = total_limit * p_bdi * (1 - p_task/8)
suggestion:
limit = total_limit * p_bdi * (1 - p_task/8) * (1 - p_cpuset/4)
That would give a very busy cpuset a limit 1/4 lower than an idle
cpu-set, thereby the idle cpu-set can do light traffic before getting
throttled.
p_bdi is ratio of writeout completions
p_task is ratio of dirtiers
p_cpuset would also be a ratio of dirtiers
Another option would be:
limit = cpuset_limit * p_bdi * (1 - p_task/8)
Each cpuset gets a pre-proportioned part of the total limit. Overlapping
cpusets would get into some arguments though.
Hmm, maybe combine the two:
limit = cpuset_limit * p_bdi * (1 - p_task/8) * (1 - p_cpuset/4)
> > On a side note, get_dirty_limits() now returns two dirty counts, both
> > the dirty and bdi_dirty, yet its callers only ever want one of those
> > results. Could we change get_dirty_limits to only calculate one dirty
> > value based upon whether bdi is non-NULL? This would save calculation of
> > regular dirty when a bdi is passed.
>
> Hmmmm.... I think Peter needs to consider this.
we need the total anyway, its where we start calculating the bdi thing
from.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 7:37 ` Peter Zijlstra
@ 2007-10-04 7:56 ` Paul Jackson
2007-10-04 8:15 ` Peter Zijlstra
2007-10-05 19:34 ` Ethan Solomita
1 sibling, 1 reply; 41+ messages in thread
From: Paul Jackson @ 2007-10-04 7:56 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: clameter, solo, linux-mm, akpm
Peter wrote:
> Perhaps you can keep a proportion in the cpu-set, and do a similar trick
> that the process proportions do.
Beware -- the following comment is made by someone who has been
basically zero attention to this thread, so could be --way-- off
the mark.
Be that as it may, avoid putting anything in the cpuset that you need
to get to frequently. Access to all cpusets in the system is guarded
by a single global mutex. The current performance assumption is that
about the only things that need to access the contents of a cpuset are:
1) explicit user file operations on the special files in the cpuset
file system, and
2) some exceptional situations from slow code paths, such as memory
shortage or cpu hotplug events.
Well ... almost. If you don't mind the occassional access to the wrong
cpuset, then just taking the task_lock on the current task will guarantee
you that the cpuset pointer in the task struct points to --some-- cpuset,
usually the right one. This can be (and is) used for some statistic
gathering purposes (look for 'fmeter' in kernel/cpuset.c), where exact
counts are not required. Perhaps that applies here as well.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 7:56 ` Paul Jackson
@ 2007-10-04 8:15 ` Peter Zijlstra
2007-10-04 8:25 ` Peter Zijlstra
2007-10-04 9:04 ` Paul Jackson
0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2007-10-04 8:15 UTC (permalink / raw)
To: Paul Jackson; +Cc: clameter, solo, linux-mm, akpm
On Thu, 2007-10-04 at 00:56 -0700, Paul Jackson wrote:
> Peter wrote:
> > Perhaps you can keep a proportion in the cpu-set, and do a similar trick
> > that the process proportions do.
>
> Beware -- the following comment is made by someone who has been
> basically zero attention to this thread, so could be --way-- off
> the mark.
>
> Be that as it may, avoid putting anything in the cpuset that you need
> to get to frequently. Access to all cpusets in the system is guarded
> by a single global mutex. The current performance assumption is that
> about the only things that need to access the contents of a cpuset are:
> 1) explicit user file operations on the special files in the cpuset
> file system, and
> 2) some exceptional situations from slow code paths, such as memory
> shortage or cpu hotplug events.
>
> Well ... almost. If you don't mind the occassional access to the wrong
> cpuset, then just taking the task_lock on the current task will guarantee
> you that the cpuset pointer in the task struct points to --some-- cpuset,
> usually the right one. This can be (and is) used for some statistic
> gathering purposes (look for 'fmeter' in kernel/cpuset.c), where exact
> counts are not required. Perhaps that applies here as well.
Ugh, yeah. Its a statistical thing, but task_lock is quite a big lock to
take. Are cpusets RCU freed? In which case we could just rcu_deref the
cpuset pointer and do whatever needs done to whatever we find.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 8:15 ` Peter Zijlstra
@ 2007-10-04 8:25 ` Peter Zijlstra
2007-10-04 9:06 ` Paul Jackson
2007-10-04 9:04 ` Paul Jackson
1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2007-10-04 8:25 UTC (permalink / raw)
To: Paul Jackson; +Cc: clameter, solo, linux-mm, akpm
On Thu, 2007-10-04 at 10:15 +0200, Peter Zijlstra wrote:
> On Thu, 2007-10-04 at 00:56 -0700, Paul Jackson wrote:
> > Peter wrote:
> > > Perhaps you can keep a proportion in the cpu-set, and do a similar trick
> > > that the process proportions do.
> >
> > Beware -- the following comment is made by someone who has been
> > basically zero attention to this thread, so could be --way-- off
> > the mark.
> >
> > Be that as it may, avoid putting anything in the cpuset that you need
> > to get to frequently. Access to all cpusets in the system is guarded
> > by a single global mutex. The current performance assumption is that
> > about the only things that need to access the contents of a cpuset are:
> > 1) explicit user file operations on the special files in the cpuset
> > file system, and
> > 2) some exceptional situations from slow code paths, such as memory
> > shortage or cpu hotplug events.
> >
> > Well ... almost. If you don't mind the occassional access to the wrong
> > cpuset, then just taking the task_lock on the current task will guarantee
> > you that the cpuset pointer in the task struct points to --some-- cpuset,
> > usually the right one. This can be (and is) used for some statistic
> > gathering purposes (look for 'fmeter' in kernel/cpuset.c), where exact
> > counts are not required. Perhaps that applies here as well.
>
> Ugh, yeah. Its a statistical thing, but task_lock is quite a big lock to
> take. Are cpusets RCU freed? In which case we could just rcu_deref the
> cpuset pointer and do whatever needs done to whatever we find.
Ok, still need my morning juice. I read tasklist_lock.
task_lock() should be fine.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 8:15 ` Peter Zijlstra
2007-10-04 8:25 ` Peter Zijlstra
@ 2007-10-04 9:04 ` Paul Jackson
1 sibling, 0 replies; 41+ messages in thread
From: Paul Jackson @ 2007-10-04 9:04 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: clameter, solo, linux-mm, akpm
Peter wrote:
> Ugh, yeah. Its a statistical thing, but task_lock is quite a big lock to
> take. Are cpusets RCU freed?
Yes, current->cpuset is rcu freed. And on the one hottest code path
that wants to look inside a cpuset, once per memory page allocation,
we look inside the current tasks cpuset to see if it has been modified
since last we looked (if it has, we go into slow path code to update
the current->mems_allowed nodemask.) The code that does this, from
kernel/cpuset.c, is:
rcu_read_lock();
my_cpusets_mem_gen = task_cs(current)->mems_generation;
rcu_read_unlock();
Sadly, I just noticed now, that with the new cgroup (aka container)
code in *-mm (and soon to be in 2.6.24), that 'task_cs' macro got
added, to deal with the fact that what used to be a single pointer in
the task struct directly to the tasks cpuset is now perhaps two more
dereferences and a second, buried, rcu guarded access away:
static inline struct cpuset *task_cs(struct task_struct *task)
{
return container_of(task_subsys_state(task, cpuset_subsys_id),
struct cpuset, css);
}
static inline struct cgroup_subsys_state *task_subsys_state(
struct task_struct *task, int subsys_id)
{
return rcu_dereference(task->cgroups->subsys[subsys_id]);
}
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 8:25 ` Peter Zijlstra
@ 2007-10-04 9:06 ` Paul Jackson
0 siblings, 0 replies; 41+ messages in thread
From: Paul Jackson @ 2007-10-04 9:06 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: clameter, solo, linux-mm, akpm
Peter wrote:
> Ok, still need my morning juice. I read tasklist_lock.
> task_lock() should be fine.
Ah - ok. I was a little surprised you found task_lock to be
unacceptably expensive here ... but didn't know enough to
be sure. This makes more sense.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/6] cpuset write throttle
2007-10-04 7:37 ` Peter Zijlstra
2007-10-04 7:56 ` Paul Jackson
@ 2007-10-05 19:34 ` Ethan Solomita
1 sibling, 0 replies; 41+ messages in thread
From: Ethan Solomita @ 2007-10-05 19:34 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Christoph Lameter, linux-mm, Andrew Morton
Peter Zijlstra wrote:
>
> currently:
>
> limit = total_limit * p_bdi * (1 - p_task/8)
>
> suggestion:
>
> limit = total_limit * p_bdi * (1 - p_task/8) * (1 - p_cpuset/4)
>
> Another option would be:
>
> limit = cpuset_limit * p_bdi * (1 - p_task/8)
>
A cpuset's relationship with memory is typically rather different than
a process's relationship with a bdi. A bdi is typically shared between
independent processes, making a proportion the right choice. A cpuset is
often set up with exclusive memory nodes. i.e. the only processes which
can allocate from a node of memory are those within this one cpuset.
In that context, we already know the proportion. It's the size of the
nodes in mems_allowed. And we also know the number of dirty pages. Do
you agree that a formal proportion is unneeded?
i.e. the cpuset_limit would be the sum of available memory across all
of mems_allowed nodes, times the ratio (e.g. 40%). This seems to fit
best into your second suggestion. My main concern is the scenario where
the bdi is highly utilized, but by other cpusets. Preferably, that high
p_bdi should not prevent this cpuset from dirtying a few pages.
What if the bdi held an array ala numdirty[MAX_NUMNODES] and then
avoided throttling if numdirty[thisnode] / bdi_totdirty is below a
threshold? Ideally we'd keep track of it per-cpuset, not per-node, but
cpusets are volatile so that could become complicated.
I'm just brainstorming here, so the above is just a suggestion.
-- Ethan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2007-10-05 19:34 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
2007-07-17 21:32 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-07-17 21:33 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
2007-07-17 21:34 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
2007-07-17 21:35 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
2007-07-17 21:36 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
2007-07-17 21:37 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
2007-07-23 20:18 ` [PATCH 0/6] cpuset aware writeback Christoph Lameter
2007-07-23 21:30 ` Ethan Solomita
2007-07-23 21:53 ` Christoph Lameter
2007-09-12 1:32 ` Ethan Solomita
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-09-14 23:15 ` Andrew Morton
2007-09-14 23:47 ` Satyam Sharma
2007-09-15 0:07 ` Andrew Morton
2007-09-15 0:16 ` Satyam Sharma
2007-09-17 18:37 ` Mike Travis
2007-09-17 19:10 ` Christoph Lameter
2007-09-19 0:51 ` Ethan Solomita
2007-09-19 2:14 ` Andrew Morton
2007-09-19 17:08 ` Christoph Lameter
2007-09-19 17:06 ` Christoph Lameter
2007-09-12 1:38 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
2007-09-12 1:39 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
[not found] ` <20070914161517.5ea3847f.akpm@linux-foundation.org>
2007-10-03 0:38 ` Ethan Solomita
2007-10-03 17:46 ` Christoph Lameter
2007-10-03 20:46 ` Ethan Solomita
2007-10-04 3:56 ` Christoph Lameter
2007-10-04 7:37 ` Peter Zijlstra
2007-10-04 7:56 ` Paul Jackson
2007-10-04 8:15 ` Peter Zijlstra
2007-10-04 8:25 ` Peter Zijlstra
2007-10-04 9:06 ` Paul Jackson
2007-10-04 9:04 ` Paul Jackson
2007-10-05 19:34 ` Ethan Solomita
2007-09-12 1:40 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
2007-09-12 1:41 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
2007-09-12 1:42 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
2007-09-14 23:15 ` Andrew Morton
2007-09-17 19:00 ` Christoph Lameter
2007-09-19 0:23 ` Ethan Solomita
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).