From: Andrew Morton <akpm@linux-foundation.org>
To: Ethan Solomita <solo@google.com>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
Christoph Lameter <clameter@sgi.com>
Subject: Re: [PATCH 1/6] cpuset write dirty map
Date: Fri, 14 Sep 2007 16:15:36 -0700 [thread overview]
Message-ID: <20070914161536.3ec5c533.akpm@linux-foundation.org> (raw)
In-Reply-To: <46E742A2.9040006@google.com>
On Tue, 11 Sep 2007 18:36:34 -0700
Ethan Solomita <solo@google.com> wrote:
> Add a dirty map to struct address_space
I get a tremendous number of rejects trying to wedge this stuff on top of
Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am
prepared to partially-fix so that I can usefully look at these changes in
tkdiff, so this is all based on a quick peek at the diff itself..
> In a NUMA system it is helpful to know where the dirty pages of a mapping
> are located. That way we will be able to implement writeout for applications
> that are constrained to a portion of the memory of the system as required by
> cpusets.
>
> This patch implements the management of dirty node maps for an address
> space through the following functions:
>
> cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
>
> cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map
>
> cpuset_init_dirty_nodes(mapping) First time init of the map
>
>
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated
> for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
>
> Updating the dirty map may involve allocating it first for large
> configurations. Therefore we protect the allocation and setting
> of a node in the map through the tree_lock. The tree_lock is
> already taken when a page is dirtied so there is no additional
> locking overhead if we insert the updating of the nodemask there.
>
> The dirty map is only cleared (or freed) when the inode is cleared.
> At that point no pages are attached to the inode anymore and therefore it can
> be done without any locking. The dirty map therefore records all nodes that
> have been used for dirty pages by that inode until the inode is no longer
> used.
>
It'd be nice to see some discussion regarding the memory consumption of
this patch and the associated tradeoffs.
> ...
>
> +#if MAX_NUMNODES <= BITS_PER_LONG
The patch is sprinkled full of this conditional.
I don't understand why this is being done. afaict it isn't described
in a code comment (it should be) nor even in the changelogs?
Given its overall complexity and its likelihood to change in the
future, I'd suggest that this conditional be centralised in a single
place. Something like
/*
* nice comment goes here
*/
#if MAX_NUMNODES <= BITS_PER_LONG
#define CPUSET_DIRTY_LIMITS 1
#else
#define CPUSET_DIRTY_LIMITS 0
#endif
Then use #if CPUSET_DIRTY_LIMITS everywhere else.
(This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
warning if someone typos '#if CPUSET_DITRY_LIMITS')
Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig,
but I suspect you'll need to jump through unfeasible hoops to do that
sort of calculation within Kconfig.
> --- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
> @@ -516,6 +516,13 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> +#ifdef CONFIG_CPUSETS
> +#if MAX_NUMNODES <= BITS_PER_LONG
> + nodemask_t dirty_nodes; /* nodes with dirty pages */
> +#else
> + nodemask_t *dirty_nodes; /* pointer to map if dirty */
> +#endif
> +#endif
afacit there is no code comment and no changelog text which explains the
above design decision? There should be, please.
There is talk of making cpusets available with CONFIG_SMP=n. Will this new
feature be available in that case? (it should be).
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
> --- 0/include/linux/writeback.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
> @@ -62,6 +62,7 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> void *fs_private; /* For use by ->writepages() */
> + nodemask_t *nodes; /* Set of nodes of interest */
> };
That comment is a bit terse. It's always good to be lavish when commenting
data structures, for understanding those is key to understanding a design.
> /*
> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
> --- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
> @@ -4,7 +4,7 @@
> * Processor and Memory placement constraints for sets of tasks.
> *
> * Copyright (C) 2003 BULL SA.
> - * Copyright (C) 2004-2006 Silicon Graphics, Inc.
> + * Copyright (C) 2004-2007 Silicon Graphics, Inc.
> * Copyright (C) 2006 Google, Inc
> *
> * Portions derived from Patrick Mochel's sysfs code.
> @@ -14,6 +14,7 @@
> * 2003-10-22 Updates by Stephen Hemminger.
> * 2004 May-July Rework by Paul Jackson.
> * 2006 Rework by Paul Menage to use generic containers
> + * 2007 Cpuset writeback by Christoph Lameter.
> *
> * This file is subject to the terms and conditions of the GNU General Public
> * License. See the file COPYING in the main directory of the Linux
> @@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
> }
> EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>
> +#if MAX_NUMNODES > BITS_PER_LONG
waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"
> +
> +/*
> + * Special functions for NUMA systems with a large number of nodes.
> + * The nodemask is pointed to from the address space structures.
> + * The attachment of the dirty_node mask is protected by the
> + * tree_lock. The nodemask is freed only when the inode is cleared
> + * (and therefore unused, thus no locking necessary).
> + */
hmm, OK, there's a hint as to wghat's going on.
It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
we might want to tweak that in the future. Yet another argument for
centralising this comparison.
> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> + struct page *page)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> + int node = page_to_nid(page);
> +
> + if (!nodes) {
> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
Does it have to be atomic? atomic is weak and can fail.
If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
should at least pass the gfp_t into this function so it can do the stronger
allocation when possible.
> + if (!nodes)
> + return;
> +
> + *nodes = NODE_MASK_NONE;
> + mapping->dirty_nodes = nodes;
> + }
> +
> + if (!node_isset(node, *nodes))
> + node_set(node, *nodes);
> +}
> +
> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> +
> + if (nodes) {
> + mapping->dirty_nodes = NULL;
> + kfree(nodes);
> + }
> +}
Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
a comment which describes the locking requirements would be good.
> +/*
> + * Called without the tree_lock. The nodemask is only freed when the inode
> + * is cleared and therefore this is safe.
> + */
> +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> + nodemask_t *mask)
> +{
> + nodemask_t *dirty_nodes = mapping->dirty_nodes;
> +
> + if (!mask)
> + return 1;
> +
> + if (!dirty_nodes)
> + return 0;
> +
> + return nodes_intersects(*dirty_nodes, *mask);
> +}
> +#endif
> +
> /**
> * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
> * @p: pointer to task_struct of some other task.
> diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
> --- 0/mm/page-writeback.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
> @@ -33,6 +33,7 @@
> #include <linux/syscalls.h>
> #include <linux/buffer_head.h>
> #include <linux/pagevec.h>
> +#include <linux/cpuset.h>
>
> /*
> * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> + cpuset_update_dirty_nodes(mapping, page);
> write_unlock_irq(&mapping->tree_lock);
> if (mapping->host) {
> /* !PageAnon && !swapper_space */
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-09-14 23:15 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-17 21:23 [PATCH 0/6] cpuset aware writeback Ethan Solomita
2007-07-17 21:32 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-07-17 21:33 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
2007-07-17 21:34 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
2007-07-17 21:35 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
2007-07-17 21:36 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
2007-07-17 21:37 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
2007-07-23 20:18 ` [PATCH 0/6] cpuset aware writeback Christoph Lameter
2007-07-23 21:30 ` Ethan Solomita
2007-07-23 21:53 ` Christoph Lameter
2007-09-12 1:32 ` Ethan Solomita
2007-09-12 1:36 ` [PATCH 1/6] cpuset write dirty map Ethan Solomita
2007-09-14 23:15 ` Andrew Morton [this message]
2007-09-14 23:47 ` Satyam Sharma
2007-09-15 0:07 ` Andrew Morton
2007-09-15 0:16 ` Satyam Sharma
2007-09-17 18:37 ` Mike Travis
2007-09-17 19:10 ` Christoph Lameter
2007-09-19 0:51 ` Ethan Solomita
2007-09-19 2:14 ` Andrew Morton
2007-09-19 17:08 ` Christoph Lameter
2007-09-19 17:06 ` Christoph Lameter
2007-09-12 1:38 ` [PATCH 2/6] cpuset write pdflush nodemask Ethan Solomita
2007-09-12 1:39 ` [PATCH 3/6] cpuset write throttle Ethan Solomita
[not found] ` <20070914161517.5ea3847f.akpm@linux-foundation.org>
2007-10-03 0:38 ` Ethan Solomita
2007-10-03 17:46 ` Christoph Lameter
2007-10-03 20:46 ` Ethan Solomita
2007-10-04 3:56 ` Christoph Lameter
2007-10-04 7:37 ` Peter Zijlstra
2007-10-04 7:56 ` Paul Jackson
2007-10-04 8:15 ` Peter Zijlstra
2007-10-04 8:25 ` Peter Zijlstra
2007-10-04 9:06 ` Paul Jackson
2007-10-04 9:04 ` Paul Jackson
2007-10-05 19:34 ` Ethan Solomita
2007-09-12 1:40 ` [PATCH 4/6] cpuset write vmscan Ethan Solomita
2007-09-12 1:41 ` [PATCH 5/6] cpuset write vm writeout Ethan Solomita
2007-09-12 1:42 ` [PATCH 6/6] cpuset dirty limits Ethan Solomita
2007-09-14 23:15 ` Andrew Morton
2007-09-17 19:00 ` Christoph Lameter
2007-09-19 0:23 ` Ethan Solomita
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070914161536.3ec5c533.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=solo@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).