From: Robin Holt <holt@sgi.com>
To: Robin Holt <holt@sgi.com>, Jack Steiner <steiner@sgi.com>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
x86@kernel.org, Yinghai Lu <yinghai@kernel.org>,
Linus Torvalds <torvalds@ppc970.osdl.org>,
Joerg Roedel <joerg.roedel@amd.com>, Andi Kleen <ak@suse.de>,
Linux Kernel <linux-kernel@vger.kernel.org>,
Stable Maintainers <stable@kernel.org>
Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.
Date: Wed, 18 Aug 2010 11:56:53 -0500 [thread overview]
Message-ID: <20100818165653.GX3043@sgi.com> (raw)
Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.
While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update. This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.
When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory. By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.
With this patch, the max value did not change. I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update. I will investigate those issues seperately.
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@ppc970.osdl.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Cc: Stable Maintainers <stable@kernel.org>
---
This patch applies cleanly to v2.6.34 and later. It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.
I had no idea whether to ask stable@kernel.org to pull this back to the
stable releases. My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently. In the end, I punted and added them to
the Cc list. We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.
arch/x86/mm/numa_64.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c 2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c 2010-08-18 11:47:18.391210989 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
unsigned long start_pfn, last_pfn, nodedata_phys;
const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
int nid;
+ int cache_alias_offset;
#ifndef CONFIG_NO_BOOTMEM
unsigned long bootmap_start, bootmap_pages, bootmap_size;
void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
start_pfn = start >> PAGE_SHIFT;
last_pfn = end >> PAGE_SHIFT;
- node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+ /*
+ * Allocate an extra cacheline per node to reduce cacheline
+ * aliasing when scanning all node's node_data.
+ */
+ cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+ node_data[nodeid] = cache_alias_offset +
+ early_node_mem(nodeid, start, end,
+ pgdat_size + cache_alias_offset,
SMP_CACHE_BYTES);
- if (node_data[nodeid] == NULL)
+ if (node_data[nodeid] == cache_alias_offset)
return;
nodedata_phys = __pa(node_data[nodeid]);
reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
next reply other threads:[~2010-08-18 16:57 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-18 16:56 Robin Holt [this message]
2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
2010-08-19 17:30 ` Roedel, Joerg
2010-08-19 20:42 ` Robin Holt
2010-08-19 22:02 ` Robin Holt
2010-08-19 22:54 ` H. Peter Anvin
2010-08-20 13:58 ` Robin Holt
2010-08-20 15:03 ` Robin Holt
2010-08-20 16:16 ` H. Peter Anvin
2010-08-21 13:07 ` Robin Holt
2010-08-23 21:42 ` H. Peter Anvin
2010-08-25 11:08 ` Robin Holt
2010-08-25 18:56 ` H. Peter Anvin
2010-08-25 21:49 ` Yinghai Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100818165653.GX3043@sgi.com \
--to=holt@sgi.com \
--cc=ak@suse.de \
--cc=hpa@zytor.com \
--cc=joerg.roedel@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=stable@kernel.org \
--cc=steiner@sgi.com \
--cc=tglx@linutronix.de \
--cc=torvalds@ppc970.osdl.org \
--cc=x86@kernel.org \
--cc=yinghai@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.