All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.
@ 2010-08-18 16:56 Robin Holt
  2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Holt @ 2010-08-18 16:56 UTC (permalink / raw)
  To: Robin Holt, Jack Steiner, Thomas Gleixner, Ingo Molnar
  Cc: H. Peter Anvin, x86, Yinghai Lu, Linus Torvalds, Joerg Roedel,
	Andi Kleen, Linux Kernel, Stable Maintainers


Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.

While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update.  This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.

When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory.  By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.

With this patch, the max value did not change.  I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update.  I will investigate those issues seperately.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@ppc970.osdl.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Cc: Stable Maintainers <stable@kernel.org>

---

This patch applies cleanly to v2.6.34 and later.  It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.

I had no idea whether to ask stable@kernel.org to pull this back to the
stable releases.  My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently.  In the end, I punted and added them to
the Cc list.  We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.

 arch/x86/mm/numa_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c	2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c	2010-08-18 11:47:18.391210989 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
 	unsigned long start_pfn, last_pfn, nodedata_phys;
 	const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
 	int nid;
+	int cache_alias_offset;
 #ifndef CONFIG_NO_BOOTMEM
 	unsigned long bootmap_start, bootmap_pages, bootmap_size;
 	void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
 	start_pfn = start >> PAGE_SHIFT;
 	last_pfn = end >> PAGE_SHIFT;
 
-	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+	/*
+	 * Allocate an extra cacheline per node to reduce cacheline
+	 * aliasing when scanning all node's node_data.
+	 */
+	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+	node_data[nodeid] = cache_alias_offset +
+			    early_node_mem(nodeid, start, end,
+					   pgdat_size + cache_alias_offset,
 					   SMP_CACHE_BYTES);
-	if (node_data[nodeid] == NULL)
+	if (node_data[nodeid] == cache_alias_offset)
 		return;
 	nodedata_phys = __pa(node_data[nodeid]);
 	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-08-25 21:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-18 16:56 [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive Robin Holt
2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
2010-08-19 17:30   ` Roedel, Joerg
2010-08-19 20:42     ` Robin Holt
2010-08-19 22:02       ` Robin Holt
2010-08-19 22:54   ` H. Peter Anvin
2010-08-20 13:58     ` Robin Holt
2010-08-20 15:03       ` Robin Holt
2010-08-20 16:16         ` H. Peter Anvin
2010-08-21 13:07           ` Robin Holt
2010-08-23 21:42             ` H. Peter Anvin
2010-08-25 11:08               ` Robin Holt
2010-08-25 18:56                 ` H. Peter Anvin
2010-08-25 21:49                   ` Yinghai Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.