[Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Robin Holt <holt@sgi.com>
To: Robin Holt <holt@sgi.com>, Jack Steiner <steiner@sgi.com>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Yinghai Lu <yinghai@kernel.org>,
	Linus Torvalds <torvalds@ppc970.osdl.org>,
	Joerg Roedel <joerg.roedel@amd.com>, Andi Kleen <ak@suse.de>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Stable Maintainers <stable@kernel.org>
Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.
Date: Wed, 18 Aug 2010 11:56:53 -0500	[thread overview]
Message-ID: <20100818165653.GX3043@sgi.com> (raw)

Subject: [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive.

While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update.  This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.

When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory.  By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.

With this patch, the max value did not change.  I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update.  I will investigate those issues seperately.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@ppc970.osdl.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Cc: Stable Maintainers <stable@kernel.org>

---

This patch applies cleanly to v2.6.34 and later.  It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.

I had no idea whether to ask stable@kernel.org to pull this back to the
stable releases.  My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently.  In the end, I punted and added them to
the Cc list.  We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.

 arch/x86/mm/numa_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c	2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c	2010-08-18 11:47:18.391210989 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
 	unsigned long start_pfn, last_pfn, nodedata_phys;
 	const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
 	int nid;
+	int cache_alias_offset;
 #ifndef CONFIG_NO_BOOTMEM
 	unsigned long bootmap_start, bootmap_pages, bootmap_size;
 	void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
 	start_pfn = start >> PAGE_SHIFT;
 	last_pfn = end >> PAGE_SHIFT;

-	node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+	/*
+	 * Allocate an extra cacheline per node to reduce cacheline
+	 * aliasing when scanning all node's node_data.
+	 */
+	cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+	node_data[nodeid] = cache_alias_offset +
+			    early_node_mem(nodeid, start, end,
+					   pgdat_size + cache_alias_offset,
 					   SMP_CACHE_BYTES);
-	if (node_data[nodeid] == NULL)
+	if (node_data[nodeid] == cache_alias_offset)
 		return;
 	nodedata_phys = __pa(node_data[nodeid]);
 	reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");

next             reply	other threads:[~2010-08-18 16:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-18 16:56 Robin Holt [this message]
2010-08-18 18:30 ` [Patch] numa:x86_64: Cacheline aliasing makes for_each_populated_zone extremely expensive -V2 Robin Holt
2010-08-19 17:30   ` Roedel, Joerg
2010-08-19 20:42     ` Robin Holt
2010-08-19 22:02       ` Robin Holt
2010-08-19 22:54   ` H. Peter Anvin
2010-08-20 13:58     ` Robin Holt
2010-08-20 15:03       ` Robin Holt
2010-08-20 16:16         ` H. Peter Anvin
2010-08-21 13:07           ` Robin Holt
2010-08-23 21:42             ` H. Peter Anvin
2010-08-25 11:08               ` Robin Holt
2010-08-25 18:56                 ` H. Peter Anvin
2010-08-25 21:49                   ` Yinghai Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100818165653.GX3043@sgi.com \
    --to=holt@sgi.com \
    --cc=ak@suse.de \
    --cc=hpa@zytor.com \
    --cc=joerg.roedel@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=stable@kernel.org \
    --cc=steiner@sgi.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@ppc970.osdl.org \
    --cc=x86@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.