All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yinghai Lu <yinghai@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>, Tejun Heo <tj@kernel.org>,
	tglx@linutronix.de, "H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [GIT PULL tip:x86/mm]
Date: Tue, 01 Mar 2011 14:19:13 -0800	[thread overview]
Message-ID: <4D6D70E1.40808@kernel.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1103010844430.19743@chino.kir.corp.google.com>

On 03/01/2011 09:18 AM, David Rientjes wrote:
> On Thu, 24 Feb 2011, Yinghai Lu wrote:
> 
>> DavidR reported that x86/mm broke his numa emulation with 128M etc.
>>
>> So wonder if that would hold you to push whole tip/x86/mm to Linus for .39
>> or need to rebase it while taking the tip/x86/numa-emulation-unify out.
>>
> 
> Ok, so 1f565a896ee1 (x86-64, NUMA: Fix size of numa_distance array) fixes 
> the boot failure when using numa=fake, but there's still another issue 
> that was introduced with regard to emulated distances between fake nodes 
> sitting hardware using a SLIT.
> 
> This is important because we want to ensure that the physical topoloy of 
> the machine is still represented in an emulated environment to 
> appropriately describe the expected latencies between the nodes.  It also 
> allows users who are using numa=fake purely as a debugging tool to test 
> more interesting configurations and benchmark memory accesses between 
> emulated nodes as though they were real.
> 
> For example, on my four-node system with a custom SLIT, this is the 
> distance when booting without numa=fake:
> 
> 	$ cat /sys/devices/system/node/node*/distance 
> 	10 20 20 30
> 	20 10 20 20
> 	20 20 10 20
> 	30 20 20 10
> 
> These physical nodes are all symmetric in size.
> 
> With numa=fake=16, we expect to see the fake nodes interleaved (as the 
> default) over the set of physical nodes.  This would suggest distance 
> files for these nodes to be:
> 
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 	20 20 10 20 20 20 10 20 20 20 10 20 20 20 10 20
> 	30 20 20 10 30 20 20 10 30 20 20 10 30 20 20 10
> 	10 20 20 30 10 20 20 30 10 20 20 30 10 20 20 30
> 	20 10 20 20 20 10 20 20 20 10 20 20 20 10 20 20
> 
> (And that is what we see with 2.6.37.)
> 
> However, x86/mm describes these distances differently:
> 
> 	node0/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node1/distance:10 10 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node2/distance:10 20 10 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node3/distance:10 20 20 10 10 20 20 20 10 20 20 20 10 20 20 20
> 	node4/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node5/distance:10 20 20 20 10 10 20 20 10 20 20 20 10 20 20 20
> 	node6/distance:10 20 20 20 10 20 10 20 10 20 20 20 10 20 20 20
> 	node7/distance:10 20 20 20 10 20 20 10 10 20 20 20 10 20 20 20
> 	node8/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node9/distance:10 20 20 20 10 20 20 20 10 10 20 20 10 20 20 20
> 	node10/distance:10 20 20 20 10 20 20 20 10 20 10 20 10 20 20 20
> 	node11/distance:10 20 20 20 10 20 20 20 10 20 20 10 10 20 20 20
> 	node12/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 20
> 	node13/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 10 20 20
> 	node14/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 10 20
> 	node15/distance:10 20 20 20 10 20 20 20 10 20 20 20 10 20 20 10
> 
> It looks as though the emulation changes sitting in x86/mm have dropped 
> the SLIT and are merely describing the emulated nodes as either having 
> physical affinity or not.

please check:

[PATCH] x86, numa, emu: Fix slit ignoring.

David Reported that after numa_emu clean up, SLIT does not honor anymore.

after looking at the code, it seems the cleanup does have several problems:
1. need to reserve temp numa dist.
	We only can use find_...without_reserve tricks when we are done with
	 the old one before get another new one.
2. during copying should only copy with NEW numa_dist_cnt size.
	so need to call numa_alloc_dist at first before copy.
3. phys_dist whould numa_dist_cnt square size
4. numa_reset_distance should free numa_dist_cnt square size

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/numa_64.c        |    6 ++---
 arch/x86/mm/numa_emulation.c |   50 ++++++++++++++++++++++++++++++-------------
 arch/x86/mm/numa_internal.h  |    1 
 3 files changed, 40 insertions(+), 17 deletions(-)

Index: linux-2.6/arch/x86/mm/numa_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_64.c
+++ linux-2.6/arch/x86/mm/numa_64.c
@@ -393,7 +393,7 @@ void __init numa_reset_distance(void)
 	size_t size;
 
 	if (numa_distance_cnt) {
-		size = numa_distance_cnt * sizeof(numa_distance[0]);
+		size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 		memblock_x86_free_range(__pa(numa_distance),
 					__pa(numa_distance) + size);
 		numa_distance_cnt = 0;
@@ -401,7 +401,7 @@ void __init numa_reset_distance(void)
 	numa_distance = NULL;
 }
 
-static int __init numa_alloc_distance(void)
+int __init numa_alloc_distance(void)
 {
 	nodemask_t nodes_parsed;
 	size_t size;
@@ -437,7 +437,7 @@ static int __init numa_alloc_distance(vo
 				LOCAL_DISTANCE : REMOTE_DISTANCE;
 	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
 
-	return 0;
+	return cnt;
 }
 
 /**
Index: linux-2.6/arch/x86/mm/numa_emulation.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_emulation.c
+++ linux-2.6/arch/x86/mm/numa_emulation.c
@@ -300,7 +300,9 @@ void __init numa_emulation(struct numa_m
 	static struct numa_meminfo pi __initdata;
 	const u64 max_addr = max_pfn << PAGE_SHIFT;
 	u8 *phys_dist = NULL;
+	int phys_size = 0;
 	int i, j, ret;
+	int new_nr;
 
 	if (!emu_cmdline)
 		goto no_emu;
@@ -341,16 +343,17 @@ void __init numa_emulation(struct numa_m
 	 * reserve it.
 	 */
 	if (numa_dist_cnt) {
-		size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
 		u64 phys;
 
+		phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
 		phys = memblock_find_in_range(0,
 					      (u64)max_pfn_mapped << PAGE_SHIFT,
-					      size, PAGE_SIZE);
+					      phys_size, PAGE_SIZE);
 		if (phys == MEMBLOCK_ERROR) {
 			pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
+		memblock_x86_reserve_range(phys, phys + phys_size, "TMP NUMA DIST");
 		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
@@ -383,21 +386,40 @@ void __init numa_emulation(struct numa_m
 
 	/* transform distance table */
 	numa_reset_distance();
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		for (j = 0; j < MAX_NUMNODES; j++) {
-			int physi = emu_nid_to_phys[i];
-			int physj = emu_nid_to_phys[j];
-			int dist;
-
-			if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
-				dist = physi == physj ?
-					LOCAL_DISTANCE : REMOTE_DISTANCE;
-			else
+	/* allocate numa_distance at first, it will set new numa_dist_cnt */
+	new_nr = numa_alloc_distance();
+	if (new_nr < 0)
+		goto free_temp_phys;
+
+	/*
+	 * only set it when we have old phys_dist,
+	 * numa_alloc_distance already set default values
+	 */
+	if (phys_dist)
+		for (i = 0; i < new_nr; i++) {
+			for (j = 0; j < new_nr; j++) {
+				int physi = emu_nid_to_phys[i];
+				int physj = emu_nid_to_phys[j];
+				int dist;
+
+				/* really need this check ? */
+				if (physi >= numa_dist_cnt ||
+				    physj >= numa_dist_cnt)
+					continue;
+
 				dist = phys_dist[physi * numa_dist_cnt + physj];
 
-			numa_set_distance(i, j, dist);
+				numa_set_distance(i, j, dist);
+			}
 		}
-	}
+
+free_temp_phys:
+
+	/* Free the temp storage for phys */
+	if (phys_dist)
+		memblock_x86_free_range(__pa(phys_dist),
+					__pa(phys_dist) + phys_size);
+
 	return;
 
 no_emu:
Index: linux-2.6/arch/x86/mm/numa_internal.h
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_internal.h
+++ linux-2.6/arch/x86/mm/numa_internal.h
@@ -18,6 +18,7 @@ struct numa_meminfo {
 void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
 void __init numa_reset_distance(void);
+int numa_alloc_distance(void);
 
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,

  parent reply	other threads:[~2011-03-01 22:20 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-24 14:51 [GIT PULL tip:x86/mm] Tejun Heo
2011-02-24 14:52 ` [GIT PULL tip:x86/mm] bootmem,x86: cleanup changes Tejun Heo
2011-02-24 19:08 ` [GIT PULL tip:x86/mm] Yinghai Lu
2011-02-24 19:23   ` Ingo Molnar
2011-02-24 19:28     ` Yinghai Lu
2011-02-24 19:32       ` Ingo Molnar
2011-02-24 19:46         ` Tejun Heo
2011-02-24 22:46           ` [patch] x86, mm: Fix size of numa_distance array David Rientjes
2011-02-24 23:30             ` Yinghai Lu
2011-02-24 23:31             ` David Rientjes
2011-02-25  9:05               ` Tejun Heo
2011-02-25  9:03             ` Tejun Heo
2011-02-25 10:58               ` Tejun Heo
2011-02-25 11:05                 ` Tejun Heo
2011-02-25  9:11             ` [PATCH x86-mm] x86-64, NUMA: " Tejun Heo
2011-03-01 17:18       ` [GIT PULL tip:x86/mm] David Rientjes
2011-03-01 18:25         ` Tejun Heo
2011-03-01 22:19         ` Yinghai Lu [this message]
2011-03-02  9:17           ` Tejun Heo
2011-03-02 10:04         ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Tejun Heo
2011-03-02 10:07           ` Ingo Molnar
2011-03-02 10:15             ` Tejun Heo
2011-03-02 10:36               ` Ingo Molnar
2011-03-02 10:25           ` [PATCH x86/mm UPDATED] " Tejun Heo
2011-03-02 10:39             ` [PATCH x86/mm] x86-64, NUMA: Better explain numa_distance handling Tejun Heo
2011-03-02 10:42               ` [PATCH UPDATED " Tejun Heo
2011-03-02 14:31                 ` David Rientjes
2011-03-02 14:30             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling David Rientjes
2011-03-02 15:42               ` Tejun Heo
2011-03-02 21:12                 ` Yinghai Lu
2011-03-02 21:36                   ` Yinghai Lu
2011-03-03 20:07                     ` David Rientjes
2011-03-04 14:32                       ` Tejun Heo
2011-03-03 20:04                   ` David Rientjes
2011-03-03 20:00                 ` David Rientjes
2011-03-04 15:31               ` [PATCH x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() handling Tejun Heo
2011-03-04 21:33                 ` David Rientjes
2011-03-05  7:50                   ` Tejun Heo
2011-03-05 15:50               ` [tip:x86/mm] x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() tip-bot for Tejun Heo
2011-03-02 16:16             ` [PATCH x86/mm UPDATED] x86-64, NUMA: Fix distance table handling Yinghai Lu
2011-03-02 16:37               ` Tejun Heo
2011-03-02 16:46                 ` Yinghai Lu
2011-03-02 16:55                   ` Tejun Heo
2011-03-02 18:52                     ` Yinghai Lu
2011-03-02 19:02                       ` Tejun Heo
2011-03-02 19:06                         ` Yinghai Lu
2011-03-02 19:13                           ` Tejun Heo
2011-03-02 20:32                             ` Yinghai Lu
2011-03-02 20:57                               ` Tejun Heo
2011-03-02 21:14                                 ` Yinghai Lu
2011-03-03  6:17                                   ` Tejun Heo
2011-03-10 18:46                                     ` Yinghai Lu
2011-03-11  8:29                                       ` Tejun Heo
2011-03-11  8:33                                         ` Tejun Heo
2011-03-11 15:48                                           ` Yinghai Lu
2011-03-11 15:54                                             ` Tejun Heo
2011-03-11 18:02                                               ` Yinghai Lu
2011-03-11 18:19                                                 ` Tejun Heo
2011-03-11 18:25                                                   ` Yinghai Lu
2011-03-11 18:29                                                     ` Tejun Heo
2011-03-11 18:45                                                       ` Yinghai Lu
2011-03-11  9:31                                         ` [PATCH x86/mm] x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation Tejun Heo
2011-03-11 15:42                                           ` Yinghai Lu
2011-03-11 16:03                                             ` Tejun Heo
2011-03-11 19:05                                           ` Yinghai Lu
2011-03-02 10:43           ` [PATCH x86/mm] x86-64, NUMA: Fix distance table handling Ingo Molnar
2011-03-02 10:53             ` Tejun Heo
2011-03-02 10:59               ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D6D70E1.40808@kernel.org \
    --to=yinghai@kernel.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rientjes@google.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.