* [PATCH 1/3 tip:x86/mm] x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
@ 2011-02-18 13:58 Tejun Heo
2011-02-18 13:59 ` [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c Tejun Heo
2011-02-18 14:00 ` [PATCH 3/3 tip:x86/mm] x86-64, NUMA: Add proper function comments to global functions Tejun Heo
0 siblings, 2 replies; 9+ messages in thread
From: Tejun Heo @ 2011-02-18 13:58 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner, Yinghai Lu, H. Peter Anvin
Cc: David Rientjes, linux-kernel, Cyrill Gorcunov
Update numa_emulation() such that, it
- takes @numa_meminfo and @numa_dist_cnt instead of directly
referencing the global variables.
- copies the distance table by iterating each distance with
node_distance() instead of memcpy'ing the distance table.
- tests emu_cmdline to determine whether emulation is requested and
fills emu_nid_to_phys[] with identity mapping if emulation is not
used. This allows the caller to call numa_emulation()
unconditionally and makes return value unncessary.
- defines dummy version if CONFIG_NUMA_EMU is disabled.
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
These three patches move NUMA emulation into a separate file. Once
reviewed, I'll push them through misc#x86-numa branch.
Thanks.
arch/x86/mm/numa_64.c | 56 +++++++++++++++++++++++++++++---------------------
1 file changed, 33 insertions(+), 23 deletions(-)
Index: work/arch/x86/mm/numa_64.c
===================================================================
--- work.orig/arch/x86/mm/numa_64.c
+++ work/arch/x86/mm/numa_64.c
@@ -797,17 +797,20 @@ static int __init split_nodes_size_inter
* Sets up the system RAM area from start_pfn to last_pfn according to the
* numa=fake command-line option.
*/
-static bool __init numa_emulation(void)
+static void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt)
{
static struct numa_meminfo ei __initdata;
static struct numa_meminfo pi __initdata;
const u64 max_addr = max_pfn << PAGE_SHIFT;
- int phys_dist_cnt = numa_distance_cnt;
u8 *phys_dist = NULL;
int i, j, ret;
+ if (!emu_cmdline)
+ goto no_emu;
+
memset(&ei, 0, sizeof(ei));
- pi = numa_meminfo;
+ pi = *numa_meminfo;
for (i = 0; i < MAX_NUMNODES; i++)
emu_nid_to_phys[i] = NUMA_NO_NODE;
@@ -830,19 +833,19 @@ static bool __init numa_emulation(void)
}
if (ret < 0)
- return false;
+ goto no_emu;
if (numa_cleanup_meminfo(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
- return false;
+ goto no_emu;
}
/*
* Copy the original distance table. It's temporary so no need to
* reserve it.
*/
- if (phys_dist_cnt) {
- size_t size = phys_dist_cnt * sizeof(numa_distance[0]);
+ if (numa_dist_cnt) {
+ size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
u64 phys;
phys = memblock_find_in_range(0,
@@ -850,14 +853,18 @@ static bool __init numa_emulation(void)
size, PAGE_SIZE);
if (phys == MEMBLOCK_ERROR) {
pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
- return false;
+ goto no_emu;
}
phys_dist = __va(phys);
- memcpy(phys_dist, numa_distance, size);
+
+ for (i = 0; i < numa_dist_cnt; i++)
+ for (j = 0; j < numa_dist_cnt; j++)
+ phys_dist[i * numa_dist_cnt + j] =
+ node_distance(i, j);
}
/* commit */
- numa_meminfo = ei;
+ *numa_meminfo = ei;
/*
* Transform __apicid_to_node table to use emulated nids by
@@ -886,18 +893,27 @@ static bool __init numa_emulation(void)
int physj = emu_nid_to_phys[j];
int dist;
- if (physi >= phys_dist_cnt || physj >= phys_dist_cnt)
+ if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
dist = physi == physj ?
LOCAL_DISTANCE : REMOTE_DISTANCE;
else
- dist = phys_dist[physi * phys_dist_cnt + physj];
+ dist = phys_dist[physi * numa_dist_cnt + physj];
numa_set_distance(i, j, dist);
}
}
- return true;
+ return;
+
+no_emu:
+ /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ emu_nid_to_phys[i] = i;
}
-#endif /* CONFIG_NUMA_EMU */
+#else /* CONFIG_NUMA_EMU */
+static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt)
+{ }
+#endif /* CONFIG_NUMA_EMU */
static int __init dummy_numa_init(void)
{
@@ -945,15 +961,9 @@ void __init initmem_init(void)
if (numa_cleanup_meminfo(&numa_meminfo) < 0)
continue;
-#ifdef CONFIG_NUMA_EMU
- /*
- * If requested, try emulation. If emulation is not used,
- * build identity emu_nid_to_phys[] for numa_add_cpu()
- */
- if (!emu_cmdline || !numa_emulation())
- for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
- emu_nid_to_phys[j] = j;
-#endif
+
+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
if (numa_register_memblks(&numa_meminfo) < 0)
continue;
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-18 13:58 [PATCH 1/3 tip:x86/mm] x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file Tejun Heo
@ 2011-02-18 13:59 ` Tejun Heo
2011-02-18 17:58 ` Yinghai Lu
2011-02-21 8:26 ` [PATCH UPDATED " Tejun Heo
2011-02-18 14:00 ` [PATCH 3/3 tip:x86/mm] x86-64, NUMA: Add proper function comments to global functions Tejun Heo
1 sibling, 2 replies; 9+ messages in thread
From: Tejun Heo @ 2011-02-18 13:59 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner, Yinghai Lu, H. Peter Anvin
Cc: David Rientjes, linux-kernel, Cyrill Gorcunov
Create numa_emulation.c and move all NUMA emulation code there. The
definitions of struct numa_memblk and numa_meminfo are moved to
numa_64.h. Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
numa_reset_distance() along with numa_emulation() are made global.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
arch/x86/include/asm/numa_64.h | 29 ++
arch/x86/mm/Makefile | 1
arch/x86/mm/numa_64.c | 479 -----------------------------------------
arch/x86/mm/numa_emulation.c | 451 ++++++++++++++++++++++++++++++++++++++
4 files changed, 482 insertions(+), 478 deletions(-)
Index: work/arch/x86/include/asm/numa_64.h
===================================================================
--- work.orig/arch/x86/include/asm/numa_64.h
+++ work/arch/x86/include/asm/numa_64.h
@@ -18,6 +18,21 @@ extern void setup_node_bootmem(int nodei
#ifdef CONFIG_NUMA
/*
+ * Data structures to describe memory configuration during NUMA
+ * initialization. Use only in NUMA init and emulation paths.
+ */
+struct numa_memblk {
+ u64 start;
+ u64 end;
+ int nid;
+};
+
+struct numa_meminfo {
+ int nr_blks;
+ struct numa_memblk blk[NR_NODE_MEMBLKS];
+};
+
+/*
* Too small node sizes may confuse the VM badly. Usually they
* result from BIOS bugs. So dont recognize nodes as standalone
* NUMA entities that have less than this amount of RAM listed:
@@ -28,15 +43,25 @@ extern nodemask_t numa_nodes_parsed __in
extern int __cpuinit numa_cpu_node(int cpu);
extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+extern void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
+extern int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+extern void __init numa_reset_distance(void);
extern void __init numa_set_distance(int from, int to, int distance);
#ifdef CONFIG_NUMA_EMU
#define FAKE_NODE_MIN_SIZE ((u64)32 << 20)
#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))
void numa_emu_cmdline(char *);
+void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt);
+#else /* CONFIG_NUMA_EMU */
+static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt)
+{ }
#endif /* CONFIG_NUMA_EMU */
-#else
+
+#else /* CONFIG_NUMA */
static inline int numa_cpu_node(int cpu) { return NUMA_NO_NODE; }
-#endif
+#endif /* CONFIG_NUMA */
#endif /* _ASM_X86_NUMA_64_H */
Index: work/arch/x86/mm/Makefile
===================================================================
--- work.orig/arch/x86/mm/Makefile
+++ work/arch/x86/mm/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmio
obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o
obj-$(CONFIG_AMD_NUMA) += amdtopology_64.o
obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o
+obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
Index: work/arch/x86/mm/numa_64.c
===================================================================
--- work.orig/arch/x86/mm/numa_64.c
+++ work/arch/x86/mm/numa_64.c
@@ -22,17 +22,6 @@
#include <asm/acpi.h>
#include <asm/amd_nb.h>
-struct numa_memblk {
- u64 start;
- u64 end;
- int nid;
-};
-
-struct numa_meminfo {
- int nr_blks;
- struct numa_memblk blk[NR_NODE_MEMBLKS];
-};
-
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
@@ -215,7 +204,7 @@ static int __init numa_add_memblk_to(int
return 0;
}
-static void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
{
mi->nr_blks--;
memmove(&mi->blk[idx], &mi->blk[idx + 1],
@@ -273,7 +262,7 @@ setup_node_bootmem(int nodeid, unsigned
node_set_online(nodeid);
}
-static int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
{
const u64 low = 0;
const u64 high = (u64)max_pfn << PAGE_SHIFT;
@@ -367,7 +356,7 @@ static void __init numa_nodemask_from_me
* Reset distance table. The current table is freed. The next
* numa_set_distance() call will create a new one.
*/
-static void __init numa_reset_distance(void)
+void __init numa_reset_distance(void)
{
size_t size;
@@ -533,388 +522,6 @@ static int __init numa_register_memblks(
return 0;
}
-#ifdef CONFIG_NUMA_EMU
-/* Numa emulation */
-static int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
-static char *emu_cmdline __initdata;
-
-void __init numa_emu_cmdline(char *str)
-{
- emu_cmdline = str;
-}
-
-static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < mi->nr_blks; i++)
- if (mi->blk[i].nid == nid)
- return i;
- return -ENOENT;
-}
-
-/*
- * Sets up nid to range from @start to @end. The return value is -errno if
- * something went wrong, 0 otherwise.
- */
-static int __init emu_setup_memblk(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- int nid, int phys_blk, u64 size)
-{
- struct numa_memblk *eb = &ei->blk[ei->nr_blks];
- struct numa_memblk *pb = &pi->blk[phys_blk];
-
- if (ei->nr_blks >= NR_NODE_MEMBLKS) {
- pr_err("NUMA: Too many emulated memblks, failing emulation\n");
- return -EINVAL;
- }
-
- ei->nr_blks++;
- eb->start = pb->start;
- eb->end = pb->start + size;
- eb->nid = nid;
-
- if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
- emu_nid_to_phys[nid] = pb->nid;
-
- pb->start += size;
- if (pb->start >= pb->end) {
- WARN_ON_ONCE(pb->start > pb->end);
- numa_remove_memblk_from(phys_blk, pi);
- }
-
- printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
- eb->start, eb->end, (eb->end - eb->start) >> 20);
- return 0;
-}
-
-/*
- * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
- * to max_addr. The return value is the number of nodes allocated.
- */
-static int __init split_nodes_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, int nr_nodes)
-{
- nodemask_t physnode_mask = NODE_MASK_NONE;
- u64 size;
- int big;
- int nid = 0;
- int i, ret;
-
- if (nr_nodes <= 0)
- return -1;
- if (nr_nodes > MAX_NUMNODES) {
- pr_info("numa=fake=%d too large, reducing to %d\n",
- nr_nodes, MAX_NUMNODES);
- nr_nodes = MAX_NUMNODES;
- }
-
- size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) / nr_nodes;
- /*
- * Calculate the number of big nodes that can be allocated as a result
- * of consolidating the remainder.
- */
- big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
- FAKE_NODE_MIN_SIZE;
-
- size &= FAKE_NODE_MIN_HASH_MASK;
- if (!size) {
- pr_err("Not enough memory for each node. "
- "NUMA emulation disabled.\n");
- return -1;
- }
-
- for (i = 0; i < pi->nr_blks; i++)
- node_set(pi->blk[i].nid, physnode_mask);
-
- /*
- * Continue to fill physical nodes with fake nodes until there is no
- * memory left on any of them.
- */
- while (nodes_weight(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
- end = start + size;
-
- if (nid < big)
- end += FAKE_NODE_MIN_SIZE;
-
- /*
- * Continue to add memory to this fake node if its
- * non-reserved memory is less than the per-node size.
- */
- while (end - start -
- memblock_x86_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > limit) {
- end = limit;
- break;
- }
- }
-
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if (limit - end -
- memblock_x86_hole_size(end, limit) < size)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return 0;
-}
-
-/*
- * Returns the end address of a node so that there is at least `size' amount of
- * non-reserved memory or `max_addr' is reached.
- */
-static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
-{
- u64 end = start + size;
-
- while (end - start - memblock_x86_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > max_addr) {
- end = max_addr;
- break;
- }
- }
- return end;
-}
-
-/*
- * Sets up fake nodes of `size' interleaved over physical nodes ranging from
- * `addr' to `max_addr'. The return value is the number of nodes allocated.
- */
-static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, u64 size)
-{
- nodemask_t physnode_mask = NODE_MASK_NONE;
- u64 min_size;
- int nid = 0;
- int i, ret;
-
- if (!size)
- return -1;
- /*
- * The limit on emulated nodes is MAX_NUMNODES, so the size per node is
- * increased accordingly if the requested size is too small. This
- * creates a uniform distribution of node sizes across the entire
- * machine (but not necessarily over physical nodes).
- */
- min_size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) /
- MAX_NUMNODES;
- min_size = max(min_size, FAKE_NODE_MIN_SIZE);
- if ((min_size & FAKE_NODE_MIN_HASH_MASK) < min_size)
- min_size = (min_size + FAKE_NODE_MIN_SIZE) &
- FAKE_NODE_MIN_HASH_MASK;
- if (size < min_size) {
- pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
- size >> 20, min_size >> 20);
- size = min_size;
- }
- size &= FAKE_NODE_MIN_HASH_MASK;
-
- for (i = 0; i < pi->nr_blks; i++)
- node_set(pi->blk[i].nid, physnode_mask);
-
- /*
- * Fill physical nodes with fake nodes of size until there is no memory
- * left on any of them.
- */
- while (nodes_weight(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = MAX_DMA32_PFN << PAGE_SHIFT;
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
-
- end = find_end_of_node(start, limit, size);
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if (limit - end -
- memblock_x86_hole_size(end, limit) < size)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return 0;
-}
-
-/*
- * Sets up the system RAM area from start_pfn to last_pfn according to the
- * numa=fake command-line option.
- */
-static void __init numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt)
-{
- static struct numa_meminfo ei __initdata;
- static struct numa_meminfo pi __initdata;
- const u64 max_addr = max_pfn << PAGE_SHIFT;
- u8 *phys_dist = NULL;
- int i, j, ret;
-
- if (!emu_cmdline)
- goto no_emu;
-
- memset(&ei, 0, sizeof(ei));
- pi = *numa_meminfo;
-
- for (i = 0; i < MAX_NUMNODES; i++)
- emu_nid_to_phys[i] = NUMA_NO_NODE;
-
- /*
- * If the numa=fake command-line contains a 'M' or 'G', it represents
- * the fixed node size. Otherwise, if it is just a single number N,
- * split the system RAM into N fake nodes.
- */
- if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
- u64 size;
-
- size = memparse(emu_cmdline, &emu_cmdline);
- ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
- } else {
- unsigned long n;
-
- n = simple_strtoul(emu_cmdline, NULL, 0);
- ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
- }
-
- if (ret < 0)
- goto no_emu;
-
- if (numa_cleanup_meminfo(&ei) < 0) {
- pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
- goto no_emu;
- }
-
- /*
- * Copy the original distance table. It's temporary so no need to
- * reserve it.
- */
- if (numa_dist_cnt) {
- size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
- u64 phys;
-
- phys = memblock_find_in_range(0,
- (u64)max_pfn_mapped << PAGE_SHIFT,
- size, PAGE_SIZE);
- if (phys == MEMBLOCK_ERROR) {
- pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
- goto no_emu;
- }
- phys_dist = __va(phys);
-
- for (i = 0; i < numa_dist_cnt; i++)
- for (j = 0; j < numa_dist_cnt; j++)
- phys_dist[i * numa_dist_cnt + j] =
- node_distance(i, j);
- }
-
- /* commit */
- *numa_meminfo = ei;
-
- /*
- * Transform __apicid_to_node table to use emulated nids by
- * reverse-mapping phys_nid. The maps should always exist but fall
- * back to zero just in case.
- */
- for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
- if (__apicid_to_node[i] == NUMA_NO_NODE)
- continue;
- for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
- if (__apicid_to_node[i] == emu_nid_to_phys[j])
- break;
- __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
- }
-
- /* make sure all emulated nodes are mapped to a physical node */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- if (emu_nid_to_phys[i] == NUMA_NO_NODE)
- emu_nid_to_phys[i] = 0;
-
- /* transform distance table */
- numa_reset_distance();
- for (i = 0; i < MAX_NUMNODES; i++) {
- for (j = 0; j < MAX_NUMNODES; j++) {
- int physi = emu_nid_to_phys[i];
- int physj = emu_nid_to_phys[j];
- int dist;
-
- if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
- dist = physi == physj ?
- LOCAL_DISTANCE : REMOTE_DISTANCE;
- else
- dist = phys_dist[physi * numa_dist_cnt + physj];
-
- numa_set_distance(i, j, dist);
- }
- }
- return;
-
-no_emu:
- /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- emu_nid_to_phys[i] = i;
-}
-#else /* CONFIG_NUMA_EMU */
-static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt)
-{ }
-#endif /* CONFIG_NUMA_EMU */
-
static int __init dummy_numa_init(void)
{
printk(KERN_INFO "%s\n",
@@ -1002,83 +609,3 @@ int __cpuinit numa_cpu_node(int cpu)
return __apicid_to_node[apicid];
return NUMA_NO_NODE;
}
-
-/*
- * UGLINESS AHEAD: Currently, CONFIG_NUMA_EMU is 64bit only and makes use
- * of 64bit specific data structures. The distinction is artificial and
- * should be removed. numa_{add|remove}_cpu() are implemented in numa.c
- * for both 32 and 64bit when CONFIG_NUMA_EMU is disabled but here when
- * enabled.
- *
- * NUMA emulation is planned to be made generic and the following and other
- * related code should be moved to numa.c.
- */
-#ifdef CONFIG_NUMA_EMU
-# ifndef CONFIG_DEBUG_PER_CPU_MAPS
-void __cpuinit numa_add_cpu(int cpu)
-{
- int physnid, nid;
-
- nid = numa_cpu_node(cpu);
- if (nid == NUMA_NO_NODE)
- nid = early_cpu_to_node(cpu);
- BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
-
- physnid = emu_nid_to_phys[nid];
-
- /*
- * Map the cpu to each emulated node that is allocated on the physical
- * node of the cpu's apic id.
- */
- for_each_online_node(nid)
- if (emu_nid_to_phys[nid] == physnid)
- cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
-}
-
-void __cpuinit numa_remove_cpu(int cpu)
-{
- int i;
-
- for_each_online_node(i)
- cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
-}
-# else /* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
-{
- struct cpumask *mask;
- int nid, physnid, i;
-
- nid = early_cpu_to_node(cpu);
- if (nid == NUMA_NO_NODE) {
- /* early_cpu_to_node() already emits a warning and trace */
- return;
- }
-
- physnid = emu_nid_to_phys[nid];
-
- for_each_online_node(i) {
- if (emu_nid_to_phys[nid] != physnid)
- continue;
-
- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
- }
-}
-
-void __cpuinit numa_add_cpu(int cpu)
-{
- numa_set_cpumask(cpu, 1);
-}
-
-void __cpuinit numa_remove_cpu(int cpu)
-{
- numa_set_cpumask(cpu, 0);
-}
-# endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
-#endif /* CONFIG_NUMA_EMU */
Index: work/arch/x86/mm/numa_emulation.c
===================================================================
--- /dev/null
+++ work/arch/x86/mm/numa_emulation.c
@@ -0,0 +1,451 @@
+/*
+ * NUMA emulation
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/topology.h>
+#include <linux/memblock.h>
+#include <asm/numa.h>
+#include <asm/dma.h>
+
+static int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
+static char *emu_cmdline __initdata;
+
+void __init numa_emu_cmdline(char *str)
+{
+ emu_cmdline = str;
+}
+
+static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
+{
+ int i;
+
+ for (i = 0; i < mi->nr_blks; i++)
+ if (mi->blk[i].nid == nid)
+ return i;
+ return -ENOENT;
+}
+
+/*
+ * Sets up nid to range from @start to @end. The return value is -errno if
+ * something went wrong, 0 otherwise.
+ */
+static int __init emu_setup_memblk(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ int nid, int phys_blk, u64 size)
+{
+ struct numa_memblk *eb = &ei->blk[ei->nr_blks];
+ struct numa_memblk *pb = &pi->blk[phys_blk];
+
+ if (ei->nr_blks >= NR_NODE_MEMBLKS) {
+ pr_err("NUMA: Too many emulated memblks, failing emulation\n");
+ return -EINVAL;
+ }
+
+ ei->nr_blks++;
+ eb->start = pb->start;
+ eb->end = pb->start + size;
+ eb->nid = nid;
+
+ if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
+ emu_nid_to_phys[nid] = pb->nid;
+
+ pb->start += size;
+ if (pb->start >= pb->end) {
+ WARN_ON_ONCE(pb->start > pb->end);
+ numa_remove_memblk_from(phys_blk, pi);
+ }
+
+ printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
+ eb->start, eb->end, (eb->end - eb->start) >> 20);
+ return 0;
+}
+
+/*
+ * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
+ * to max_addr. The return value is the number of nodes allocated.
+ */
+static int __init split_nodes_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, int nr_nodes)
+{
+ nodemask_t physnode_mask = NODE_MASK_NONE;
+ u64 size;
+ int big;
+ int nid = 0;
+ int i, ret;
+
+ if (nr_nodes <= 0)
+ return -1;
+ if (nr_nodes > MAX_NUMNODES) {
+ pr_info("numa=fake=%d too large, reducing to %d\n",
+ nr_nodes, MAX_NUMNODES);
+ nr_nodes = MAX_NUMNODES;
+ }
+
+ size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) / nr_nodes;
+ /*
+ * Calculate the number of big nodes that can be allocated as a result
+ * of consolidating the remainder.
+ */
+ big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
+ FAKE_NODE_MIN_SIZE;
+
+ size &= FAKE_NODE_MIN_HASH_MASK;
+ if (!size) {
+ pr_err("Not enough memory for each node. "
+ "NUMA emulation disabled.\n");
+ return -1;
+ }
+
+ for (i = 0; i < pi->nr_blks; i++)
+ node_set(pi->blk[i].nid, physnode_mask);
+
+ /*
+ * Continue to fill physical nodes with fake nodes until there is no
+ * memory left on any of them.
+ */
+ while (nodes_weight(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+ u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+ end = start + size;
+
+ if (nid < big)
+ end += FAKE_NODE_MIN_SIZE;
+
+ /*
+ * Continue to add memory to this fake node if its
+ * non-reserved memory is less than the per-node size.
+ */
+ while (end - start -
+ memblock_x86_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > limit) {
+ end = limit;
+ break;
+ }
+ }
+
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if (limit - end -
+ memblock_x86_hole_size(end, limit) < size)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Returns the end address of a node so that there is at least `size' amount of
+ * non-reserved memory or `max_addr' is reached.
+ */
+static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
+{
+ u64 end = start + size;
+
+ while (end - start - memblock_x86_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > max_addr) {
+ end = max_addr;
+ break;
+ }
+ }
+ return end;
+}
+
+/*
+ * Sets up fake nodes of `size' interleaved over physical nodes ranging from
+ * `addr' to `max_addr'. The return value is the number of nodes allocated.
+ */
+static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, u64 size)
+{
+ nodemask_t physnode_mask = NODE_MASK_NONE;
+ u64 min_size;
+ int nid = 0;
+ int i, ret;
+
+ if (!size)
+ return -1;
+ /*
+ * The limit on emulated nodes is MAX_NUMNODES, so the size per node is
+ * increased accordingly if the requested size is too small. This
+ * creates a uniform distribution of node sizes across the entire
+ * machine (but not necessarily over physical nodes).
+ */
+ min_size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) /
+ MAX_NUMNODES;
+ min_size = max(min_size, FAKE_NODE_MIN_SIZE);
+ if ((min_size & FAKE_NODE_MIN_HASH_MASK) < min_size)
+ min_size = (min_size + FAKE_NODE_MIN_SIZE) &
+ FAKE_NODE_MIN_HASH_MASK;
+ if (size < min_size) {
+ pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
+ size >> 20, min_size >> 20);
+ size = min_size;
+ }
+ size &= FAKE_NODE_MIN_HASH_MASK;
+
+ for (i = 0; i < pi->nr_blks; i++)
+ node_set(pi->blk[i].nid, physnode_mask);
+
+ /*
+ * Fill physical nodes with fake nodes of size until there is no memory
+ * left on any of them.
+ */
+ while (nodes_weight(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+ u64 dma32_end = MAX_DMA32_PFN << PAGE_SHIFT;
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+
+ end = find_end_of_node(start, limit, size);
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if (limit - end -
+ memblock_x86_hole_size(end, limit) < size)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Sets up the system RAM area from start_pfn to last_pfn according to the
+ * numa=fake command-line option.
+ */
+void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
+{
+ static struct numa_meminfo ei __initdata;
+ static struct numa_meminfo pi __initdata;
+ const u64 max_addr = max_pfn << PAGE_SHIFT;
+ u8 *phys_dist = NULL;
+ int i, j, ret;
+
+ if (!emu_cmdline)
+ goto no_emu;
+
+ memset(&ei, 0, sizeof(ei));
+ pi = *numa_meminfo;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ emu_nid_to_phys[i] = NUMA_NO_NODE;
+
+ /*
+ * If the numa=fake command-line contains a 'M' or 'G', it represents
+ * the fixed node size. Otherwise, if it is just a single number N,
+ * split the system RAM into N fake nodes.
+ */
+ if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
+ u64 size;
+
+ size = memparse(emu_cmdline, &emu_cmdline);
+ ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
+ } else {
+ unsigned long n;
+
+ n = simple_strtoul(emu_cmdline, NULL, 0);
+ ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
+ }
+
+ if (ret < 0)
+ goto no_emu;
+
+ if (numa_cleanup_meminfo(&ei) < 0) {
+ pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
+ goto no_emu;
+ }
+
+ /*
+ * Copy the original distance table. It's temporary so no need to
+ * reserve it.
+ */
+ if (numa_dist_cnt) {
+ size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
+ u64 phys;
+
+ phys = memblock_find_in_range(0,
+ (u64)max_pfn_mapped << PAGE_SHIFT,
+ size, PAGE_SIZE);
+ if (phys == MEMBLOCK_ERROR) {
+ pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
+ goto no_emu;
+ }
+ phys_dist = __va(phys);
+
+ for (i = 0; i < numa_dist_cnt; i++)
+ for (j = 0; j < numa_dist_cnt; j++)
+ phys_dist[i * numa_dist_cnt + j] =
+ node_distance(i, j);
+ }
+
+ /* commit */
+ *numa_meminfo = ei;
+
+ /*
+ * Transform __apicid_to_node table to use emulated nids by
+ * reverse-mapping phys_nid. The maps should always exist but fall
+ * back to zero just in case.
+ */
+ for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+ if (__apicid_to_node[i] == NUMA_NO_NODE)
+ continue;
+ for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
+ if (__apicid_to_node[i] == emu_nid_to_phys[j])
+ break;
+ __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
+ }
+
+ /* make sure all emulated nodes are mapped to a physical node */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ if (emu_nid_to_phys[i] == NUMA_NO_NODE)
+ emu_nid_to_phys[i] = 0;
+
+ /* transform distance table */
+ numa_reset_distance();
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ for (j = 0; j < MAX_NUMNODES; j++) {
+ int physi = emu_nid_to_phys[i];
+ int physj = emu_nid_to_phys[j];
+ int dist;
+
+ if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
+ dist = physi == physj ?
+ LOCAL_DISTANCE : REMOTE_DISTANCE;
+ else
+ dist = phys_dist[physi * numa_dist_cnt + physj];
+
+ numa_set_distance(i, j, dist);
+ }
+ }
+ return;
+
+no_emu:
+ /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ emu_nid_to_phys[i] = i;
+}
+
+#ifndef CONFIG_DEBUG_PER_CPU_MAPS
+void __cpuinit numa_add_cpu(int cpu)
+{
+ int physnid, nid;
+
+ nid = numa_cpu_node(cpu);
+ if (nid == NUMA_NO_NODE)
+ nid = early_cpu_to_node(cpu);
+ BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
+
+ physnid = emu_nid_to_phys[nid];
+
+ /*
+ * Map the cpu to each emulated node that is allocated on the physical
+ * node of the cpu's apic id.
+ */
+ for_each_online_node(nid)
+ if (emu_nid_to_phys[nid] == physnid)
+ cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
+}
+
+void __cpuinit numa_remove_cpu(int cpu)
+{
+ int i;
+
+ for_each_online_node(i)
+ cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
+}
+#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
+static void __cpuinit numa_set_cpumask(int cpu, int enable)
+{
+ struct cpumask *mask;
+ int nid, physnid, i;
+
+ nid = early_cpu_to_node(cpu);
+ if (nid == NUMA_NO_NODE) {
+ /* early_cpu_to_node() already emits a warning and trace */
+ return;
+ }
+
+ physnid = emu_nid_to_phys[nid];
+
+ for_each_online_node(i) {
+ if (emu_nid_to_phys[nid] != physnid)
+ continue;
+
+ mask = debug_cpumask_set_cpu(cpu, enable);
+ if (!mask)
+ return;
+
+ if (enable)
+ cpumask_set_cpu(cpu, mask);
+ else
+ cpumask_clear_cpu(cpu, mask);
+ }
+}
+
+void __cpuinit numa_add_cpu(int cpu)
+{
+ numa_set_cpumask(cpu, 1);
+}
+
+void __cpuinit numa_remove_cpu(int cpu)
+{
+ numa_set_cpumask(cpu, 0);
+}
+#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 3/3 tip:x86/mm] x86-64, NUMA: Add proper function comments to global functions
2011-02-18 13:58 [PATCH 1/3 tip:x86/mm] x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file Tejun Heo
2011-02-18 13:59 ` [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c Tejun Heo
@ 2011-02-18 14:00 ` Tejun Heo
1 sibling, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2011-02-18 14:00 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner, Yinghai Lu, H. Peter Anvin
Cc: David Rientjes, linux-kernel, Cyrill Gorcunov
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
arch/x86/mm/numa_64.c | 50 ++++++++++++++++++++++++++++++++++++-------
arch/x86/mm/numa_emulation.c | 29 ++++++++++++++++++++++--
2 files changed, 69 insertions(+), 10 deletions(-)
Index: work/arch/x86/mm/numa_64.c
===================================================================
--- work.orig/arch/x86/mm/numa_64.c
+++ work/arch/x86/mm/numa_64.c
@@ -204,6 +204,14 @@ static int __init numa_add_memblk_to(int
return 0;
}
+/**
+ * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
+ * @idx: Index of memblk to remove
+ * @mi: numa_meminfo to remove memblk from
+ *
+ * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
+ * decrementing @mi->nr_blks.
+ */
void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
{
mi->nr_blks--;
@@ -211,6 +219,17 @@ void __init numa_remove_memblk_from(int
(mi->nr_blks - idx) * sizeof(mi->blk[0]));
}
+/**
+ * numa_add_memblk - Add one numa_memblk to numa_meminfo
+ * @nid: NUMA node ID of the new memblk
+ * @start: Start address of the new memblk
+ * @end: End address of the new memblk
+ *
+ * Add a new memblk to the default numa_meminfo.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
int __init numa_add_memblk(int nid, u64 start, u64 end)
{
return numa_add_memblk_to(nid, start, end, &numa_meminfo);
@@ -262,6 +281,16 @@ setup_node_bootmem(int nodeid, unsigned
node_set_online(nodeid);
}
+/**
+ * numa_cleanup_meminfo - Cleanup a numa_meminfo
+ * @mi: numa_meminfo to clean up
+ *
+ * Sanitize @mi by merging and removing unncessary memblks. Also check for
+ * conflicts and clear unused memblks.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
{
const u64 low = 0;
@@ -352,9 +381,11 @@ static void __init numa_nodemask_from_me
node_set(mi->blk[i].nid, *nodemask);
}
-/*
- * Reset distance table. The current table is freed. The next
- * numa_set_distance() call will create a new one.
+/**
+ * numa_reset_distance - Reset NUMA distance table
+ *
+ * The current table is freed. The next numa_set_distance() call will
+ * create a new one.
*/
void __init numa_reset_distance(void)
{
@@ -369,10 +400,15 @@ void __init numa_reset_distance(void)
numa_distance = NULL;
}
-/*
- * Set the distance between node @from to @to to @distance. If distance
- * table doesn't exist, one which is large enough to accomodate all the
- * currently known nodes will be created.
+/**
+ * numa_set_distance - Set NUMA distance from one NUMA to another
+ * @from: the 'from' node to set distance
+ * @to: the 'to' node to set distance
+ * @distance: NUMA distance
+ *
+ * Set the distance from node @from to @to to @distance. If distance table
+ * doesn't exist, one which is large enough to accomodate all the currently
+ * known nodes will be created.
*/
void __init numa_set_distance(int from, int to, int distance)
{
Index: work/arch/x86/mm/numa_emulation.c
===================================================================
--- work.orig/arch/x86/mm/numa_emulation.c
+++ work/arch/x86/mm/numa_emulation.c
@@ -266,9 +266,32 @@ static int __init split_nodes_size_inter
return 0;
}
-/*
- * Sets up the system RAM area from start_pfn to last_pfn according to the
- * numa=fake command-line option.
+/**
+ * numa_emulation - Emulate NUMA nodes
+ * @numa_meminfo: NUMA configuration to massage
+ * @numa_dist_cnt: The size of the physical NUMA distance table
+ *
+ * Emulate NUMA nodes according to the numa=fake kernel parameter.
+ * @numa_meminfo contains the physical memory configuration and is modified
+ * to reflect the emulated configuration on success. @numa_dist_cnt is
+ * used to determine the size of the physical distance table.
+ *
+ * On success, the following modifications are made.
+ *
+ * - @numa_meminfo is updated to reflect the emulated nodes.
+ *
+ * - __apicid_to_node[] is updated such that APIC IDs are mapped to the
+ * emulated nodes.
+ *
+ * - NUMA distance table is rebuilt to represent distances between emulated
+ * nodes. The distances are determined considering how emulated nodes
+ * are mapped to physical nodes and match the actual distances.
+ *
+ * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical
+ * nodes. This is used by numa_add_cpu() and numa_remove_cpu().
+ *
+ * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with
+ * identity mapping and no other modification is made.
*/
void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
{
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-18 13:59 ` [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c Tejun Heo
@ 2011-02-18 17:58 ` Yinghai Lu
2011-02-20 21:58 ` David Rientjes
2011-02-21 8:26 ` [PATCH UPDATED " Tejun Heo
1 sibling, 1 reply; 9+ messages in thread
From: Yinghai Lu @ 2011-02-18 17:58 UTC (permalink / raw)
To: Tejun Heo, David Rientjes
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
Cyrill Gorcunov
On 02/18/2011 05:59 AM, Tejun Heo wrote:
> Create numa_emulation.c and move all NUMA emulation code there. The
> definitions of struct numa_memblk and numa_meminfo are moved to
> numa_64.h. Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
> numa_reset_distance() along with numa_emulation() are made global.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
> arch/x86/include/asm/numa_64.h | 29 ++
> arch/x86/mm/Makefile | 1
> arch/x86/mm/numa_64.c | 479 -----------------------------------------
> arch/x86/mm/numa_emulation.c | 451 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 482 insertions(+), 478 deletions(-)
>
> Index: work/arch/x86/include/asm/numa_64.h
> ===================================================================
> --- work.orig/arch/x86/include/asm/numa_64.h
> +++ work/arch/x86/include/asm/numa_64.h
> @@ -18,6 +18,21 @@ extern void setup_node_bootmem(int nodei
>
> #ifdef CONFIG_NUMA
> /*
> + * Data structures to describe memory configuration during NUMA
> + * initialization. Use only in NUMA init and emulation paths.
> + */
those internal struct and functions declaring could be in
arch/x86/mm/numa_internal.h ?
David, can you please check tip/x86/mm ?
Thanks
Yinghai
> +struct numa_memblk {
> + u64 start;
> + u64 end;
> + int nid;
> +};
> +
> +struct numa_meminfo {
> + int nr_blks;
> + struct numa_memblk blk[NR_NODE_MEMBLKS];
> +};
> +
> +/*
> * Too small node sizes may confuse the VM badly. Usually they
> * result from BIOS bugs. So dont recognize nodes as standalone
> * NUMA entities that have less than this amount of RAM listed:
> @@ -28,15 +43,25 @@ extern nodemask_t numa_nodes_parsed __in
>
> extern int __cpuinit numa_cpu_node(int cpu);
> extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> +extern void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
> +extern int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
> +extern void __init numa_reset_distance(void);
> extern void __init numa_set_distance(int from, int to, int distance);
>
> #ifdef CONFIG_NUMA_EMU
> #define FAKE_NODE_MIN_SIZE ((u64)32 << 20)
> #define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))
> void numa_emu_cmdline(char *);
> +void __init numa_emulation(struct numa_meminfo *numa_meminfo,
> + int numa_dist_cnt);
> +#else /* CONFIG_NUMA_EMU */
> +static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
> + int numa_dist_cnt)
> +{ }
> #endif /* CONFIG_NUMA_EMU */
> -#else
> +
> +#else /* CONFIG_NUMA */
> static inline int numa_cpu_node(int cpu) { return NUMA_NO_NODE; }
> -#endif
> +#endif /* CONFIG_NUMA */
>
> #endif /* _ASM_X86_NUMA_64_H */
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-18 17:58 ` Yinghai Lu
@ 2011-02-20 21:58 ` David Rientjes
2011-02-24 19:38 ` Yinghai Lu
0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2011-02-20 21:58 UTC (permalink / raw)
To: Yinghai Lu
Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
linux-kernel, Cyrill Gorcunov
On Fri, 18 Feb 2011, Yinghai Lu wrote:
> David, can you please check tip/x86/mm ?
>
Yeah, I'm in the process of doing so without this set yet. I'm getting
boot failures for numa=fake=128M right now, so I'll need to investigate
and diagnose that issue before adding more patches.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH UPDATED 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-18 13:59 ` [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c Tejun Heo
2011-02-18 17:58 ` Yinghai Lu
@ 2011-02-21 8:26 ` Tejun Heo
2011-02-21 17:10 ` Yinghai Lu
1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2011-02-21 8:26 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner, Yinghai Lu, H. Peter Anvin
Cc: David Rientjes, linux-kernel, Cyrill Gorcunov
Create numa_emulation.c and move all NUMA emulation code there. The
definitions of struct numa_memblk and numa_meminfo are moved to
numa_64.h. Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
numa_reset_distance() along with numa_emulation() are made global.
- v2: Internal declarations moved to numa_internal.h as suggested by
Yinghai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
Yinghai, does it look okay to you?
Thanks.
arch/x86/mm/Makefile | 1
arch/x86/mm/numa_64.c | 480 -------------------------------------------
arch/x86/mm/numa_emulation.c | 452 ++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/numa_internal.h | 31 ++
4 files changed, 488 insertions(+), 476 deletions(-)
Index: work/arch/x86/mm/Makefile
===================================================================
--- work.orig/arch/x86/mm/Makefile
+++ work/arch/x86/mm/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmio
obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o
obj-$(CONFIG_AMD_NUMA) += amdtopology_64.o
obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o
+obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
Index: work/arch/x86/mm/numa_64.c
===================================================================
--- work.orig/arch/x86/mm/numa_64.c
+++ work/arch/x86/mm/numa_64.c
@@ -18,20 +18,10 @@
#include <asm/e820.h>
#include <asm/proto.h>
#include <asm/dma.h>
-#include <asm/numa.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
-struct numa_memblk {
- u64 start;
- u64 end;
- int nid;
-};
-
-struct numa_meminfo {
- int nr_blks;
- struct numa_memblk blk[NR_NODE_MEMBLKS];
-};
+#include "numa_internal.h"
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
@@ -215,7 +205,7 @@ static int __init numa_add_memblk_to(int
return 0;
}
-static void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
{
mi->nr_blks--;
memmove(&mi->blk[idx], &mi->blk[idx + 1],
@@ -273,7 +263,7 @@ setup_node_bootmem(int nodeid, unsigned
node_set_online(nodeid);
}
-static int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
{
const u64 low = 0;
const u64 high = (u64)max_pfn << PAGE_SHIFT;
@@ -367,7 +357,7 @@ static void __init numa_nodemask_from_me
* Reset distance table. The current table is freed. The next
* numa_set_distance() call will create a new one.
*/
-static void __init numa_reset_distance(void)
+void __init numa_reset_distance(void)
{
size_t size;
@@ -533,388 +523,6 @@ static int __init numa_register_memblks(
return 0;
}
-#ifdef CONFIG_NUMA_EMU
-/* Numa emulation */
-static int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
-static char *emu_cmdline __initdata;
-
-void __init numa_emu_cmdline(char *str)
-{
- emu_cmdline = str;
-}
-
-static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < mi->nr_blks; i++)
- if (mi->blk[i].nid == nid)
- return i;
- return -ENOENT;
-}
-
-/*
- * Sets up nid to range from @start to @end. The return value is -errno if
- * something went wrong, 0 otherwise.
- */
-static int __init emu_setup_memblk(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- int nid, int phys_blk, u64 size)
-{
- struct numa_memblk *eb = &ei->blk[ei->nr_blks];
- struct numa_memblk *pb = &pi->blk[phys_blk];
-
- if (ei->nr_blks >= NR_NODE_MEMBLKS) {
- pr_err("NUMA: Too many emulated memblks, failing emulation\n");
- return -EINVAL;
- }
-
- ei->nr_blks++;
- eb->start = pb->start;
- eb->end = pb->start + size;
- eb->nid = nid;
-
- if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
- emu_nid_to_phys[nid] = pb->nid;
-
- pb->start += size;
- if (pb->start >= pb->end) {
- WARN_ON_ONCE(pb->start > pb->end);
- numa_remove_memblk_from(phys_blk, pi);
- }
-
- printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
- eb->start, eb->end, (eb->end - eb->start) >> 20);
- return 0;
-}
-
-/*
- * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
- * to max_addr. The return value is the number of nodes allocated.
- */
-static int __init split_nodes_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, int nr_nodes)
-{
- nodemask_t physnode_mask = NODE_MASK_NONE;
- u64 size;
- int big;
- int nid = 0;
- int i, ret;
-
- if (nr_nodes <= 0)
- return -1;
- if (nr_nodes > MAX_NUMNODES) {
- pr_info("numa=fake=%d too large, reducing to %d\n",
- nr_nodes, MAX_NUMNODES);
- nr_nodes = MAX_NUMNODES;
- }
-
- size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) / nr_nodes;
- /*
- * Calculate the number of big nodes that can be allocated as a result
- * of consolidating the remainder.
- */
- big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
- FAKE_NODE_MIN_SIZE;
-
- size &= FAKE_NODE_MIN_HASH_MASK;
- if (!size) {
- pr_err("Not enough memory for each node. "
- "NUMA emulation disabled.\n");
- return -1;
- }
-
- for (i = 0; i < pi->nr_blks; i++)
- node_set(pi->blk[i].nid, physnode_mask);
-
- /*
- * Continue to fill physical nodes with fake nodes until there is no
- * memory left on any of them.
- */
- while (nodes_weight(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
- end = start + size;
-
- if (nid < big)
- end += FAKE_NODE_MIN_SIZE;
-
- /*
- * Continue to add memory to this fake node if its
- * non-reserved memory is less than the per-node size.
- */
- while (end - start -
- memblock_x86_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > limit) {
- end = limit;
- break;
- }
- }
-
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if (limit - end -
- memblock_x86_hole_size(end, limit) < size)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return 0;
-}
-
-/*
- * Returns the end address of a node so that there is at least `size' amount of
- * non-reserved memory or `max_addr' is reached.
- */
-static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
-{
- u64 end = start + size;
-
- while (end - start - memblock_x86_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > max_addr) {
- end = max_addr;
- break;
- }
- }
- return end;
-}
-
-/*
- * Sets up fake nodes of `size' interleaved over physical nodes ranging from
- * `addr' to `max_addr'. The return value is the number of nodes allocated.
- */
-static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, u64 size)
-{
- nodemask_t physnode_mask = NODE_MASK_NONE;
- u64 min_size;
- int nid = 0;
- int i, ret;
-
- if (!size)
- return -1;
- /*
- * The limit on emulated nodes is MAX_NUMNODES, so the size per node is
- * increased accordingly if the requested size is too small. This
- * creates a uniform distribution of node sizes across the entire
- * machine (but not necessarily over physical nodes).
- */
- min_size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) /
- MAX_NUMNODES;
- min_size = max(min_size, FAKE_NODE_MIN_SIZE);
- if ((min_size & FAKE_NODE_MIN_HASH_MASK) < min_size)
- min_size = (min_size + FAKE_NODE_MIN_SIZE) &
- FAKE_NODE_MIN_HASH_MASK;
- if (size < min_size) {
- pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
- size >> 20, min_size >> 20);
- size = min_size;
- }
- size &= FAKE_NODE_MIN_HASH_MASK;
-
- for (i = 0; i < pi->nr_blks; i++)
- node_set(pi->blk[i].nid, physnode_mask);
-
- /*
- * Fill physical nodes with fake nodes of size until there is no memory
- * left on any of them.
- */
- while (nodes_weight(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = MAX_DMA32_PFN << PAGE_SHIFT;
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
-
- end = find_end_of_node(start, limit, size);
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if (limit - end -
- memblock_x86_hole_size(end, limit) < size)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return 0;
-}
-
-/*
- * Sets up the system RAM area from start_pfn to last_pfn according to the
- * numa=fake command-line option.
- */
-static void __init numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt)
-{
- static struct numa_meminfo ei __initdata;
- static struct numa_meminfo pi __initdata;
- const u64 max_addr = max_pfn << PAGE_SHIFT;
- u8 *phys_dist = NULL;
- int i, j, ret;
-
- if (!emu_cmdline)
- goto no_emu;
-
- memset(&ei, 0, sizeof(ei));
- pi = *numa_meminfo;
-
- for (i = 0; i < MAX_NUMNODES; i++)
- emu_nid_to_phys[i] = NUMA_NO_NODE;
-
- /*
- * If the numa=fake command-line contains a 'M' or 'G', it represents
- * the fixed node size. Otherwise, if it is just a single number N,
- * split the system RAM into N fake nodes.
- */
- if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
- u64 size;
-
- size = memparse(emu_cmdline, &emu_cmdline);
- ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
- } else {
- unsigned long n;
-
- n = simple_strtoul(emu_cmdline, NULL, 0);
- ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
- }
-
- if (ret < 0)
- goto no_emu;
-
- if (numa_cleanup_meminfo(&ei) < 0) {
- pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
- goto no_emu;
- }
-
- /*
- * Copy the original distance table. It's temporary so no need to
- * reserve it.
- */
- if (numa_dist_cnt) {
- size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
- u64 phys;
-
- phys = memblock_find_in_range(0,
- (u64)max_pfn_mapped << PAGE_SHIFT,
- size, PAGE_SIZE);
- if (phys == MEMBLOCK_ERROR) {
- pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
- goto no_emu;
- }
- phys_dist = __va(phys);
-
- for (i = 0; i < numa_dist_cnt; i++)
- for (j = 0; j < numa_dist_cnt; j++)
- phys_dist[i * numa_dist_cnt + j] =
- node_distance(i, j);
- }
-
- /* commit */
- *numa_meminfo = ei;
-
- /*
- * Transform __apicid_to_node table to use emulated nids by
- * reverse-mapping phys_nid. The maps should always exist but fall
- * back to zero just in case.
- */
- for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
- if (__apicid_to_node[i] == NUMA_NO_NODE)
- continue;
- for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
- if (__apicid_to_node[i] == emu_nid_to_phys[j])
- break;
- __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
- }
-
- /* make sure all emulated nodes are mapped to a physical node */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- if (emu_nid_to_phys[i] == NUMA_NO_NODE)
- emu_nid_to_phys[i] = 0;
-
- /* transform distance table */
- numa_reset_distance();
- for (i = 0; i < MAX_NUMNODES; i++) {
- for (j = 0; j < MAX_NUMNODES; j++) {
- int physi = emu_nid_to_phys[i];
- int physj = emu_nid_to_phys[j];
- int dist;
-
- if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
- dist = physi == physj ?
- LOCAL_DISTANCE : REMOTE_DISTANCE;
- else
- dist = phys_dist[physi * numa_dist_cnt + physj];
-
- numa_set_distance(i, j, dist);
- }
- }
- return;
-
-no_emu:
- /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- emu_nid_to_phys[i] = i;
-}
-#else /* CONFIG_NUMA_EMU */
-static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt)
-{ }
-#endif /* CONFIG_NUMA_EMU */
-
static int __init dummy_numa_init(void)
{
printk(KERN_INFO "%s\n",
@@ -1002,83 +610,3 @@ int __cpuinit numa_cpu_node(int cpu)
return __apicid_to_node[apicid];
return NUMA_NO_NODE;
}
-
-/*
- * UGLINESS AHEAD: Currently, CONFIG_NUMA_EMU is 64bit only and makes use
- * of 64bit specific data structures. The distinction is artificial and
- * should be removed. numa_{add|remove}_cpu() are implemented in numa.c
- * for both 32 and 64bit when CONFIG_NUMA_EMU is disabled but here when
- * enabled.
- *
- * NUMA emulation is planned to be made generic and the following and other
- * related code should be moved to numa.c.
- */
-#ifdef CONFIG_NUMA_EMU
-# ifndef CONFIG_DEBUG_PER_CPU_MAPS
-void __cpuinit numa_add_cpu(int cpu)
-{
- int physnid, nid;
-
- nid = numa_cpu_node(cpu);
- if (nid == NUMA_NO_NODE)
- nid = early_cpu_to_node(cpu);
- BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
-
- physnid = emu_nid_to_phys[nid];
-
- /*
- * Map the cpu to each emulated node that is allocated on the physical
- * node of the cpu's apic id.
- */
- for_each_online_node(nid)
- if (emu_nid_to_phys[nid] == physnid)
- cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
-}
-
-void __cpuinit numa_remove_cpu(int cpu)
-{
- int i;
-
- for_each_online_node(i)
- cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
-}
-# else /* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
-{
- struct cpumask *mask;
- int nid, physnid, i;
-
- nid = early_cpu_to_node(cpu);
- if (nid == NUMA_NO_NODE) {
- /* early_cpu_to_node() already emits a warning and trace */
- return;
- }
-
- physnid = emu_nid_to_phys[nid];
-
- for_each_online_node(i) {
- if (emu_nid_to_phys[nid] != physnid)
- continue;
-
- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
- }
-}
-
-void __cpuinit numa_add_cpu(int cpu)
-{
- numa_set_cpumask(cpu, 1);
-}
-
-void __cpuinit numa_remove_cpu(int cpu)
-{
- numa_set_cpumask(cpu, 0);
-}
-# endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
-#endif /* CONFIG_NUMA_EMU */
Index: work/arch/x86/mm/numa_emulation.c
===================================================================
--- /dev/null
+++ work/arch/x86/mm/numa_emulation.c
@@ -0,0 +1,452 @@
+/*
+ * NUMA emulation
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/topology.h>
+#include <linux/memblock.h>
+#include <asm/dma.h>
+
+#include "numa_internal.h"
+
+static int emu_nid_to_phys[MAX_NUMNODES] __cpuinitdata;
+static char *emu_cmdline __initdata;
+
+void __init numa_emu_cmdline(char *str)
+{
+ emu_cmdline = str;
+}
+
+static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
+{
+ int i;
+
+ for (i = 0; i < mi->nr_blks; i++)
+ if (mi->blk[i].nid == nid)
+ return i;
+ return -ENOENT;
+}
+
+/*
+ * Sets up nid to range from @start to @end. The return value is -errno if
+ * something went wrong, 0 otherwise.
+ */
+static int __init emu_setup_memblk(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ int nid, int phys_blk, u64 size)
+{
+ struct numa_memblk *eb = &ei->blk[ei->nr_blks];
+ struct numa_memblk *pb = &pi->blk[phys_blk];
+
+ if (ei->nr_blks >= NR_NODE_MEMBLKS) {
+ pr_err("NUMA: Too many emulated memblks, failing emulation\n");
+ return -EINVAL;
+ }
+
+ ei->nr_blks++;
+ eb->start = pb->start;
+ eb->end = pb->start + size;
+ eb->nid = nid;
+
+ if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
+ emu_nid_to_phys[nid] = pb->nid;
+
+ pb->start += size;
+ if (pb->start >= pb->end) {
+ WARN_ON_ONCE(pb->start > pb->end);
+ numa_remove_memblk_from(phys_blk, pi);
+ }
+
+ printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
+ eb->start, eb->end, (eb->end - eb->start) >> 20);
+ return 0;
+}
+
+/*
+ * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
+ * to max_addr. The return value is the number of nodes allocated.
+ */
+static int __init split_nodes_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, int nr_nodes)
+{
+ nodemask_t physnode_mask = NODE_MASK_NONE;
+ u64 size;
+ int big;
+ int nid = 0;
+ int i, ret;
+
+ if (nr_nodes <= 0)
+ return -1;
+ if (nr_nodes > MAX_NUMNODES) {
+ pr_info("numa=fake=%d too large, reducing to %d\n",
+ nr_nodes, MAX_NUMNODES);
+ nr_nodes = MAX_NUMNODES;
+ }
+
+ size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) / nr_nodes;
+ /*
+ * Calculate the number of big nodes that can be allocated as a result
+ * of consolidating the remainder.
+ */
+ big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
+ FAKE_NODE_MIN_SIZE;
+
+ size &= FAKE_NODE_MIN_HASH_MASK;
+ if (!size) {
+ pr_err("Not enough memory for each node. "
+ "NUMA emulation disabled.\n");
+ return -1;
+ }
+
+ for (i = 0; i < pi->nr_blks; i++)
+ node_set(pi->blk[i].nid, physnode_mask);
+
+ /*
+ * Continue to fill physical nodes with fake nodes until there is no
+ * memory left on any of them.
+ */
+ while (nodes_weight(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+ u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+ end = start + size;
+
+ if (nid < big)
+ end += FAKE_NODE_MIN_SIZE;
+
+ /*
+ * Continue to add memory to this fake node if its
+ * non-reserved memory is less than the per-node size.
+ */
+ while (end - start -
+ memblock_x86_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > limit) {
+ end = limit;
+ break;
+ }
+ }
+
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if (limit - end -
+ memblock_x86_hole_size(end, limit) < size)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Returns the end address of a node so that there is at least `size' amount of
+ * non-reserved memory or `max_addr' is reached.
+ */
+static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
+{
+ u64 end = start + size;
+
+ while (end - start - memblock_x86_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > max_addr) {
+ end = max_addr;
+ break;
+ }
+ }
+ return end;
+}
+
+/*
+ * Sets up fake nodes of `size' interleaved over physical nodes ranging from
+ * `addr' to `max_addr'. The return value is the number of nodes allocated.
+ */
+static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, u64 size)
+{
+ nodemask_t physnode_mask = NODE_MASK_NONE;
+ u64 min_size;
+ int nid = 0;
+ int i, ret;
+
+ if (!size)
+ return -1;
+ /*
+ * The limit on emulated nodes is MAX_NUMNODES, so the size per node is
+ * increased accordingly if the requested size is too small. This
+ * creates a uniform distribution of node sizes across the entire
+ * machine (but not necessarily over physical nodes).
+ */
+ min_size = (max_addr - addr - memblock_x86_hole_size(addr, max_addr)) /
+ MAX_NUMNODES;
+ min_size = max(min_size, FAKE_NODE_MIN_SIZE);
+ if ((min_size & FAKE_NODE_MIN_HASH_MASK) < min_size)
+ min_size = (min_size + FAKE_NODE_MIN_SIZE) &
+ FAKE_NODE_MIN_HASH_MASK;
+ if (size < min_size) {
+ pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
+ size >> 20, min_size >> 20);
+ size = min_size;
+ }
+ size &= FAKE_NODE_MIN_HASH_MASK;
+
+ for (i = 0; i < pi->nr_blks; i++)
+ node_set(pi->blk[i].nid, physnode_mask);
+
+ /*
+ * Fill physical nodes with fake nodes of size until there is no memory
+ * left on any of them.
+ */
+ while (nodes_weight(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+ u64 dma32_end = MAX_DMA32_PFN << PAGE_SHIFT;
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+
+ end = find_end_of_node(start, limit, size);
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ memblock_x86_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if (limit - end -
+ memblock_x86_hole_size(end, limit) < size)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Sets up the system RAM area from start_pfn to last_pfn according to the
+ * numa=fake command-line option.
+ */
+void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
+{
+ static struct numa_meminfo ei __initdata;
+ static struct numa_meminfo pi __initdata;
+ const u64 max_addr = max_pfn << PAGE_SHIFT;
+ u8 *phys_dist = NULL;
+ int i, j, ret;
+
+ if (!emu_cmdline)
+ goto no_emu;
+
+ memset(&ei, 0, sizeof(ei));
+ pi = *numa_meminfo;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ emu_nid_to_phys[i] = NUMA_NO_NODE;
+
+ /*
+ * If the numa=fake command-line contains a 'M' or 'G', it represents
+ * the fixed node size. Otherwise, if it is just a single number N,
+ * split the system RAM into N fake nodes.
+ */
+ if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
+ u64 size;
+
+ size = memparse(emu_cmdline, &emu_cmdline);
+ ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
+ } else {
+ unsigned long n;
+
+ n = simple_strtoul(emu_cmdline, NULL, 0);
+ ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
+ }
+
+ if (ret < 0)
+ goto no_emu;
+
+ if (numa_cleanup_meminfo(&ei) < 0) {
+ pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
+ goto no_emu;
+ }
+
+ /*
+ * Copy the original distance table. It's temporary so no need to
+ * reserve it.
+ */
+ if (numa_dist_cnt) {
+ size_t size = numa_dist_cnt * sizeof(phys_dist[0]);
+ u64 phys;
+
+ phys = memblock_find_in_range(0,
+ (u64)max_pfn_mapped << PAGE_SHIFT,
+ size, PAGE_SIZE);
+ if (phys == MEMBLOCK_ERROR) {
+ pr_warning("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
+ goto no_emu;
+ }
+ phys_dist = __va(phys);
+
+ for (i = 0; i < numa_dist_cnt; i++)
+ for (j = 0; j < numa_dist_cnt; j++)
+ phys_dist[i * numa_dist_cnt + j] =
+ node_distance(i, j);
+ }
+
+ /* commit */
+ *numa_meminfo = ei;
+
+ /*
+ * Transform __apicid_to_node table to use emulated nids by
+ * reverse-mapping phys_nid. The maps should always exist but fall
+ * back to zero just in case.
+ */
+ for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+ if (__apicid_to_node[i] == NUMA_NO_NODE)
+ continue;
+ for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
+ if (__apicid_to_node[i] == emu_nid_to_phys[j])
+ break;
+ __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
+ }
+
+ /* make sure all emulated nodes are mapped to a physical node */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ if (emu_nid_to_phys[i] == NUMA_NO_NODE)
+ emu_nid_to_phys[i] = 0;
+
+ /* transform distance table */
+ numa_reset_distance();
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ for (j = 0; j < MAX_NUMNODES; j++) {
+ int physi = emu_nid_to_phys[i];
+ int physj = emu_nid_to_phys[j];
+ int dist;
+
+ if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
+ dist = physi == physj ?
+ LOCAL_DISTANCE : REMOTE_DISTANCE;
+ else
+ dist = phys_dist[physi * numa_dist_cnt + physj];
+
+ numa_set_distance(i, j, dist);
+ }
+ }
+ return;
+
+no_emu:
+ /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ emu_nid_to_phys[i] = i;
+}
+
+#ifndef CONFIG_DEBUG_PER_CPU_MAPS
+void __cpuinit numa_add_cpu(int cpu)
+{
+ int physnid, nid;
+
+ nid = numa_cpu_node(cpu);
+ if (nid == NUMA_NO_NODE)
+ nid = early_cpu_to_node(cpu);
+ BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
+
+ physnid = emu_nid_to_phys[nid];
+
+ /*
+ * Map the cpu to each emulated node that is allocated on the physical
+ * node of the cpu's apic id.
+ */
+ for_each_online_node(nid)
+ if (emu_nid_to_phys[nid] == physnid)
+ cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
+}
+
+void __cpuinit numa_remove_cpu(int cpu)
+{
+ int i;
+
+ for_each_online_node(i)
+ cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
+}
+#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
+static void __cpuinit numa_set_cpumask(int cpu, int enable)
+{
+ struct cpumask *mask;
+ int nid, physnid, i;
+
+ nid = early_cpu_to_node(cpu);
+ if (nid == NUMA_NO_NODE) {
+ /* early_cpu_to_node() already emits a warning and trace */
+ return;
+ }
+
+ physnid = emu_nid_to_phys[nid];
+
+ for_each_online_node(i) {
+ if (emu_nid_to_phys[nid] != physnid)
+ continue;
+
+ mask = debug_cpumask_set_cpu(cpu, enable);
+ if (!mask)
+ return;
+
+ if (enable)
+ cpumask_set_cpu(cpu, mask);
+ else
+ cpumask_clear_cpu(cpu, mask);
+ }
+}
+
+void __cpuinit numa_add_cpu(int cpu)
+{
+ numa_set_cpumask(cpu, 1);
+}
+
+void __cpuinit numa_remove_cpu(int cpu)
+{
+ numa_set_cpumask(cpu, 0);
+}
+#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
Index: work/arch/x86/mm/numa_internal.h
===================================================================
--- /dev/null
+++ work/arch/x86/mm/numa_internal.h
@@ -0,0 +1,31 @@
+#ifndef __X86_MM_NUMA_INTERNAL_H
+#define __X86_MM_NUMA_INTERNAL_H
+
+#include <linux/types.h>
+#include <asm/numa.h>
+
+struct numa_memblk {
+ u64 start;
+ u64 end;
+ int nid;
+};
+
+struct numa_meminfo {
+ int nr_blks;
+ struct numa_memblk blk[NR_NODE_MEMBLKS];
+};
+
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+void __init numa_reset_distance(void);
+
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt);
+#else
+static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt)
+{ }
+#endif
+
+#endif /* __X86_MM_NUMA_INTERNAL_H */
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH UPDATED 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-21 8:26 ` [PATCH UPDATED " Tejun Heo
@ 2011-02-21 17:10 ` Yinghai Lu
2011-02-22 10:14 ` Tejun Heo
0 siblings, 1 reply; 9+ messages in thread
From: Yinghai Lu @ 2011-02-21 17:10 UTC (permalink / raw)
To: Tejun Heo
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, David Rientjes,
linux-kernel, Cyrill Gorcunov
On 02/21/2011 12:26 AM, Tejun Heo wrote:
> Create numa_emulation.c and move all NUMA emulation code there. The
> definitions of struct numa_memblk and numa_meminfo are moved to
> numa_64.h. Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
> numa_reset_distance() along with numa_emulation() are made global.
>
> - v2: Internal declarations moved to numa_internal.h as suggested by
> Yinghai.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
> Yinghai, does it look okay to you?
>
Yes.
Acked-by: Yinghai Lu <yinghai@kernel.org>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH UPDATED 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-21 17:10 ` Yinghai Lu
@ 2011-02-22 10:14 ` Tejun Heo
0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2011-02-22 10:14 UTC (permalink / raw)
To: Yinghai Lu
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, David Rientjes,
linux-kernel, Cyrill Gorcunov
On Mon, Feb 21, 2011 at 09:10:48AM -0800, Yinghai Lu wrote:
> Acked-by: Yinghai Lu <yinghai@kernel.org>
Thanks. Patches 1-3 applied to x86-numa.
--
tejun
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c
2011-02-20 21:58 ` David Rientjes
@ 2011-02-24 19:38 ` Yinghai Lu
0 siblings, 0 replies; 9+ messages in thread
From: Yinghai Lu @ 2011-02-24 19:38 UTC (permalink / raw)
To: David Rientjes, Ingo Molnar
Cc: Tejun Heo, Thomas Gleixner, H. Peter Anvin, linux-kernel,
Cyrill Gorcunov
On 02/20/2011 01:58 PM, David Rientjes wrote:
> On Fri, 18 Feb 2011, Yinghai Lu wrote:
>
>> David, can you please check tip/x86/mm ?
>>
>
> Yeah, I'm in the process of doing so without this set yet. I'm getting
> boot failures for numa=fake=128M right now, so I'll need to investigate
> and diagnose that issue before adding more patches.
any info about the boot failures?
Thanks
Yinghai
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-02-24 19:40 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-18 13:58 [PATCH 1/3 tip:x86/mm] x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file Tejun Heo
2011-02-18 13:59 ` [PATCH 2/3 tip:x86/mm] x86-64, NUMA: Move NUMA emulation into numa_emulation.c Tejun Heo
2011-02-18 17:58 ` Yinghai Lu
2011-02-20 21:58 ` David Rientjes
2011-02-24 19:38 ` Yinghai Lu
2011-02-21 8:26 ` [PATCH UPDATED " Tejun Heo
2011-02-21 17:10 ` Yinghai Lu
2011-02-22 10:14 ` Tejun Heo
2011-02-18 14:00 ` [PATCH 3/3 tip:x86/mm] x86-64, NUMA: Add proper function comments to global functions Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox