* Re: [patch 2/2] mm: add node hotplug emulation [not found] <A24AE1FFE7AEC5489F83450EE98351BF28723FC4A7@shsmsx502.ccr.corp.intel.com> @ 2010-11-22 1:47 ` Shaohui Zheng 2010-11-24 6:45 ` Shaohui Zheng 2010-11-28 2:00 ` David Rientjes 0 siblings, 2 replies; 7+ messages in thread From: Shaohui Zheng @ 2010-11-22 1:47 UTC (permalink / raw) To: akpm, gregkh, rientjes Cc: mingo, hpa, tglx, lethal, ak, yinghai, randy.dunlap, linux-kernel, linux-mm, x86, haicheng.li, haicheng.li, shaohui.zheng, shaohui.zheng On Mon, Nov 22, 2010 at 09:47:02AM +0800, Zheng, Shaohui wrote: > Add an interface to allow new nodes to be added when performing memory > hot-add. This provides a convenient interface to test memory hotplug > notifier callbacks and surrounding hotplug code when new nodes are > onlined without actually having a machine with such hotpluggable SRAT > entries. > > This adds a new interface at /sys/devices/system/memory/add_node that > behaves in a similar way to the memory hot-add "probe" interface. Its > format is size@start, where "size" is the size of the new node to be > added and "start" is the physical address of the new memory. > > The new node id is a currently offline, but possible, node. The bit must > be set in node_possible_map so that nr_node_ids is sized appropriately. > > For emulation on x86, for example, it would be possible to set aside > memory for hotplugged nodes (say, anything above 2G) and to add an > additional three nodes as being possible on boot with > > mem=2G numa=possible=3 > > and then creating a new 128M node at runtime: > > # echo 128M@0x80000000 > /sys/devices/system/memory/add_node > On node 1 totalpages: 0 > init_memory_mapping: 0000000080000000-0000000088000000 > 0080000000 - 0088000000 page 2M For cpu/memory physical hotplug, we have the unique interface probe/release, it is the _standard_ interface, it is not only for x86, ppc use the the interface as well. For node hotplug, it should follow the rule. You are creating a new interface /sys/devices/system/memory/add_node to add both memory and node, you are just trying to create DUPLICATED feature with the memory probe interface, it breaks the rule. I did NOT see the feature difference with our emulator patch http://lkml.org/lkml/2010/11/16/740, you pick up a piece of feature from emulator, and create an other thread. You are trying to replace the interface with a new one, which is not recommended. the memory probe interface is already powerful and flexible enough after apply our patch. What's more important, it keeps the old directives, and it maintains backwards compatibility. Add a memory section(128M) to node 3(boots with mem=1024m) echo 0x40000000,3 > memory/probe And more we make it friendly, it is possible to add memory to do echo 3g > memory/probe echo 1024m,3 > memory/probe It maintains backwards compatibility. Another format suggested by Dave Hansen: echo physical_address=0x40000000 numa_node=3 > memory/probe we should not need duplicated interface /sys/devices/system/memory/add_node here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch 2/2] mm: add node hotplug emulation 2010-11-22 1:47 ` [patch 2/2] mm: add node hotplug emulation Shaohui Zheng @ 2010-11-24 6:45 ` Shaohui Zheng 2010-11-28 2:01 ` David Rientjes 2010-11-28 2:00 ` David Rientjes 1 sibling, 1 reply; 7+ messages in thread From: Shaohui Zheng @ 2010-11-24 6:45 UTC (permalink / raw) To: akpm, gregkh, rientjes Cc: mingo, hpa, tglx, lethal, ak, yinghai, randy.dunlap, linux-kernel, linux-mm, x86, haicheng.li, haicheng.li, shaohui.zheng On Mon, Nov 22, 2010 at 09:47:06AM +0800, Shaohui Zheng wrote: > On Mon, Nov 22, 2010 at 09:47:02AM +0800, Zheng, Shaohui wrote: > > For cpu/memory physical hotplug, we have the unique interface probe/release, > it is the _standard_ interface, it is not only for x86, ppc use the the interface > as well. For node hotplug, it should follow the rule. > > You are creating a new interface /sys/devices/system/memory/add_node to add both > memory and node, you are just trying to create DUPLICATED feature with the > memory probe interface, it breaks the rule. > > I did NOT see the feature difference with our emulator patch http://lkml.org/lkml/2010/11/16/740, > you pick up a piece of feature from emulator, and create an other thread. You > are trying to replace the interface with a new one, which is not recommended. > the memory probe interface is already powerful and flexible enough after apply > our patch. What's more important, it keeps the old directives, and it maintains > backwards compatibility. > > Add a memory section(128M) to node 3(boots with mem=1024m) > > echo 0x40000000,3 > memory/probe > > And more we make it friendly, it is possible to add memory to do > > echo 3g > memory/probe > echo 1024m,3 > memory/probe > > It maintains backwards compatibility. > > Another format suggested by Dave Hansen: > > echo physical_address=0x40000000 numa_node=3 > memory/probe > > we should not need duplicated interface /sys/devices/system/memory/add_node here. ah, a long time silence. Does somebody know the status of this patch, is it accepted by the maintainer? I am not in patch's CC list, so I will not get mail notice when the patch was accepted by the maintainer. the other hotplug emulator patches has dependency on this patch, so I can not re-make my patchset if this patch is still pending. thanks. -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch 2/2] mm: add node hotplug emulation 2010-11-24 6:45 ` Shaohui Zheng @ 2010-11-28 2:01 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2010-11-28 2:01 UTC (permalink / raw) To: Shaohui Zheng Cc: akpm, gregkh, mingo, hpa, tglx, lethal, ak, yinghai, randy.dunlap, linux-kernel, linux-mm, x86, haicheng.li, haicheng.li, shaohui.zheng On Wed, 24 Nov 2010, Shaohui Zheng wrote: > ah, a long time silence. > Sorry, last week included a holiday in the USA. > Does somebody know the status of this patch, is it accepted by the maintainer? > I am not in patch's CC list, so I will not get mail notice when the patch was > accepted by the maintainer. > Neither of these patches have been merged anywhere yet, you're not missing anything :) If/when Andrew picks it up, I'm quite certain he'll cc you on it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch 2/2] mm: add node hotplug emulation 2010-11-22 1:47 ` [patch 2/2] mm: add node hotplug emulation Shaohui Zheng 2010-11-24 6:45 ` Shaohui Zheng @ 2010-11-28 2:00 ` David Rientjes 1 sibling, 0 replies; 7+ messages in thread From: David Rientjes @ 2010-11-28 2:00 UTC (permalink / raw) To: Shaohui Zheng Cc: akpm, gregkh, mingo, hpa, tglx, lethal, ak, yinghai, randy.dunlap, linux-kernel, linux-mm, x86, haicheng.li, haicheng.li, shaohui.zheng On Mon, 22 Nov 2010, Shaohui Zheng wrote: > > and then creating a new 128M node at runtime: > > > > # echo 128M@0x80000000 > /sys/devices/system/memory/add_node > > On node 1 totalpages: 0 > > init_memory_mapping: 0000000080000000-0000000088000000 > > 0080000000 - 0088000000 page 2M > > For cpu/memory physical hotplug, we have the unique interface probe/release, > it is the _standard_ interface, it is not only for x86, ppc use the the interface > as well. For node hotplug, it should follow the rule. > > You are creating a new interface /sys/devices/system/memory/add_node to add both > memory and node, you are just trying to create DUPLICATED feature with the > memory probe interface, it breaks the rule. > It's not duplicated, the function of add_node is distinct since it maps the added memory to a node that wasn't previously defined (for the x86 case, defined by the SRAT). I think this is better than an additional abstraction layer that remaps memory to nodes above what the BIOS has defined, and there's nothing architecture specific about add_node; if an arch can do probe then it can use this new interface. > I did NOT see the feature difference with our emulator patch http://lkml.org/lkml/2010/11/16/740, > you pick up a piece of feature from emulator, and create an other thread. You > are trying to replace the interface with a new one, which is not recommended. > the memory probe interface is already powerful and flexible enough after apply > our patch. What's more important, it keeps the old directives, and it maintains > backwards compatibility. > This achieves the same goal in a much cleaner and generic way. It doesn't replace anything that currently sits in the kernel, instead it competes directly with your model for node hotplug emulation. > Add a memory section(128M) to node 3(boots with mem=1024m) > > echo 0x40000000,3 > memory/probe > > And more we make it friendly, it is possible to add memory to do > > echo 3g > memory/probe > echo 1024m,3 > memory/probe > > It maintains backwards compatibility. > My patch doesn't break backwards compatibility, it adds a new debugfs file that allows you to test node hotplug. > Another format suggested by Dave Hansen: > > echo physical_address=0x40000000 numa_node=3 > memory/probe > > we should not need duplicated interface /sys/devices/system/memory/add_node here. > We don't need to define a node id, we only need to ensure that a possible node is not yet online and use it; we don't gain anything by trying to hotplug node ids in a sparse or interleaved way (although it is certainly possible with a combination of my patch and CONFIG_MEMORY_HOTREMOVE). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks @ 2010-11-17 2:07 shaohui.zheng 2010-11-17 2:08 ` [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation shaohui.zheng 0 siblings, 1 reply; 7+ messages in thread From: shaohui.zheng @ 2010-11-17 2:07 UTC (permalink / raw) To: akpm, linux-mm; +Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng * PATCHSET INTRODUCTION patch 1: Add function to hide memory region via e820 table. Then emulator will use these memory regions to fake offlined numa nodes. patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node". patch 3: Provide an userland interface to hotplug-add fake offlined nodes. patch 4: Abstract cpu register functions, make these interface friend for cpu hotplug emulation patch 5: Support cpu probe/release in x86, it provide a software method to hot add/remove cpu with sysfs interface. patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling domain to build the incorrect hierarchy. patch 7: extend memory probe interface to support NUMA, we can add the memory to a specified node with the interface. patch 8: Documentations * FEEDBACKS & RESPONSES 1) Patch 0 Balbir & Greg: Suggest to use tool git/quilt to manage/send the patchset. Response: Thanks for the recommendation, With help from Fengguang, I get quilt working, it is a great tool. 2) Patch 2 Jaswinder Singh: if (hidden_num) is not required in patch 2 Response: good catching, it is removed in v2. 3) Patch 3 Dave Hansen: Suggest to create a dedicated sysfs file for each possible node. Greg: How big would this "list" be? What will it look like exactly? Haicheng: It should follow "one value per file". It intends to show acceptable parameters. For example, if we have 4 fake offlined nodes, like node 2-5, then: $ cat /sys/devices/system/node/probe 2-5 Then user hotadds node3 to system: $ echo 3 > /sys/devices/system/node/probe $ cat /sys/devices/system/node/probe 2,4-5 Greg: As you are trying to add a new sysfs file, please create the matching Documentation/ABI/ file as well. Response: We miss it, and we already add it in v2. Patch 4 & 5: Paul Mundt: This looks like an incredibly painful interface. How about scrapping all of this _emu() mess and just reworking the register_cpu() interface? Response: accept Paul's suggestion, and remove the cpu _emu functions. Patch 7: Dave Hansen: If we're going to put multiple values into the file now and add to the ABI, can we be more explicit about it? echo "physical_address=0x40000000 numa_node=3" > memory/probe Response: Dave's new interface was accpeted, and more we still keep the old format for compatibility. We documented the these interfaces into Documentation/ABI in v2. Greg: suggest to use configfs replace for the memory probe interface Andi: This is a debugging interface. It doesn't need to have the most pretty interface in the world, because it will be only used for QA by a few people. it's just a QA interface, not the next generation of POSIX. Response: We still keep it as sysfs interface since node/cpu/memory probe interface are all in sysfs, we can create another group of patches to support configfs if we have this strong requirement in future. * WHAT IS HOTPLUG EMULATOR NUMA hotplug emulator is collectively named for the hotplug emulation it is able to emulate NUMA Node Hotplug thru a pure software way. It intends to help people easily debug and test node/cpu/memory hotplug related stuff on a none-numa-hotplug-support machine, even an UMA machine. The emulator provides mechanism to emulate the process of physcial cpu/mem hotadd, it provides possibility to debug CPU and memory hotplug on the machines without NUMA support for kenrel developers. It offers an interface for cpu and memory hotplug test purpose. * WHY DO WE USE HOTPLUG EMULATOR We are focusing on the hotplug emualation for a few months. The emualor helps team to reproduce all the major hotplug bugs. It plays an important role to the hotplug code quality assuirance. Because of the hotplug emulator, we already move most of the debug working to virtual evironment. * Principles & Usages NUMA hotplug emulator include 3 different parts, We add a menu item to the menuconfig to enable/disable them. 1) Node hotplug emulation: The emulator firstly hides RAM via E820 table, and then it can fake offlined nodes with the hidden RAM. After system bootup, user is able to hotplug-add these offlined nodes, which is just similar to a real hotplug hardware behavior. Using boot option "numa=hide=N*size" to fake offlined nodes: - N is the number of hidden nodes - size is the memory size (in MB) per hidden node. There is a sysfs entry "probe" under /sys/devices/system/node/ for user to hotplug the fake offlined nodes: - to show all fake offlined nodes: $ cat /sys/devices/system/node/probe - to hotadd a fake offlined node, e.g. nodeid is N: $ echo N > /sys/devices/system/node/probe 2) CPU hotplug emulation: The emulator reserve CPUs throu grub parameter, the reserved CPUs can be hot-add/hot-remove in software method. When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU hotplug process. For the CPU supported SMT, some logical CPUs are in the same socket, but it may located in different NUMA node after we have emulator. We put the logical CPU into a fake CPU socket, and assign it an unique phys_proc_id. For the fake socket, we put one logical CPU in only. - to hide CPUs - Using boot option "maxcpus=N" hide CPUs N is the number of initialize CPUs - Using boot option "cpu_hpe=on" to enable cpu hotplug emulation when cpu_hpe is enabled, the rest CPUs will not be initialized - to hot-add CPU to node $ echo nid > cpu/probe - to hot-remove CPU $ echo nid > cpu/release 3) Memory hotplug emulation: The emulator reserve memory before OS booting, the reserved memory region is remove from e820 table, and they can be hot-added via the probe interface, this interface was extend to support add memory to the specified node, It maintains backwards compatibility. The difficulty of Memory Release is well-known, we have no plan for it until now. - reserve memory throu grub parameter mem=1024m - add a memory section to node 3 $ echo 0x40000000,3 > memory/probe OR $ echo 1024m,3 > memory/probe * ACKNOWLEDGMENT hotplug emulator includes a team's efforts, thanks all of them. They are: Andi Kleen, Haicheng Li, Shaohui Zheng, Fengguang Wu and Yongkang You -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-17 2:07 [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks shaohui.zheng @ 2010-11-17 2:08 ` shaohui.zheng 2010-11-17 8:16 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: shaohui.zheng @ 2010-11-17 2:08 UTC (permalink / raw) To: akpm, linux-mm Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu, Haicheng Li, Shaohui Zheng [-- Attachment #1: 002-hotplug-emulator-x86-infrastructure-of-node-hotplug-emulation.patch --] [-- Type: text/plain, Size: 7348 bytes --] From: Haicheng Li <haicheng.li@intel.com> NUMA hotplug emulator introduces a new node state N_HIDDEN to identify the fake offlined node. It firstly hides RAM via E820 table and then emulates fake offlined nodes with the hidden RAM. After system bootup, user is able to hotplug-add these offlined nodes, which is just similar to a real hardware hotplug behavior. Using boot option "numa=hide=N*size" to fake offlined nodes: - N is the number of hidden nodes - size is the memory size (in MB) per hidden node. OPEN: Kernel might use part of hidden memory region as RAM buffer, now emulator directly hide 128M extra space to workaround this issue. Any better way to avoid this conflict? CC: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Haicheng Li <haicheng.li@intel.com> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> --- Index: linux-hpe4/arch/x86/include/asm/numa_64.h =================================================================== --- linux-hpe4.orig/arch/x86/include/asm/numa_64.h 2010-11-15 17:13:02.453461462 +0800 +++ linux-hpe4/arch/x86/include/asm/numa_64.h 2010-11-15 17:13:07.093461818 +0800 @@ -37,7 +37,7 @@ extern void __cpuinit numa_add_cpu(int cpu); extern void __cpuinit numa_remove_cpu(int cpu); -#ifdef CONFIG_NUMA_EMU +#if defined(CONFIG_NUMA_EMU) || defined(CONFIG_NODE_HOTPLUG_EMU) #define FAKE_NODE_MIN_SIZE ((u64)64 << 20) #define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) #endif /* CONFIG_NUMA_EMU */ Index: linux-hpe4/arch/x86/mm/numa_64.c =================================================================== --- linux-hpe4.orig/arch/x86/mm/numa_64.c 2010-11-15 17:13:02.463461371 +0800 +++ linux-hpe4/arch/x86/mm/numa_64.c 2010-11-15 17:21:05.510961676 +0800 @@ -304,6 +304,123 @@ } } +#ifdef CONFIG_NODE_HOTPLUG_EMU +static char *hp_cmdline __initdata; +static struct bootnode *hidden_nodes; +static u64 hp_start; +static long hidden_num, hp_size; +static u64 nodes_size[MAX_NUMNODES] __initdata; + +int hotadd_hidden_nodes(int nid) +{ + int ret; + + if (!node_hidden(nid)) + return -EINVAL; + + ret = add_memory(nid, hidden_nodes[nid].start, + hidden_nodes[nid].end - hidden_nodes[nid].start); + if (!ret) { + node_clear_hidden(nid); + return 0; + } else { + return -EEXIST; + } +} + +/* parse the comand line for numa=hide */ +static long __init parse_hide_nodes(char *hp_cmdline) +{ + int coef = 1, nid = 0; + u64 size = 0; + long total = 0; + char buf[512], *p; + + /* parse numa=hide command-line */ + hidden_num = 0; + p = buf; + while (1) { + if (*hp_cmdline == ',' || *hp_cmdline == '\0') { + *p = '\0'; + size = simple_strtoul(buf, NULL, 0); + printk(KERN_ERR "size: %dM buf:%s coef: %d.\n", (int)size, buf, coef); + if (!((size<<20) & FAKE_NODE_MIN_HASH_MASK)) + printk(KERN_ERR "%d M is less than minimum node size, ignore it.\n", (int)size); + + size <<= 20; + /* Round down to nearest FAKE_NODE_MIN_SIZE. */ + size &= FAKE_NODE_MIN_HASH_MASK; + + if (size) { + int i; + total += size * coef; + for (i = 0; i < coef; i++) + nodes_size[nid++] = size; + hidden_num += coef; + } + + coef = 1; + p = buf; + if (*hp_cmdline == '\0') + break; + hp_cmdline++; + } else if (*hp_cmdline == '*') { + *p++ = '\0'; + coef = simple_strtoul(buf, NULL, 0); + p = buf; + hp_cmdline++; + } else if (!isdigit(*hp_cmdline)) { + break; + } + + *p++ = *hp_cmdline++; + } + + return total; +} + +static void __init numa_hide_nodes(void) +{ + hp_size = parse_hide_nodes(hp_cmdline); + + hp_start = e820_hide_mem(hp_size); + if (hp_start <= 0) { + printk(KERN_ERR "Hide too much memory, disable node hotplug emualtion."); + hidden_num = 0; + return; + } + + /* leave 128M space for possible RAM buffer usage later + any other better way to avoid this conflict?*/ + + e820_hide_mem(128*1024*1024); +} + +static void __init numa_hotplug_emulation(void) +{ + int i, num_nodes = 0, nid; + + for_each_online_node(i) + if (i > num_nodes) + num_nodes = i; + + i = num_nodes + hidden_num; + if (!hidden_nodes) { + hidden_nodes = alloc_bootmem(sizeof(struct bootnode) * i); + memset(hidden_nodes, 0, sizeof(struct bootnode) * i); + } + + nid = num_nodes + 1; + for (i = 0; i < hidden_num; i++) { + node_set(nid, node_possible_map); + hidden_nodes[nid].start = hp_start; + hidden_nodes[nid].end = hp_start + (nodes_size[i]); + hp_start = hidden_nodes[nid].end; + node_set_hidden(nid++); + } +} +#endif /* CONFIG_NODE_HOTPLUG_EMU */ + #ifdef CONFIG_NUMA_EMU /* Numa emulation */ static struct bootnode nodes[MAX_NUMNODES] __initdata; @@ -658,7 +775,7 @@ #ifdef CONFIG_NUMA_EMU if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8)) - return; + goto done; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -666,14 +783,14 @@ #ifdef CONFIG_ACPI_NUMA if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT)) - return; + goto done; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif #ifdef CONFIG_K8_NUMA if (!numa_off && k8 && !k8_scan_nodes()) - return; + goto done; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -693,6 +810,13 @@ numa_set_node(i, 0); e820_register_active_regions(0, start_pfn, last_pfn); setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT); + +done: +#ifdef CONFIG_NODE_HOTPLUG_EMU + if (hidden_num) + numa_hotplug_emulation(); +#endif + return; } unsigned long __init numa_free_all_bootmem(void) @@ -720,6 +844,12 @@ if (!strncmp(opt, "fake=", 5)) cmdline = opt + 5; #endif +#ifdef CONFIG_NODE_HOTPLUG_EMU + if (!strncmp(opt, "hide=", 5)) { + hp_cmdline = opt + 5; + numa_hide_nodes(); + } +#endif #ifdef CONFIG_ACPI_NUMA if (!strncmp(opt, "noacpi", 6)) acpi_numa = -1; Index: linux-hpe4/include/linux/nodemask.h =================================================================== --- linux-hpe4.orig/include/linux/nodemask.h 2010-11-15 17:13:02.463461371 +0800 +++ linux-hpe4/include/linux/nodemask.h 2010-11-15 17:13:07.093461818 +0800 @@ -371,6 +371,10 @@ */ enum node_states { N_POSSIBLE, /* The node could become online at some point */ +#ifdef CONFIG_NODE_HOTPLUG_EMU + N_HIDDEN, /* The node is hidden at booting time, could be + * onlined in run time */ +#endif N_ONLINE, /* The node is online */ N_NORMAL_MEMORY, /* The node has regular memory */ #ifdef CONFIG_HIGHMEM @@ -470,6 +474,13 @@ #define node_online(node) node_state((node), N_ONLINE) #define node_possible(node) node_state((node), N_POSSIBLE) +#ifdef CONFIG_NODE_HOTPLUG_EMU +#define node_set_hidden(node) node_set_state((node), N_HIDDEN) +#define node_clear_hidden(node) node_clear_state((node), N_HIDDEN) +#define node_hidden(node) node_state((node), N_HIDDEN) +extern int hotadd_hidden_nodes(int nid); +#endif + #define for_each_node(node) for_each_node_state(node, N_POSSIBLE) #define for_each_online_node(node) for_each_node_state(node, N_ONLINE) -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-17 2:08 ` [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation shaohui.zheng @ 2010-11-17 8:16 ` David Rientjes 2010-11-17 7:51 ` Shaohui Zheng 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-17 8:16 UTC (permalink / raw) To: Shaohui Zheng Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote: > From: Haicheng Li <haicheng.li@intel.com> > > NUMA hotplug emulator introduces a new node state N_HIDDEN to > identify the fake offlined node. It firstly hides RAM via E820 > table and then emulates fake offlined nodes with the hidden RAM. > Hmm, why can't you use numa=hide to hide a specified quantity of memory from the kernel and then use the add_memory() interface to hot-add the offlined memory in the desired quantity? In other words, why do you need to track the offlined nodes with a state? The userspace interface would take a desired size of hidden memory to hot-add and the node id would be the first_unset_node(node_online_map). > After system bootup, user is able to hotplug-add these offlined > nodes, which is just similar to a real hardware hotplug behavior. > > Using boot option "numa=hide=N*size" to fake offlined nodes: > - N is the number of hidden nodes > - size is the memory size (in MB) per hidden node. > size should be parsed with memparse() so users can specify 'M' or 'G', it would even make your parsing code simpler. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-17 8:16 ` David Rientjes @ 2010-11-17 7:51 ` Shaohui Zheng 2010-11-17 21:10 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Shaohui Zheng @ 2010-11-17 7:51 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Wed, Nov 17, 2010 at 12:16:47AM -0800, David Rientjes wrote: > On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote: > > > From: Haicheng Li <haicheng.li@intel.com> > > > > NUMA hotplug emulator introduces a new node state N_HIDDEN to > > identify the fake offlined node. It firstly hides RAM via E820 > > table and then emulates fake offlined nodes with the hidden RAM. > > > > Hmm, why can't you use numa=hide to hide a specified quantity of memory > from the kernel and then use the add_memory() interface to hot-add the > offlined memory in the desired quantity? In other words, why do you need > to track the offlined nodes with a state? > > The userspace interface would take a desired size of hidden memory to > hot-add and the node id would be the first_unset_node(node_online_map). Yes, it is a good idea, your solution is what we indeed do in our first 2 versions. We use mem=memsize to hide memory, and we call add_memory interface to hot-add offlined memory with desired quantity, and we can also add to desired nodes(even through the nodes does not exists). it is very flexible solution. However, this solution was denied since we notice NUMA emulation, we should reuse it. Currently, our solution creates static nodes when OS boots, only the node with state N_HIDDEN can be hot-added with node/probe interface, and we can query > > > After system bootup, user is able to hotplug-add these offlined > > nodes, which is just similar to a real hardware hotplug behavior. > > > > Using boot option "numa=hide=N*size" to fake offlined nodes: > > - N is the number of hidden nodes > > - size is the memory size (in MB) per hidden node. > > > > size should be parsed with memparse() so users can specify 'M' or 'G', it > would even make your parsing code simpler. Agree, if we use memparse, users can specify 'M' or 'G', we will added it when we send next version. -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-17 7:51 ` Shaohui Zheng @ 2010-11-17 21:10 ` David Rientjes 2010-11-18 4:14 ` Shaohui Zheng 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-17 21:10 UTC (permalink / raw) To: Shaohui Zheng Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Wed, 17 Nov 2010, Shaohui Zheng wrote: > > Hmm, why can't you use numa=hide to hide a specified quantity of memory > > from the kernel and then use the add_memory() interface to hot-add the > > offlined memory in the desired quantity? In other words, why do you need > > to track the offlined nodes with a state? > > > > The userspace interface would take a desired size of hidden memory to > > hot-add and the node id would be the first_unset_node(node_online_map). > Yes, it is a good idea, your solution is what we indeed do in our first 2 > versions. We use mem=memsize to hide memory, and we call add_memory interface > to hot-add offlined memory with desired quantity, and we can also add to > desired nodes(even through the nodes does not exists). it is very flexible > solution. > > However, this solution was denied since we notice NUMA emulation, we should > reuse it. > I don't understand why that's a requirement, NUMA emulation is a seperate feature. Although both are primarily used to test and instrument other VM and kernel code, NUMA emulation is restricted to only being used at boot to fake nodes on smaller machines and can be used to test things like the slab allocator. The NUMA hotplug emulator that you're developing here is primarily used to test the hotplug callbacks; for that use-case, it seems particularly helpful if nodes can be hotplugged of various sizes and node ids rather than having static characteristics that cannot be changed with a reboot. > Currently, our solution creates static nodes when OS boots, only the node with > state N_HIDDEN can be hot-added with node/probe interface, and we can query > The idea that I've proposed (and you've apparently thought about and even implemented at one point) is much more powerful than that. We need not query the state of hidden nodes that we've setup at boot but can rather use the amount of hidden memory to setup the nodes in any way that we want at runtime (various sizes, interleaved node ids, etc). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-17 21:10 ` David Rientjes @ 2010-11-18 4:14 ` Shaohui Zheng 2010-11-18 6:27 ` Paul Mundt 0 siblings, 1 reply; 7+ messages in thread From: Shaohui Zheng @ 2010-11-18 4:14 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote: > On Wed, 17 Nov 2010, Shaohui Zheng wrote: > > > > Hmm, why can't you use numa=hide to hide a specified quantity of memory > > > from the kernel and then use the add_memory() interface to hot-add the > > > offlined memory in the desired quantity? In other words, why do you need > > > to track the offlined nodes with a state? > > > > > > The userspace interface would take a desired size of hidden memory to > > > hot-add and the node id would be the first_unset_node(node_online_map). > > Yes, it is a good idea, your solution is what we indeed do in our first 2 > > versions. We use mem=memsize to hide memory, and we call add_memory interface > > to hot-add offlined memory with desired quantity, and we can also add to > > desired nodes(even through the nodes does not exists). it is very flexible > > solution. > > > > However, this solution was denied since we notice NUMA emulation, we should > > reuse it. > > > > I don't understand why that's a requirement, NUMA emulation is a seperate > feature. Although both are primarily used to test and instrument other VM > and kernel code, NUMA emulation is restricted to only being used at boot > to fake nodes on smaller machines and can be used to test things like the > slab allocator. The NUMA hotplug emulator that you're developing here is > primarily used to test the hotplug callbacks; for that use-case, it seems > particularly helpful if nodes can be hotplugged of various sizes and node > ids rather than having static characteristics that cannot be changed with > a reboot. > I agree with you. the early emulator do the same thing as you said, but there is already NUMA emulation to create fake node, our emulator also creates fake nodes. We worried about that we will suffer the critiques from the community, so we drop the original degsin. I did not know whether other engineers have the same attitude with you. I think that I can publish both codes, and let the community to decide which one is prefered. In my personal opinion, both methods are acceptable for me. > > Currently, our solution creates static nodes when OS boots, only the node with > > state N_HIDDEN can be hot-added with node/probe interface, and we can query > > > > The idea that I've proposed (and you've apparently thought about and even > implemented at one point) is much more powerful than that. We need not > query the state of hidden nodes that we've setup at boot but can rather > use the amount of hidden memory to setup the nodes in any way that we want > at runtime (various sizes, interleaved node ids, etc). yes, if we select your proposal. we just mark all the nodes as POSSIBLE node. there is no hidden nodes any more. the node will be created after add memory to the node first time. This is the early patch( Not very formal, it is just an interanl version): diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 454997c..9dc6a02 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -73,6 +73,7 @@ * * node_set_online(node) set bit 'node' in node_online_map * node_set_offline(node) clear bit 'node' in node_online_map + * node_set_possible(node) set bit 'node' in node_possible_map * * for_each_node(node) for-loop node over node_possible_map * for_each_online_node(node) for-loop node over node_online_map @@ -432,6 +433,11 @@ static inline void node_set_offline(int nid) node_clear_state(nid, N_ONLINE); nr_online_nodes = num_node_state(N_ONLINE); } + +static inline void node_set_possible(int nid) +{ + node_set_state(nid, N_POSSIBLE); +} #else static inline int node_state(int node, enum node_states state) @@ -462,6 +468,7 @@ static inline int num_node_state(enum node_states state) #define node_set_online(node) node_set_state((node), N_ONLINE) #define node_set_offline(node) node_clear_state((node), N_ONLINE) +#define node_set_possible(node) node_set_state((node), N_POSSIBLE) #endif #define node_online_map node_states[N_ONLINE] diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eb40925..059ebf0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1602,6 +1602,9 @@ config HOTPLUG_CPU ( Note: power management support will enable this option automatically on SMP systems. ) Say N if you want to disable CPU hotplug. +config ARCH_CPU_PROBE_RELEASE + def_bool y + depends on HOTPLUG_CPU config COMPAT_VDSO def_bool y diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 550df48..52094bc 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -26,12 +26,11 @@ void __init setup_node_to_cpumask_map(void) { unsigned int node, num = 0; - /* setup nr_node_ids if not done yet */ - if (nr_node_ids == MAX_NUMNODES) { - for_each_node_mask(node, node_possible_map) - num = node; - nr_node_ids = num + 1; - } + /* re-setup nr_node_ids, when CONFIG_ARCH_MEMORY_PROBE enabled and mem=XXX + specified, nr_node_ids will be set as the maximum value */ + for_each_node_mask(node, node_possible_map) + num = node; + nr_node_ids = num + 1; /* allocate the map */ for (node = 0; node < nr_node_ids; node++) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bd02505..3d0e37c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -327,6 +327,8 @@ static int block_size_init(void) * will not need to do it from userspace. The fake hot-add code * as well as ppc64 will do all of their discovery in userspace * and will require this interface. + * + * Parameter format: start_addr, nid */ #ifdef CONFIG_ARCH_MEMORY_PROBE static ssize_t @@ -336,10 +338,26 @@ memory_probe_store(struct class *class, const char *buf, size_t count) int nid; int ret; - phys_addr = simple_strtoull(buf, NULL, 0); + char *p = strchr(buf, ','); + + if (p != NULL && strlen(p+1) > 0) { + /* nid specified */ + *p++ = '\0'; + nid = simple_strtoul(p, NULL, 0); + phys_addr = simple_strtoull(buf, NULL, 0); + } else { + phys_addr = simple_strtoull(buf, NULL, 0); + nid = memory_add_physaddr_to_nid(phys_addr); + } - nid = memory_add_physaddr_to_nid(phys_addr); - ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT); + if (nid < 0 || nid > nr_node_ids - 1) { + printk(KERN_ERR "Invalid node id %d(0<=nid<%d).\n", nid, nr_node_ids); + } else { + printk(KERN_INFO "Add a memory section to node: %d.\n", nid); + ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT); + if (ret) + count = ret; + } if (ret) count = ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8deb9d0..0d7eeea 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3946,9 +3946,19 @@ static void __init setup_nr_node_ids(void) unsigned int node; unsigned int highest = 0; + #ifdef CONFIG_ARCH_MEMORY_PROBE + /* grub parameter mem=XXX specified */ + if (1){ + int cnt; + for (cnt = 0; cnt < MAX_NUMNODES; cnt++) + node_set_possible(cnt); + } + #endif + for_each_node_mask(node, node_possible_map) highest = node; nr_node_ids = highest + 1; + printk(KERN_INFO "setup_nr_node_ids: nr_node_ids : %d.\n", nr_node_ids); } #else static inline void setup_nr_node_ids(void) -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-18 4:14 ` Shaohui Zheng @ 2010-11-18 6:27 ` Paul Mundt 2010-11-18 5:27 ` Shaohui Zheng 0 siblings, 1 reply; 7+ messages in thread From: Paul Mundt @ 2010-11-18 6:27 UTC (permalink / raw) To: Shaohui Zheng Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote: > On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote: > > The idea that I've proposed (and you've apparently thought about and even > > implemented at one point) is much more powerful than that. We need not > > query the state of hidden nodes that we've setup at boot but can rather > > use the amount of hidden memory to setup the nodes in any way that we want > > at runtime (various sizes, interleaved node ids, etc). > > yes, if we select your proposal. we just mark all the nodes as POSSIBLE node. > there is no hidden nodes any more. the node will be created after add memory > to the node first time. > This is roughly what I had in mind in my N_HIDDEN review, so I quite favour this approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-18 6:27 ` Paul Mundt @ 2010-11-18 5:27 ` Shaohui Zheng 2010-11-18 21:24 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Shaohui Zheng @ 2010-11-18 5:27 UTC (permalink / raw) To: Paul Mundt Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Thu, Nov 18, 2010 at 03:27:15PM +0900, Paul Mundt wrote: > On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote: > > On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote: > > > The idea that I've proposed (and you've apparently thought about and even > > > implemented at one point) is much more powerful than that. We need not > > > query the state of hidden nodes that we've setup at boot but can rather > > > use the amount of hidden memory to setup the nodes in any way that we want > > > at runtime (various sizes, interleaved node ids, etc). > > > > yes, if we select your proposal. we just mark all the nodes as POSSIBLE node. > > there is no hidden nodes any more. the node will be created after add memory > > to the node first time. > > > This is roughly what I had in mind in my N_HIDDEN review, so I quite > favour this approach. Our testing shows that it is a feasible approach, and it works well. however, there is still a problem which we should worry about. in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because we do not know how many nodes will be hot-added through memory/probe interface. it might be a little wasting of memory. -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-18 5:27 ` Shaohui Zheng @ 2010-11-18 21:24 ` David Rientjes 2010-11-19 0:32 ` Shaohui Zheng 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-18 21:24 UTC (permalink / raw) To: Shaohui Zheng Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Thu, 18 Nov 2010, Shaohui Zheng wrote: > in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled > and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because > we do not know how many nodes will be hot-added through memory/probe interface. > it might be a little wasting of memory. > nr_node_ids need not be set to anything different at boot, the MEM_GOING_ONLINE callback should be used for anything (like the slab allocators) where a new node is introduced and needs to be dealt with accordingly; this is how regular memory hotplug works, we need no additional code in this regard because it's emulated. If a subsystem needs to change in response to a new node going online and doesn't as a result of using your emulator, that's a bug and either needs to be fixed or prohibited from use with CONFIG_MEMORY_HOTPLUG. (See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals only with the case of node hotplug.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-18 21:24 ` David Rientjes @ 2010-11-19 0:32 ` Shaohui Zheng 2010-11-21 0:48 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Shaohui Zheng @ 2010-11-19 0:32 UTC (permalink / raw) To: David Rientjes Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Thu, Nov 18, 2010 at 01:24:52PM -0800, David Rientjes wrote: > On Thu, 18 Nov 2010, Shaohui Zheng wrote: > > > in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled > > and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because > > we do not know how many nodes will be hot-added through memory/probe interface. > > it might be a little wasting of memory. > > > > nr_node_ids need not be set to anything different at boot, the > MEM_GOING_ONLINE callback should be used for anything (like the slab > allocators) where a new node is introduced and needs to be dealt with > accordingly; this is how regular memory hotplug works, we need no > additional code in this regard because it's emulated. If a subsystem > needs to change in response to a new node going online and doesn't as a > result of using your emulator, that's a bug and either needs to be fixed > or prohibited from use with CONFIG_MEMORY_HOTPLUG. > > (See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals > only with the case of node hotplug.) nr_node_ids is the possible node number. when we do regular memory online, it is oline to a possible node, and it is already counted in to nr_node_ids. if you increment nr_node_ids dynamically when node online, it causes a lot of problems. Many data are initialized according to nr_node_ids. That is our experience when we debug the emulator. mm/page_alloc.c: /* * Figure out the number of possible node ids. */ static void __init setup_nr_node_ids(void) { unsigned int node; unsigned int highest = 0; for_each_node_mask(node, node_possible_map) highest = node; nr_node_ids = highest + 1; } There is no conflict between emulator and CONFIG_MEMORY_HOTPLUG. A real node can be onlined because we already set it as _possible_; if emulator is enabled, all the nodes were marked as _possbile_ node, the real ndoe is also included in. -- Thanks & Regards, Shaohui -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation 2010-11-19 0:32 ` Shaohui Zheng @ 2010-11-21 0:48 ` David Rientjes 2010-11-21 2:28 ` [patch 1/2] x86: add numa=possible command line option David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-21 0:48 UTC (permalink / raw) To: Shaohui Zheng Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li On Fri, 19 Nov 2010, Shaohui Zheng wrote: > nr_node_ids is the possible node number. when we do regular memory online, > it is oline to a possible node, and it is already counted in to nr_node_ids. > > if you increment nr_node_ids dynamically when node online, it causes a lot of > problems. Many data are initialized according to nr_node_ids. That is our > experience when we debug the emulator. > I think what we'll end up wanting to do is something like this, which adds a numa=possible=<N> parameter for x86; this will add an additional N possible nodes to node_possible_map that we can use to online later. It also adds a new /sys/devices/system/memory/add_node file which takes a typical "size@start" value to hot-add an emulated node. For example, using "mem=2G numa=possible=1" on the command line and doing echo 128M@0x80000000" > /sys/devices/system/memory/add_node would hot-add a node of 128M. Comments? --- diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c --- a/arch/x86/mm/numa_64.c +++ b/arch/x86/mm/numa_64.c @@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = { int numa_off __initdata; static unsigned long __initdata nodemap_addr; static unsigned long __initdata nodemap_size; +static unsigned long __initdata numa_possible_nodes; /* * Map cpu index to node index @@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, #ifdef CONFIG_NUMA_EMU if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8)) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, #ifdef CONFIG_ACPI_NUMA if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT)) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif #ifdef CONFIG_K8_NUMA if (!numa_off && k8 && !k8_scan_nodes()) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, numa_set_node(i, 0); memblock_x86_register_active_regions(0, start_pfn, last_pfn); setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT); +out: __maybe_unused + for (i = 0; i < numa_possible_nodes; i++) { + int nid; + + nid = first_unset_node(node_possible_map); + if (nid == MAX_NUMNODES) + break; + node_set(nid, node_possible_map); + } } unsigned long __init numa_free_all_bootmem(void) @@ -675,6 +685,8 @@ static __init int numa_setup(char *opt) if (!strncmp(opt, "noacpi", 6)) acpi_numa = -1; #endif + if (!strncmp(opt, "possible=", 9)) + numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0); return 0; } early_param("numa", numa_setup); diff --git a/drivers/base/memory.c b/drivers/base/memory.c --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr, } static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store); +static ssize_t +memory_add_node_store(struct class *class, struct class_attribute *attr, + const char *buf, size_t count) +{ + nodemask_t mask; + u64 start, size; + char *p; + int nid; + int ret; + + size = memparse(buf, &p); + if (size < (PAGES_PER_SECTION << PAGE_SHIFT)) + return -EINVAL; + if (*p != '@') + return -EINVAL; + + start = simple_strtoull(p + 1, NULL, 0); + + nodes_andnot(mask, node_possible_map, node_online_map); + nid = first_node(mask); + if (nid == MAX_NUMNODES) + return -EINVAL; + + ret = add_memory(nid, start, size); + return ret ? ret : count; +} +static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store); + static int memory_probe_init(void) { - return sysfs_create_file(&memory_sysdev_class.kset.kobj, + int err; + + err = sysfs_create_file(&memory_sysdev_class.kset.kobj, &class_attr_probe.attr); + if (err) + return err; + return sysfs_create_file(&memory_sysdev_class.kset.kobj, + &class_attr_add_node.attr); } #else static inline int memory_probe_init(void) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [patch 1/2] x86: add numa=possible command line option 2010-11-21 0:48 ` David Rientjes @ 2010-11-21 2:28 ` David Rientjes 2010-11-21 2:28 ` [patch 2/2] mm: add node hotplug emulation David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-21 2:28 UTC (permalink / raw) To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner Cc: Greg Kroah-Hartman, Shaohui Zheng, Paul Mundt, Andrew Morton, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86 Adds a numa=possible=<N> command line option to set an additional N nodes as being possible for memory hotplug. This set of possible nodes controls nr_node_ids and the sizes of several dynamically allocated node arrays. This allows memory hotplug to create new nodes for newly added memory rather than binding it to existing nodes. The first use-case for this will be node hotplug emulation which will use these possible nodes to create new nodes to test the memory hotplug callbacks and surrounding memory hotplug code. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/x86/x86_64/boot-options.txt | 4 ++++ arch/x86/mm/numa_64.c | 18 +++++++++++++++--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -174,6 +174,10 @@ NUMA If given as an integer, fills all system RAM with N fake nodes interleaved over physical nodes. + numa=possible=<N> + Sets an additional N nodes as being possible for memory + hotplug. + ACPI acpi=off Don't enable ACPI diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c --- a/arch/x86/mm/numa_64.c +++ b/arch/x86/mm/numa_64.c @@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = { int numa_off __initdata; static unsigned long __initdata nodemap_addr; static unsigned long __initdata nodemap_size; +static unsigned long __initdata numa_possible_nodes; /* * Map cpu index to node index @@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, #ifdef CONFIG_NUMA_EMU if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8)) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, #ifdef CONFIG_ACPI_NUMA if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT)) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif #ifdef CONFIG_K8_NUMA if (!numa_off && k8 && !k8_scan_nodes()) - return; + goto out; nodes_clear(node_possible_map); nodes_clear(node_online_map); #endif @@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn, numa_set_node(i, 0); memblock_x86_register_active_regions(0, start_pfn, last_pfn); setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT); +out: __maybe_unused + for (i = 0; i < numa_possible_nodes; i++) { + int nid; + + nid = first_unset_node(node_possible_map); + if (nid == MAX_NUMNODES) + break; + node_set(nid, node_possible_map); + } } unsigned long __init numa_free_all_bootmem(void) @@ -675,6 +685,8 @@ static __init int numa_setup(char *opt) if (!strncmp(opt, "noacpi", 6)) acpi_numa = -1; #endif + if (!strncmp(opt, "possible=", 9)) + numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0); return 0; } early_param("numa", numa_setup); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [patch 2/2] mm: add node hotplug emulation 2010-11-21 2:28 ` [patch 1/2] x86: add numa=possible command line option David Rientjes @ 2010-11-21 2:28 ` David Rientjes 2010-11-21 17:34 ` Greg KH 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2010-11-21 2:28 UTC (permalink / raw) To: Andrew Morton, Greg Kroah-Hartman Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86 Add an interface to allow new nodes to be added when performing memory hot-add. This provides a convenient interface to test memory hotplug notifier callbacks and surrounding hotplug code when new nodes are onlined without actually having a machine with such hotpluggable SRAT entries. This adds a new interface at /sys/devices/system/memory/add_node that behaves in a similar way to the memory hot-add "probe" interface. Its format is size@start, where "size" is the size of the new node to be added and "start" is the physical address of the new memory. The new node id is a currently offline, but possible, node. The bit must be set in node_possible_map so that nr_node_ids is sized appropriately. For emulation on x86, for example, it would be possible to set aside memory for hotplugged nodes (say, anything above 2G) and to add an additional three nodes as being possible on boot with mem=2G numa=possible=3 and then creating a new 128M node at runtime: # echo 128M@0x80000000 > /sys/devices/system/memory/add_node On node 1 totalpages: 0 init_memory_mapping: 0000000080000000-0000000088000000 0080000000 - 0088000000 page 2M Once the new node has been added, its memory can be onlined. If this memory represents memory section 16, for example: # echo online > /sys/devices/system/memory/memory16/state Built 2 zonelists in Node order, mobility grouping on. Total pages: 514846 Policy zone: Normal [ The memory section(s) mapped to a particular node are visible via /sys/devices/system/node/node1, in this example. ] The new node is now hotplugged and ready for testing. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/memory-hotplug.txt | 24 ++++++++++++++++++++++++ drivers/base/memory.c | 36 +++++++++++++++++++++++++++++++++++- 2 files changed, 59 insertions(+), 1 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -18,6 +18,7 @@ be changed often. 4. Physical memory hot-add phase 4.1 Hardware(Firmware) Support 4.2 Notify memory hot-add event by hand + 4.3 Node hotplug emulation 5. Logical Memory hot-add phase 5.1. State of memory 5.2. How to online memory @@ -215,6 +216,29 @@ current implementation). You'll have to online memory by yourself. Please see "How to online memory" in this text. +4.3 Node hotplug emulation +------------ +It is possible to test node hotplug by assigning the newly added memory to a +new node id when using a different interface with a similar behavior to +"probe" described in section 4.2. If a node id is possible (there are bits +in /sys/devices/system/memory/possible that are not online), then it may be +used to emulate a newly added node as the result of memory hotplug by using +the "add_node" interface. + +The add_node interface is located at +/sys/devices/system/memory/add_node + +You can create a new node of a specified size starting at the physical +address of new memory by + +% echo size@start_address_of_new_memory > /sys/devices/system/memory/add_node + +Where "size" can be represented in megabytes or gigabytes (for example, +"128M" or "1G"). The minumum size is that of a memory section. + +Once the new node has been added, it is possible to online the memory by +toggling the "state" of its memory section(s) as described in section 5.1. + ------------------------------ 5. Logical Memory hot-add phase diff --git a/drivers/base/memory.c b/drivers/base/memory.c --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr, } static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store); +static ssize_t +memory_add_node_store(struct class *class, struct class_attribute *attr, + const char *buf, size_t count) +{ + nodemask_t mask; + u64 start, size; + char *p; + int nid; + int ret; + + size = memparse(buf, &p); + if (size < (PAGES_PER_SECTION << PAGE_SHIFT)) + return -EINVAL; + if (*p != '@') + return -EINVAL; + + start = simple_strtoull(p + 1, NULL, 0); + + nodes_andnot(mask, node_possible_map, node_online_map); + nid = first_node(mask); + if (nid == MAX_NUMNODES) + return -EINVAL; + + ret = add_memory(nid, start, size); + return ret ? ret : count; +} +static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store); + static int memory_probe_init(void) { - return sysfs_create_file(&memory_sysdev_class.kset.kobj, + int err; + + err = sysfs_create_file(&memory_sysdev_class.kset.kobj, &class_attr_probe.attr); + if (err) + return err; + return sysfs_create_file(&memory_sysdev_class.kset.kobj, + &class_attr_add_node.attr); } #else static inline int memory_probe_init(void) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch 2/2] mm: add node hotplug emulation 2010-11-21 2:28 ` [patch 2/2] mm: add node hotplug emulation David Rientjes @ 2010-11-21 17:34 ` Greg KH 2010-11-21 21:48 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Greg KH @ 2010-11-21 17:34 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86 On Sat, Nov 20, 2010 at 06:28:38PM -0800, David Rientjes wrote: > > Add an interface to allow new nodes to be added when performing memory > hot-add. This provides a convenient interface to test memory hotplug > notifier callbacks and surrounding hotplug code when new nodes are > onlined without actually having a machine with such hotpluggable SRAT > entries. > > This adds a new interface at /sys/devices/system/memory/add_node that > behaves in a similar way to the memory hot-add "probe" interface. Its > format is size@start, where "size" is the size of the new node to be > added and "start" is the physical address of the new memory. Ick, we are trying to clean up the system devices right now which would prevent this type of tree being added. > The new node id is a currently offline, but possible, node. The bit must > be set in node_possible_map so that nr_node_ids is sized appropriately. > > For emulation on x86, for example, it would be possible to set aside > memory for hotplugged nodes (say, anything above 2G) and to add an > additional three nodes as being possible on boot with > > mem=2G numa=possible=3 > > and then creating a new 128M node at runtime: > > # echo 128M@0x80000000 > /sys/devices/system/memory/add_node > On node 1 totalpages: 0 > init_memory_mapping: 0000000080000000-0000000088000000 > 0080000000 - 0088000000 page 2M > > Once the new node has been added, its memory can be onlined. If this > memory represents memory section 16, for example: > > # echo online > /sys/devices/system/memory/memory16/state > Built 2 zonelists in Node order, mobility grouping on. Total pages: 514846 > Policy zone: Normal > > [ The memory section(s) mapped to a particular node are visible via > /sys/devices/system/node/node1, in this example. ] > > The new node is now hotplugged and ready for testing. > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > Documentation/memory-hotplug.txt | 24 ++++++++++++++++++++++++ > drivers/base/memory.c | 36 +++++++++++++++++++++++++++++++++++- > 2 files changed, 59 insertions(+), 1 deletions(-) When adding sysfs files you need to document it in Documentation/ABI instead. But as this is a debugging thing, why not just put it in debugfs instead? thanks, greg k-h -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch 2/2] mm: add node hotplug emulation 2010-11-21 17:34 ` Greg KH @ 2010-11-21 21:48 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2010-11-21 21:48 UTC (permalink / raw) To: Greg KH Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86 On Sun, 21 Nov 2010, Greg KH wrote: > But as this is a debugging thing, why not just put it in debugfs > instead? > Ok, I think Paul had a similar suggestion during the discussion of Shaohui's patchset. I'll move it, thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-11-28 2:01 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <A24AE1FFE7AEC5489F83450EE98351BF28723FC4A7@shsmsx502.ccr.corp.intel.com> 2010-11-22 1:47 ` [patch 2/2] mm: add node hotplug emulation Shaohui Zheng 2010-11-24 6:45 ` Shaohui Zheng 2010-11-28 2:01 ` David Rientjes 2010-11-28 2:00 ` David Rientjes 2010-11-17 2:07 [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks shaohui.zheng 2010-11-17 2:08 ` [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation shaohui.zheng 2010-11-17 8:16 ` David Rientjes 2010-11-17 7:51 ` Shaohui Zheng 2010-11-17 21:10 ` David Rientjes 2010-11-18 4:14 ` Shaohui Zheng 2010-11-18 6:27 ` Paul Mundt 2010-11-18 5:27 ` Shaohui Zheng 2010-11-18 21:24 ` David Rientjes 2010-11-19 0:32 ` Shaohui Zheng 2010-11-21 0:48 ` David Rientjes 2010-11-21 2:28 ` [patch 1/2] x86: add numa=possible command line option David Rientjes 2010-11-21 2:28 ` [patch 2/2] mm: add node hotplug emulation David Rientjes 2010-11-21 17:34 ` Greg KH 2010-11-21 21:48 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).