* [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable. @ 2013-02-20 11:00 Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT Tang Chen ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Tang Chen @ 2013-02-20 11:00 UTC (permalink / raw) To: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer Cc: linux-kernel, linux-mm As mentioned by HPA before, when we are using movablemem_map=acpi, if all the memory in SRAT is hotpluggable, then the kernel will have no memory to use, and will fail to boot. Before parsing SRAT, memblock has already reserved some memory in memblock.reserve, which is used by the kernel, such as storing the kernel image. We are not able to prevent the kernel from using these memory. So, these 2 patches make the node which the kernel resides in un-hotpluggable. patch1: Do not add the memory reserved by memblock into movablemenm_map.map[]. patch2: Do not add any other memory ranges in the same node into movablemenm_map.map[], so that make the node which the kernel resides in un-hotpluggable. Tang Chen (2): acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable. Documentation/kernel-parameters.txt | 6 ++++++ arch/x86/mm/srat.c | 35 ++++++++++++++++++++++++++++++++++- include/linux/mm.h | 1 + 3 files changed, 41 insertions(+), 1 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 11:00 [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable Tang Chen @ 2013-02-20 11:00 ` Tang Chen 2013-02-20 12:31 ` Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable Tang Chen 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton 2 siblings, 1 reply; 19+ messages in thread From: Tang Chen @ 2013-02-20 11:00 UTC (permalink / raw) To: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer Cc: linux-kernel, linux-mm As mentioned by HPA before, when we are using movablemem_map=acpi, if all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. Before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. This patch changes the movablemem_map=acpi option's behavior. The memory ranges reserved by memblock will not be added into movablemem_map.map[]. So even if all the memory is hotpluggable, there will always be memory that could be used by the kernel. Reported-by: H Peter Anvin <hpa@zytor.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- arch/x86/mm/srat.c | 18 +++++++++++++++++- 1 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c index 62ba97b..b8028b2 100644 --- a/arch/x86/mm/srat.c +++ b/arch/x86/mm/srat.c @@ -145,7 +145,7 @@ static inline int save_add_info(void) {return 0;} static void __init handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) { - int overlap; + int overlap, i; unsigned long start_pfn, end_pfn; start_pfn = PFN_DOWN(start); @@ -161,8 +161,24 @@ handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. + * + * Before parsing SRAT, memblock has already reserve some memory ranges + * for other purposes, such as for kernel image. We cannot prevent + * kernel from using these memory, so we need to exclude these memory + * even if it is hotpluggable. */ if (hotpluggable && movablemem_map.acpi) { + /* Exclude ranges reserved by memblock. */ + struct memblock_type *rgn = &memblock.reserved; + + for (i = 0; i < rgn->cnt; i++) { + if (end <= rgn->regions[i].base || + start >= rgn->regions[i].base + + rgn->regions[i].size) + continue; + goto out; + } + insert_movablemem_map(start_pfn, end_pfn); /* -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 11:00 ` [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT Tang Chen @ 2013-02-20 12:31 ` Tang Chen 2013-02-20 12:35 ` Will Huck 0 siblings, 1 reply; 19+ messages in thread From: Tang Chen @ 2013-02-20 12:31 UTC (permalink / raw) To: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer Cc: linux-kernel, linux-mm On 02/20/2013 07:00 PM, Tang Chen wrote: > As mentioned by HPA before, when we are using movablemem_map=acpi, if all the > memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. > > Before parsing SRAT, memblock has already reserve some memory ranges for other > purposes, such as for kernel image, and so on. We cannot prevent kernel from > using these memory. So we need to exclude these ranges even if these memory is > hotpluggable. > > This patch changes the movablemem_map=acpi option's behavior. The memory ranges > reserved by memblock will not be added into movablemem_map.map[]. So even if > all the memory is hotpluggable, there will always be memory that could be used > by the kernel. > > Reported-by: H Peter Anvin<hpa@zytor.com> > Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com> > --- > arch/x86/mm/srat.c | 18 +++++++++++++++++- > 1 files changed, 17 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c > index 62ba97b..b8028b2 100644 > --- a/arch/x86/mm/srat.c > +++ b/arch/x86/mm/srat.c > @@ -145,7 +145,7 @@ static inline int save_add_info(void) {return 0;} > static void __init > handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) > { > - int overlap; > + int overlap, i; > unsigned long start_pfn, end_pfn; > > start_pfn = PFN_DOWN(start); > @@ -161,8 +161,24 @@ handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) > * > * Using movablemem_map, we can prevent memblock from allocating memory > * on ZONE_MOVABLE at boot time. > + * > + * Before parsing SRAT, memblock has already reserve some memory ranges > + * for other purposes, such as for kernel image. We cannot prevent > + * kernel from using these memory, so we need to exclude these memory > + * even if it is hotpluggable. > */ > if (hotpluggable&& movablemem_map.acpi) { > + /* Exclude ranges reserved by memblock. */ > + struct memblock_type *rgn =&memblock.reserved; > + > + for (i = 0; i< rgn->cnt; i++) { > + if (end<= rgn->regions[i].base || > + start>= rgn->regions[i].base + > + rgn->regions[i].size) Hi all, Here, I scan the memblock.reserved each time we parse an entry because the rgn->regions[i].nid is set to MAX_NUMNODES in memblock_reserve(). So I cannot obtain the nid which the kernel resides in directly from memblock.reserved. I think there could be some problems if the memory ranges in SRAT are not in increasing order, since if [3,4) [1,2) are all on node0, and kernel is not using [3,4), but using [1,2), then I cannot remove [3,4) because I don't know on which node [3,4) is. Any idea for this ? And by the way, I think this approach works well when the memory entries in SRAT are arranged in increasing order. Thanks. :) > + continue; > + goto out; > + } > + > insert_movablemem_map(start_pfn, end_pfn); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 12:31 ` Tang Chen @ 2013-02-20 12:35 ` Will Huck 2013-02-20 22:41 ` Luck, Tony 0 siblings, 1 reply; 19+ messages in thread From: Will Huck @ 2013-02-20 12:35 UTC (permalink / raw) To: Tang Chen Cc: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On 02/20/2013 08:31 PM, Tang Chen wrote: > On 02/20/2013 07:00 PM, Tang Chen wrote: >> As mentioned by HPA before, when we are using movablemem_map=acpi, if >> all the >> memory ranges in SRAT is hotpluggable, then no memory can be used by >> kernel. >> >> Before parsing SRAT, memblock has already reserve some memory ranges >> for other >> purposes, such as for kernel image, and so on. We cannot prevent >> kernel from >> using these memory. So we need to exclude these ranges even if these >> memory is >> hotpluggable. >> >> This patch changes the movablemem_map=acpi option's behavior. The >> memory ranges >> reserved by memblock will not be added into movablemem_map.map[]. So >> even if >> all the memory is hotpluggable, there will always be memory that >> could be used >> by the kernel. >> What's the relationship between e820 map and SRAT? >> Reported-by: H Peter Anvin<hpa@zytor.com> >> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com> >> --- >> arch/x86/mm/srat.c | 18 +++++++++++++++++- >> 1 files changed, 17 insertions(+), 1 deletions(-) >> >> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c >> index 62ba97b..b8028b2 100644 >> --- a/arch/x86/mm/srat.c >> +++ b/arch/x86/mm/srat.c >> @@ -145,7 +145,7 @@ static inline int save_add_info(void) {return 0;} >> static void __init >> handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) >> { >> - int overlap; >> + int overlap, i; >> unsigned long start_pfn, end_pfn; >> >> start_pfn = PFN_DOWN(start); >> @@ -161,8 +161,24 @@ handle_movablemem(int node, u64 start, u64 end, >> u32 hotpluggable) >> * >> * Using movablemem_map, we can prevent memblock from >> allocating memory >> * on ZONE_MOVABLE at boot time. >> + * >> + * Before parsing SRAT, memblock has already reserve some memory >> ranges >> + * for other purposes, such as for kernel image. We cannot prevent >> + * kernel from using these memory, so we need to exclude these >> memory >> + * even if it is hotpluggable. >> */ >> if (hotpluggable&& movablemem_map.acpi) { >> + /* Exclude ranges reserved by memblock. */ >> + struct memblock_type *rgn =&memblock.reserved; >> + >> + for (i = 0; i< rgn->cnt; i++) { >> + if (end<= rgn->regions[i].base || >> + start>= rgn->regions[i].base + >> + rgn->regions[i].size) > > Hi all, > > Here, I scan the memblock.reserved each time we parse an entry because > the > rgn->regions[i].nid is set to MAX_NUMNODES in memblock_reserve(). So I > cannot > obtain the nid which the kernel resides in directly from > memblock.reserved. > > I think there could be some problems if the memory ranges in SRAT are > not in > increasing order, since if [3,4) [1,2) are all on node0, and kernel is > not > using [3,4), but using [1,2), then I cannot remove [3,4) because I > don't know > on which node [3,4) is. > > Any idea for this ? > > And by the way, I think this approach works well when the memory > entries in > SRAT are arranged in increasing order. > > Thanks. :) > >> + continue; >> + goto out; >> + } >> + >> insert_movablemem_map(start_pfn, end_pfn); >> >> /* > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 12:35 ` Will Huck @ 2013-02-20 22:41 ` Luck, Tony 2013-02-21 0:05 ` Will Huck 2013-02-25 1:35 ` Will Huck 0 siblings, 2 replies; 19+ messages in thread From: Luck, Tony @ 2013-02-20 22:41 UTC (permalink / raw) To: Will Huck, Tang Chen Cc: akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org > What's the relationship between e820 map and SRAT? The e820 map (or EFI memory map on some recent systems) provides a list of memory ranges together with usage information (e.g. reserved for BIOS, or available) and attributes (WB cacheable, uncacheable). The SRAT table provides topology information for address ranges. It tells the OS which memory is close to each cpu, and which is more distant. If there are multiple degrees of "distant" then the SLIT table provides a matrix of relative latencies between nodes. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 22:41 ` Luck, Tony @ 2013-02-21 0:05 ` Will Huck 2013-02-21 0:23 ` Luck, Tony 2013-02-25 1:35 ` Will Huck 1 sibling, 1 reply; 19+ messages in thread From: Will Huck @ 2013-02-21 0:05 UTC (permalink / raw) To: Luck, Tony Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Hi Tony, On 02/21/2013 06:41 AM, Luck, Tony wrote: >> What's the relationship between e820 map and SRAT? > The e820 map (or EFI memory map on some recent systems) provides > a list of memory ranges together with usage information (e.g. reserved > for BIOS, or available) and attributes (WB cacheable, uncacheable). > > The SRAT table provides topology information for address ranges. It > tells the OS which memory is close to each cpu, and which is more > distant. If there are multiple degrees of "distant" then the SLIT table > provides a matrix of relative latencies between nodes. Thanks for your clarify. What's the relationship between memory ranges and address ranges here? What's the relationship between memory/address ranges and /proc/iomem? > > -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-21 0:05 ` Will Huck @ 2013-02-21 0:23 ` Luck, Tony 2013-02-25 7:07 ` Will Huck 2013-02-25 9:01 ` Will Huck 0 siblings, 2 replies; 19+ messages in thread From: Luck, Tony @ 2013-02-21 0:23 UTC (permalink / raw) To: Will Huck Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org > Thanks for your clarify. What's the relationship between memory ranges > and address ranges here? The ranges in the SRAT table might cover more memory than is present on the system. E.g. on some large Itanium systems the SRAT table would say that 0-1TB was on node0, 1-2TB on node1, etc. The EFI memory map described the memory actually present (perhaps just a handful of GB on each node). X86 systems tend not to have such radically sparse layouts, so this may be less of a distinction. > What's the relationship between memory/address ranges and /proc/iomem? I *think* that /proc/iomem just shows what is in e820 (for the memory entries, it also adds in I/O ranges that come from other ACPI sources). -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-21 0:23 ` Luck, Tony @ 2013-02-25 7:07 ` Will Huck 2013-02-25 9:01 ` Will Huck 1 sibling, 0 replies; 19+ messages in thread From: Will Huck @ 2013-02-25 7:07 UTC (permalink / raw) To: Luck, Tony Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org On 02/21/2013 08:23 AM, Luck, Tony wrote: >> Thanks for your clarify. What's the relationship between memory ranges >> and address ranges here? > The ranges in the SRAT table might cover more memory than is present on > the system. E.g. on some large Itanium systems the SRAT table would say > that 0-1TB was on node0, 1-2TB on node1, etc. > > The EFI memory map described the memory actually present (perhaps just > a handful of GB on each node). > > X86 systems tend not to have such radically sparse layouts, so this may be less > of a distinction. > >> What's the relationship between memory/address ranges and /proc/iomem? > I *think* that /proc/iomem just shows what is in e820 (for the memory entries, > it also adds in I/O ranges that come from other ACPI sources). Funtion detect_memory use int 0x15 to get e820 memory map information, but why the address range is not contigous and seprate to several ranges? > > -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-21 0:23 ` Luck, Tony 2013-02-25 7:07 ` Will Huck @ 2013-02-25 9:01 ` Will Huck 1 sibling, 0 replies; 19+ messages in thread From: Will Huck @ 2013-02-25 9:01 UTC (permalink / raw) To: Luck, Tony Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org On 02/21/2013 08:23 AM, Luck, Tony wrote: >> Thanks for your clarify. What's the relationship between memory ranges >> and address ranges here? > The ranges in the SRAT table might cover more memory than is present on > the system. E.g. on some large Itanium systems the SRAT table would say > that 0-1TB was on node0, 1-2TB on node1, etc. > > The EFI memory map described the memory actually present (perhaps just > a handful of GB on each node). > > X86 systems tend not to have such radically sparse layouts, so this may be less > of a distinction. > >> What's the relationship between memory/address ranges and /proc/iomem? > I *think* that /proc/iomem just shows what is in e820 (for the memory entries, > it also adds in I/O ranges that come from other ACPI sources). When I setup a new e820 range through memmap, system hung during boot. Is there limit when setup memmap? > > -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-20 22:41 ` Luck, Tony 2013-02-21 0:05 ` Will Huck @ 2013-02-25 1:35 ` Will Huck 2013-02-25 3:32 ` Tang Chen 2013-02-25 19:06 ` Luck, Tony 1 sibling, 2 replies; 19+ messages in thread From: Will Huck @ 2013-02-25 1:35 UTC (permalink / raw) To: Luck, Tony Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org On 02/21/2013 06:41 AM, Luck, Tony wrote: >> What's the relationship between e820 map and SRAT? > The e820 map (or EFI memory map on some recent systems) provides > a list of memory ranges together with usage information (e.g. reserved > for BIOS, or available) and attributes (WB cacheable, uncacheable). > > The SRAT table provides topology information for address ranges. It > tells the OS which memory is close to each cpu, and which is more > distant. If there are multiple degrees of "distant" then the SLIT table > provides a matrix of relative latencies between nodes. What's the meaning of multiple degrees of "distant" here? Eg, there are ten nodes, can SRAT tell each node which memory on other node is more close or distant? If the answer is yes, why need SLIT since processes can use memory close to their nodes. SRAT and SLIT are get from firmware or UEFI? > > -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-25 1:35 ` Will Huck @ 2013-02-25 3:32 ` Tang Chen 2013-02-25 19:06 ` Luck, Tony 1 sibling, 0 replies; 19+ messages in thread From: Tang Chen @ 2013-02-25 3:32 UTC (permalink / raw) To: Will Huck Cc: Luck, Tony, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org On 02/25/2013 09:35 AM, Will Huck wrote: > On 02/21/2013 06:41 AM, Luck, Tony wrote: >>> What's the relationship between e820 map and SRAT? >> The e820 map (or EFI memory map on some recent systems) provides >> a list of memory ranges together with usage information (e.g. reserved >> for BIOS, or available) and attributes (WB cacheable, uncacheable). >> >> The SRAT table provides topology information for address ranges. It >> tells the OS which memory is close to each cpu, and which is more >> distant. If there are multiple degrees of "distant" then the SLIT table >> provides a matrix of relative latencies between nodes. > > What's the meaning of multiple degrees of "distant" here? Eg, there are > ten nodes, can SRAT tell each node which memory on other node is more > close or distant? If the answer is yes, why need SLIT since processes > can use memory close to their nodes. Hi Will Referring to the ACPI spec, SRAT provides info of each node, and SLIT provides info between nodes and nodes, I think. SRAT provides number of CPUs and memory of node i, memory range, the PXM id which will be mapped to node id, and hotplug info, and so on. SLIT provides a matrix describing the distances between node i and node j. > > > SRAT and SLIT are get from firmware or UEFI? > I think we can get this info from ACPI BIOS. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT. 2013-02-25 1:35 ` Will Huck 2013-02-25 3:32 ` Tang Chen @ 2013-02-25 19:06 ` Luck, Tony 1 sibling, 0 replies; 19+ messages in thread From: Luck, Tony @ 2013-02-25 19:06 UTC (permalink / raw) To: Will Huck Cc: Tang Chen, akpm@linux-foundation.org, jiang.liu@huawei.com, wujianguo@huawei.com, hpa@zytor.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org, isimatu.yasuaki@jp.fujitsu.com, rob@landley.net, kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com, mgorman@suse.de, rientjes@google.com, guz.fnst@cn.fujitsu.com, rusty@rustcorp.com.au, lliubbo@gmail.com, jaegeuk.hanse@gmail.com, glommer@parallels.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org > What's the meaning of multiple degrees of "distant" here? Eg, there are > ten nodes, can SRAT tell each node which memory on other node is more > close or distant? If the answer is yes, why need SLIT since processes > can use memory close to their nodes. Small systems can have point to point link between every pair of nodes. E.g. a four node system where each node supports 3 links looks like a square with both diagonals drawn in. The SLIT matrix for such a machine might look like this: 10 20 20 20 20 10 20 20 20 20 10 10 20 20 20 10 Now imagine building an eight node system from these same processors. We still only have three links available on each node. So we arrange them like the corners on a cube (with no diagonal lines at all). Now the latency from one node to another may just be one hop along a side, Or perhaps two hops. Worst case is getting from any corner to the diagonally opposite one which will take three hops. So the SLIT might look like (where 10 is no hops, 20 = 1 hop 30 =2 hops and 40 - 3 hops. 10 20 30 20 30 20 30 40 20 10 20 30 20 30 40 20 30 20 10 20 30 40 30 20 20 30 20 10 40 30 20 30 30 20 30 40 10 20 30 20 20 30 40 20 20 10 30 20 30 40 30 20 30 30 10 20 40 30 20 30 20 30 20 10 > SRAT and SLIT are get from firmware or UEFI? SRAT and SLIT are part of ACPI - so constructed by firmware. See http://acpi.info -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable. 2013-02-20 11:00 [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT Tang Chen @ 2013-02-20 11:00 ` Tang Chen 2013-02-23 19:26 ` Rob Landley 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton 2 siblings, 1 reply; 19+ messages in thread From: Tang Chen @ 2013-02-20 11:00 UTC (permalink / raw) To: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer Cc: linux-kernel, linux-mm There could be several memory ranges in the node in which the kernel resides. When using movablemem_map=acpi, we may skip one range that have memory reserved by memblock. But if it is too small, then the kernel will fail to boot. So, make the whole node which the kernel resides in un-hotpluggable. Then the kernel has enough memory to use. Reported-by: H Peter Anvin <hpa@zytor.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/x86/mm/srat.c | 17 +++++++++++++++++ include/linux/mm.h | 1 + 3 files changed, 24 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b94b98..b9a3f9f 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1652,6 +1652,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. in flags from SRAT from ACPI BIOS to determine which memory devices could be hotplugged. The corresponding memory ranges will be set as ZONE_MOVABLE. + NOTE: Whatever node the kernel resides in will always + be un-hotpluggable. movablemem_map=nn[KMG]@ss[KMG] [KNL,X86,IA-64,PPC] This parameter is similar to @@ -1673,6 +1675,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. satisfied. So the administrator should be careful that the amount of movablemem_map areas are not too large. Otherwise kernel won't have enough memory to start. + NOTE: We don't stop users specifying the node the + kernel resides in as hotpluggable so that this + option can be used as a workaround of firmware + bugs. MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c index b8028b2..79836d0 100644 --- a/arch/x86/mm/srat.c +++ b/arch/x86/mm/srat.c @@ -166,6 +166,9 @@ handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) * for other purposes, such as for kernel image. We cannot prevent * kernel from using these memory, so we need to exclude these memory * even if it is hotpluggable. + * Furthermore, to ensure the kernel has enough memory to boot, we make + * all the memory on the node which the kernel resides in + * un-hotpluggable. */ if (hotpluggable && movablemem_map.acpi) { /* Exclude ranges reserved by memblock. */ @@ -176,9 +179,23 @@ handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) start >= rgn->regions[i].base + rgn->regions[i].size) continue; + + /* + * If the memory range overlaps the memory reserved by + * memblock, then the kernel resides in this node. + */ + node_set(node, movablemem_map.numa_nodes_kernel); + goto out; } + /* + * If the kernel resides in this node, then the whole node + * should not be hotpluggable. + */ + if (node_isset(node, movablemem_map.numa_nodes_kernel)) + goto out; + insert_movablemem_map(start_pfn, end_pfn); /* diff --git a/include/linux/mm.h b/include/linux/mm.h index 107c288..00d2d85 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1345,6 +1345,7 @@ struct movablemem_map { int nr_map; struct movablemem_entry map[MOVABLEMEM_MAP_MAX]; nodemask_t numa_nodes_hotplug; /* on which nodes we specify memory */ + nodemask_t numa_nodes_kernel; /* on which nodes kernel resides in */ }; extern void __init insert_movablemem_map(unsigned long start_pfn, -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable. 2013-02-20 11:00 ` [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable Tang Chen @ 2013-02-23 19:26 ` Rob Landley 2013-02-25 2:54 ` Tang Chen 0 siblings, 1 reply; 19+ messages in thread From: Rob Landley @ 2013-02-23 19:26 UTC (permalink / raw) To: Tang Chen Cc: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On 02/20/2013 05:00:56 AM, Tang Chen wrote: > There could be several memory ranges in the node in which the kernel > resides. > When using movablemem_map=acpi, we may skip one range that have > memory reserved > by memblock. But if it is too small, then the kernel will fail to > boot. So, make > the whole node which the kernel resides in un-hotpluggable. Then the > kernel has > enough memory to use. > > Reported-by: H Peter Anvin <hpa@zytor.com> > Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Docs part Acked-by: Rob Landley <rob@landley.net> (with minor non-blocking snark). > @@ -1673,6 +1675,10 @@ bytes respectively. Such letter suffixes can > also be entirely omitted. > satisfied. So the administrator should be > careful that > the amount of movablemem_map areas are not too > large. > Otherwise kernel won't have enough memory to > start. > + NOTE: We don't stop users specifying the node > the > + kernel resides in as hotpluggable so that > this > + option can be used as a workaround of > firmware > + bugs. I usually see workaround "for", not "of". And your whitespace is inconsistent on that last line. And I'm now kind of curious what such a workaround would accomplish, but I'm suspect it's obvious to people who wind up needing it. > MTD_Partition= [MTD] > Format: <name>,<region-number>,<size>,<offset> > diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c > index b8028b2..79836d0 100644 > --- a/arch/x86/mm/srat.c > +++ b/arch/x86/mm/srat.c > @@ -166,6 +166,9 @@ handle_movablemem(int node, u64 start, u64 end, > u32 hotpluggable) > * for other purposes, such as for kernel image. We cannot > prevent > * kernel from using these memory, so we need to exclude these > memory > * even if it is hotpluggable. > + * Furthermore, to ensure the kernel has enough memory to boot, > we make > + * all the memory on the node which the kernel resides in > + * un-hotpluggable. > */ Can you hot-unplug half a node? (Do you have a choice with the granularity here?) Rob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable. 2013-02-23 19:26 ` Rob Landley @ 2013-02-25 2:54 ` Tang Chen 0 siblings, 0 replies; 19+ messages in thread From: Tang Chen @ 2013-02-25 2:54 UTC (permalink / raw) To: Rob Landley Cc: akpm, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On 02/24/2013 03:26 AM, Rob Landley wrote: > On 02/20/2013 05:00:56 AM, Tang Chen wrote: >> There could be several memory ranges in the node in which the kernel >> resides. >> When using movablemem_map=acpi, we may skip one range that have memory >> reserved >> by memblock. But if it is too small, then the kernel will fail to >> boot. So, make >> the whole node which the kernel resides in un-hotpluggable. Then the >> kernel has >> enough memory to use. >> >> Reported-by: H Peter Anvin <hpa@zytor.com> >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> > > Docs part Acked-by: Rob Landley <rob@landley.net> (with minor > non-blocking snark). Hi Rob, Thanks for ack. :) > >> @@ -1673,6 +1675,10 @@ bytes respectively. Such letter suffixes can >> also be entirely omitted. >> satisfied. So the administrator should be careful that >> the amount of movablemem_map areas are not too large. >> Otherwise kernel won't have enough memory to start. >> + NOTE: We don't stop users specifying the node the >> + kernel resides in as hotpluggable so that this >> + option can be used as a workaround of firmware >> + bugs. > > I usually see workaround "for", not "of". And your whitespace is > inconsistent on that last line. > > And I'm now kind of curious what such a workaround would accomplish, but > I'm suspect it's obvious to people who wind up needing it. SFAIK, this is more useful when debugging. > >> MTD_Partition= [MTD] >> Format: <name>,<region-number>,<size>,<offset> >> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c >> index b8028b2..79836d0 100644 >> --- a/arch/x86/mm/srat.c >> +++ b/arch/x86/mm/srat.c >> @@ -166,6 +166,9 @@ handle_movablemem(int node, u64 start, u64 end, >> u32 hotpluggable) >> * for other purposes, such as for kernel image. We cannot prevent >> * kernel from using these memory, so we need to exclude these memory >> * even if it is hotpluggable. >> + * Furthermore, to ensure the kernel has enough memory to boot, we make >> + * all the memory on the node which the kernel resides in >> + * un-hotpluggable. >> */ > > Can you hot-unplug half a node? (Do you have a choice with the > granularity here?) No, we cannot hot-plug/hot-unplug half a node. But we can offline some of the memory, not all the memory on one node. :) Here, hotplug means finally you will physically remove the hardware device from the system while the system is running. So there is no such thing like hotplug half a node, I think. :) Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable. 2013-02-20 11:00 [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable Tang Chen @ 2013-02-20 21:36 ` Andrew Morton 2013-02-21 3:03 ` Tang Chen ` (2 more replies) 2 siblings, 3 replies; 19+ messages in thread From: Andrew Morton @ 2013-02-20 21:36 UTC (permalink / raw) To: Tang Chen Cc: jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On Wed, 20 Feb 2013 19:00:54 +0800 Tang Chen <tangchen@cn.fujitsu.com> wrote: > As mentioned by HPA before, when we are using movablemem_map=acpi, if all the > memory in SRAT is hotpluggable, then the kernel will have no memory to use, and > will fail to boot. > > Before parsing SRAT, memblock has already reserved some memory in memblock.reserve, > which is used by the kernel, such as storing the kernel image. We are not able to > prevent the kernel from using these memory. So, these 2 patches make the node which > the kernel resides in un-hotpluggable. I'm planning to roll all these into a single commit: acpi-memory-hotplug-support-getting-hotplug-info-from-srat.patch acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix.patch acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix.patch acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix.patch acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix-fix.patch acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix-fix-fix.patch for reasons of tree-cleanliness and to avoid bisection holes. They're at http://ozlabs.org/~akpm/mmots/broken-out/. Can you please check the changelog for acpi-memory-hotplug-support-getting-hotplug-info-from-srat.patch to see if it needs any updates due to all the fixup patches? If so, please send me the new changelog, thanks. Also, please review the changelogging for these: page_alloc-add-movable_memmap-kernel-parameter.patch page_alloc-add-movable_memmap-kernel-parameter-fix.patch page_alloc-add-movable_memmap-kernel-parameter-fix-fix.patch page_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes.patch page_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix.patch page_alloc-add-movable_memmap-kernel-parameter-rename-movablecore_map-to-movablemem_map.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix.patch memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix-fix.patch memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap.patch memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix.patch memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix.patch memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix.patch memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix.patch memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix.patch acpi-memory-hotplug-parse-srat-before-memblock-is-ready.patch acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix.patch acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix-fix.patch and while we're there, let's pause to admire how prescient I was in refusing to merge all this into 3.8-rc1 :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable. 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton @ 2013-02-21 3:03 ` Tang Chen 2013-02-21 7:03 ` Tang Chen 2013-02-23 19:40 ` Rob Landley 2 siblings, 0 replies; 19+ messages in thread From: Tang Chen @ 2013-02-21 3:03 UTC (permalink / raw) To: Andrew Morton Cc: jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On 02/21/2013 05:36 AM, Andrew Morton wrote: > On Wed, 20 Feb 2013 19:00:54 +0800 > Tang Chen<tangchen@cn.fujitsu.com> wrote: > >> As mentioned by HPA before, when we are using movablemem_map=acpi, if all the >> memory in SRAT is hotpluggable, then the kernel will have no memory to use, and >> will fail to boot. >> >> Before parsing SRAT, memblock has already reserved some memory in memblock.reserve, >> which is used by the kernel, such as storing the kernel image. We are not able to >> prevent the kernel from using these memory. So, these 2 patches make the node which >> the kernel resides in un-hotpluggable. > > I'm planning to roll all these into a single commit: > > acpi-memory-hotplug-support-getting-hotplug-info-from-srat.patch > acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix.patch > acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix.patch > acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix.patch > acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix-fix.patch > acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix-fix-fix-fix.patch > > for reasons of tree-cleanliness and to avoid bisection holes. They're > at http://ozlabs.org/~akpm/mmots/broken-out/. > > Can you please check the changelog for > acpi-memory-hotplug-support-getting-hotplug-info-from-srat.patch to see > if it needs any updates due to all the fixup patches? If so, please > send me the new changelog, thanks. Hi Andrew, Please use the following changelog for acpi-memory-hotplug-support-getting-hotplug-info-from-srat.patch ********** We now provide an option for users who don't want to specify physical memory address in kernel commandline. /* * For movablemem_map=acpi: * * SRAT: |_____| |_____| |_________| |_________| ...... * node id: 0 1 1 2 * hotpluggable: n y y n * movablemem_map: |_____| |_________| * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. */ So user just specify movablemem_map=acpi, and the kernel will use hotpluggable info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE. If all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. But before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. Furthermore, there could be several memory ranges in the single node which the kernel resides in. We may skip one range that have memory reserved by memblock, but if the rest of memory is too small, then the kernel will fail to boot. So, make the whole node which the kernel resides in un-hotpluggable. Then the kernel has enough memory to use. NOTE: Using this way will cause NUMA performance down because the whole node will be set as ZONE_MOVABLE, and kernel cannot use memory on it. If users don't want to lose NUMA performance, just don't use it. ********** > > Also, please review the changelogging for these: The following xxx-fix-... patches will also be rolled, right ? I'll post the changelogs later. Thanks. :) > > page_alloc-add-movable_memmap-kernel-parameter.patch > page_alloc-add-movable_memmap-kernel-parameter-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-rename-movablecore_map-to-movablemem_map.patch > > memory-hotplug-remove-sys-firmware-memmap-x-sysfs.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix-fix.patch > > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix-fix.patch > > memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix.patch > > acpi-memory-hotplug-parse-srat-before-memblock-is-ready.patch > acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix.patch > acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix-fix.patch > > > and while we're there, let's pause to admire how prescient I was in > refusing to merge all this into 3.8-rc1 :) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable. 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton 2013-02-21 3:03 ` Tang Chen @ 2013-02-21 7:03 ` Tang Chen 2013-02-23 19:40 ` Rob Landley 2 siblings, 0 replies; 19+ messages in thread From: Tang Chen @ 2013-02-21 7:03 UTC (permalink / raw) To: Andrew Morton Cc: jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, rob, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm Hi Andrew, Please see below. :) On 02/21/2013 05:36 AM, Andrew Morton wrote: > > Also, please review the changelogging for these: > > page_alloc-add-movable_memmap-kernel-parameter.patch > page_alloc-add-movable_memmap-kernel-parameter-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes.patch > page_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix.patch > page_alloc-add-movable_memmap-kernel-parameter-rename-movablecore_map-to-movablemem_map.patch ********** Add functions to parse movablemem_map boot option. Since the option could be specified more then once, all the maps will be stored in the global variable movablemem_map.map array. And also, we keep the array in monotonic increasing order by start_pfn. And merge all overlapped ranges. ********** > > memory-hotplug-remove-sys-firmware-memmap-x-sysfs.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix.patch > memory-hotplug-remove-sys-firmware-memmap-x-sysfs-fix-fix-fix-fix-fix.patch ********** When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. This patch implements the function to remove them. We cannot free firmware_map_entry which is allocated by bootmem because there is no way to do so when the system is up. But we can at least remember the address of that memory and reuse the storage when the memory is added next time. This patch also introduces a new list map_entries_bootmem to link the map entries allocated by bootmem when they are removed, and a lock to protect it. And these entries will be reused when the memory is hot-added again. The idea is suggestted by Andrew Morton <akpm@linux-foundation.org> NOTE: It is unsafe to return an entry pointer and release the map_entries_lock. So we should not hold the map_entries_lock separately in firmware_map_find_entry() and firmware_map_remove_entry(). Hold the map_entries_lock across find and remove /sys/firmware/memmap/X operation. And also, users of these two functions need to be careful to hold the lock when using these two functions. ********** > > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix.patch > memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix-fix-fix-fix.patch ********** For removing memmap region of sparse-vmemmap which is allocated bootmem, memmap region of sparse-vmemmap needs to be registered by get_page_bootmem(). So the patch searches pages of virtual mapping and registers the pages by get_page_bootmem(). NOTE: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node() when platform doesn't support it. It's implemented by adding a new Kconfig option named CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected by memory-hotplug feature fully supported archs(currently only on x86_64). Since we have 2 config options called MEMORY_HOTPLUG and MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately, and codes in function register_page_bootmem_info_node() are only used for collecting infomation for hot-remove, so reside it under MEMORY_HOTREMOVE. Besides page_isolation.c selected by MEMORY_ISOLATION under MEMORY_HOTPLUG is also such case, move it too. ********** > > memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix.patch > memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix.patch ********** When memory is removed, the corresponding pagetables should alse be removed. This patch introduces some common APIs to support vmemmap pagetable and x86_64 architecture direct mapping pagetable removing. All pages of virtual mapping in removed memory cannot be freed if some pages used as PGD/PUD include not only removed memory but also other memory. So this patch uses the following way to check whether a page can be freed or not. 1) When removing memory, the page structs of the removed memory are filled with 0FD. 2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. For direct mapping pages, update direct_pages_count[level] when we freed their pagetables. And do not free the pages again because they were freed when offlining. For vmemmap pages, free the pages and their pagetables. For larger pages, do not split them into smaller ones because there is no way to know if the larger page has been split. As a result, there is no way to decide when to split. We deal the larger pages in the following way: 1) For direct mapped pages, all the pages were freed when they were offlined. And since menmory offline is done section by section, all the memory ranges being removed are aligned to PAGE_SIZE. So only need to deal with unaligned pages when freeing vmemmap pages. 2) For vmemmap pages being used to store page_struct, if part of the larger page is still in use, just fill the unused part with 0xFD. And when the whole page is fulfilled with 0xFD, then free the larger page. ********** > > acpi-memory-hotplug-parse-srat-before-memblock-is-ready.patch > acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix.patch > acpi-memory-hotplug-parse-srat-before-memblock-is-ready-fix-fix.patch ********** On linux, the pages used by kernel could not be migrated. As a result, if a memory range is used by kernel, it cannot be hot-removed. So if we want to hot-remove memory, we should prevent kernel from using it. The way now used to prevent this is specify a memory range by movablemem_map boot option and set it as ZONE_MOVABLE. But when the system is booting, memblock will allocate memory, and reserve the memory for kernel. And before we parse SRAT, and know the node memory ranges, memblock is working. And it may allocate memory in ranges to be set as ZONE_MOVABLE. This memory can be used by kernel, and never be freed. So, let's parse SRAT before memblock is called first. And it is early enough. The first call of memblock_find_in_range_node() is in: setup_arch() |-->setup_real_mode() so, this patch add a function early_parse_srat() to parse SRAT, and call it before setup_real_mode() is called. NOTE: 1) early_parse_srat() is called before numa_init(), and has initialized numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO NOT zero numa_meminfo in numa_init(), otherwise we will lose memory numa info. 2) I don't know why using count of memory affinities parsed from SRAT as a return value in original acpi_numa_init(). So I add a static variable srat_mem_cnt to remember this count and use it as the return value of the new acpi_numa_init() ********** -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable. 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton 2013-02-21 3:03 ` Tang Chen 2013-02-21 7:03 ` Tang Chen @ 2013-02-23 19:40 ` Rob Landley 2 siblings, 0 replies; 19+ messages in thread From: Rob Landley @ 2013-02-23 19:40 UTC (permalink / raw) To: Andrew Morton Cc: Tang Chen, jiang.liu, wujianguo, hpa, wency, laijs, linfeng, yinghai, isimatu.yasuaki, kosaki.motohiro, minchan.kim, mgorman, rientjes, guz.fnst, rusty, lliubbo, jaegeuk.hanse, tony.luck, glommer, linux-kernel, linux-mm On 02/20/2013 03:36:50 PM, Andrew Morton wrote: > and while we're there, let's pause to admire how prescient I was in > refusing to merge all this into 3.8-rc1 :) I'm on a plane, which is why I am not digging out the Dr. Who episode "planet of the spiders", digitizing the "All praise to the great one" chant, and attaching it to this email. (So, consider yourself lucky, I guess.) Rob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-02-25 19:06 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-20 11:00 [Bug fix PATCH 0/2] Make whatever node kernel resides in un-hotpluggable Tang Chen 2013-02-20 11:00 ` [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT Tang Chen 2013-02-20 12:31 ` Tang Chen 2013-02-20 12:35 ` Will Huck 2013-02-20 22:41 ` Luck, Tony 2013-02-21 0:05 ` Will Huck 2013-02-21 0:23 ` Luck, Tony 2013-02-25 7:07 ` Will Huck 2013-02-25 9:01 ` Will Huck 2013-02-25 1:35 ` Will Huck 2013-02-25 3:32 ` Tang Chen 2013-02-25 19:06 ` Luck, Tony 2013-02-20 11:00 ` [Bug fix PATCH 2/2] acpi, movablemem_map: Make whatever nodes the kernel resides in un-hotpluggable Tang Chen 2013-02-23 19:26 ` Rob Landley 2013-02-25 2:54 ` Tang Chen 2013-02-20 21:36 ` [Bug fix PATCH 0/2] Make whatever node " Andrew Morton 2013-02-21 3:03 ` Tang Chen 2013-02-21 7:03 ` Tang Chen 2013-02-23 19:40 ` Rob Landley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).