* [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. @ 2015-07-01 3:16 Tang Chen 2015-07-01 6:25 ` Xishi Qiu 2015-07-15 21:20 ` Tejun Heo 0 siblings, 2 replies; 11+ messages in thread From: Tang Chen @ 2015-07-01 3:16 UTC (permalink / raw) To: tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, qiuxishi, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe Cc: x86, tangchen, linux-kernel, linux-mm When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 60000000] 2. on node 0: [100000000, 20000000000] 3. on node 1: [20000000000, 40000000000] 4. on node 4: [40000000000, 60000000000] 5. on node 5: [60000000000, 80000000000] 6. on node 2: [80000000000, a0000000000] 7. on node 3: [a0000000000, a0800000000] 8. on node 6: [c0000000000, a0800000000] 9. on node 7: [e0000000000, a0800000000] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a0800000000. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- arch/x86/mm/numa.c | 6 ++++-- include/linux/memblock.h | 2 ++ mm/memblock.c | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(&memblock.memory, + bi->start, bi->end - bi->start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-01 3:16 [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo Tang Chen @ 2015-07-01 6:25 ` Xishi Qiu 2015-07-01 7:55 ` Tang Chen 2015-07-15 21:20 ` Tejun Heo 1 sibling, 1 reply; 11+ messages in thread From: Xishi Qiu @ 2015-07-01 6:25 UTC (permalink / raw) To: Tang Chen Cc: tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 2015/7/1 11:16, Tang Chen wrote: > When parsing SRAT, all memory ranges are added into numa_meminfo. > In numa_init(), before entering numa_cleanup_meminfo(), all possible > memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > all ranges over max_pfn or empty. > > But, this only works if the nodes are continuous. Let's have a look > at the following example: > > We have an SRAT like this: > SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] > SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] > SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] > SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug > SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug > SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug > SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug > SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug > SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug > > On boot, only node 0,1,2,3 exist. > > And the numa_meminfo will look like this: > numa_meminfo.nr_blks = 9 > 1. on node 0: [0, 60000000] > 2. on node 0: [100000000, 20000000000] > 3. on node 1: [20000000000, 40000000000] > 4. on node 4: [40000000000, 60000000000] > 5. on node 5: [60000000000, 80000000000] > 6. on node 2: [80000000000, a0000000000] > 7. on node 3: [a0000000000, a0800000000] > 8. on node 6: [c0000000000, a0800000000] > 9. on node 7: [e0000000000, a0800000000] > > And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > the end address is over max_pfn, which is a0800000000. But 4 and 5 > are not removed because their end addresses are less then max_pfn. > But in fact, node 4 and 5 don't exist. > > In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > > Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), > node 4 and 5 will be mistakenly set to online. > > In this patch, we use memblock_overlaps_region() to check if ranges in > numa_meminfo overlap with ranges in memory_block. Since memory_block contains > all available memory at boot time, if they overlap, it means the ranges > exist. If not, then remove them from numa_meminfo. > Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? Thanks, Xishi Qiu > Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> > --- > arch/x86/mm/numa.c | 6 ++++-- > include/linux/memblock.h | 2 ++ > mm/memblock.c | 2 +- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > index 4053bb5..0c55cc5 100644 > --- a/arch/x86/mm/numa.c > +++ b/arch/x86/mm/numa.c > @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > bi->start = max(bi->start, low); > bi->end = min(bi->end, high); > > - /* and there's no empty block */ > - if (bi->start >= bi->end) > + /* and there's no empty or non-exist block */ > + if (bi->start >= bi->end || > + memblock_overlaps_region(&memblock.memory, > + bi->start, bi->end - bi->start) == -1) > numa_remove_memblk_from(i--, mi); > } > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index 0215ffd..3bf6cc1 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > int memblock_free(phys_addr_t base, phys_addr_t size); > int memblock_reserve(phys_addr_t base, phys_addr_t size); > void memblock_trim_memory(phys_addr_t align); > +long memblock_overlaps_region(struct memblock_type *type, > + phys_addr_t base, phys_addr_t size); > int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > diff --git a/mm/memblock.c b/mm/memblock.c > index 1b444c7..55b5f9f 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p > return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > } > > -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, > +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > phys_addr_t base, phys_addr_t size) > { > unsigned long i; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-01 6:25 ` Xishi Qiu @ 2015-07-01 7:55 ` Tang Chen 2015-07-01 8:55 ` Xishi Qiu 2015-07-02 15:02 ` Yasuaki Ishimatsu 0 siblings, 2 replies; 11+ messages in thread From: Tang Chen @ 2015-07-01 7:55 UTC (permalink / raw) To: Xishi Qiu Cc: tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 07/01/2015 02:25 PM, Xishi Qiu wrote: > On 2015/7/1 11:16, Tang Chen wrote: > >> When parsing SRAT, all memory ranges are added into numa_meminfo. >> In numa_init(), before entering numa_cleanup_meminfo(), all possible >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes >> all ranges over max_pfn or empty. >> >> But, this only works if the nodes are continuous. Let's have a look >> at the following example: >> >> We have an SRAT like this: >> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] >> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] >> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] >> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug >> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug >> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug >> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug >> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug >> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug >> >> On boot, only node 0,1,2,3 exist. >> >> And the numa_meminfo will look like this: >> numa_meminfo.nr_blks = 9 >> 1. on node 0: [0, 60000000] >> 2. on node 0: [100000000, 20000000000] >> 3. on node 1: [20000000000, 40000000000] >> 4. on node 4: [40000000000, 60000000000] >> 5. on node 5: [60000000000, 80000000000] >> 6. on node 2: [80000000000, a0000000000] >> 7. on node 3: [a0000000000, a0800000000] >> 8. on node 6: [c0000000000, a0800000000] >> 9. on node 7: [e0000000000, a0800000000] >> >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because >> the end address is over max_pfn, which is a0800000000. But 4 and 5 >> are not removed because their end addresses are less then max_pfn. >> But in fact, node 4 and 5 don't exist. >> >> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. >> >> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), >> node 4 and 5 will be mistakenly set to online. >> >> In this patch, we use memblock_overlaps_region() to check if ranges in >> numa_meminfo overlap with ranges in memory_block. Since memory_block contains >> all available memory at boot time, if they overlap, it means the ranges >> exist. If not, then remove them from numa_meminfo. >> > Hi Tang Chen, > > What's the impact of this problem? > > Command "numactl --hard" will show an empty node(no cpu and no memory, > but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. > > Thanks, > Xishi Qiu > >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> >> --- >> arch/x86/mm/numa.c | 6 ++++-- >> include/linux/memblock.h | 2 ++ >> mm/memblock.c | 2 +- >> 3 files changed, 7 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c >> index 4053bb5..0c55cc5 100644 >> --- a/arch/x86/mm/numa.c >> +++ b/arch/x86/mm/numa.c >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) >> bi->start = max(bi->start, low); >> bi->end = min(bi->end, high); >> >> - /* and there's no empty block */ >> - if (bi->start >= bi->end) >> + /* and there's no empty or non-exist block */ >> + if (bi->start >= bi->end || >> + memblock_overlaps_region(&memblock.memory, >> + bi->start, bi->end - bi->start) == -1) >> numa_remove_memblk_from(i--, mi); >> } >> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h >> index 0215ffd..3bf6cc1 100644 >> --- a/include/linux/memblock.h >> +++ b/include/linux/memblock.h >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); >> int memblock_free(phys_addr_t base, phys_addr_t size); >> int memblock_reserve(phys_addr_t base, phys_addr_t size); >> void memblock_trim_memory(phys_addr_t align); >> +long memblock_overlaps_region(struct memblock_type *type, >> + phys_addr_t base, phys_addr_t size); >> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); >> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); >> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); >> diff --git a/mm/memblock.c b/mm/memblock.c >> index 1b444c7..55b5f9f 100644 >> --- a/mm/memblock.c >> +++ b/mm/memblock.c >> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p >> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); >> } >> >> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, >> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, >> phys_addr_t base, phys_addr_t size) >> { >> unsigned long i; > > > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-01 7:55 ` Tang Chen @ 2015-07-01 8:55 ` Xishi Qiu 2015-07-02 15:02 ` Yasuaki Ishimatsu 1 sibling, 0 replies; 11+ messages in thread From: Xishi Qiu @ 2015-07-01 8:55 UTC (permalink / raw) To: Tang Chen Cc: tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 2015/7/1 15:55, Tang Chen wrote: > > On 07/01/2015 02:25 PM, Xishi Qiu wrote: >> On 2015/7/1 11:16, Tang Chen wrote: >> >>> When parsing SRAT, all memory ranges are added into numa_meminfo. >>> In numa_init(), before entering numa_cleanup_meminfo(), all possible >>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes >>> all ranges over max_pfn or empty. >>> >>> But, this only works if the nodes are continuous. Let's have a look >>> at the following example: >>> >>> We have an SRAT like this: >>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] >>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] >>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] >>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug >>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug >>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug >>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug >>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug >>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug >>> >>> On boot, only node 0,1,2,3 exist. >>> >>> And the numa_meminfo will look like this: >>> numa_meminfo.nr_blks = 9 >>> 1. on node 0: [0, 60000000] >>> 2. on node 0: [100000000, 20000000000] >>> 3. on node 1: [20000000000, 40000000000] >>> 4. on node 4: [40000000000, 60000000000] >>> 5. on node 5: [60000000000, 80000000000] >>> 6. on node 2: [80000000000, a0000000000] >>> 7. on node 3: [a0000000000, a0800000000] >>> 8. on node 6: [c0000000000, a0800000000] >>> 9. on node 7: [e0000000000, a0800000000] >>> >>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because >>> the end address is over max_pfn, which is a0800000000. But 4 and 5 >>> are not removed because their end addresses are less then max_pfn. >>> But in fact, node 4 and 5 don't exist. >>> >>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. >>> >>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), >>> node 4 and 5 will be mistakenly set to online. >>> >>> In this patch, we use memblock_overlaps_region() to check if ranges in >>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains >>> all available memory at boot time, if they overlap, it means the ranges >>> exist. If not, then remove them from numa_meminfo. >>> >> Hi Tang Chen, >> >> What's the impact of this problem? >> >> Command "numactl --hard" will show an empty node(no cpu and no memory, >> but pgdat is created), right? > > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. > Yes, because srat->numa_meminfo->alloc pgdat. Thanks, Xishi Qiu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-01 7:55 ` Tang Chen 2015-07-01 8:55 ` Xishi Qiu @ 2015-07-02 15:02 ` Yasuaki Ishimatsu 2015-07-03 1:26 ` Tang Chen 1 sibling, 1 reply; 11+ messages in thread From: Yasuaki Ishimatsu @ 2015-07-02 15:02 UTC (permalink / raw) To: Tang Chen Cc: Xishi Qiu, tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm Hi Tang, > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly set to online. Why does lscpu show the above result? Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen <tangchen@cn.fujitsu.com> wrote: > > On 07/01/2015 02:25 PM, Xishi Qiu wrote: > > On 2015/7/1 11:16, Tang Chen wrote: > > > >> When parsing SRAT, all memory ranges are added into numa_meminfo. > >> In numa_init(), before entering numa_cleanup_meminfo(), all possible > >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > >> all ranges over max_pfn or empty. > >> > >> But, this only works if the nodes are continuous. Let's have a look > >> at the following example: > >> > >> We have an SRAT like this: > >> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] > >> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] > >> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] > >> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug > >> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug > >> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug > >> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug > >> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug > >> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug > >> > >> On boot, only node 0,1,2,3 exist. > >> > >> And the numa_meminfo will look like this: > >> numa_meminfo.nr_blks = 9 > >> 1. on node 0: [0, 60000000] > >> 2. on node 0: [100000000, 20000000000] > >> 3. on node 1: [20000000000, 40000000000] > >> 4. on node 4: [40000000000, 60000000000] > >> 5. on node 5: [60000000000, 80000000000] > >> 6. on node 2: [80000000000, a0000000000] > >> 7. on node 3: [a0000000000, a0800000000] > >> 8. on node 6: [c0000000000, a0800000000] > >> 9. on node 7: [e0000000000, a0800000000] > >> > >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > >> the end address is over max_pfn, which is a0800000000. But 4 and 5 > >> are not removed because their end addresses are less then max_pfn. > >> But in fact, node 4 and 5 don't exist. > >> > >> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > >> > >> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), > >> node 4 and 5 will be mistakenly set to online. > >> > >> In this patch, we use memblock_overlaps_region() to check if ranges in > >> numa_meminfo overlap with ranges in memory_block. Since memory_block contains > >> all available memory at boot time, if they overlap, it means the ranges > >> exist. If not, then remove them from numa_meminfo. > >> > > Hi Tang Chen, > > > > What's the impact of this problem? > > > > Command "numactl --hard" will show an empty node(no cpu and no memory, > > but pgdat is created), right? > > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. > > Thanks. > > > > > Thanks, > > Xishi Qiu > > > >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> > >> --- > >> arch/x86/mm/numa.c | 6 ++++-- > >> include/linux/memblock.h | 2 ++ > >> mm/memblock.c | 2 +- > >> 3 files changed, 7 insertions(+), 3 deletions(-) > >> > >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > >> index 4053bb5..0c55cc5 100644 > >> --- a/arch/x86/mm/numa.c > >> +++ b/arch/x86/mm/numa.c > >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > >> bi->start = max(bi->start, low); > >> bi->end = min(bi->end, high); > >> > >> - /* and there's no empty block */ > >> - if (bi->start >= bi->end) > >> + /* and there's no empty or non-exist block */ > >> + if (bi->start >= bi->end || > >> + memblock_overlaps_region(&memblock.memory, > >> + bi->start, bi->end - bi->start) == -1) > >> numa_remove_memblk_from(i--, mi); > >> } > >> > >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h > >> index 0215ffd..3bf6cc1 100644 > >> --- a/include/linux/memblock.h > >> +++ b/include/linux/memblock.h > >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > >> int memblock_free(phys_addr_t base, phys_addr_t size); > >> int memblock_reserve(phys_addr_t base, phys_addr_t size); > >> void memblock_trim_memory(phys_addr_t align); > >> +long memblock_overlaps_region(struct memblock_type *type, > >> + phys_addr_t base, phys_addr_t size); > >> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > >> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > >> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > >> diff --git a/mm/memblock.c b/mm/memblock.c > >> index 1b444c7..55b5f9f 100644 > >> --- a/mm/memblock.c > >> +++ b/mm/memblock.c > >> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p > >> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > >> } > >> > >> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >> phys_addr_t base, phys_addr_t size) > >> { > >> unsigned long i; > > > > > > . > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-02 15:02 ` Yasuaki Ishimatsu @ 2015-07-03 1:26 ` Tang Chen 2015-07-06 16:42 ` Yasuaki Ishimatsu 0 siblings, 1 reply; 11+ messages in thread From: Tang Chen @ 2015-07-03 1:26 UTC (permalink / raw) To: Yasuaki Ishimatsu Cc: Xishi Qiu, tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: > Hi Tang, > >> On my box, if I run lscpu, the output looks like this: >> >> NUMA node0 CPU(s): 0-14,128-142 >> NUMA node1 CPU(s): 15-29,143-157 >> NUMA node2 CPU(s): >> NUMA node3 CPU(s): >> NUMA node4 CPU(s): 62-76,190-204 >> NUMA node5 CPU(s): 78-92,206-220 >> >> Node 2 and 3 are not exist, but they are online. > According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. > set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. > > Thanks, > Yasuaki Ishimatsu > > On Wed, 1 Jul 2015 15:55:30 +0800 > Tang Chen <tangchen@cn.fujitsu.com> wrote: > >> On 07/01/2015 02:25 PM, Xishi Qiu wrote: >>> On 2015/7/1 11:16, Tang Chen wrote: >>> >>>> When parsing SRAT, all memory ranges are added into numa_meminfo. >>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible >>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes >>>> all ranges over max_pfn or empty. >>>> >>>> But, this only works if the nodes are continuous. Let's have a look >>>> at the following example: >>>> >>>> We have an SRAT like this: >>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] >>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] >>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] >>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug >>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug >>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug >>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug >>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug >>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug >>>> >>>> On boot, only node 0,1,2,3 exist. >>>> >>>> And the numa_meminfo will look like this: >>>> numa_meminfo.nr_blks = 9 >>>> 1. on node 0: [0, 60000000] >>>> 2. on node 0: [100000000, 20000000000] >>>> 3. on node 1: [20000000000, 40000000000] >>>> 4. on node 4: [40000000000, 60000000000] >>>> 5. on node 5: [60000000000, 80000000000] >>>> 6. on node 2: [80000000000, a0000000000] >>>> 7. on node 3: [a0000000000, a0800000000] >>>> 8. on node 6: [c0000000000, a0800000000] >>>> 9. on node 7: [e0000000000, a0800000000] >>>> >>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because >>>> the end address is over max_pfn, which is a0800000000. But 4 and 5 >>>> are not removed because their end addresses are less then max_pfn. >>>> But in fact, node 4 and 5 don't exist. >>>> >>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. >>>> >>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), >>>> node 4 and 5 will be mistakenly set to online. >>>> >>>> In this patch, we use memblock_overlaps_region() to check if ranges in >>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains >>>> all available memory at boot time, if they overlap, it means the ranges >>>> exist. If not, then remove them from numa_meminfo. >>>> >>> Hi Tang Chen, >>> >>> What's the impact of this problem? >>> >>> Command "numactl --hard" will show an empty node(no cpu and no memory, >>> but pgdat is created), right? >> On my box, if I run lscpu, the output looks like this: >> >> NUMA node0 CPU(s): 0-14,128-142 >> NUMA node1 CPU(s): 15-29,143-157 >> NUMA node2 CPU(s): >> NUMA node3 CPU(s): >> NUMA node4 CPU(s): 62-76,190-204 >> NUMA node5 CPU(s): 78-92,206-220 >> >> Node 2 and 3 are not exist, but they are online. >> >> Thanks. >> >>> Thanks, >>> Xishi Qiu >>> >>>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> >>>> --- >>>> arch/x86/mm/numa.c | 6 ++++-- >>>> include/linux/memblock.h | 2 ++ >>>> mm/memblock.c | 2 +- >>>> 3 files changed, 7 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c >>>> index 4053bb5..0c55cc5 100644 >>>> --- a/arch/x86/mm/numa.c >>>> +++ b/arch/x86/mm/numa.c >>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) >>>> bi->start = max(bi->start, low); >>>> bi->end = min(bi->end, high); >>>> >>>> - /* and there's no empty block */ >>>> - if (bi->start >= bi->end) >>>> + /* and there's no empty or non-exist block */ >>>> + if (bi->start >= bi->end || >>>> + memblock_overlaps_region(&memblock.memory, >>>> + bi->start, bi->end - bi->start) == -1) >>>> numa_remove_memblk_from(i--, mi); >>>> } >>>> >>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h >>>> index 0215ffd..3bf6cc1 100644 >>>> --- a/include/linux/memblock.h >>>> +++ b/include/linux/memblock.h >>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); >>>> int memblock_free(phys_addr_t base, phys_addr_t size); >>>> int memblock_reserve(phys_addr_t base, phys_addr_t size); >>>> void memblock_trim_memory(phys_addr_t align); >>>> +long memblock_overlaps_region(struct memblock_type *type, >>>> + phys_addr_t base, phys_addr_t size); >>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); >>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); >>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); >>>> diff --git a/mm/memblock.c b/mm/memblock.c >>>> index 1b444c7..55b5f9f 100644 >>>> --- a/mm/memblock.c >>>> +++ b/mm/memblock.c >>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p >>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); >>>> } >>>> >>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, >>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, >>>> phys_addr_t base, phys_addr_t size) >>>> { >>>> unsigned long i; >>> >>> . >>> > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-03 1:26 ` Tang Chen @ 2015-07-06 16:42 ` Yasuaki Ishimatsu 2015-07-07 8:57 ` Tang Chen 0 siblings, 1 reply; 11+ messages in thread From: Yasuaki Ishimatsu @ 2015-07-06 16:42 UTC (permalink / raw) To: Tang Chen Cc: Xishi Qiu, tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm, Yasuaki Ishimatsu On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen <tangchen@cn.fujitsu.com> wrote: > > On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: > > Hi Tang, > > > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > > According your description of patch, node 4 and 5 are mistakenly > > Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Thanks, Yasuaki Ishimatsu > > set to online. Why does lscpu show the above result? > > Well, actually not only lscpu gives the strange result, under > /sys/device/system/node, > interfaces for node 2 and 3 are also created. > > I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But > obviously, > node 2 and 3 are set online, which is incorrect. > > For now, I only found that in numa_cleanup_meminfo(), memory above > max_pfn is removed, > but holes between nodes are not removed. > > I think libraries are not able to handle this problem since nodes are > set online in kernel. > Seeing from user space, there is no hole. > > Thanks. > > > > > Thanks, > > Yasuaki Ishimatsu > > > > On Wed, 1 Jul 2015 15:55:30 +0800 > > Tang Chen <tangchen@cn.fujitsu.com> wrote: > > > >> On 07/01/2015 02:25 PM, Xishi Qiu wrote: > >>> On 2015/7/1 11:16, Tang Chen wrote: > >>> > >>>> When parsing SRAT, all memory ranges are added into numa_meminfo. > >>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible > >>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > >>>> all ranges over max_pfn or empty. > >>>> > >>>> But, this only works if the nodes are continuous. Let's have a look > >>>> at the following example: > >>>> > >>>> We have an SRAT like this: > >>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] > >>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] > >>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] > >>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug > >>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug > >>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug > >>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug > >>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug > >>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug > >>>> > >>>> On boot, only node 0,1,2,3 exist. > >>>> > >>>> And the numa_meminfo will look like this: > >>>> numa_meminfo.nr_blks = 9 > >>>> 1. on node 0: [0, 60000000] > >>>> 2. on node 0: [100000000, 20000000000] > >>>> 3. on node 1: [20000000000, 40000000000] > >>>> 4. on node 4: [40000000000, 60000000000] > >>>> 5. on node 5: [60000000000, 80000000000] > >>>> 6. on node 2: [80000000000, a0000000000] > >>>> 7. on node 3: [a0000000000, a0800000000] > >>>> 8. on node 6: [c0000000000, a0800000000] > >>>> 9. on node 7: [e0000000000, a0800000000] > >>>> > >>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > >>>> the end address is over max_pfn, which is a0800000000. But 4 and 5 > >>>> are not removed because their end addresses are less then max_pfn. > >>>> But in fact, node 4 and 5 don't exist. > >>>> > >>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > >>>> > >>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), > >>>> node 4 and 5 will be mistakenly set to online. > >>>> > >>>> In this patch, we use memblock_overlaps_region() to check if ranges in > >>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains > >>>> all available memory at boot time, if they overlap, it means the ranges > >>>> exist. If not, then remove them from numa_meminfo. > >>>> > >>> Hi Tang Chen, > >>> > >>> What's the impact of this problem? > >>> > >>> Command "numactl --hard" will show an empty node(no cpu and no memory, > >>> but pgdat is created), right? > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > >> > >> Thanks. > >> > >>> Thanks, > >>> Xishi Qiu > >>> > >>>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> > >>>> --- > >>>> arch/x86/mm/numa.c | 6 ++++-- > >>>> include/linux/memblock.h | 2 ++ > >>>> mm/memblock.c | 2 +- > >>>> 3 files changed, 7 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > >>>> index 4053bb5..0c55cc5 100644 > >>>> --- a/arch/x86/mm/numa.c > >>>> +++ b/arch/x86/mm/numa.c > >>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > >>>> bi->start = max(bi->start, low); > >>>> bi->end = min(bi->end, high); > >>>> > >>>> - /* and there's no empty block */ > >>>> - if (bi->start >= bi->end) > >>>> + /* and there's no empty or non-exist block */ > >>>> + if (bi->start >= bi->end || > >>>> + memblock_overlaps_region(&memblock.memory, > >>>> + bi->start, bi->end - bi->start) == -1) > >>>> numa_remove_memblk_from(i--, mi); > >>>> } > >>>> > >>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h > >>>> index 0215ffd..3bf6cc1 100644 > >>>> --- a/include/linux/memblock.h > >>>> +++ b/include/linux/memblock.h > >>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > >>>> int memblock_free(phys_addr_t base, phys_addr_t size); > >>>> int memblock_reserve(phys_addr_t base, phys_addr_t size); > >>>> void memblock_trim_memory(phys_addr_t align); > >>>> +long memblock_overlaps_region(struct memblock_type *type, > >>>> + phys_addr_t base, phys_addr_t size); > >>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > >>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > >>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > >>>> diff --git a/mm/memblock.c b/mm/memblock.c > >>>> index 1b444c7..55b5f9f 100644 > >>>> --- a/mm/memblock.c > >>>> +++ b/mm/memblock.c > >>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p > >>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > >>>> } > >>>> > >>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >>>> phys_addr_t base, phys_addr_t size) > >>>> { > >>>> unsigned long i; > >>> > >>> . > >>> > > . > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-06 16:42 ` Yasuaki Ishimatsu @ 2015-07-07 8:57 ` Tang Chen 0 siblings, 0 replies; 11+ messages in thread From: Tang Chen @ 2015-07-07 8:57 UTC (permalink / raw) To: Yasuaki Ishimatsu Cc: Xishi Qiu, tglx, mingo, hpa, akpm, tj, dyoung, isimatu.yasuaki, lcapitulino, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote: > On Fri, 3 Jul 2015 09:26:05 +0800 > Tang Chen <tangchen@cn.fujitsu.com> wrote: > >> On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: >>> Hi Tang, >>> >>>> On my box, if I run lscpu, the output looks like this: >>>> >>>> NUMA node0 CPU(s): 0-14,128-142 >>>> NUMA node1 CPU(s): 15-29,143-157 >>>> NUMA node2 CPU(s): >>>> NUMA node3 CPU(s): >>>> NUMA node4 CPU(s): 62-76,190-204 >>>> NUMA node5 CPU(s): 78-92,206-220 >>>> >>>> Node 2 and 3 are not exist, but they are online. >>> According your description of patch, node 4 and 5 are mistakenly >> Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. > Please add the results of lscpu before/after applyinig the patch into > description of your patch. > > Feel free to add my > Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Thanks for reviewing. Will update the patch soon. Thanks. > > Thanks, > Yasuaki Ishimatsu > >>> set to online. Why does lscpu show the above result? >> Well, actually not only lscpu gives the strange result, under >> /sys/device/system/node, >> interfaces for node 2 and 3 are also created. >> >> I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But >> obviously, >> node 2 and 3 are set online, which is incorrect. >> >> For now, I only found that in numa_cleanup_meminfo(), memory above >> max_pfn is removed, >> but holes between nodes are not removed. >> >> I think libraries are not able to handle this problem since nodes are >> set online in kernel. >> Seeing from user space, there is no hole. >> >> Thanks. >> >>> Thanks, >>> Yasuaki Ishimatsu >>> >>> On Wed, 1 Jul 2015 15:55:30 +0800 >>> Tang Chen <tangchen@cn.fujitsu.com> wrote: >>> >>>> On 07/01/2015 02:25 PM, Xishi Qiu wrote: >>>>> On 2015/7/1 11:16, Tang Chen wrote: >>>>> >>>>>> When parsing SRAT, all memory ranges are added into numa_meminfo. >>>>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible >>>>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes >>>>>> all ranges over max_pfn or empty. >>>>>> >>>>>> But, this only works if the nodes are continuous. Let's have a look >>>>>> at the following example: >>>>>> >>>>>> We have an SRAT like this: >>>>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] >>>>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] >>>>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] >>>>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug >>>>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug >>>>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug >>>>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug >>>>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug >>>>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug >>>>>> >>>>>> On boot, only node 0,1,2,3 exist. >>>>>> >>>>>> And the numa_meminfo will look like this: >>>>>> numa_meminfo.nr_blks = 9 >>>>>> 1. on node 0: [0, 60000000] >>>>>> 2. on node 0: [100000000, 20000000000] >>>>>> 3. on node 1: [20000000000, 40000000000] >>>>>> 4. on node 4: [40000000000, 60000000000] >>>>>> 5. on node 5: [60000000000, 80000000000] >>>>>> 6. on node 2: [80000000000, a0000000000] >>>>>> 7. on node 3: [a0000000000, a0800000000] >>>>>> 8. on node 6: [c0000000000, a0800000000] >>>>>> 9. on node 7: [e0000000000, a0800000000] >>>>>> >>>>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because >>>>>> the end address is over max_pfn, which is a0800000000. But 4 and 5 >>>>>> are not removed because their end addresses are less then max_pfn. >>>>>> But in fact, node 4 and 5 don't exist. >>>>>> >>>>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. >>>>>> >>>>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), >>>>>> node 4 and 5 will be mistakenly set to online. >>>>>> >>>>>> In this patch, we use memblock_overlaps_region() to check if ranges in >>>>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains >>>>>> all available memory at boot time, if they overlap, it means the ranges >>>>>> exist. If not, then remove them from numa_meminfo. >>>>>> >>>>> Hi Tang Chen, >>>>> >>>>> What's the impact of this problem? >>>>> >>>>> Command "numactl --hard" will show an empty node(no cpu and no memory, >>>>> but pgdat is created), right? >>>> On my box, if I run lscpu, the output looks like this: >>>> >>>> NUMA node0 CPU(s): 0-14,128-142 >>>> NUMA node1 CPU(s): 15-29,143-157 >>>> NUMA node2 CPU(s): >>>> NUMA node3 CPU(s): >>>> NUMA node4 CPU(s): 62-76,190-204 >>>> NUMA node5 CPU(s): 78-92,206-220 >>>> >>>> Node 2 and 3 are not exist, but they are online. >>>> >>>> Thanks. >>>> >>>>> Thanks, >>>>> Xishi Qiu >>>>> >>>>>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> >>>>>> --- >>>>>> arch/x86/mm/numa.c | 6 ++++-- >>>>>> include/linux/memblock.h | 2 ++ >>>>>> mm/memblock.c | 2 +- >>>>>> 3 files changed, 7 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c >>>>>> index 4053bb5..0c55cc5 100644 >>>>>> --- a/arch/x86/mm/numa.c >>>>>> +++ b/arch/x86/mm/numa.c >>>>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) >>>>>> bi->start = max(bi->start, low); >>>>>> bi->end = min(bi->end, high); >>>>>> >>>>>> - /* and there's no empty block */ >>>>>> - if (bi->start >= bi->end) >>>>>> + /* and there's no empty or non-exist block */ >>>>>> + if (bi->start >= bi->end || >>>>>> + memblock_overlaps_region(&memblock.memory, >>>>>> + bi->start, bi->end - bi->start) == -1) >>>>>> numa_remove_memblk_from(i--, mi); >>>>>> } >>>>>> >>>>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h >>>>>> index 0215ffd..3bf6cc1 100644 >>>>>> --- a/include/linux/memblock.h >>>>>> +++ b/include/linux/memblock.h >>>>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); >>>>>> int memblock_free(phys_addr_t base, phys_addr_t size); >>>>>> int memblock_reserve(phys_addr_t base, phys_addr_t size); >>>>>> void memblock_trim_memory(phys_addr_t align); >>>>>> +long memblock_overlaps_region(struct memblock_type *type, >>>>>> + phys_addr_t base, phys_addr_t size); >>>>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); >>>>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); >>>>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); >>>>>> diff --git a/mm/memblock.c b/mm/memblock.c >>>>>> index 1b444c7..55b5f9f 100644 >>>>>> --- a/mm/memblock.c >>>>>> +++ b/mm/memblock.c >>>>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p >>>>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); >>>>>> } >>>>>> >>>>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, >>>>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, >>>>>> phys_addr_t base, phys_addr_t size) >>>>>> { >>>>>> unsigned long i; >>>>> . >>>>> >>> . >>> > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-01 3:16 [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo Tang Chen 2015-07-01 6:25 ` Xishi Qiu @ 2015-07-15 21:20 ` Tejun Heo 2015-07-16 5:30 ` Tang Chen 2015-07-16 7:21 ` Tang Chen 1 sibling, 2 replies; 11+ messages in thread From: Tejun Heo @ 2015-07-15 21:20 UTC (permalink / raw) To: Tang Chen Cc: tglx, mingo, hpa, akpm, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, qiuxishi, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... > - /* and there's no empty block */ > - if (bi->start >= bi->end) > + /* and there's no empty or non-exist block */ > + if (bi->start >= bi->end || > + memblock_overlaps_region(&memblock.memory, > + bi->start, bi->end - bi->start) == -1) Ugh.... can you please change memblock_overlaps_region() to return bool instead? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-15 21:20 ` Tejun Heo @ 2015-07-16 5:30 ` Tang Chen 2015-07-16 7:21 ` Tang Chen 1 sibling, 0 replies; 11+ messages in thread From: Tang Chen @ 2015-07-16 5:30 UTC (permalink / raw) To: Tejun Heo Cc: tglx, mingo, hpa, akpm, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, qiuxishi, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 07/16/2015 05:20 AM, Tejun Heo wrote: > On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: > ... >> - /* and there's no empty block */ >> - if (bi->start >= bi->end) >> + /* and there's no empty or non-exist block */ >> + if (bi->start >= bi->end || >> + memblock_overlaps_region(&memblock.memory, >> + bi->start, bi->end - bi->start) == -1) > Ugh.... can you please change memblock_overlaps_region() to return > bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Maybe it had some users before. Of course for now, it is only called by memblock_is_region_reserved(). It is OK to change the return value of memblock_overlaps_region() to bool. But any caller of memblock_is_region_reserved() should also be changed. I think it is OK to leave it there. Thanks. > > Thanks. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. 2015-07-15 21:20 ` Tejun Heo 2015-07-16 5:30 ` Tang Chen @ 2015-07-16 7:21 ` Tang Chen 1 sibling, 0 replies; 11+ messages in thread From: Tang Chen @ 2015-07-16 7:21 UTC (permalink / raw) To: Tejun Heo Cc: tglx, mingo, hpa, akpm, dyoung, isimatu.yasuaki, yasu.isimatu, lcapitulino, qiuxishi, will.deacon, tony.luck, vladimir.murzin, fabf, kuleshovmail, bhe, x86, linux-kernel, linux-mm On 07/16/2015 05:20 AM, Tejun Heo wrote: > On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: > ... >> - /* and there's no empty block */ >> - if (bi->start >= bi->end) >> + /* and there's no empty or non-exist block */ >> + if (bi->start >= bi->end || >> + memblock_overlaps_region(&memblock.memory, >> + bi->start, bi->end - bi->start) == -1) > Ugh.... can you please change memblock_overlaps_region() to return > bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Of course for now, it is only called by memblock_is_region_reserved(). Will post a patch to do this. Thanks. > > Thanks. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-07-16 7:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-01 3:16 [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo Tang Chen 2015-07-01 6:25 ` Xishi Qiu 2015-07-01 7:55 ` Tang Chen 2015-07-01 8:55 ` Xishi Qiu 2015-07-02 15:02 ` Yasuaki Ishimatsu 2015-07-03 1:26 ` Tang Chen 2015-07-06 16:42 ` Yasuaki Ishimatsu 2015-07-07 8:57 ` Tang Chen 2015-07-15 21:20 ` Tejun Heo 2015-07-16 5:30 ` Tang Chen 2015-07-16 7:21 ` Tang Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).