All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
To: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>,
	tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	akpm@linux-foundation.org, tj@kernel.org, dyoung@redhat.com,
	isimatu.yasuaki@jp.fujitsu.com, lcapitulino@redhat.com,
	will.deacon@arm.com, tony.luck@intel.com,
	vladimir.murzin@arm.com, fabf@skynet.be, kuleshovmail@gmail.com,
	bhe@redhat.com, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
Date: Thu, 02 Jul 2015 08:02:18 -0700 (PDT)	[thread overview]
Message-ID: <5595527a.0b32370a.6c7e.01ee@mx.google.com> (raw)
In-Reply-To: <55939CF2.6080108@cn.fujitsu.com>

Hi Tang,

> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s):     0-14,128-142
> NUMA node1 CPU(s):     15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s):     62-76,190-204
> NUMA node5 CPU(s):     78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly
set to online. Why does lscpu show the above result?

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> 
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> > On 2015/7/1 11:16, Tang Chen wrote:
> >
> >> When parsing SRAT, all memory ranges are added into numa_meminfo.
> >> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> >> all ranges over max_pfn or empty.
> >>
> >> But, this only works if the nodes are continuous. Let's have a look
> >> at the following example:
> >>
> >> We have an SRAT like this:
> >> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
> >> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
> >> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
> >> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
> >> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
> >> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
> >> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
> >> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
> >> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
> >>
> >> On boot, only node 0,1,2,3 exist.
> >>
> >> And the numa_meminfo will look like this:
> >> numa_meminfo.nr_blks = 9
> >> 1. on node 0: [0, 60000000]
> >> 2. on node 0: [100000000, 20000000000]
> >> 3. on node 1: [20000000000, 40000000000]
> >> 4. on node 4: [40000000000, 60000000000]
> >> 5. on node 5: [60000000000, 80000000000]
> >> 6. on node 2: [80000000000, a0000000000]
> >> 7. on node 3: [a0000000000, a0800000000]
> >> 8. on node 6: [c0000000000, a0800000000]
> >> 9. on node 7: [e0000000000, a0800000000]
> >>
> >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> >> the end address is over max_pfn, which is a0800000000. But 4 and 5
> >> are not removed because their end addresses are less then max_pfn.
> >> But in fact, node 4 and 5 don't exist.
> >>
> >> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
> >>
> >> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
> >> node 4 and 5 will be mistakenly set to online.
> >>
> >> In this patch, we use memblock_overlaps_region() to check if ranges in
> >> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> >> all available memory at boot time, if they overlap, it means the ranges
> >> exist. If not, then remove them from numa_meminfo.
> >>
> > Hi Tang Chen,
> >
> > What's the impact of this problem?
> >
> > Command "numactl --hard" will show an empty node(no cpu and no memory,
> > but pgdat is created), right?
> 
> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s):     0-14,128-142
> NUMA node1 CPU(s):     15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s):     62-76,190-204
> NUMA node5 CPU(s):     78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.
> 
> Thanks.
> 
> >
> > Thanks,
> > Xishi Qiu
> >
> >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> >> ---
> >>   arch/x86/mm/numa.c       | 6 ++++--
> >>   include/linux/memblock.h | 2 ++
> >>   mm/memblock.c            | 2 +-
> >>   3 files changed, 7 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index 4053bb5..0c55cc5 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> >>   		bi->start = max(bi->start, low);
> >>   		bi->end = min(bi->end, high);
> >>   
> >> -		/* and there's no empty block */
> >> -		if (bi->start >= bi->end)
> >> +		/* and there's no empty or non-exist block */
> >> +		if (bi->start >= bi->end ||
> >> +		    memblock_overlaps_region(&memblock.memory,
> >> +			bi->start, bi->end - bi->start) == -1)
> >>   			numa_remove_memblk_from(i--, mi);
> >>   	}
> >>   
> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> index 0215ffd..3bf6cc1 100644
> >> --- a/include/linux/memblock.h
> >> +++ b/include/linux/memblock.h
> >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> >>   int memblock_free(phys_addr_t base, phys_addr_t size);
> >>   int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >>   void memblock_trim_memory(phys_addr_t align);
> >> +long memblock_overlaps_region(struct memblock_type *type,
> >> +			      phys_addr_t base, phys_addr_t size);
> >>   int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >>   int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >>   int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >> diff --git a/mm/memblock.c b/mm/memblock.c
> >> index 1b444c7..55b5f9f 100644
> >> --- a/mm/memblock.c
> >> +++ b/mm/memblock.c
> >> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
> >>   	return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
> >>   }
> >>   
> >> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >>   					phys_addr_t base, phys_addr_t size)
> >>   {
> >>   	unsigned long i;
> >
> >
> > .
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
To: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>, <tglx@linutronix.de>,
	<mingo@redhat.com>, <hpa@zytor.com>, <akpm@linux-foundation.org>,
	<tj@kernel.org>, <dyoung@redhat.com>,
	<isimatu.yasuaki@jp.fujitsu.com>, <lcapitulino@redhat.com>,
	<will.deacon@arm.com>, <tony.luck@intel.com>,
	<vladimir.murzin@arm.com>, <fabf@skynet.be>,
	<kuleshovmail@gmail.com>, <bhe@redhat.com>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
Date: Thu, 02 Jul 2015 08:02:18 -0700 (PDT)	[thread overview]
Message-ID: <5595527a.0b32370a.6c7e.01ee@mx.google.com> (raw)
In-Reply-To: <55939CF2.6080108@cn.fujitsu.com>

Hi Tang,

> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s):     0-14,128-142
> NUMA node1 CPU(s):     15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s):     62-76,190-204
> NUMA node5 CPU(s):     78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly
set to online. Why does lscpu show the above result?

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> 
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> > On 2015/7/1 11:16, Tang Chen wrote:
> >
> >> When parsing SRAT, all memory ranges are added into numa_meminfo.
> >> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> >> all ranges over max_pfn or empty.
> >>
> >> But, this only works if the nodes are continuous. Let's have a look
> >> at the following example:
> >>
> >> We have an SRAT like this:
> >> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
> >> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
> >> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
> >> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
> >> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
> >> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
> >> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
> >> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
> >> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
> >>
> >> On boot, only node 0,1,2,3 exist.
> >>
> >> And the numa_meminfo will look like this:
> >> numa_meminfo.nr_blks = 9
> >> 1. on node 0: [0, 60000000]
> >> 2. on node 0: [100000000, 20000000000]
> >> 3. on node 1: [20000000000, 40000000000]
> >> 4. on node 4: [40000000000, 60000000000]
> >> 5. on node 5: [60000000000, 80000000000]
> >> 6. on node 2: [80000000000, a0000000000]
> >> 7. on node 3: [a0000000000, a0800000000]
> >> 8. on node 6: [c0000000000, a0800000000]
> >> 9. on node 7: [e0000000000, a0800000000]
> >>
> >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> >> the end address is over max_pfn, which is a0800000000. But 4 and 5
> >> are not removed because their end addresses are less then max_pfn.
> >> But in fact, node 4 and 5 don't exist.
> >>
> >> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
> >>
> >> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
> >> node 4 and 5 will be mistakenly set to online.
> >>
> >> In this patch, we use memblock_overlaps_region() to check if ranges in
> >> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> >> all available memory at boot time, if they overlap, it means the ranges
> >> exist. If not, then remove them from numa_meminfo.
> >>
> > Hi Tang Chen,
> >
> > What's the impact of this problem?
> >
> > Command "numactl --hard" will show an empty node(no cpu and no memory,
> > but pgdat is created), right?
> 
> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s):     0-14,128-142
> NUMA node1 CPU(s):     15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s):     62-76,190-204
> NUMA node5 CPU(s):     78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.
> 
> Thanks.
> 
> >
> > Thanks,
> > Xishi Qiu
> >
> >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> >> ---
> >>   arch/x86/mm/numa.c       | 6 ++++--
> >>   include/linux/memblock.h | 2 ++
> >>   mm/memblock.c            | 2 +-
> >>   3 files changed, 7 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index 4053bb5..0c55cc5 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> >>   		bi->start = max(bi->start, low);
> >>   		bi->end = min(bi->end, high);
> >>   
> >> -		/* and there's no empty block */
> >> -		if (bi->start >= bi->end)
> >> +		/* and there's no empty or non-exist block */
> >> +		if (bi->start >= bi->end ||
> >> +		    memblock_overlaps_region(&memblock.memory,
> >> +			bi->start, bi->end - bi->start) == -1)
> >>   			numa_remove_memblk_from(i--, mi);
> >>   	}
> >>   
> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> index 0215ffd..3bf6cc1 100644
> >> --- a/include/linux/memblock.h
> >> +++ b/include/linux/memblock.h
> >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> >>   int memblock_free(phys_addr_t base, phys_addr_t size);
> >>   int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >>   void memblock_trim_memory(phys_addr_t align);
> >> +long memblock_overlaps_region(struct memblock_type *type,
> >> +			      phys_addr_t base, phys_addr_t size);
> >>   int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >>   int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >>   int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >> diff --git a/mm/memblock.c b/mm/memblock.c
> >> index 1b444c7..55b5f9f 100644
> >> --- a/mm/memblock.c
> >> +++ b/mm/memblock.c
> >> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
> >>   	return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
> >>   }
> >>   
> >> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >>   					phys_addr_t base, phys_addr_t size)
> >>   {
> >>   	unsigned long i;
> >
> >
> > .
> >
> 

  parent reply	other threads:[~2015-07-02 15:02 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-01  3:16 [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo Tang Chen
2015-07-01  3:16 ` Tang Chen
2015-07-01  6:25 ` Xishi Qiu
2015-07-01  6:25   ` Xishi Qiu
2015-07-01  7:55   ` Tang Chen
2015-07-01  7:55     ` Tang Chen
2015-07-01  8:55     ` Xishi Qiu
2015-07-01  8:55       ` Xishi Qiu
2015-07-02 15:02     ` Yasuaki Ishimatsu [this message]
2015-07-02 15:02       ` Yasuaki Ishimatsu
2015-07-03  1:26       ` Tang Chen
2015-07-03  1:26         ` Tang Chen
2015-07-06 16:42         ` Yasuaki Ishimatsu
2015-07-06 16:42           ` Yasuaki Ishimatsu
2015-07-07  8:57           ` Tang Chen
2015-07-07  8:57             ` Tang Chen
2015-07-15 21:20 ` Tejun Heo
2015-07-15 21:20   ` Tejun Heo
2015-07-16  5:30   ` Tang Chen
2015-07-16  5:30     ` Tang Chen
2015-07-16  7:21   ` Tang Chen
2015-07-16  7:21     ` Tang Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5595527a.0b32370a.6c7e.01ee@mx.google.com \
    --to=yasu.isimatu@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=dyoung@redhat.com \
    --cc=fabf@skynet.be \
    --cc=hpa@zytor.com \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=kuleshovmail@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=qiuxishi@huawei.com \
    --cc=tangchen@cn.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=vladimir.murzin@arm.com \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.