linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tang Chen <tangchen@cn.fujitsu.com>
To: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>,
	tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	akpm@linux-foundation.org, tj@kernel.org, dyoung@redhat.com,
	isimatu.yasuaki@jp.fujitsu.com, lcapitulino@redhat.com,
	will.deacon@arm.com, tony.luck@intel.com,
	vladimir.murzin@arm.com, fabf@skynet.be, kuleshovmail@gmail.com,
	bhe@redhat.com, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
Date: Fri, 3 Jul 2015 09:26:05 +0800	[thread overview]
Message-ID: <5595E4AD.8020502@cn.fujitsu.com> (raw)
In-Reply-To: <5595527a.0b32370a.6c7e.01ee@mx.google.com>


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
> Hi Tang,
>
>> On my box, if I run lscpu, the output looks like this:
>>
>> NUMA node0 CPU(s):     0-14,128-142
>> NUMA node1 CPU(s):     15-29,143-157
>> NUMA node2 CPU(s):
>> NUMA node3 CPU(s):
>> NUMA node4 CPU(s):     62-76,190-204
>> NUMA node5 CPU(s):     78-92,206-220
>>
>> Node 2 and 3 are not exist, but they are online.
> According your description of patch, node 4 and 5 are mistakenly

Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.
> set to online. Why does lscpu show the above result?

Well, actually not only lscpu gives the strange result, under 
/sys/device/system/node,
interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But 
obviously,
node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above 
max_pfn is removed,
but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are 
set online in kernel.
Seeing from user space, there is no hole.

Thanks.

>
> Thanks,
> Yasuaki Ishimatsu
>
> On Wed, 1 Jul 2015 15:55:30 +0800
> Tang Chen <tangchen@cn.fujitsu.com> wrote:
>
>> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>>> On 2015/7/1 11:16, Tang Chen wrote:
>>>
>>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>>> all ranges over max_pfn or empty.
>>>>
>>>> But, this only works if the nodes are continuous. Let's have a look
>>>> at the following example:
>>>>
>>>> We have an SRAT like this:
>>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>>>
>>>> On boot, only node 0,1,2,3 exist.
>>>>
>>>> And the numa_meminfo will look like this:
>>>> numa_meminfo.nr_blks = 9
>>>> 1. on node 0: [0, 60000000]
>>>> 2. on node 0: [100000000, 20000000000]
>>>> 3. on node 1: [20000000000, 40000000000]
>>>> 4. on node 4: [40000000000, 60000000000]
>>>> 5. on node 5: [60000000000, 80000000000]
>>>> 6. on node 2: [80000000000, a0000000000]
>>>> 7. on node 3: [a0000000000, a0800000000]
>>>> 8. on node 6: [c0000000000, a0800000000]
>>>> 9. on node 7: [e0000000000, a0800000000]
>>>>
>>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>>>> are not removed because their end addresses are less then max_pfn.
>>>> But in fact, node 4 and 5 don't exist.
>>>>
>>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>>
>>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>>>> node 4 and 5 will be mistakenly set to online.
>>>>
>>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>>>> all available memory at boot time, if they overlap, it means the ranges
>>>> exist. If not, then remove them from numa_meminfo.
>>>>
>>> Hi Tang Chen,
>>>
>>> What's the impact of this problem?
>>>
>>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>>> but pgdat is created), right?
>> On my box, if I run lscpu, the output looks like this:
>>
>> NUMA node0 CPU(s):     0-14,128-142
>> NUMA node1 CPU(s):     15-29,143-157
>> NUMA node2 CPU(s):
>> NUMA node3 CPU(s):
>> NUMA node4 CPU(s):     62-76,190-204
>> NUMA node5 CPU(s):     78-92,206-220
>>
>> Node 2 and 3 are not exist, but they are online.
>>
>> Thanks.
>>
>>> Thanks,
>>> Xishi Qiu
>>>
>>>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>>>> ---
>>>>    arch/x86/mm/numa.c       | 6 ++++--
>>>>    include/linux/memblock.h | 2 ++
>>>>    mm/memblock.c            | 2 +-
>>>>    3 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>> index 4053bb5..0c55cc5 100644
>>>> --- a/arch/x86/mm/numa.c
>>>> +++ b/arch/x86/mm/numa.c
>>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>>>>    		bi->start = max(bi->start, low);
>>>>    		bi->end = min(bi->end, high);
>>>>    
>>>> -		/* and there's no empty block */
>>>> -		if (bi->start >= bi->end)
>>>> +		/* and there's no empty or non-exist block */
>>>> +		if (bi->start >= bi->end ||
>>>> +		    memblock_overlaps_region(&memblock.memory,
>>>> +			bi->start, bi->end - bi->start) == -1)
>>>>    			numa_remove_memblk_from(i--, mi);
>>>>    	}
>>>>    
>>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>>> index 0215ffd..3bf6cc1 100644
>>>> --- a/include/linux/memblock.h
>>>> +++ b/include/linux/memblock.h
>>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>>>>    int memblock_free(phys_addr_t base, phys_addr_t size);
>>>>    int memblock_reserve(phys_addr_t base, phys_addr_t size);
>>>>    void memblock_trim_memory(phys_addr_t align);
>>>> +long memblock_overlaps_region(struct memblock_type *type,
>>>> +			      phys_addr_t base, phys_addr_t size);
>>>>    int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>>>>    int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>>>>    int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>>> index 1b444c7..55b5f9f 100644
>>>> --- a/mm/memblock.c
>>>> +++ b/mm/memblock.c
>>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
>>>>    	return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>>>>    }
>>>>    
>>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>>    					phys_addr_t base, phys_addr_t size)
>>>>    {
>>>>    	unsigned long i;
>>>
>>> .
>>>
> .
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-07-03  1:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-01  3:16 [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo Tang Chen
2015-07-01  6:25 ` Xishi Qiu
2015-07-01  7:55   ` Tang Chen
2015-07-01  8:55     ` Xishi Qiu
2015-07-02 15:02     ` Yasuaki Ishimatsu
2015-07-03  1:26       ` Tang Chen [this message]
2015-07-06 16:42         ` Yasuaki Ishimatsu
2015-07-07  8:57           ` Tang Chen
2015-07-15 21:20 ` Tejun Heo
2015-07-16  5:30   ` Tang Chen
2015-07-16  7:21   ` Tang Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5595E4AD.8020502@cn.fujitsu.com \
    --to=tangchen@cn.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=dyoung@redhat.com \
    --cc=fabf@skynet.be \
    --cc=hpa@zytor.com \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=kuleshovmail@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=qiuxishi@huawei.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=vladimir.murzin@arm.com \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    --cc=yasu.isimatu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).