From: Tang Chen <tangchen@cn.fujitsu.com>
To: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: mingo@redhat.com, hpa@zytor.com, akpm@linux-foundation.org,
yinghai@kernel.org, jiang.liu@huawei.com, wency@cn.fujitsu.com,
laijs@cn.fujitsu.com, isimatu.yasuaki@jp.fujitsu.com,
tj@kernel.org, mgorman@suse.de, minchan@kernel.org,
mina86@mina86.com, gong.chen@linux.intel.com,
lwoodman@redhat.com, riel@redhat.com, jweiner@redhat.com,
prarit@redhat.com, x86@kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.
Date: Mon, 03 Jun 2013 15:35:53 +0800 [thread overview]
Message-ID: <51AC4759.6090101@cn.fujitsu.com> (raw)
In-Reply-To: <20130531162401.GA31139@dhcp-192-168-178-175.profitbricks.localdomain>
Hi Vasilis,
On 06/01/2013 12:24 AM, Vasilis Liaskovitis wrote:
......
>> +void __init_memblock memblock_mark_kernel_nodes()
>> +{
>> + int i, nid;
>> + struct memblock_type *reserved =&memblock.reserved;
>> +
>> + for (i = 0; i< reserved->cnt; i++)
>> + if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
>> + nid = memblock_get_region_node(&reserved->regions[i]);
>> + node_set(nid, memblock_kernel_nodemask);
>> + }
>> +}
>
> I think there is a problem here because memblock_set_region_node is sometimes
> called with nid == MAX_NUMNODES. This means the correct node is not properly
> masked in the memblock_kernel_nodemask bitmap.
> E.g. in a VM test, memblock_mark_kernel_nodes with extra pr_warn calls iterates
> over the following memblocks (ranges below are memblks base-(base+size)):
>
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000000000-0x00000000010000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000098000-0x00000000100000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000001000000-0x00000001a5a000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000037000000-0x000000377f8000
>
> where MAX_NUMNODES is 64 because CONFIG_NODES_SHIFT=6.
> The ranges above belong to node 0, but the node's bit is never marked.
>
> With a buggy bios that marks all memory as hotpluggable, this results in a
> panic, because both checks against hotpluggable bit and memblock_kernel_bitmask
> (in early_mem_hotplug_init) fail, the numa regions have all been merged together
> and memblock_reserve_hotpluggable is called for all memory.
>
> With a correct bios (some part of initial memory is not hotplug-able) the kernel
> can boot since the hotpluggable bit check works ok, but extra dimms on node 0
> will still be allowed to be in MOVABLE_ZONE.
>
OK, I see the problem. But would you please give me a call trace that
can show
how this could happen. I think the memory block info should be the same as
numa_meminfo. Can we fix the caller to make it set nid correctly ?
> Actually this behaviour (being able to have MOVABLE memory on nodes with kernel
> reserved memblocks) sort of matches the policy I requested in v2 :). But i
> suspect that is not your intent i.e. you want memblock_kernel_nodemask_bitmap to
> prevent movable reservations for the whole node where kernel has reserved
> memblocks.
I intended to set the whole node which the kernel resides in as
un-hotpluggable.
>
> Is there a way to get accurate nid information for memblocks at early boot? I
> suspect pfn_to_nid doesn't work yet at this stage (i got a panic when I
> attempted iirc)
In such an early time, I think we can only get nid from numa_meminfo. So
as I
said above, I'd like to fix this problem by making memblock has correct nid.
And I read the patch below. I think if we get nid from numa_meminfo,
than we
don't need to call memblock_get_region_node().
Thanks. :)
>
> I used the hack below but it depends on CONFIG_NUMA, hopefully there is a
> cleaner general way:
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index cfd8c2f..af8ad2a 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -133,6 +133,19 @@ void __init setup_node_to_cpumask_map(void)
> pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
> }
>
> +int __init numa_find_range_nid(u64 start, u64 size)
> +{
> + unsigned int i;
> + struct numa_meminfo *mi =&numa_meminfo;
> +
> + for (i = 0; i< mi->nr_blks; i++) {
> + if (start>= mi->blk[i].start&& start + size -1<= mi->blk[i].end)
> + return mi->blk[i].nid;
> + }
> + return -1;
> +}
> +EXPORT_SYMBOL(numa_find_range_nid);
> +
> static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
> bool hotpluggable,
> struct numa_meminfo *mi)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77a71fb..194b7c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1600,6 +1600,9 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> #endif
>
> +#ifdef CONFIG_NUMA
> +int __init numa_find_range_nid(u64 start, u64 size);
> +#endif
> struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
> int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index a6b7845..284aced 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -834,15 +834,26 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
>
> void __init_memblock memblock_mark_kernel_nodes()
> {
> - int i, nid;
> + int i, nid, tmpnid;
> struct memblock_type *reserved =&memblock.reserved;
>
> for (i = 0; i< reserved->cnt; i++)
> if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
> nid = memblock_get_region_node(&reserved->regions[i]);
> + if (nid == MAX_NUMNODES) {
> + tmpnid = numa_find_range_nid(reserved->regions[i].base,
> + reserved->regions[i].size);
> + if (tmpnid>= 0)
> + nid = tmpnid;
> + }
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e862311..84d6e64 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -667,11 +667,7 @@ static bool srat_used __initdata;
> */
> static void __init early_x86_numa_init(void)
> {
> - /*
> - * Need to find out which nodes the kernel resides in, and arrange
> - * them as un-hotpluggable when parsing SRAT.
> - */
> - memblock_mark_kernel_nodes();
>
> if (!numa_off) {
> #ifdef CONFIG_X86_NUMAQ
> @@ -779,6 +775,12 @@ void __init early_initmem_init(void)
> load_cr3(swapper_pg_dir);
> __flush_tlb_all();
>
> + /*
> + * Need to find out which nodes the kernel resides in, and arrange
> + * them as un-hotpluggable when parsing SRAT.
> + */
> +
> + memblock_mark_kernel_nodes();
> early_mem_hotplug_init();
>
> early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Tang Chen <tangchen@cn.fujitsu.com>
To: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: mingo@redhat.com, hpa@zytor.com, akpm@linux-foundation.org,
yinghai@kernel.org, jiang.liu@huawei.com, wency@cn.fujitsu.com,
laijs@cn.fujitsu.com, isimatu.yasuaki@jp.fujitsu.com,
tj@kernel.org, mgorman@suse.de, minchan@kernel.org,
mina86@mina86.com, gong.chen@linux.intel.com,
lwoodman@redhat.com, riel@redhat.com, jweiner@redhat.com,
prarit@redhat.com, x86@kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.
Date: Mon, 03 Jun 2013 15:35:53 +0800 [thread overview]
Message-ID: <51AC4759.6090101@cn.fujitsu.com> (raw)
In-Reply-To: <20130531162401.GA31139@dhcp-192-168-178-175.profitbricks.localdomain>
Hi Vasilis,
On 06/01/2013 12:24 AM, Vasilis Liaskovitis wrote:
......
>> +void __init_memblock memblock_mark_kernel_nodes()
>> +{
>> + int i, nid;
>> + struct memblock_type *reserved =&memblock.reserved;
>> +
>> + for (i = 0; i< reserved->cnt; i++)
>> + if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
>> + nid = memblock_get_region_node(&reserved->regions[i]);
>> + node_set(nid, memblock_kernel_nodemask);
>> + }
>> +}
>
> I think there is a problem here because memblock_set_region_node is sometimes
> called with nid == MAX_NUMNODES. This means the correct node is not properly
> masked in the memblock_kernel_nodemask bitmap.
> E.g. in a VM test, memblock_mark_kernel_nodes with extra pr_warn calls iterates
> over the following memblocks (ranges below are memblks base-(base+size)):
>
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000000000-0x00000000010000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000098000-0x00000000100000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000001000000-0x00000001a5a000
> [ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000037000000-0x000000377f8000
>
> where MAX_NUMNODES is 64 because CONFIG_NODES_SHIFT=6.
> The ranges above belong to node 0, but the node's bit is never marked.
>
> With a buggy bios that marks all memory as hotpluggable, this results in a
> panic, because both checks against hotpluggable bit and memblock_kernel_bitmask
> (in early_mem_hotplug_init) fail, the numa regions have all been merged together
> and memblock_reserve_hotpluggable is called for all memory.
>
> With a correct bios (some part of initial memory is not hotplug-able) the kernel
> can boot since the hotpluggable bit check works ok, but extra dimms on node 0
> will still be allowed to be in MOVABLE_ZONE.
>
OK, I see the problem. But would you please give me a call trace that
can show
how this could happen. I think the memory block info should be the same as
numa_meminfo. Can we fix the caller to make it set nid correctly ?
> Actually this behaviour (being able to have MOVABLE memory on nodes with kernel
> reserved memblocks) sort of matches the policy I requested in v2 :). But i
> suspect that is not your intent i.e. you want memblock_kernel_nodemask_bitmap to
> prevent movable reservations for the whole node where kernel has reserved
> memblocks.
I intended to set the whole node which the kernel resides in as
un-hotpluggable.
>
> Is there a way to get accurate nid information for memblocks at early boot? I
> suspect pfn_to_nid doesn't work yet at this stage (i got a panic when I
> attempted iirc)
In such an early time, I think we can only get nid from numa_meminfo. So
as I
said above, I'd like to fix this problem by making memblock has correct nid.
And I read the patch below. I think if we get nid from numa_meminfo,
than we
don't need to call memblock_get_region_node().
Thanks. :)
>
> I used the hack below but it depends on CONFIG_NUMA, hopefully there is a
> cleaner general way:
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index cfd8c2f..af8ad2a 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -133,6 +133,19 @@ void __init setup_node_to_cpumask_map(void)
> pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
> }
>
> +int __init numa_find_range_nid(u64 start, u64 size)
> +{
> + unsigned int i;
> + struct numa_meminfo *mi =&numa_meminfo;
> +
> + for (i = 0; i< mi->nr_blks; i++) {
> + if (start>= mi->blk[i].start&& start + size -1<= mi->blk[i].end)
> + return mi->blk[i].nid;
> + }
> + return -1;
> +}
> +EXPORT_SYMBOL(numa_find_range_nid);
> +
> static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
> bool hotpluggable,
> struct numa_meminfo *mi)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77a71fb..194b7c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1600,6 +1600,9 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> #endif
>
> +#ifdef CONFIG_NUMA
> +int __init numa_find_range_nid(u64 start, u64 size);
> +#endif
> struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
> int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index a6b7845..284aced 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -834,15 +834,26 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
>
> void __init_memblock memblock_mark_kernel_nodes()
> {
> - int i, nid;
> + int i, nid, tmpnid;
> struct memblock_type *reserved =&memblock.reserved;
>
> for (i = 0; i< reserved->cnt; i++)
> if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
> nid = memblock_get_region_node(&reserved->regions[i]);
> + if (nid == MAX_NUMNODES) {
> + tmpnid = numa_find_range_nid(reserved->regions[i].base,
> + reserved->regions[i].size);
> + if (tmpnid>= 0)
> + nid = tmpnid;
> + }
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e862311..84d6e64 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -667,11 +667,7 @@ static bool srat_used __initdata;
> */
> static void __init early_x86_numa_init(void)
> {
> - /*
> - * Need to find out which nodes the kernel resides in, and arrange
> - * them as un-hotpluggable when parsing SRAT.
> - */
> - memblock_mark_kernel_nodes();
>
> if (!numa_off) {
> #ifdef CONFIG_X86_NUMAQ
> @@ -779,6 +775,12 @@ void __init early_initmem_init(void)
> load_cr3(swapper_pg_dir);
> __flush_tlb_all();
>
> + /*
> + * Need to find out which nodes the kernel resides in, and arrange
> + * them as un-hotpluggable when parsing SRAT.
> + */
> +
> + memblock_mark_kernel_nodes();
> early_mem_hotplug_init();
>
> early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
next prev parent reply other threads:[~2013-06-03 7:33 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-24 9:29 [PATCH v3 00/13] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 01/13] x86: get pg_data_t's memory from other node Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 0:31 ` Wanpeng Li
2013-06-03 0:31 ` Wanpeng Li
2013-05-24 9:29 ` [PATCH v3 02/13] acpi: Print Hot-Pluggable Field in SRAT Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 0:50 ` Wanpeng Li
2013-06-03 0:50 ` Wanpeng Li
2013-05-24 9:29 ` [PATCH v3 03/13] page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using SRAT Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 0:52 ` Wanpeng Li
2013-06-03 0:52 ` Wanpeng Li
2013-05-24 9:29 ` [PATCH v3 04/13] x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct numa_meminfo Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 05/13] x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup numa_meminfo Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 06/13] memblock, numa: Introduce flag into memblock Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 1:30 ` Wanpeng Li
2013-06-03 1:30 ` Wanpeng Li
2013-06-03 1:59 ` Tang Chen
2013-06-03 1:59 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-31 16:24 ` Vasilis Liaskovitis
2013-05-31 16:24 ` Vasilis Liaskovitis
2013-06-03 7:35 ` Tang Chen [this message]
2013-06-03 7:35 ` Tang Chen
2013-06-03 13:18 ` Vasilis Liaskovitis
2013-06-03 13:18 ` Vasilis Liaskovitis
2013-06-06 9:42 ` Tang Chen
2013-06-06 9:42 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 08/13] x86, numa: Move memory_add_physaddr_to_nid() to CONFIG_NUMA Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 09/13] x86, numa, memblock: Introduce MEMBLK_LOCAL_NODE to mark and reserve node-life-cycle data Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-24 9:29 ` [PATCH v3 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-05-31 16:15 ` Vasilis Liaskovitis
2013-05-31 16:15 ` Vasilis Liaskovitis
2013-05-24 9:29 ` [PATCH v3 11/13] x86, memblock, mem-hotplug: Free hotpluggable memory reserved by memblock Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 2:57 ` Wanpeng Li
2013-06-03 2:57 ` Wanpeng Li
2013-05-24 9:29 ` [PATCH v3 12/13] x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher priority Tang Chen
2013-05-24 9:29 ` Tang Chen
2013-06-03 2:59 ` Wanpeng Li
2013-06-03 7:37 ` Tang Chen
2013-06-03 7:37 ` Tang Chen
2013-06-03 2:59 ` Wanpeng Li
2013-05-24 9:29 ` [PATCH v3 13/13] doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot option Tang Chen
2013-05-24 9:29 ` Tang Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51AC4759.6090101@cn.fujitsu.com \
--to=tangchen@cn.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=gong.chen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=isimatu.yasuaki@jp.fujitsu.com \
--cc=jiang.liu@huawei.com \
--cc=jweiner@redhat.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lwoodman@redhat.com \
--cc=mgorman@suse.de \
--cc=mina86@mina86.com \
--cc=minchan@kernel.org \
--cc=mingo@redhat.com \
--cc=prarit@redhat.com \
--cc=riel@redhat.com \
--cc=tj@kernel.org \
--cc=vasilis.liaskovitis@profitbricks.com \
--cc=wency@cn.fujitsu.com \
--cc=x86@kernel.org \
--cc=yinghai@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.