Re: [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Zefan Li <lizefan@huawei.com>,
	Xinwei Hu <huxinwei@huawei.com>,
	Hanjun Guo <guohanjun@huawei.com>
Subject: Re: [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc
Date: Thu, 27 Oct 2016 10:41:24 +0800	[thread overview]
Message-ID: <58116954.8080908@huawei.com> (raw)
In-Reply-To: <20161026093152.GE18382@dhcp22.suse.cz>



On 2016/10/26 17:31, Michal Hocko wrote:
> On Wed 26-10-16 11:10:44, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/10/25 21:23, Michal Hocko wrote:
>>> On Tue 25-10-16 10:59:17, Zhen Lei wrote:
>>>> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
>>>> actually exist. The percpu variable areas and numa control blocks of that
>>>> memoryless numa nodes need to be allocated from the nearest available
>>>> node to improve performance.
>>>>
>>>> Although memblock_alloc_try_nid and memblock_virt_alloc_try_nid try the
>>>> specified nid at the first time, but if that allocation failed it will
>>>> directly drop to use NUMA_NO_NODE. This mean any nodes maybe possible at
>>>> the second time.
>>>>
>>>> To compatible the above old scene, I use a marco node_distance_ready to
>>>> control it. By default, the marco node_distance_ready is not defined in
>>>> any platforms, the above mentioned functions will work as normal as
>>>> before. Otherwise, they will try the nearest node first.
>>>
>>> I am sorry but it is absolutely unclear to me _what_ is the motivation
>>> of the patch. Is this a performance optimization, correctness issue or
>>> something else? Could you please restate what is the problem, why do you
>>> think it has to be fixed at memblock layer and describe what the actual
>>> fix is please?
>>
>> This is a performance optimization.
> 
> Do you have any numbers to back the improvements?
I have not collected any performance data, but at least in theory, it's beneficial and harmless,
except make code looks a bit urly. Because all related functions are actually defined as __init,
for example:
phys_addr_t __init memblock_alloc_try_nid(
void * __init memblock_virt_alloc_try_nid(

And all related memory(percpu variables and NODE_DATA) is mostly referred at running time.

> 
>> The problem is if some memoryless numa nodes are
>> actually exist, for example: there are total 4 nodes, 0,1,2,3, node 1 has no memory,
>> and the node distances is as below:
>>                     ---------board-------
>> 		    |                   |
>>                     |                   |
>>                  socket0             socket1
>>                    / \                 / \
>>                   /   \               /   \
>>                node0 node1         node2 node3
>> distance[1][0] is nearer than distance[1][2] and distance[1][3]. CPUs on node1 access
>> the memory of node0 is faster than node2 or node3.
>>
>> Linux defines a lot of percpu variables, each cpu has a copy of it and most of the time
>> only to access their own percpu area. In this example, we hope the percpu area of CPUs
>> on node1 allocated from node0. But without these patches, it's not sure that.
> 
> I am not familiar with the percpu allocator much so I might be
> completely missig a point but why cannot this be solved in the percpu
> allocator directly e.g. by using cpu_to_mem which should already be
> memoryless aware.
My test result told me that it can not:
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000011ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000001200000000-0x00000013ffffffff]
[    0.000000] Initmem setup node 3 [mem 0x0000001400000000-0x00000017ffffffff]


[   14.801895] NODE_DATA(0) = 0x11ffffe500
[   14.805749] NODE_DATA(1) = 0x11ffffca00	//(1), see below
[   14.809602] NODE_DATA(2) = 0x13ffffe500
[   14.813455] NODE_DATA(3) = 0x17fffe5480
[   14.817316] cpu 0 on node0: 11fff87638
[   14.821083] cpu 1 on node0: 11fff9c638
[   14.824850] cpu 2 on node0: 11fffb1638
[   14.828616] cpu 3 on node0: 11fffc6638
[   14.832383] cpu 4 on node1: 17fff8a638	//(2), see below
[   14.836149] cpu 5 on node1: 17fff9f638
[   14.839912] cpu 6 on node1: 17fffb4638
[   14.843677] cpu 7 on node1: 17fffc9638
[   14.847444] cpu 8 on node2: 13fffa4638
[   14.851210] cpu 9 on node2: 13fffb9638
[   14.854976] cpu10 on node2: 13fffce638
[   14.858742] cpu11 on node2: 13fffe3638
[   14.862510] cpu12 on node3: 17fff36638
[   14.866276] cpu13 on node3: 17fff4b638
[   14.870042] cpu14 on node3: 17fff60638
[   14.873809] cpu15 on node3: 17fff75638

(1) memblock_alloc_try_nid and with these patches, memory was allocated from node0
(2) do the same implementation as X86 and PowerPC, memory was allocated from node3:
    	return  __alloc_bootmem_node(NODE_DATA(nid), size, align, __pa(MAX_DMA_ADDRESS));

I'm not sure how about on X86 and PowerPC, here is my test cases. Is anybody interested and
have testing environment, can you help me to execute it?

static int tst_numa_002(void)
{
        int i;

        for (i = 0; i < nr_node_ids; i++)
                pr_info("NODE_DATA(%d) = 0x%llx\n", i, virt_to_phys(NODE_DATA(i)));

        return 0;
}

static int tst_numa_003(void)
{
        int cpu;
        void __percpu *p;

        p = __alloc_percpu(0x100, 1);

        for_each_possible_cpu(cpu)
                pr_info("cpu%2d on node%d: %llx\n", cpu, cpu_to_node(cpu), per_cpu_ptr_to_phys(per_cpu_ptr(p, cpu)));

        free_percpu(p);

        return 0;
}

> 
> Generating a new API while we have means to use an existing one sounds
> just not right to me.
Yes, so I gave up to create two new functions and selected this implementation.

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-10-27  2:43 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-25  2:59 [PATCH 0/2] to support memblock near alloc and memoryless on arm64 Zhen Lei
2016-10-25  2:59 ` [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc Zhen Lei
2016-10-25 13:23   ` Michal Hocko
2016-10-26  3:10     ` Leizhen (ThunderTown)
2016-10-26  9:31       ` Michal Hocko
2016-10-27  2:41         ` Leizhen (ThunderTown) [this message]
2016-10-27  7:22           ` Michal Hocko
2016-10-27  8:23             ` Leizhen (ThunderTown)
2016-10-25  2:59 ` [PATCH 2/2] arm64/numa: support HAVE_MEMORYLESS_NODES Zhen Lei
2016-10-26 18:36   ` Will Deacon
2016-10-27  3:54     ` Leizhen (ThunderTown)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58116954.8080908@huawei.com \
    --to=thunder.leizhen@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=guohanjun@huawei.com \
    --cc=huxinwei@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mhocko@kernel.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).