From mboxrd@z Thu Jan  1 00:00:00 1970
From: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Subject: Re: [PATCH] mem: balanced allocation of hugepages
Date: Mon, 27 Mar 2017 14:01:59 +0100
Message-ID: <077682cf-8534-7890-9453-7c9e822bd3e6@intel.com>
References: <CGME20170216130139eucas1p2512567d6f5db9eaac5ee840b56bf920a@eucas1p2.samsung.com>
 <1487250070-13973-1-git-send-email-i.maximets@samsung.com>
 <50517d4c-5174-f4b2-e77e-143f7aac2c00@samsung.com>
 <f50d0fa1-9530-436c-d532-0e6123f4e06d@intel.com>
 <aca5b73b-75d9-da12-26f3-67ff6fe218ac@samsung.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Heetae Ahn <heetae82.ahn@samsung.com>,
 Yuanhan Liu <yuanhan.liu@linux.intel.com>,
 Jianfeng Tan <jianfeng.tan@intel.com>, Neil Horman <nhorman@tuxdriver.com>,
 Yulong Pei <yulong.pei@intel.com>, stable@dpdk.org,
 Thomas Monjalon <thomas.monjalon@6wind.com>,
 Bruce Richardson <bruce.richardson@intel.com>
To: Ilya Maximets <i.maximets@samsung.com>, dev@dpdk.org,
 David Marchand <david.marchand@6wind.com>
Return-path: <dev-bounces@dpdk.org>
In-Reply-To: <aca5b73b-75d9-da12-26f3-67ff6fe218ac@samsung.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 09/03/2017 12:57, Ilya Maximets wrote:
> On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote:
>> Hi Ilya,
>>
>> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected.
>> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point.
>>
>> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable.
>>
>> Currently at a high level regarding hugepages per numa node:
>> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point.
>> 2) Find out numa node of each hugepage.
>> 3) Check if we have enough hugepages for requested memory in each numa socket/node.
>>
>> Using libnuma we could try to allocate hugepages per numa:
>> 1) Try to map as many hugepages from numa 0.
>> 2) Check if we have enough hugepages for requested memory in numa 0.
>> 3) Try to map as many hugepages from numa 1.
>> 4) Check if we have enough hugepages for requested memory in numa 1.
>>
>> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg).
>> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory.
>>
>> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa.
>>
>> Thoughts?
> Hi Sergio,
>
> Thanks for your attention to this.
>
> For now, as we have some issues with non-contiguous
> hugepages, I'm thinking about following hybrid schema:
> 1) Allocate essential hugepages:
> 	1.1) Allocate as many hugepages from numa N to
> 	     only fit requested memory for this numa.
> 	1.2) repeat 1.1 for all numa nodes.
> 2) Try to map all remaining free hugepages in a round-robin
>     fashion like in this patch.
> 3) Sort pages and choose the most suitable.
>
> This solution should decrease number of issues connected with
> non-contiguous memory.

Sorry for late reply, I was hoping for more comments from the community.

IMHO this should be default behavior, which means no config option and 
libnuma as EAL dependency.
I think your proposal is good, could you consider implementing such 
approach on next release?

Regards.

> Best regards, Ilya Maximets.
>
>> On 06/03/2017 09:34, Ilya Maximets wrote:
>>> Hi all.
>>>
>>> So, what about this change?
>>>
>>> Best regards, Ilya Maximets.
>>>
>>> On 16.02.2017 16:01, Ilya Maximets wrote:
>>>> Currently EAL allocates hugepages one by one not paying
>>>> attention from which NUMA node allocation was done.
>>>>
>>>> Such behaviour leads to allocation failure if number of
>>>> available hugepages for application limited by cgroups
>>>> or hugetlbfs and memory requested not only from the first
>>>> socket.
>>>>
>>>> Example:
>>>>      # 90 x 1GB hugepages availavle in a system
>>>>
>>>>      cgcreate -g hugetlb:/test
>>>>      # Limit to 32GB of hugepages
>>>>      cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test
>>>>      # Request 4GB from each of 2 sockets
>>>>      cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ...
>>>>
>>>>      EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB
>>>>      EAL: 32 not 90 hugepages of size 1024 MB allocated
>>>>      EAL: Not enough memory available on socket 1!
>>>>           Requested: 4096MB, available: 0MB
>>>>      PANIC in rte_eal_init():
>>>>      Cannot init memory
>>>>
>>>>      This happens beacause all allocated pages are
>>>>      on socket 0.
>>>>
>>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each
>>>> hugepage to one of requested nodes in a round-robin fashion.
>>>> In this case all allocated pages will be fairly distributed
>>>> between all requested nodes.
>>>>
>>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES
>>>> introduced and disabled by default because of external
>>>> dependency from libnuma.
>>>>
>>>> Cc:<stable@dpdk.org>
>>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages")
>>>>
>>>> Signed-off-by: Ilya Maximets<i.maximets@samsung.com>
>>>> ---
>>>>    config/common_base                       |  1 +
>>>>    lib/librte_eal/Makefile                  |  4 ++
>>>>    lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++
>>>>    mk/rte.app.mk                            |  3 ++
>>>>    4 files changed, 74 insertions(+)

Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>