All of lore.kernel.org
 help / color / mirror / Atom feed
From: Olivier Matz <olivier.matz@6wind.com>
To: Bao-Long Tran <tranbaolong@niometrics.com>
Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com,
	dev@dpdk.org, users@dpdk.org, ricudis@niometrics.com
Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Date: Thu, 26 Dec 2019 16:45:24 +0100	[thread overview]
Message-ID: <20191226154524.GG22738@platinum> (raw)
In-Reply-To: <AEEF393A-B56D-4F06-B54F-5AF4022B1F2D@niometrics.com>

Hi Bao-Long,

On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> Hi,
> 
> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior 
> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> same mempool size, the number of hugepages allocated changes from run to run.
> 
> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> 
> 1. Reserve 16x1G hugepages on socket 0 
> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> make && ./build/basicfwd 
> 3. At the same time, watch the number of hugepages allocated 
> "watch -n.1 ls /dev/hugepages"
> 4. Repeat step 2
> 
> If you can reproduce, you should see that for some runs, DPDK allocates 5
> hugepages, other times it allocates 6. When it allocates 6, if you watch the 
> output from step 3., you will see that DPDK first  try to allocate 5 hugepages, 
> then unmap all 5, retry, and got 6.

I cannot reproduce in the same conditions than yours (with 16 hugepages
on socket 0), but I think I can see a similar issue:

If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
are used). If I reserve 5 hugepages, it takes more time,
taking/releasing hugepages several times, and it finally succeeds with 5
hugepages.

> For our use case, it's important that DPDK allocate the same number of 
> hugepages on every run so we can get reproducable results.

One possibility is to use the --legacy-mem EAL option. It will try to
reserve all hugepages first.

> Studying the code, this seems to be the behavior of
> rte_mempool_populate_default(). If I understand correctly, if the first try fail
> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> condition, and eventually wound up with 6 hugepages.

No, I think you don't have the IOVA-contiguous constraint in your
case. This is what I see:

a- reserve 5 hugepages on socket 0, and start your patched basicfwd
b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
c- the total element size (with header) is 2304 + 64 = 2368
d- in rte_mempool_op_calc_mem_size_helper(), it calculates
   obj_per_page = 453438    (453438 * 2368 = 1073741184)
   mem_size = 4966058495
e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
   rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
     mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
     align=64)
   For some reason, it fails: we can see that the number of map'd hugepages
   increases in /dev/hugepages, the return to its original value.
   I don't think it should fail here.
f- then, it will try to allocate the biggest available contiguous zone. In
   my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
   This is a second problem: if we call it again, it returns NULL, because
   it won't map another hugepage.
g- by luck, calling rte_mempool_populate_virt() allocates a small aera
   (mempool header), and it triggers the mapping a a new hugepage, that
   will be used in the next loop, back at step d with a smaller mem_size.

> Questions: 
> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory 
> is abundant? 

In my case, it looks that we have a bit less than 1G which is free at
the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
The allocator ends up in mapping 5 pages (and fail), while only 4 is
needed.

Anatoly, do you have any idea? Shouldn't we take in account the amount
of free space at the end of the heap when expanding?

> 2. Why does the 2nd retry need N+1 hugepages?

When the first alloc fails, the mempool code tries to allocate in
several chunks which are not virtually contiguous. This is needed in
case the memory is fragmented.

> Some insights for Q1: From my experiments, seems like the IOVA of the first
> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> It could explain the retry when the IOVA of the first hugepage is near the end of 
> the IOVA space. But I have also seen situation where the 1st hugepage is near
> the beginning of the IOVA space and it still failed the 1st time.
> 
> Here's the code:
> #include <rte_eal.h>
> #include <rte_mbuf.h>
> 
> int
> main(int argc, char *argv[])
> {
> 	struct rte_mempool *mbuf_pool;
> 	unsigned mbuf_pool_size = 2097151;
> 
> 	int ret = rte_eal_init(argc, argv);
> 	if (ret < 0)
> 		rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> 
> 	printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> 	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> 		256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> 
> 	printf("mbuf_pool %p\n", mbuf_pool);
> 
> 	return 0;
> }
> 
> Best regards,
> BL

Regards,
Olivier

  reply	other threads:[~2019-12-26 15:45 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-23 11:09 [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation Bao-Long Tran
2019-12-26 15:45 ` Olivier Matz [this message]
2019-12-27  8:11   ` Olivier Matz
2019-12-27 10:05     ` Bao-Long Tran
2019-12-27 11:11       ` Olivier Matz
2020-01-07 13:06         ` Burakov, Anatoly
2020-01-09 13:32           ` Olivier Matz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191226154524.GG22738@platinum \
    --to=olivier.matz@6wind.com \
    --cc=anatoly.burakov@intel.com \
    --cc=arybchenko@solarflare.com \
    --cc=dev@dpdk.org \
    --cc=ricudis@niometrics.com \
    --cc=tranbaolong@niometrics.com \
    --cc=users@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.