qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Michal Privoznik <mprivozn@redhat.com>, qemu-devel@nongnu.org
Cc: imammedo@redhat.com, marcandre.lureau@redhat.com, berrange@redhat.com
Subject: Re: [PATCH] hostmem: Honour multiple preferred nodes if possible
Date: Wed, 14 Dec 2022 11:54:31 +0100	[thread overview]
Message-ID: <da19b9fa-3032-b355-e0f4-7ae5e27e09ab@redhat.com> (raw)
In-Reply-To: <ba02465fc48807eddea9ad646fca7cc92f929ae7.1670603308.git.mprivozn@redhat.com>

On 09.12.22 17:29, Michal Privoznik wrote:
> If a memory-backend is configured with mode
> HOST_MEM_POLICY_PREFERRED then
> host_memory_backend_memory_complete() calls mbind() as:
> 
>    mbind(..., MPOL_PREFERRED, nodemask, ...);
> 
> Here, 'nodemask' is a bitmap of host NUMA nodes and corresponds
> to the .host-nodes attribute. Therefore, there can be multiple
> nodes specified. However, the documentation to MPOL_PREFERRED
> says:
> 
>    MPOL_PREFERRED
>      This mode sets the preferred node for allocation. ...
>      If nodemask specifies more than one node ID, the first node
>      in the mask will be selected as the preferred node.
> 
> Therefore, only the first node is honoured and the rest is

s/honoured/honored/

> silently ignored. Well, with recent changes to the kernel and
> numactl we can do better.

Yeah, I think this "silent" ignoring was part of the design for both, 
the kernel feature and the QEMU feature. Yes, we can do better now.

> 
> Firstly, new mode - MPOL_PREFERRED_MANY - was introduced to
> kernel (v5.15-rc1~107^2~21) which now accepts multiple NUMA
> nodes.

Maybe give the kernel commit instead


"The Linux kernel added in v5.15 via commit cfcaa66f8032 ("") support 
for MPOL_PREFERRED_MANY, which accepts multiple preferred NUMA nodes 
instead.

> 
> Then, numa_has_preferred_many() API was introduced to numactl
> (v2.0.15~26) allowing applications to query kernel support.
> 
> Wiring this all together, we can pass MPOL_PREFERRED_MANY to the
> mbind() call instead and stop ignoring multiple nodes, silently.
> 
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>   backends/hostmem.c | 28 ++++++++++++++++++++++++++++
>   meson.build        |  5 +++++
>   2 files changed, 33 insertions(+)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 8640294c10..e0d6cb6c8a 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -23,10 +23,22 @@
>   
>   #ifdef CONFIG_NUMA
>   #include <numaif.h>
> +#include <numa.h>
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_DEFAULT != MPOL_DEFAULT);
> +/*
> + * HOST_MEM_POLICY_PREFERRED may some time also by MPOL_PREFERRED_MANY, see
> + * below.

I failed to parse that sentence. :)

"
HOST_MEM_POLICY_PREFERRED may either transalte to MPOL_PREFERRED or 
MPOL_PREFERRED_MANY, see comments further below.
"

?

> + */
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_PREFERRED != MPOL_PREFERRED);
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
> +
> +/*
> + * -1 for uninitialized,
> + *  0 for MPOL_PREFERRED_MANY unsupported,
> + *  1 for supported.
> + */
> +static int has_preferred_many = -1;

maybe "has_mpol_preferred_many" or "supports_mpol_preferred_many" instead.

... but why do we have to cache that value at all? ...

>   #endif
>   
>   char *
> @@ -346,6 +358,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>            * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
>            * this doesn't catch hugepage case. */
>           unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
> +        int mode = backend->policy;
>   
>           /* check for invalid host-nodes and policies and give more verbose
>            * error messages than mbind(). */
> @@ -369,6 +382,21 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>                  BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
>           assert(maxnode <= MAX_NODES);
>   
> +#ifdef HAVE_NUMA_SET_PREFERRED_MANY
> +        if (has_preferred_many < 0) {
> +            /* Check, whether kernel supports MPOL_PREFERRED_MANY. */
> +            has_preferred_many = numa_has_preferred_many() > 0 ? 1 : 0;
> +        }
> +
> +        if (mode == MPOL_PREFERRED && has_preferred_many > 0) {
> +            /*
> +             * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
> +             * silently picks the first node.
> +             */
> +            mode = MPOL_PREFERRED_MANY;
> +        }
> +#endif

... maybe simply not cache the value?


#ifdef HAVE_NUMA_SET_PREFERRED_MANY
	if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
		/* ... */
		mode = MPOL_PREFERRED_MANY;
	}
#endif

> +
>           if (maxnode &&
>               mbind(ptr, sz, backend->policy, backend->host_nodes, maxnode + 1,
>                     flags)) {
> diff --git a/meson.build b/meson.build
> index 5c6b5a1c75..ebbff7a8ea 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1858,6 +1858,11 @@ config_host_data.set('CONFIG_LINUX_AIO', libaio.found())
>   config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
>   config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
>   config_host_data.set('CONFIG_NUMA', numa.found())
> +if numa.found()
> +  config_host_data.set('HAVE_NUMA_SET_PREFERRED_MANY',
> +                       cc.has_function('numa_set_preferred_many',
> +                                       dependencies: numa))

You're using numa_has_preferred_many(), so better check for that and use 
HAVE_NUMA_HAS_PREFERRED_MANY?

Thanks!

-- 
Thanks,

David / dhildenb



      reply	other threads:[~2022-12-14 10:55 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-09 16:29 [PATCH] hostmem: Honour multiple preferred nodes if possible Michal Privoznik
2022-12-14 10:54 ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da19b9fa-3032-b355-e0f4-7ae5e27e09ab@redhat.com \
    --to=david@redhat.com \
    --cc=berrange@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mprivozn@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).