From: David Hildenbrand <david@redhat.com>
To: Michal Privoznik <mprivozn@redhat.com>, qemu-devel@nongnu.org
Cc: imammedo@redhat.com, marcandre.lureau@redhat.com, berrange@redhat.com
Subject: Re: [PATCH] hostmem: Honour multiple preferred nodes if possible
Date: Wed, 14 Dec 2022 11:54:31 +0100 [thread overview]
Message-ID: <da19b9fa-3032-b355-e0f4-7ae5e27e09ab@redhat.com> (raw)
In-Reply-To: <ba02465fc48807eddea9ad646fca7cc92f929ae7.1670603308.git.mprivozn@redhat.com>
On 09.12.22 17:29, Michal Privoznik wrote:
> If a memory-backend is configured with mode
> HOST_MEM_POLICY_PREFERRED then
> host_memory_backend_memory_complete() calls mbind() as:
>
> mbind(..., MPOL_PREFERRED, nodemask, ...);
>
> Here, 'nodemask' is a bitmap of host NUMA nodes and corresponds
> to the .host-nodes attribute. Therefore, there can be multiple
> nodes specified. However, the documentation to MPOL_PREFERRED
> says:
>
> MPOL_PREFERRED
> This mode sets the preferred node for allocation. ...
> If nodemask specifies more than one node ID, the first node
> in the mask will be selected as the preferred node.
>
> Therefore, only the first node is honoured and the rest is
s/honoured/honored/
> silently ignored. Well, with recent changes to the kernel and
> numactl we can do better.
Yeah, I think this "silent" ignoring was part of the design for both,
the kernel feature and the QEMU feature. Yes, we can do better now.
>
> Firstly, new mode - MPOL_PREFERRED_MANY - was introduced to
> kernel (v5.15-rc1~107^2~21) which now accepts multiple NUMA
> nodes.
Maybe give the kernel commit instead
"The Linux kernel added in v5.15 via commit cfcaa66f8032 ("") support
for MPOL_PREFERRED_MANY, which accepts multiple preferred NUMA nodes
instead.
>
> Then, numa_has_preferred_many() API was introduced to numactl
> (v2.0.15~26) allowing applications to query kernel support.
>
> Wiring this all together, we can pass MPOL_PREFERRED_MANY to the
> mbind() call instead and stop ignoring multiple nodes, silently.
>
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
> backends/hostmem.c | 28 ++++++++++++++++++++++++++++
> meson.build | 5 +++++
> 2 files changed, 33 insertions(+)
>
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 8640294c10..e0d6cb6c8a 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -23,10 +23,22 @@
>
> #ifdef CONFIG_NUMA
> #include <numaif.h>
> +#include <numa.h>
> QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_DEFAULT != MPOL_DEFAULT);
> +/*
> + * HOST_MEM_POLICY_PREFERRED may some time also by MPOL_PREFERRED_MANY, see
> + * below.
I failed to parse that sentence. :)
"
HOST_MEM_POLICY_PREFERRED may either transalte to MPOL_PREFERRED or
MPOL_PREFERRED_MANY, see comments further below.
"
?
> + */
> QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_PREFERRED != MPOL_PREFERRED);
> QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
> QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
> +
> +/*
> + * -1 for uninitialized,
> + * 0 for MPOL_PREFERRED_MANY unsupported,
> + * 1 for supported.
> + */
> +static int has_preferred_many = -1;
maybe "has_mpol_preferred_many" or "supports_mpol_preferred_many" instead.
... but why do we have to cache that value at all? ...
> #endif
>
> char *
> @@ -346,6 +358,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
> * this doesn't catch hugepage case. */
> unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
> + int mode = backend->policy;
>
> /* check for invalid host-nodes and policies and give more verbose
> * error messages than mbind(). */
> @@ -369,6 +382,21 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
> assert(maxnode <= MAX_NODES);
>
> +#ifdef HAVE_NUMA_SET_PREFERRED_MANY
> + if (has_preferred_many < 0) {
> + /* Check, whether kernel supports MPOL_PREFERRED_MANY. */
> + has_preferred_many = numa_has_preferred_many() > 0 ? 1 : 0;
> + }
> +
> + if (mode == MPOL_PREFERRED && has_preferred_many > 0) {
> + /*
> + * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
> + * silently picks the first node.
> + */
> + mode = MPOL_PREFERRED_MANY;
> + }
> +#endif
... maybe simply not cache the value?
#ifdef HAVE_NUMA_SET_PREFERRED_MANY
if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
/* ... */
mode = MPOL_PREFERRED_MANY;
}
#endif
> +
> if (maxnode &&
> mbind(ptr, sz, backend->policy, backend->host_nodes, maxnode + 1,
> flags)) {
> diff --git a/meson.build b/meson.build
> index 5c6b5a1c75..ebbff7a8ea 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1858,6 +1858,11 @@ config_host_data.set('CONFIG_LINUX_AIO', libaio.found())
> config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
> config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
> config_host_data.set('CONFIG_NUMA', numa.found())
> +if numa.found()
> + config_host_data.set('HAVE_NUMA_SET_PREFERRED_MANY',
> + cc.has_function('numa_set_preferred_many',
> + dependencies: numa))
You're using numa_has_preferred_many(), so better check for that and use
HAVE_NUMA_HAS_PREFERRED_MANY?
Thanks!
--
Thanks,
David / dhildenb
prev parent reply other threads:[~2022-12-14 10:55 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-09 16:29 [PATCH] hostmem: Honour multiple preferred nodes if possible Michal Privoznik
2022-12-14 10:54 ` David Hildenbrand [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=da19b9fa-3032-b355-e0f4-7ae5e27e09ab@redhat.com \
--to=david@redhat.com \
--cc=berrange@redhat.com \
--cc=imammedo@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=mprivozn@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).