From: Mark Kanda <mark.kanda@oracle.com>
To: David Hildenbrand <david@redhat.com>, qemu-devel@nongnu.org
Cc: pbonzini@redhat.com, berrange@redhat.com
Subject: Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
Date: Mon, 29 Jan 2024 16:59:10 -0600 [thread overview]
Message-ID: <93a62006-328c-40f4-a18e-5fbf89cba49f@oracle.com> (raw)
In-Reply-To: <c15161eb-f52c-4a82-8b4b-0ba03842188c@redhat.com>
On 1/29/24 1:11 PM, David Hildenbrand wrote:
> On 22.01.24 16:32, Mark Kanda wrote:
>> v2:
>> - require MADV_POPULATE_WRITE (simplify the implementation)
>> - require prealloc context threads to ensure optimal thread placement
>> - use machine phase 'initialized' to detremine when to allow parallel
>> init
>>
>> QEMU initializes preallocated backend memory when parsing the
>> corresponding
>> objects from the command line. In certain scenarios, such as memory
>> being
>> preallocated across multiple numa nodes, this approach is not optimal
>> due to
>> the unnecessary serialization.
>>
>> This series addresses this issue by initializing the backend memory
>> objects in
>> parallel.
>
> I just played with the code, some comments:
>
> * I suggest squashing both patches. It doesn't make things clearer if we
> factor out unconditionally adding contexts to a global list.
>
> * Keep the functions MT-capable, at least as long as async=false. That
> is, don't involve the global list if async=false. virtio-mem will
> perform preallocation from other threads at some point, where we could
> see concurrent preallocations for different devices. I made sure that
> qemu_mem_prealloc() can handle that.
>
> * Rename wait_mem_prealloc() to qemu_finish_async_mem_prealloc() and let
> it report the error / return true/false like qemu_prealloc_mem().
> Especially, don't change the existing
> "qemu_prealloc_mem: preallocating memory failed" error message.
>
> * Do the conditional async=false fixup in touch_all_pages(). That means,
> in qemu_prealloc_mem(), only route the async parameter through.
>
>
> One thing I don't quite like is what happens when multiple threads
> would try
> issuing "async=true". It will currently not happen, but we should catch
> whenever that happens and require that only one thread at a time can
> perform async preallocs. Maybe we can assert in qemu_prealloc_mem()/
> qemu_finish_async_mem_prealloc() that we hold the BQL. Hopefully, that
> is the case when we start creating memory backends, before starting the
> main loop. If not, maybe we should just document that async limitation.
>
> Ideally, we'd have some async_start(), prealloc(), prealloc(),
> async_finish() interface, where async_start() would block until
> another thread called async_finish(), so we never have a mixture.
> But that would currently be over-engineering.
>
>
> I'll attach the untested, likely broken, code I played with to see
> what it could look like. Observe how I only conditionally add the
> context to the list at the end of touch_all_pages().
>
Thank you very much for the feedback David. I'll take a close look at
this for v3.
Best regards,
-Mark
>
> From fe26cc5252f1284efa8e667310609a22c6166324 Mon Sep 17 00:00:00 2001
> From: Mark Kanda <mark.kanda@oracle.com>
> Date: Mon, 22 Jan 2024 09:32:18 -0600
> Subject: [PATCH] oslib-posix: initialize selected backend memory
> objects in
> parallel
>
> QEMU initializes preallocated backend memory as the objects are parsed
> from
> the command line. This is not optimal in some cases (e.g. memory spanning
> multiple NUMA nodes) because the memory objects are initialized in
> series.
>
> Allow the initialization to occur in parallel. In order to ensure optimal
> thread placement, parallel initialization requires prealloc context
> threads
> to be in use.
>
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> backends/hostmem.c | 8 ++-
> hw/virtio/virtio-mem.c | 4 +-
> include/qemu/osdep.h | 19 +++++-
> system/vl.c | 8 +++
> util/oslib-posix.c | 130 +++++++++++++++++++++++++++++++----------
> util/oslib-win32.c | 8 ++-
> 6 files changed, 140 insertions(+), 37 deletions(-)
>
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 30f69b2cb5..8f602dc86f 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -20,6 +20,7 @@
> #include "qom/object_interfaces.h"
> #include "qemu/mmap-alloc.h"
> #include "qemu/madvise.h"
> +#include "hw/qdev-core.h"
>
> #ifdef CONFIG_NUMA
> #include <numaif.h>
> @@ -235,9 +236,10 @@ static void
> host_memory_backend_set_prealloc(Object *obj, bool value,
> int fd = memory_region_get_fd(&backend->mr);
> void *ptr = memory_region_get_ram_ptr(&backend->mr);
> uint64_t sz = memory_region_size(&backend->mr);
> + bool async = !phase_check(PHASE_MACHINE_INITIALIZED);
>
> if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
> - backend->prealloc_context, errp)) {
> + backend->prealloc_context, async,
> errp)) {
> return;
> }
> backend->prealloc = true;
> @@ -323,6 +325,7 @@ host_memory_backend_memory_complete(UserCreatable
> *uc, Error **errp)
> HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc);
> void *ptr;
> uint64_t sz;
> + bool async = !phase_check(PHASE_MACHINE_INITIALIZED);
>
> if (!bc->alloc) {
> return;
> @@ -398,7 +401,8 @@ host_memory_backend_memory_complete(UserCreatable
> *uc, Error **errp)
> if (backend->prealloc &&
> !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
> ptr, sz,
> backend->prealloc_threads,
> - backend->prealloc_context, errp)) {
> + backend->prealloc_context,
> + async, errp)) {
> return;
> }
> }
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 99ab989852..ffd119ebac 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM
> *vmem, uint64_t start_gpa,
> int fd = memory_region_get_fd(&vmem->memdev->mr);
> Error *local_err = NULL;
>
> - if (!qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err)) {
> + if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false,
> &local_err)) {
> static bool warned;
>
> /*
> @@ -1248,7 +1248,7 @@ static int
> virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, void *arg,
> int fd = memory_region_get_fd(&vmem->memdev->mr);
> Error *local_err = NULL;
>
> - if (!qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err)) {
> + if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false,
> &local_err)) {
> error_report_err(local_err);
> return -ENOMEM;
> }
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index c9692cc314..ed48f3d028 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext;
> * @area: start address of the are to preallocate
> * @sz: the size of the area to preallocate
> * @max_threads: maximum number of threads to use
> + * @tc: prealloc context threads pointer, NULL if not in use
> + * @async: request asynchronous preallocation, requires @tc
> * @errp: returns an error if this function fails
> *
> * Preallocate memory (populate/prefault page tables writable) for
> the virtual
> @@ -687,10 +689,25 @@ typedef struct ThreadContext ThreadContext;
> * each page in the area was faulted in writable at least once, for
> example,
> * after allocating file blocks for mapped files.
> *
> + * When setting @async, allocation might be performed asynchronously.
> + * qemu_finish_async_mem_prealloc() must be called to finish any
> asyncronous
> + * preallocation, reporting any preallocation error.
> + *
> * Return: true on success, else false setting @errp with error.
> */
> bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> - ThreadContext *tc, Error **errp);
> + ThreadContext *tc, bool async, Error **errp);
> +
> +/**
> + * qemu_finish_async_mem_prealloc:
> + * @errp: returns an error if this function fails
> + *
> + * Finish any outstanding memory prealloc to complete, reporting any
> error
> + * like qemu_prealloc_mem() would.
> + *
> + * Return: true on success, else false setting @errp with error.
> + */
> +bool qemu_finish_async_mem_prealloc(Error **errp);
>
> /**
> * qemu_get_pid_name:
> diff --git a/system/vl.c b/system/vl.c
> index 788d88ea03..290bb3232b 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -2009,6 +2009,14 @@ static void qemu_create_late_backends(void)
>
> object_option_foreach_add(object_create_late);
>
> + /*
> + * Wait for any outstanding memory prealloc from created memory
> + * backends to complete.
> + */
> + if (!qemu_finish_async_mem_prealloc(&error_fatal)) {
> + exit(1);
> + }
> +
> if (tpm_init() < 0) {
> exit(1);
> }
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 7c297003b9..c37548abdc 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -42,6 +42,7 @@
> #include "qemu/cutils.h"
> #include "qemu/units.h"
> #include "qemu/thread-context.h"
> +#include "qemu/main-loop.h"
>
> #ifdef CONFIG_LINUX
> #include <sys/syscall.h>
> @@ -63,11 +64,15 @@
>
> struct MemsetThread;
>
> +static QLIST_HEAD(, MemsetContext) memset_contexts =
> + QLIST_HEAD_INITIALIZER(memset_contexts);
> +
> typedef struct MemsetContext {
> bool all_threads_created;
> bool any_thread_failed;
> struct MemsetThread *threads;
> int num_threads;
> + QLIST_ENTRY(MemsetContext) next;
> } MemsetContext;
>
> struct MemsetThread {
> @@ -412,19 +417,44 @@ static inline int get_memset_num_threads(size_t
> hpagesize, size_t numpages,
> return ret;
> }
>
> +static int wait_and_free_mem_prealloc_context(MemsetContext *context)
> +{
> + int i, ret = 0, tmp;
> +
> + for (i = 0; i < context->num_threads; i++) {
> + tmp =
> (uintptr_t)qemu_thread_join(&context->threads[i].pgthread);
> +
> + if (tmp) {
> + ret = tmp;
> + }
> + }
> + g_free(context->threads);
> + g_free(context);
> + return ret;
> +}
> +
> static int touch_all_pages(char *area, size_t hpagesize, size_t
> numpages,
> - int max_threads, ThreadContext *tc,
> + int max_threads, ThreadContext *tc, bool
> async,
> bool use_madv_populate_write)
> {
> static gsize initialized = 0;
> - MemsetContext context = {
> - .num_threads = get_memset_num_threads(hpagesize, numpages,
> max_threads),
> - };
> + MemsetContext *context = g_new0(MemsetContext, 1);
> size_t numpages_per_thread, leftover;
> void *(*touch_fn)(void *);
> - int ret = 0, i = 0;
> + int ret, i = 0;
> char *addr = area;
>
> + /*
> + * Async prealloc is only allowed when using MADV_POPULATE_WRITE and
> + * prealloc context (to ensure optimal thread placement).
> + */
> + if (!use_madv_populate_write || !tc) {
> + async = false;
> + }
> +
> + context->num_threads = get_memset_num_threads(hpagesize, numpages,
> + max_threads);
> +
> if (g_once_init_enter(&initialized)) {
> qemu_mutex_init(&page_mutex);
> qemu_cond_init(&page_cond);
> @@ -432,8 +462,11 @@ static int touch_all_pages(char *area, size_t
> hpagesize, size_t numpages,
> }
>
> if (use_madv_populate_write) {
> - /* Avoid creating a single thread for MADV_POPULATE_WRITE */
> - if (context.num_threads == 1) {
> + /*
> + * Avoid creating a single thread for MADV_POPULATE_WRITE when
> + * preallocating synchronously.
> + */
> + if (context->num_threads == 1 && !async) {
> if (qemu_madvise(area, hpagesize * numpages,
> QEMU_MADV_POPULATE_WRITE)) {
> return -errno;
> @@ -445,50 +478,85 @@ static int touch_all_pages(char *area, size_t
> hpagesize, size_t numpages,
> touch_fn = do_touch_pages;
> }
>
> - context.threads = g_new0(MemsetThread, context.num_threads);
> - numpages_per_thread = numpages / context.num_threads;
> - leftover = numpages % context.num_threads;
> - for (i = 0; i < context.num_threads; i++) {
> - context.threads[i].addr = addr;
> - context.threads[i].numpages = numpages_per_thread + (i <
> leftover);
> - context.threads[i].hpagesize = hpagesize;
> - context.threads[i].context = &context;
> + context->threads = g_new0(MemsetThread, context->num_threads);
> + numpages_per_thread = numpages / context->num_threads;
> + leftover = numpages % context->num_threads;
> + for (i = 0; i < context->num_threads; i++) {
> + context->threads[i].addr = addr;
> + context->threads[i].numpages = numpages_per_thread + (i <
> leftover);
> + context->threads[i].hpagesize = hpagesize;
> + context->threads[i].context = context;
> if (tc) {
> - thread_context_create_thread(tc,
> &context.threads[i].pgthread,
> + thread_context_create_thread(tc,
> &context->threads[i].pgthread,
> "touch_pages",
> - touch_fn, &context.threads[i],
> + touch_fn, &context->threads[i],
> QEMU_THREAD_JOINABLE);
> } else {
> - qemu_thread_create(&context.threads[i].pgthread,
> "touch_pages",
> - touch_fn, &context.threads[i],
> + qemu_thread_create(&context->threads[i].pgthread, "touch_pages",
> + touch_fn, &context->threads[i],
> QEMU_THREAD_JOINABLE);
> }
> - addr += context.threads[i].numpages * hpagesize;
> + addr += context->threads[i].numpages * hpagesize;
> + }
> +
> + if (async) {
> + /*
> + * async requests currently require the BQL. Add it to the
> list and kick
> + * preallocation off during qemu_finish_async_mem_prealloc().
> + */
> + assert(bql_locked());
> + QLIST_INSERT_HEAD(&memset_contexts, context, next);
> + return 0;
> }
>
> if (!use_madv_populate_write) {
> - sigbus_memset_context = &context;
> + sigbus_memset_context = context;
> }
>
> qemu_mutex_lock(&page_mutex);
> - context.all_threads_created = true;
> + context->all_threads_created = true;
> qemu_cond_broadcast(&page_cond);
> qemu_mutex_unlock(&page_mutex);
> + ret = wait_and_free_mem_prealloc_context(context);
>
> - for (i = 0; i < context.num_threads; i++) {
> - int tmp =
> (uintptr_t)qemu_thread_join(&context.threads[i].pgthread);
> + if (!use_madv_populate_write) {
> + sigbus_memset_context = NULL;
> + }
> + return ret;
> +}
> +
> +bool qemu_finish_async_mem_prealloc(Error **errp)
> +{
> + int ret, tmp;
> + MemsetContext *context, *next_context;
> +
> + /* Waiting for preallocation requires the BQL. */
> + assert(bql_locked());
> + if (QLIST_EMPTY(&memset_contexts)) {
> + return 0;
> + }
> +
> + qemu_mutex_lock(&page_mutex);
> + QLIST_FOREACH(context, &memset_contexts, next) {
> + context->all_threads_created = true;
> + }
> + qemu_cond_broadcast(&page_cond);
> + qemu_mutex_unlock(&page_mutex);
>
> + QLIST_FOREACH_SAFE(context, &memset_contexts, next, next_context) {
> + QLIST_REMOVE(context, next);
> + tmp = wait_and_free_mem_prealloc_context(context);
> if (tmp) {
> ret = tmp;
> }
> }
>
> - if (!use_madv_populate_write) {
> - sigbus_memset_context = NULL;
> + if (ret) {
> + error_setg_errno(errp, -ret,
> + "qemu_prealloc_mem: preallocating memory
> failed");
> + return false;
> }
> - g_free(context.threads);
> -
> - return ret;
> + return true;
> }
>
> static bool madv_populate_write_possible(char *area, size_t pagesize)
> @@ -498,7 +566,7 @@ static bool madv_populate_write_possible(char
> *area, size_t pagesize)
> }
>
> bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> - ThreadContext *tc, Error **errp)
> + ThreadContext *tc, bool async, Error **errp)
> {
> static gsize initialized;
> int ret;
> @@ -540,7 +608,7 @@ bool qemu_prealloc_mem(int fd, char *area, size_t
> sz, int max_threads,
> }
>
> /* touch pages simultaneously */
> - ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc,
> + ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc,
> async,
> use_madv_populate_write);
> if (ret) {
> error_setg_errno(errp, -ret,
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index c4a5f05a49..107f0efe37 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -265,7 +265,7 @@ int getpagesize(void)
> }
>
> bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> - ThreadContext *tc, Error **errp)
> + ThreadContext *tc, bool async, Error **errp)
> {
> int i;
> size_t pagesize = qemu_real_host_page_size();
> @@ -278,6 +278,12 @@ bool qemu_prealloc_mem(int fd, char *area, size_t
> sz, int max_threads,
> return true;
> }
>
> +bool qemu_finish_async_mem_prealloc(Error **errp)
> +{
> + /* async prealloc not supported, there is nothing to finish */
> + return true;
> +}
> +
> char *qemu_get_pid_name(pid_t pid)
> {
> /* XXX Implement me */
prev parent reply other threads:[~2024-01-29 23:00 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-22 15:32 [PATCH v2 0/2] Initialize backend memory objects in parallel Mark Kanda
2024-01-22 15:32 ` [PATCH v2 1/2] oslib-posix: refactor memory prealloc threads Mark Kanda
2024-01-22 15:32 ` [PATCH v2 2/2] oslib-posix: initialize backend memory objects in parallel Mark Kanda
2024-01-29 13:39 ` [PATCH v2 0/2] Initialize " Mark Kanda
2024-01-29 13:41 ` David Hildenbrand
2024-01-29 19:11 ` David Hildenbrand
2024-01-29 22:59 ` Mark Kanda [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=93a62006-328c-40f4-a18e-5fbf89cba49f@oracle.com \
--to=mark.kanda@oracle.com \
--cc=berrange@redhat.com \
--cc=david@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).