qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Mark Kanda <mark.kanda@oracle.com>
To: David Hildenbrand <david@redhat.com>, qemu-devel@nongnu.org
Cc: pbonzini@redhat.com, berrange@redhat.com
Subject: Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
Date: Mon, 29 Jan 2024 16:59:10 -0600	[thread overview]
Message-ID: <93a62006-328c-40f4-a18e-5fbf89cba49f@oracle.com> (raw)
In-Reply-To: <c15161eb-f52c-4a82-8b4b-0ba03842188c@redhat.com>

On 1/29/24 1:11 PM, David Hildenbrand wrote:
> On 22.01.24 16:32, Mark Kanda wrote:
>> v2:
>> - require MADV_POPULATE_WRITE (simplify the implementation)
>> - require prealloc context threads to ensure optimal thread placement
>> - use machine phase 'initialized' to detremine when to allow parallel 
>> init
>>
>> QEMU initializes preallocated backend memory when parsing the 
>> corresponding
>> objects from the command line. In certain scenarios, such as memory 
>> being
>> preallocated across multiple numa nodes, this approach is not optimal 
>> due to
>> the unnecessary serialization.
>>
>> This series addresses this issue by initializing the backend memory 
>> objects in
>> parallel.
>
> I just played with the code, some comments:
>
> * I suggest squashing both patches. It doesn't make things clearer if we
>   factor out unconditionally adding contexts to a global list.
>
> * Keep the functions MT-capable, at least as long as async=false. That
>   is, don't involve the global list if async=false. virtio-mem will
>   perform preallocation from other threads at some point, where we could
>   see concurrent preallocations for different devices. I made sure that
>   qemu_mem_prealloc() can handle that.
>
> * Rename wait_mem_prealloc() to qemu_finish_async_mem_prealloc() and let
>   it report the error / return true/false like qemu_prealloc_mem().
>   Especially, don't change the existing
>   "qemu_prealloc_mem: preallocating memory failed" error message.
>
> * Do the conditional async=false fixup in touch_all_pages(). That means,
>   in qemu_prealloc_mem(), only route the async parameter through.
>
>
> One thing I don't quite like is what happens when multiple threads 
> would try
> issuing "async=true". It will currently not happen, but we should catch
> whenever that happens and require that only one thread at a time can
> perform async preallocs. Maybe we can assert in qemu_prealloc_mem()/
> qemu_finish_async_mem_prealloc() that we hold the BQL. Hopefully, that
> is the case when we start creating memory backends, before starting the
> main loop. If not, maybe we should just document that async limitation.
>
> Ideally, we'd have some async_start(), prealloc(), prealloc(),
> async_finish() interface, where async_start() would block until
> another thread called async_finish(), so we never have a mixture.
> But that would currently be over-engineering.
>
>
> I'll attach the untested, likely broken, code I played with to see
> what it could look like. Observe how I only conditionally add the
> context to the list at the end of touch_all_pages().
>

Thank you very much for the feedback David. I'll take a close look at 
this for v3.

Best regards,
-Mark

>
> From fe26cc5252f1284efa8e667310609a22c6166324 Mon Sep 17 00:00:00 2001
> From: Mark Kanda <mark.kanda@oracle.com>
> Date: Mon, 22 Jan 2024 09:32:18 -0600
> Subject: [PATCH] oslib-posix: initialize selected backend memory 
> objects in
>  parallel
>
> QEMU initializes preallocated backend memory as the objects are parsed 
> from
> the command line. This is not optimal in some cases (e.g. memory spanning
> multiple NUMA nodes) because the memory objects are initialized in 
> series.
>
> Allow the initialization to occur in parallel. In order to ensure optimal
> thread placement, parallel initialization requires prealloc context 
> threads
> to be in use.
>
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  backends/hostmem.c     |   8 ++-
>  hw/virtio/virtio-mem.c |   4 +-
>  include/qemu/osdep.h   |  19 +++++-
>  system/vl.c            |   8 +++
>  util/oslib-posix.c     | 130 +++++++++++++++++++++++++++++++----------
>  util/oslib-win32.c     |   8 ++-
>  6 files changed, 140 insertions(+), 37 deletions(-)
>
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 30f69b2cb5..8f602dc86f 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -20,6 +20,7 @@
>  #include "qom/object_interfaces.h"
>  #include "qemu/mmap-alloc.h"
>  #include "qemu/madvise.h"
> +#include "hw/qdev-core.h"
>
>  #ifdef CONFIG_NUMA
>  #include <numaif.h>
> @@ -235,9 +236,10 @@ static void 
> host_memory_backend_set_prealloc(Object *obj, bool value,
>          int fd = memory_region_get_fd(&backend->mr);
>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>          uint64_t sz = memory_region_size(&backend->mr);
> +        bool async = !phase_check(PHASE_MACHINE_INITIALIZED);
>
>          if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
> -                               backend->prealloc_context, errp)) {
> +                               backend->prealloc_context, async, 
> errp)) {
>              return;
>          }
>          backend->prealloc = true;
> @@ -323,6 +325,7 @@ host_memory_backend_memory_complete(UserCreatable 
> *uc, Error **errp)
>      HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc);
>      void *ptr;
>      uint64_t sz;
> +    bool async = !phase_check(PHASE_MACHINE_INITIALIZED);
>
>      if (!bc->alloc) {
>          return;
> @@ -398,7 +401,8 @@ host_memory_backend_memory_complete(UserCreatable 
> *uc, Error **errp)
>      if (backend->prealloc && 
> !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
>                                                  ptr, sz,
> backend->prealloc_threads,
> - backend->prealloc_context, errp)) {
> + backend->prealloc_context,
> +                                                async, errp)) {
>          return;
>      }
>  }
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 99ab989852..ffd119ebac 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM 
> *vmem, uint64_t start_gpa,
>          int fd = memory_region_get_fd(&vmem->memdev->mr);
>          Error *local_err = NULL;
>
> -        if (!qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err)) {
> +        if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, 
> &local_err)) {
>              static bool warned;
>
>              /*
> @@ -1248,7 +1248,7 @@ static int 
> virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, void *arg,
>      int fd = memory_region_get_fd(&vmem->memdev->mr);
>      Error *local_err = NULL;
>
> -    if (!qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err)) {
> +    if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, 
> &local_err)) {
>          error_report_err(local_err);
>          return -ENOMEM;
>      }
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index c9692cc314..ed48f3d028 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext;
>   * @area: start address of the are to preallocate
>   * @sz: the size of the area to preallocate
>   * @max_threads: maximum number of threads to use
> + * @tc: prealloc context threads pointer, NULL if not in use
> + * @async: request asynchronous preallocation, requires @tc
>   * @errp: returns an error if this function fails
>   *
>   * Preallocate memory (populate/prefault page tables writable) for 
> the virtual
> @@ -687,10 +689,25 @@ typedef struct ThreadContext ThreadContext;
>   * each page in the area was faulted in writable at least once, for 
> example,
>   * after allocating file blocks for mapped files.
>   *
> + * When setting @async, allocation might be performed asynchronously.
> + * qemu_finish_async_mem_prealloc() must be called to finish any 
> asyncronous
> + * preallocation, reporting any preallocation error.
> + *
>   * Return: true on success, else false setting @errp with error.
>   */
>  bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> -                       ThreadContext *tc, Error **errp);
> +                       ThreadContext *tc, bool async, Error **errp);
> +
> +/**
> + * qemu_finish_async_mem_prealloc:
> + * @errp: returns an error if this function fails
> + *
> + * Finish any outstanding memory prealloc to complete, reporting any 
> error
> + * like qemu_prealloc_mem() would.
> + *
> + * Return: true on success, else false setting @errp with error.
> + */
> +bool qemu_finish_async_mem_prealloc(Error **errp);
>
>  /**
>   * qemu_get_pid_name:
> diff --git a/system/vl.c b/system/vl.c
> index 788d88ea03..290bb3232b 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -2009,6 +2009,14 @@ static void qemu_create_late_backends(void)
>
>      object_option_foreach_add(object_create_late);
>
> +    /*
> +     * Wait for any outstanding memory prealloc from created memory
> +     * backends to complete.
> +     */
> +    if (!qemu_finish_async_mem_prealloc(&error_fatal)) {
> +        exit(1);
> +    }
> +
>      if (tpm_init() < 0) {
>          exit(1);
>      }
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 7c297003b9..c37548abdc 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -42,6 +42,7 @@
>  #include "qemu/cutils.h"
>  #include "qemu/units.h"
>  #include "qemu/thread-context.h"
> +#include "qemu/main-loop.h"
>
>  #ifdef CONFIG_LINUX
>  #include <sys/syscall.h>
> @@ -63,11 +64,15 @@
>
>  struct MemsetThread;
>
> +static QLIST_HEAD(, MemsetContext) memset_contexts =
> +    QLIST_HEAD_INITIALIZER(memset_contexts);
> +
>  typedef struct MemsetContext {
>      bool all_threads_created;
>      bool any_thread_failed;
>      struct MemsetThread *threads;
>      int num_threads;
> +    QLIST_ENTRY(MemsetContext) next;
>  } MemsetContext;
>
>  struct MemsetThread {
> @@ -412,19 +417,44 @@ static inline int get_memset_num_threads(size_t 
> hpagesize, size_t numpages,
>      return ret;
>  }
>
> +static int wait_and_free_mem_prealloc_context(MemsetContext *context)
> +{
> +    int i, ret = 0, tmp;
> +
> +    for (i = 0; i < context->num_threads; i++) {
> +        tmp = 
> (uintptr_t)qemu_thread_join(&context->threads[i].pgthread);
> +
> +        if (tmp) {
> +            ret = tmp;
> +        }
> +    }
> +    g_free(context->threads);
> +    g_free(context);
> +    return ret;
> +}
> +
>  static int touch_all_pages(char *area, size_t hpagesize, size_t 
> numpages,
> -                           int max_threads, ThreadContext *tc,
> +                           int max_threads, ThreadContext *tc, bool 
> async,
>                             bool use_madv_populate_write)
>  {
>      static gsize initialized = 0;
> -    MemsetContext context = {
> -        .num_threads = get_memset_num_threads(hpagesize, numpages, 
> max_threads),
> -    };
> +    MemsetContext *context = g_new0(MemsetContext, 1);
>      size_t numpages_per_thread, leftover;
>      void *(*touch_fn)(void *);
> -    int ret = 0, i = 0;
> +    int ret, i = 0;
>      char *addr = area;
>
> +    /*
> +     * Async prealloc is only allowed when using MADV_POPULATE_WRITE and
> +     * prealloc context (to ensure optimal thread placement).
> +     */
> +    if (!use_madv_populate_write || !tc) {
> +        async = false;
> +    }
> +
> +    context->num_threads = get_memset_num_threads(hpagesize, numpages,
> +                                                  max_threads);
> +
>      if (g_once_init_enter(&initialized)) {
>          qemu_mutex_init(&page_mutex);
>          qemu_cond_init(&page_cond);
> @@ -432,8 +462,11 @@ static int touch_all_pages(char *area, size_t 
> hpagesize, size_t numpages,
>      }
>
>      if (use_madv_populate_write) {
> -        /* Avoid creating a single thread for MADV_POPULATE_WRITE */
> -        if (context.num_threads == 1) {
> +        /*
> +         * Avoid creating a single thread for MADV_POPULATE_WRITE when
> +         * preallocating synchronously.
> +         */
> +        if (context->num_threads == 1 && !async) {
>              if (qemu_madvise(area, hpagesize * numpages,
>                               QEMU_MADV_POPULATE_WRITE)) {
>                  return -errno;
> @@ -445,50 +478,85 @@ static int touch_all_pages(char *area, size_t 
> hpagesize, size_t numpages,
>          touch_fn = do_touch_pages;
>      }
>
> -    context.threads = g_new0(MemsetThread, context.num_threads);
> -    numpages_per_thread = numpages / context.num_threads;
> -    leftover = numpages % context.num_threads;
> -    for (i = 0; i < context.num_threads; i++) {
> -        context.threads[i].addr = addr;
> -        context.threads[i].numpages = numpages_per_thread + (i < 
> leftover);
> -        context.threads[i].hpagesize = hpagesize;
> -        context.threads[i].context = &context;
> +    context->threads = g_new0(MemsetThread, context->num_threads);
> +    numpages_per_thread = numpages / context->num_threads;
> +    leftover = numpages % context->num_threads;
> +    for (i = 0; i < context->num_threads; i++) {
> +        context->threads[i].addr = addr;
> +        context->threads[i].numpages = numpages_per_thread + (i < 
> leftover);
> +        context->threads[i].hpagesize = hpagesize;
> +        context->threads[i].context = context;
>          if (tc) {
> -            thread_context_create_thread(tc, 
> &context.threads[i].pgthread,
> +            thread_context_create_thread(tc, 
> &context->threads[i].pgthread,
>                                           "touch_pages",
> -                                         touch_fn, &context.threads[i],
> +                                         touch_fn, &context->threads[i],
>                                           QEMU_THREAD_JOINABLE);
>          } else {
> -            qemu_thread_create(&context.threads[i].pgthread, 
> "touch_pages",
> -                               touch_fn, &context.threads[i],
> + qemu_thread_create(&context->threads[i].pgthread, "touch_pages",
> +                               touch_fn, &context->threads[i],
>                                 QEMU_THREAD_JOINABLE);
>          }
> -        addr += context.threads[i].numpages * hpagesize;
> +        addr += context->threads[i].numpages * hpagesize;
> +    }
> +
> +    if (async) {
> +        /*
> +         * async requests currently require the BQL. Add it to the 
> list and kick
> +         * preallocation off during qemu_finish_async_mem_prealloc().
> +         */
> +        assert(bql_locked());
> +        QLIST_INSERT_HEAD(&memset_contexts, context, next);
> +        return 0;
>      }
>
>      if (!use_madv_populate_write) {
> -        sigbus_memset_context = &context;
> +        sigbus_memset_context = context;
>      }
>
>      qemu_mutex_lock(&page_mutex);
> -    context.all_threads_created = true;
> +    context->all_threads_created = true;
>      qemu_cond_broadcast(&page_cond);
>      qemu_mutex_unlock(&page_mutex);
> +    ret = wait_and_free_mem_prealloc_context(context);
>
> -    for (i = 0; i < context.num_threads; i++) {
> -        int tmp = 
> (uintptr_t)qemu_thread_join(&context.threads[i].pgthread);
> +    if (!use_madv_populate_write) {
> +        sigbus_memset_context = NULL;
> +    }
> +    return ret;
> +}
> +
> +bool qemu_finish_async_mem_prealloc(Error **errp)
> +{
> +    int ret, tmp;
> +    MemsetContext *context, *next_context;
> +
> +    /* Waiting for preallocation requires the BQL. */
> +    assert(bql_locked());
> +    if (QLIST_EMPTY(&memset_contexts)) {
> +        return 0;
> +    }
> +
> +    qemu_mutex_lock(&page_mutex);
> +    QLIST_FOREACH(context, &memset_contexts, next) {
> +        context->all_threads_created = true;
> +    }
> +    qemu_cond_broadcast(&page_cond);
> +    qemu_mutex_unlock(&page_mutex);
>
> +    QLIST_FOREACH_SAFE(context, &memset_contexts, next, next_context) {
> +        QLIST_REMOVE(context, next);
> +        tmp = wait_and_free_mem_prealloc_context(context);
>          if (tmp) {
>              ret = tmp;
>          }
>      }
>
> -    if (!use_madv_populate_write) {
> -        sigbus_memset_context = NULL;
> +    if (ret) {
> +        error_setg_errno(errp, -ret,
> +                         "qemu_prealloc_mem: preallocating memory 
> failed");
> +        return false;
>      }
> -    g_free(context.threads);
> -
> -    return ret;
> +    return true;
>  }
>
>  static bool madv_populate_write_possible(char *area, size_t pagesize)
> @@ -498,7 +566,7 @@ static bool madv_populate_write_possible(char 
> *area, size_t pagesize)
>  }
>
>  bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> -                       ThreadContext *tc, Error **errp)
> +                       ThreadContext *tc, bool async, Error **errp)
>  {
>      static gsize initialized;
>      int ret;
> @@ -540,7 +608,7 @@ bool qemu_prealloc_mem(int fd, char *area, size_t 
> sz, int max_threads,
>      }
>
>      /* touch pages simultaneously */
> -    ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc,
> +    ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc, 
> async,
>                            use_madv_populate_write);
>      if (ret) {
>          error_setg_errno(errp, -ret,
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index c4a5f05a49..107f0efe37 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -265,7 +265,7 @@ int getpagesize(void)
>  }
>
>  bool qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
> -                       ThreadContext *tc, Error **errp)
> +                       ThreadContext *tc, bool async, Error **errp)
>  {
>      int i;
>      size_t pagesize = qemu_real_host_page_size();
> @@ -278,6 +278,12 @@ bool qemu_prealloc_mem(int fd, char *area, size_t 
> sz, int max_threads,
>      return true;
>  }
>
> +bool qemu_finish_async_mem_prealloc(Error **errp)
> +{
> +    /* async prealloc not supported, there is nothing to finish */
> +    return true;
> +}
> +
>  char *qemu_get_pid_name(pid_t pid)
>  {
>      /* XXX Implement me */



      reply	other threads:[~2024-01-29 23:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-22 15:32 [PATCH v2 0/2] Initialize backend memory objects in parallel Mark Kanda
2024-01-22 15:32 ` [PATCH v2 1/2] oslib-posix: refactor memory prealloc threads Mark Kanda
2024-01-22 15:32 ` [PATCH v2 2/2] oslib-posix: initialize backend memory objects in parallel Mark Kanda
2024-01-29 13:39 ` [PATCH v2 0/2] Initialize " Mark Kanda
2024-01-29 13:41   ` David Hildenbrand
2024-01-29 19:11 ` David Hildenbrand
2024-01-29 22:59   ` Mark Kanda [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93a62006-328c-40f4-a18e-5fbf89cba49f@oracle.com \
    --to=mark.kanda@oracle.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).